
There are four Latin1-to-ASCII conversions that the WWers routinely check for and correct as necessary: - oë (o, e-umlaut) - the posting software converts this to "ooe". This makes no sense in conversions resulting in cooerdinate, cooeperate, zooelogy, and their derivatives. The WWers will search for "ooe" and correct as necessary. (The conversion is correct for some words, but off-hand I can't think of an example.) - n-tilde - converted to plain "n". The only word I'm aware of where this is incorrect is cañon/canyon. It's OK in words like senor, senorita, and pinon. - ° (degree) - converted to "deg.". OK most of the time, but results in "deg.." if at sentence end. The WWers remove the extra period. The conversion can also mess up tables. Minor messes can be fixed by the WWers, but major ones usually result in a request for an ASCII file (unless one's already been provided). - § (Section) - converted to "Sec.". OK most of the time, but "§§" results in "Sec.Sec.", which the WWers convert to "Secs.". The example Roger mentions in 35833 is a mistake by whoever WWed it (no, not me). (Yes, WWers make mistakes, too. We're human, so hopefully that's not a revelation.) I've made the correction, and new files will be on-line in about 45 minutes. Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Roger Frank Sent: Wednesday, January 25, 2012 1:05 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] UTF-8 TXT (was Producing epub ready HTML)
On Jan 25, 2012, at 9:52 AM, Carlo Traverso wrote:
I believe that the problem in handling UTF-8 submissions is unitame, that is the tool that the WWers use to recode UTF-8 to iso-Latin-1. It cannot handle simple things like bullets and greek, but it would be very easy to extend it to be able to handle these characters and more (I did).
Unitame is part of it. But PG wants to include a plain ASCII file. Look at the first paragraph of "The Black Star" (etext 35833).
It started in UTF-8 as 35833-0.txt: They poured through the man-made cañons then with unitame into Latin-1 as 35833-8.txt: They poured through the man-made cañons but the ASCII version of 35833.txt has They poured through the man-made canons I would write that in the ASCII file as They poured through the man-made canyons
There are similar special cases for "oo" in a word and others. Is there a tool to catch these special cases that the WWers could use if they are given a Latin-1 file? It's not a discussion of should a UTF-8 file alone be sufficient--that's one the WWers, Marcello, and Greg should agree upon. Right now, the ASCII version is in the mix, and unitame alone isn't enough to get it done. Anybody up for writing a latin1tame program?
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d