Re: [gutvol-d] UTF-8 TXT (was Producing epub ready HTML)

25 Jan 2012

      There are four Latin1-to-ASCII conversions that the WWers routinely
check for and correct as necessary:

- oë (o, e-umlaut) - the posting software converts this to "ooe".
This makes no sense in conversions resulting in cooerdinate,
cooeperate, zooelogy, and their derivatives.  The WWers will search
for "ooe" and correct as necessary.  (The conversion is correct for
some words, but off-hand I can't think of an example.)

- n-tilde - converted to plain "n".  The only word I'm aware of where
this is incorrect is cañon/canyon.  It's OK in words like senor,
senorita, and pinon.

- ° (degree) - converted to "deg.".  OK most of the time, but results
in "deg.." if at sentence end.  The WWers remove the extra period.
The conversion can also mess up tables.  Minor messes can be fixed by
the WWers, but major ones usually result in a request for an ASCII
file (unless one's already been provided).

- § (Section) - converted to "Sec.".  OK most of the time, but "§§"
results in "Sec.Sec.", which the WWers convert to "Secs.".

The example Roger mentions in 35833 is a mistake by whoever WWed it
(no, not me).  (Yes, WWers make mistakes, too.  We're human, so
hopefully that's not a revelation.)  I've made the correction, and new
files will be on-line in about 45 minutes.  

Al
...
-----Original Message-----
From: gutvol-d-bounces@lists.pglaf.org 
[mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Roger Frank
Sent: Wednesday, January 25, 2012 1:05 PM
To: Project Gutenberg Volunteer Discussion
Subject: Re: [gutvol-d] UTF-8 TXT (was Producing epub ready HTML)
On Jan 25, 2012, at 9:52 AM, Carlo Traverso wrote:
...
I believe that the problem in handling UTF-8 submissions is
unitame, that is the tool that the WWers use to recode UTF-8 to
iso-Latin-1. It cannot handle simple things like bullets and
greek,
but it would be very easy to extend it to be able to handle these
characters and more (I did).
Unitame is part of it. But PG wants to include a plain ASCII file.
Look at the first paragraph of "The Black Star" (etext 35833).
It started in UTF-8 as 35833-0.txt:
  They poured through the man-made cañons
then with unitame into Latin-1 as 35833-8.txt:
  They poured through the man-made cañons
but the ASCII version of 35833.txt has
  They poured through the man-made canons
I would write that in the ASCII file as
  They poured through the man-made canyons
There are similar special cases for "oo" in a word and others.
Is there a tool to catch these special cases that the WWers
could use if they are given a Latin-1 file? It's not a discussion
of should a UTF-8 file alone be sufficient--that's one the
WWers, Marcello, and Greg should agree upon.
Right now, the ASCII version is in the mix, and unitame
alone isn't enough to get it done. Anybody up for
writing a latin1tame program?
--Roger
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d

Re: [gutvol-d] UTF-8 TXT (was Producing epub ready HTML)

Al Haines