Re: [gutvol-d] Producing epub ready HTML

Roger, Someone is working on adapting gutcheck to handle UTF8 as a spinoff of the effort to improve Guiguts' handling of languages other than English (following up on a suggestion you made to me); see below. Could the wwers please liaise with him on whether this could be a sanctioned project and also whether it should work under the gutcheck sourceforge project (if not we can do it under the guiguts project). Also are better tools needed to convert UTF8 to Latin-1 and ASCII. The same individual has compiled aspell 0.6 for Window which offers Unicode normalization. All are welcome to contribute views in the fora below. Developers in C and Perl are needed. Hunter http://www.pgdp.net/phpBB2/viewtopic.php?p=820092#820092 http://www.pgdp.net/phpBB2/viewtopic.php?t=50082 From: Roger Frank <rfrank@rfrank.net> I work in UTF-8. It would be easiest for me to send up UTF-8. But the WW tools to check the submission, like gutcheck, struggle with UTF-8

I believe that the problem in handling UTF-8 submissions is unitame, that is the tool that the WWers use to recode UTF-8 to iso-Latin-1. It cannot handle simple things like bullets and greek, but it would be very easy to extend it to be able to handle these characters and more (I did). Extending unitame is just editing an iso-latin file unitame.dat. A slightly more complicated extension could allow to have unitame accept a submission-specific patch file (looking first an unitame.dat in the current location, then the global unitame.dat). This could be much easier to finalize than fixing gutcheck. UTF-8 gutcheck could hence be obtained piping unitame output to "classic" gutcheck. Possibly, even a quick and dirty version of unitame.dat, replacing unhandled characters with #, instead of the current verbose output that suggests manual handling, could be enough to use gutcheck with UTF-8 files. Of course manual tweaking would be needed if some WW-ers want to post iso-latin-1 or ASCII files anyway. I can contribute a version of unitame.dat transliterating greek (although with some suboptimal transliteration) and handling dingbats. Carlo

On Jan 25, 2012, at 9:52 AM, Carlo Traverso wrote:
I believe that the problem in handling UTF-8 submissions is unitame, that is the tool that the WWers use to recode UTF-8 to iso-Latin-1. It cannot handle simple things like bullets and greek, but it would be very easy to extend it to be able to handle these characters and more (I did).
Unitame is part of it. But PG wants to include a plain ASCII file. Look at the first paragraph of "The Black Star" (etext 35833). It started in UTF-8 as 35833-0.txt: They poured through the man-made cañons then with unitame into Latin-1 as 35833-8.txt: They poured through the man-made cañons but the ASCII version of 35833.txt has They poured through the man-made canons I would write that in the ASCII file as They poured through the man-made canyons There are similar special cases for "oo" in a word and others. Is there a tool to catch these special cases that the WWers could use if they are given a Latin-1 file? It's not a discussion of should a UTF-8 file alone be sufficient--that's one the WWers, Marcello, and Greg should agree upon. Right now, the ASCII version is in the mix, and unitame alone isn't enough to get it done. Anybody up for writing a latin1tame program? --Roger

There are four Latin1-to-ASCII conversions that the WWers routinely check for and correct as necessary: - oë (o, e-umlaut) - the posting software converts this to "ooe". This makes no sense in conversions resulting in cooerdinate, cooeperate, zooelogy, and their derivatives. The WWers will search for "ooe" and correct as necessary. (The conversion is correct for some words, but off-hand I can't think of an example.) - n-tilde - converted to plain "n". The only word I'm aware of where this is incorrect is cañon/canyon. It's OK in words like senor, senorita, and pinon. - ° (degree) - converted to "deg.". OK most of the time, but results in "deg.." if at sentence end. The WWers remove the extra period. The conversion can also mess up tables. Minor messes can be fixed by the WWers, but major ones usually result in a request for an ASCII file (unless one's already been provided). - § (Section) - converted to "Sec.". OK most of the time, but "§§" results in "Sec.Sec.", which the WWers convert to "Secs.". The example Roger mentions in 35833 is a mistake by whoever WWed it (no, not me). (Yes, WWers make mistakes, too. We're human, so hopefully that's not a revelation.) I've made the correction, and new files will be on-line in about 45 minutes. Al
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Roger Frank Sent: Wednesday, January 25, 2012 1:05 PM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] UTF-8 TXT (was Producing epub ready HTML)
On Jan 25, 2012, at 9:52 AM, Carlo Traverso wrote:
I believe that the problem in handling UTF-8 submissions is unitame, that is the tool that the WWers use to recode UTF-8 to iso-Latin-1. It cannot handle simple things like bullets and greek, but it would be very easy to extend it to be able to handle these characters and more (I did).
Unitame is part of it. But PG wants to include a plain ASCII file. Look at the first paragraph of "The Black Star" (etext 35833).
It started in UTF-8 as 35833-0.txt: They poured through the man-made cañons then with unitame into Latin-1 as 35833-8.txt: They poured through the man-made cañons but the ASCII version of 35833.txt has They poured through the man-made canons I would write that in the ASCII file as They poured through the man-made canyons
There are similar special cases for "oo" in a word and others. Is there a tool to catch these special cases that the WWers could use if they are given a Latin-1 file? It's not a discussion of should a UTF-8 file alone be sufficient--that's one the WWers, Marcello, and Greg should agree upon. Right now, the ASCII version is in the mix, and unitame alone isn't enough to get it done. Anybody up for writing a latin1tame program?
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 01/25/2012 10:04 PM, Roger Frank wrote:
It started in UTF-8 as 35833-0.txt: They poured through the man-made cañons then with unitame into Latin-1 as 35833-8.txt: They poured through the man-made cañons but the ASCII version of 35833.txt has They poured through the man-made canons I would write that in the ASCII file as They poured through the man-made canyons
How about Señor? Wouldn't that become Senyor? -- Marcello Perathoner webmaster@gutenberg.org

On Wed, Jan 25, 2012 at 1:47 PM, Marcello Perathoner <marcello@perathoner.de> wrote:
On 01/25/2012 10:04 PM, Roger Frank wrote:
It started in UTF-8 as 35833-0.txt: They poured through the man-made cañons then with unitame into Latin-1 as 35833-8.txt: They poured through the man-made cañons but the ASCII version of 35833.txt has They poured through the man-made canons I would write that in the ASCII file as They poured through the man-made canyons
How about Señor? Wouldn't that become Senyor?
Only if the program was stupid enough to insist on converting all instances of a character the same way, no matter what the context. That's pretty darn stupid. -- Kie ekzistas vivo, ekzistas espero.

Anybody up for writing a latin1tame program?
latin1tame exists, and is called 1252to7. And it sends ñ to n. To send ñ to ny you could use a custom unitame.dat that generates ASCII, as a replacement to 1252to7. This could also be used to convert ä to a instead of ae, as requested for french instead of german. The problem with the custom unitame.dat approach, is that the conversion is character by character, and is the same for the whole file. It is already a progress with respect to be, as now, site-wide. In any case, if you submit RST or TEI, PG has to survive with what epubmaker gives out, i.e. crappy iso-latin given by unitame and crappy ASCII. If this is good enough for RST sources, it is good enough for HTML+UTF-8 sources. My proposal is to allow a bit less of crap in the unitame output, and is simple enough. And allows to use unmodified gutcheck to verify the UTF-8 txt file, before porting gutcheck to Unicode, that will take time. PG moreover requires ASCII and iso-latin-1 just to conceal them. Spending too much work on it is a waste of time. Carlo PS. I see that a version of unitame using a local version of unitame.dat already exists, dated 2006.

My original queation was really whether it isn't time to abandon 8-bit-only files except as their encoding is a strict subset of utf-8/unicode. Just autoconvert iso-1158-1 etc. at the beginning and be done with it. In the year 2525, if man is still alive, they'll look at a 7-bit ascii file and ask "What's this?" They'll be much more likely to understand a represemtation that includes most of the languages on earth at the time.

On Wed, Jan 25, 2012 at 2:37 PM, don kretz <dakretz@gmail.com> wrote:
In the year 2525, if man is still alive, they'll look at a 7-bit ascii file and ask "What's this?" They'll be much more likely to understand a represemtation that includes most of the languages on earth at the time.
The whole point of UTF-8 is that a 7-bit ASCII file is UTF-8. -- Kie ekzistas vivo, ekzistas espero.

Exactly. So there's no conversion effort except to map all the hacks we've added to deal with characters we have to mangle because they aren't available in 7-bit ascii. So why require the extra hacking any longer?

On 01/25/2012 05:23 PM, Carlo Traverso wrote:
Anybody up for writing a latin1tame program?
latin1tame exists, and is called 1252to7. And it sends ñ to n.
Not exactly. Windows codepage 1252 is NOT Latin-1 (iso-8859-1). Microsoft replaced the control characters from 128-159 with other characters. The correct Microsoft codepage for Latin-1 (iso-8859-1) is Windows-28591. In any event, there's an obvious need for such conversion to correctly handle certain characters in context rather than using a blind mapping. David

On Thu, January 26, 2012 7:46 am, D Garcia wrote:
On 01/25/2012 05:23 PM, Carlo Traverso wrote:
Anybody up for writing a latin1tame program?
latin1tame exists, and is called 1252to7. And it sends ñ to n.
Not exactly. Windows codepage 1252 is NOT Latin-1 (iso-8859-1). Microsoft replaced the control characters from 128-159 with other characters. The correct Microsoft codepage for Latin-1 (iso-8859-1) is Windows-28591.
But the MSWindows additions are reserved in Latin-1, so if you have a valid Latin-1 file it will /not/ contain any of the Windows specific characters. Thus, 1252to7 is functionally equivalent to latin1tame; they will both behave identically on a Latin-1 file.

On 01/25/2012 11:11 AM, hmonroe.pglaf@huntermonroe.com wrote:
Someone is working on adapting gutcheck to handle UTF8 as a spinoff of the effort to improve Guiguts' handling of languages other than English (following up on a suggestion you made to me); see below. Could the wwers please liaise with him on whether this could be a sanctioned project and also whether it should work under the gutcheck sourceforge project (if not we can do it under the guiguts project). Also are better tools needed to convert UTF8 to Latin-1 and ASCII. The same individual has compiled aspell 0.6 for Window which offers Unicode normalization. All are welcome to contribute views in the fora below. Developers in C and Perl are needed.
Hunter
Hunter, if you go this route I would recommend that you fork the gutcheck code and call it something different per standard practice, which clearly distinguishes the branches, and reduces potential end-user confusion. David

On Thu, Jan 26, 2012 at 10:04:56AM -0500, D Garcia wrote:
On 01/25/2012 11:11 AM, hmonroe.pglaf@huntermonroe.com wrote:
Someone is working on adapting gutcheck to handle UTF8 as a spinoff of the effort to improve Guiguts' handling of languages other than English (following up on a suggestion you made to me); see below. Could the wwers please liaise with him on whether this could be a sanctioned project and also whether it should work under the gutcheck sourceforge project (if not we can do it under the guiguts project). Also are better tools needed to convert UTF8 to Latin-1 and ASCII. The same individual has compiled aspell 0.6 for Window which offers Unicode normalization. All are welcome to contribute views in the fora below. Developers in C and Perl are needed.
Hunter
Hunter, if you go this route I would recommend that you fork the gutcheck code and call it something different per standard practice, which clearly distinguishes the branches, and reduces potential end-user confusion.
David
Jim Tinsley isn't really active as a WWer any more (though he's still on that mailing list; he is not on gutvol-d). Jim Tinsley <jtinsley@pobox.com> He will probably be thrilled to have someone do development. I don't know whether a fork is necessary, it might just be that we need some newer versions. I'll be happy to install updated versions on pglaf.org. We have gutcheck online at http://upload.pglaf.org, and this could offer multiple versions (it already offers multiple options). -- Greg

-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Greg Newby Sent: Thursday, January 26, 2012 11:21 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] gutcheck fork? (Was: Re: Producing epub ready HTML)
On 01/25/2012 11:11 AM, hmonroe.pglaf@huntermonroe.com wrote:
Someone is working on adapting gutcheck to handle UTF8 as a spinoff of the effort to improve Guiguts' handling of languages other
up on a suggestion you made to me); see below. Could the wwers please liaise with him on whether this could be a sanctioned
On Thu, Jan 26, 2012 at 10:04:56AM -0500, D Garcia wrote: than English (following project and also whether
it should work under the gutcheck sourceforge project (if not we can do it under the guiguts project). Also are better tools needed to convert UTF8 to Latin-1 and ASCII. The same individual has compiled aspell 0.6 for Window which offers Unicode normalization. All are welcome to contribute views in the fora below. Developers in C and Perl are needed.
Hunter
Hunter, if you go this route I would recommend that you fork the gutcheck code and call it something different per standard
If Gutcheck's being upgraded, Jeebies and Gutspell should be, too. All three should remain (or be available) as stand-alone, command-line, no-GUI, .exe's, for local use. (It's easy to generate a wrapper batch file pointing to the text (or texts) being checked, send the report to a text file, and display the report file in a text editor.) Al practice,
which clearly distinguishes the branches, and reduces potential end-user confusion.
David
Jim Tinsley isn't really active as a WWer any more (though he's still on that mailing list; he is not on gutvol-d).
Jim Tinsley <jtinsley@pobox.com>
He will probably be thrilled to have someone do development. I don't know whether a fork is necessary, it might just be that we need some newer versions. I'll be happy to install updated versions on pglaf.org.
We have gutcheck online at http://upload.pglaf.org, and this could offer multiple versions (it already offers multiple options).
-- Greg _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (10)
-
Al Haines
-
D Garcia
-
David Starner
-
don kretz
-
Greg Newby
-
hmonroe.pglaf@huntermonroe.com
-
Lee Passey
-
Marcello Perathoner
-
Roger Frank
-
traverso@posso.dm.unipi.it