
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Roger Frank Sent: Wednesday, January 25, 2012 6:30 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] Producing epub ready HTML
On Jan 25, 2012, at 3:28 AM, Marcello Perathoner wrote:
The PG website already offers UTF-8 text only.
For a while I was sending up UTF-8 text only along with HTML. I stopped when it seemed that was causing a lot of work for the WWers. As of now, I send ASCII if that's sufficient, Latin-1 if it has characters in the Latin-1 set, and UTF-8 if it has characters not in Latin-1. There are two consequences: (1) everything that goes up in Latin-1 could go up in UTF-8 instead but doesn't and (2) I don't send up plain text with curly quotes at all.
I work in UTF-8. It would be easiest for me to send up UTF-8. But
Custom UTF8 files aren't difficult to handle. I'll assume English-language submissions for the following. If a UTF8 file arrives *with* an accompanying Latin1 or ASCII file (DP's normal practice), I'll check the latter file(s) with Gutcheck/Jeebies/Gutspell, and if any problems are found, I (and the other WWers) will correct the submission's text/HTML files accordingly. (I use Windows Notepad for Latin1/ASCII text/HTML files, and SCUnipad for UTF8 text/HTML files.) If a UTF8 file arrives *without* an accompanying Latin1/ASCII equivalent, I'll first use Unitame in its non-convert mode to see what Unicode characters are in the file. If there's nothing it can't convert, I'll use Unitame in its convert mode to generate a Latin1 file from the UTF8 file, and do the normal checks on the Latin1 file, making any corrections as above. Both text files will be posted. PG's posting software will generate an ASCII file from the Latin1 file, except for most foreign language files, when it will ask if an ASCII file should be generated. As Roger mentions, curly quotes and oe-ligatures are insufficient reason for creating a UTF8 file. This is mentioned somewhere in DP's forums--I forget where. Use normal keyboard quotes, and "oe" or "[oe]" in text files; curly quotes and oe-lig (or their entities) in HTML files. UTF8 files are pretty much mandatory only when a text is in languages such as Greek, Hebrew, Chinese, Japanese, Cyrillic, etc., and a Latin1/ASCII conversion isn't practical. For English texts containing smatterings of Greek, the submitter has two options: transliterate the Greek characters as per PG's Greek How-To, submitting only a Latin1 file, or submit a UTF8 text with the actual Greek characters along with a transliterated Latin1 file. As for UTF8 files in Oriental/Cyrillic/etc., languages, Roger is correct in saying that Gutcheck and the other normal tools don't work very well on these files, but in the cases of the aforesaid languages, it probably doesn't matter since none of the WWers speak/read them anyway, so have no hope of telling if, for example, something is misspelled. I should mention that one thing that does slow things down for the WWers are text/HTML files with CR-only line endings. CR/LF is far preferable, since all the WWers use Windows. (I'm aware that there are fix utilities, but CR/LF's make things easier to start with.) It's also helpful to the WWers that text files are identified as to their character set. UTF8 files should have "-utf8" embedded in their filename, e.g. "somefile-utf8.txt". Latin1 and ASCII files much the same: "-ltn1", "-lt1", "-latin1", "-iso" or similar for Latin1 files, and "-asc" or "-ascii" for ASCII text files. Al the
WW tools to check the submission, like gutcheck, struggle with UTF-8. The HTML I send up is UTF-8 and that survives because the WWers don't have to check it. They check the text file, which should be ASCII if it can be, and only UTF-8 instead of Latin-1 if there are characters that absolutely are necessary and not in Latin-1. Curly quotes are not viewed as necessary to the text version. Even an oe ligature isn't strong enough to justify UTF-8.
Seems to me it's all about the tools. The WWers always seem overloaded and sending UTF-8 up makes that worse. If they had better tools to do their job, actually getting UTF-8 up on the PG website wouldn't be as problematic.
This is an outside view of the problem. I am not a WWer. I'd love to hear from a WWer about Marcello's comment. Though correct as stated, perhaps it would be more accurate to say "The PG website already offers UTF-8 text only, but please don't send it to us unless absolutely necessary." I hope someday that statement will not be true, but I believe it is now.
--Roger
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d