
Thanks Al -- this would be great information to continue to be shared in a permanent appropriate location, such as in the general information section located at upload.pglaf.org<https://upload.pglaf.org/> ________________________________ From: gutvol-d <gutvol-d-bounces@lists.pglaf.org> on behalf of Al Haines <ajhaines@shaw.ca> Sent: Sunday, January 5, 2020 11:51 AM To: gbnewby@pglaf.org <gbnewby@pglaf.org>; 'Project Gutenberg Volunteer Discussion' <gutvol-d@lists.pglaf.org>; Joseph E. Loewenstein, M.D. <loewenstein@sssnet.com>; 'Al Haines' <ajhaines@shaw.ca>; 'Chuck Greif' <cbgrf@yahoo.com>; 'David Widger' <cdwidger@gmail.com> Subject: Re: [gutvol-d] UTF-8 File Names UTF8 text files should have "-utf8" in their file names, e.g. "myfile-utf8.txt". This forces PG's posting software to treat text files as UTF8, rather than its defaulting to Latin1/ASCII. (This has been a de facto standard for some years now.) It's not necessary to do this for HTML files. If necessary, the posting software will prompt for the correct character set, but it's an easy prompt to skip through, or give the wrong answer to. The presence of "-utf8" in the filename obviates the need for the prompt. Latin1/ASCII text files don't trigger the prompt at all. If PG's upload check reports a UTF8 text file without the "-utf8", then I rename the file inside the zip file, before unzipping it. Conversely, if a text file arrives with "-lat1", "-ltn1", "-iso", "-asc", or some such Latin1/ASCII indicator, I rename the file to remove it, since addhd handles Latin1/ASCII files correctly. Also, to avoid spurious files generated by the posting software, I make sure the base name of the zip file is the same as the base names of the text and HTML files, e.g. myfile.zip contains myfile.txt (or myfile-utf8.txt) and myfile.htm. Re HTML files: PG's extension for them is ".htm". When an uploaded zip file contains an HTML file with the extension ".html", I rename it, as mentioned above. If this isn't done, the posting software copies the file to a new file with the extension ".htm", e.g. "myfile.html" is copied to "myfile.htm", leaving a spurious ".html" file. (BTW, the posting software never, ever, modifies the submitted files--it copies them to new files with the required name, then works with the new files.) While I'm at it, more on file names... As mentioned above, I rename text and HTML files, and sometimes zip files, so that their base names are the same (except for the "-utf8" part of text files). There's no need for text/HTML files to have some versioning component. For example, I recently handled a zip file named diam-pg.zip, which contained diam-b.html and diam-cc-utf8.txt. I renamed the files to remove the "-b" and "-cc" parts, and the zip file to remove the "-pg" part, of their respective names. I've also seen names like "myfile-text.txt" and "myfile-html.html", and "myfile-8.txt" and "myfile-h.html". There's no need for such double-indicators of a file's type. There have been any number of similar variations, some probably related to the PPer's versioning system as they work through the PPing process; some related the PPing software. In one of the DP posts are these fragments:
All of PG uploads are currently UTF-8, even if the file is Latin-1.
Or even do away with the appending altogether. The direct upload panel lets the White Washers know what encoding to expect. So they know to run
No idea where this idea comes from, but it's wrong. See below. the
program to convert the Latin-1 characters to UTF-8.
Also wrong. The WWers don't see PG's upload screen (except for their own projects). They also never convert Latin1 to UTF8--that's done to posted Latin1/ASCII files by PG's behind-the-scenes software, which the WWers have no control over. I think that's sufficient food for thought/discussion for now... Al