
The subject of file names with xxxx-lt1.txt or xxxx-lat1.txt for Latin-1 and the xxxx-utf8.txt for UTF-8 text files came up in a DP forum <https://www.pgdp.net/phpBB3/viewtopic.php?p=1189531#p1189531>. The gist of the discussion (as I see it) is that we select the file type on the direct upload screen. Do we even need an indicator of the file type in the file name? If so, then do we need both? UTF8 is be the default so xxxx.txt without the (-UTF8) should be sufficient for UTF-8 files. The -lt1 or -lat1 could remain the backup signal that the file needs to go through the upconvert program to UTF-8. Cheers, Rick

Hi, Rick. The whitewashers get whatever you upload. No transformation or conversion is done automatically: it's just a .zip file that they download and process. Informative filenames are fine. Any text should end in .txt - Greg On Sat, Jan 04, 2020 at 11:00:28PM -0800, Rick Tonsing wrote:
The subject of file names with xxxx-lt1.txt or xxxx-lat1.txt for Latin-1 and the xxxx-utf8.txt for UTF-8 text files came up in a DP forum <https://www.pgdp.net/phpBB3/viewtopic.php?p=1189531#p1189531>.
The gist of the discussion (as I see it) is that we select the file type on the direct upload screen. Do we even need an indicator of the file type in the file name? If so, then do we need both? UTF8 is be the default so xxxx.txt without the (-UTF8) should be sufficient for UTF-8 files. The -lt1 or -lat1 could remain the backup signal that the file needs to go through the upconvert program to UTF-8.
Cheers,
Rick
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org https://lists.pglaf.org/mailman/listinfo/gutvol-d Unsubscribe: https://lists.pglaf.org/mailman/options/gutvol-d

UTF8 text files should have "-utf8" in their file names, e.g. "myfile-utf8.txt". This forces PG's posting software to treat text files as UTF8, rather than its defaulting to Latin1/ASCII. (This has been a de facto standard for some years now.) It's not necessary to do this for HTML files. If necessary, the posting software will prompt for the correct character set, but it's an easy prompt to skip through, or give the wrong answer to. The presence of "-utf8" in the filename obviates the need for the prompt. Latin1/ASCII text files don't trigger the prompt at all. If PG's upload check reports a UTF8 text file without the "-utf8", then I rename the file inside the zip file, before unzipping it. Conversely, if a text file arrives with "-lat1", "-ltn1", "-iso", "-asc", or some such Latin1/ASCII indicator, I rename the file to remove it, since addhd handles Latin1/ASCII files correctly. Also, to avoid spurious files generated by the posting software, I make sure the base name of the zip file is the same as the base names of the text and HTML files, e.g. myfile.zip contains myfile.txt (or myfile-utf8.txt) and myfile.htm. Re HTML files: PG's extension for them is ".htm". When an uploaded zip file contains an HTML file with the extension ".html", I rename it, as mentioned above. If this isn't done, the posting software copies the file to a new file with the extension ".htm", e.g. "myfile.html" is copied to "myfile.htm", leaving a spurious ".html" file. (BTW, the posting software never, ever, modifies the submitted files--it copies them to new files with the required name, then works with the new files.) While I'm at it, more on file names... As mentioned above, I rename text and HTML files, and sometimes zip files, so that their base names are the same (except for the "-utf8" part of text files). There's no need for text/HTML files to have some versioning component. For example, I recently handled a zip file named diam-pg.zip, which contained diam-b.html and diam-cc-utf8.txt. I renamed the files to remove the "-b" and "-cc" parts, and the zip file to remove the "-pg" part, of their respective names. I've also seen names like "myfile-text.txt" and "myfile-html.html", and "myfile-8.txt" and "myfile-h.html". There's no need for such double-indicators of a file's type. There have been any number of similar variations, some probably related to the PPer's versioning system as they work through the PPing process; some related the PPing software. In one of the DP posts are these fragments:
All of PG uploads are currently UTF-8, even if the file is Latin-1.
Or even do away with the appending altogether. The direct upload panel lets the White Washers know what encoding to expect. So they know to run
No idea where this idea comes from, but it's wrong. See below. the
program to convert the Latin-1 characters to UTF-8.
Also wrong. The WWers don't see PG's upload screen (except for their own projects). They also never convert Latin1 to UTF8--that's done to posted Latin1/ASCII files by PG's behind-the-scenes software, which the WWers have no control over. I think that's sufficient food for thought/discussion for now... Al
-----Original Message----- From: gutvol-d [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Greg Newby Sent: Sunday, January 05, 2020 6:55 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] UTF-8 File Names
Hi, Rick. The whitewashers get whatever you upload. No transformation or conversion is done automatically: it's just a .zip file that they download and process.
Informative filenames are fine. Any text should end in .txt - Greg
On Sat, Jan 04, 2020 at 11:00:28PM -0800, Rick Tonsing wrote:
The subject of file names with xxxx-lt1.txt or xxxx-lat1.txt for Latin-1 and the xxxx-utf8.txt for UTF-8 text files came up in a DP forum <https://www.pgdp.net/phpBB3/viewtopic.php?p=1189531#p1189531> .
The gist of the discussion (as I see it) is that we select the file type on the direct upload screen. Do we even need an indicator of the
file type in the file name? If so, then do we need both? UTF8 is be the default so xxxx.txt without the (-UTF8) should be sufficient for UTF-8 files. The -lt1 or -lat1 could remain the backup signal that the file needs to go through the upconvert program to UTF-8.
Cheers,
Rick
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org https://lists.pglaf.org/mailman/listinfo/gutvol-d Unsubscribe: https://lists.pglaf.org/mailman/options/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org https://lists.pglaf.org/mailman/listinfo/gutvol-d Unsubscribe: https://lists.pglaf.org/mailman/options/gutvol-d

Thanks Al -- this would be great information to continue to be shared in a permanent appropriate location, such as in the general information section located at upload.pglaf.org<https://upload.pglaf.org/> ________________________________ From: gutvol-d <gutvol-d-bounces@lists.pglaf.org> on behalf of Al Haines <ajhaines@shaw.ca> Sent: Sunday, January 5, 2020 11:51 AM To: gbnewby@pglaf.org <gbnewby@pglaf.org>; 'Project Gutenberg Volunteer Discussion' <gutvol-d@lists.pglaf.org>; Joseph E. Loewenstein, M.D. <loewenstein@sssnet.com>; 'Al Haines' <ajhaines@shaw.ca>; 'Chuck Greif' <cbgrf@yahoo.com>; 'David Widger' <cdwidger@gmail.com> Subject: Re: [gutvol-d] UTF-8 File Names UTF8 text files should have "-utf8" in their file names, e.g. "myfile-utf8.txt". This forces PG's posting software to treat text files as UTF8, rather than its defaulting to Latin1/ASCII. (This has been a de facto standard for some years now.) It's not necessary to do this for HTML files. If necessary, the posting software will prompt for the correct character set, but it's an easy prompt to skip through, or give the wrong answer to. The presence of "-utf8" in the filename obviates the need for the prompt. Latin1/ASCII text files don't trigger the prompt at all. If PG's upload check reports a UTF8 text file without the "-utf8", then I rename the file inside the zip file, before unzipping it. Conversely, if a text file arrives with "-lat1", "-ltn1", "-iso", "-asc", or some such Latin1/ASCII indicator, I rename the file to remove it, since addhd handles Latin1/ASCII files correctly. Also, to avoid spurious files generated by the posting software, I make sure the base name of the zip file is the same as the base names of the text and HTML files, e.g. myfile.zip contains myfile.txt (or myfile-utf8.txt) and myfile.htm. Re HTML files: PG's extension for them is ".htm". When an uploaded zip file contains an HTML file with the extension ".html", I rename it, as mentioned above. If this isn't done, the posting software copies the file to a new file with the extension ".htm", e.g. "myfile.html" is copied to "myfile.htm", leaving a spurious ".html" file. (BTW, the posting software never, ever, modifies the submitted files--it copies them to new files with the required name, then works with the new files.) While I'm at it, more on file names... As mentioned above, I rename text and HTML files, and sometimes zip files, so that their base names are the same (except for the "-utf8" part of text files). There's no need for text/HTML files to have some versioning component. For example, I recently handled a zip file named diam-pg.zip, which contained diam-b.html and diam-cc-utf8.txt. I renamed the files to remove the "-b" and "-cc" parts, and the zip file to remove the "-pg" part, of their respective names. I've also seen names like "myfile-text.txt" and "myfile-html.html", and "myfile-8.txt" and "myfile-h.html". There's no need for such double-indicators of a file's type. There have been any number of similar variations, some probably related to the PPer's versioning system as they work through the PPing process; some related the PPing software. In one of the DP posts are these fragments:
All of PG uploads are currently UTF-8, even if the file is Latin-1.
Or even do away with the appending altogether. The direct upload panel lets the White Washers know what encoding to expect. So they know to run
No idea where this idea comes from, but it's wrong. See below. the
program to convert the Latin-1 characters to UTF-8.
Also wrong. The WWers don't see PG's upload screen (except for their own projects). They also never convert Latin1 to UTF8--that's done to posted Latin1/ASCII files by PG's behind-the-scenes software, which the WWers have no control over. I think that's sufficient food for thought/discussion for now... Al
participants (4)
-
Al Haines
-
Greg Newby
-
James Adcock
-
Rick Tonsing