UTF8 text files should have "-utf8" in their file names, e.g.
"myfile-utf8.txt". This forces PG's posting software to treat text
files as UTF8, rather than its defaulting to Latin1/ASCII. (This has
been a de facto standard for some years now.) It's not necessary to do
this for HTML files.
If necessary, the posting software will prompt for the correct character
set, but it's an easy prompt to skip through, or give the wrong answer
to. The presence of "-utf8" in the filename obviates the need for the
prompt. Latin1/ASCII text files don't trigger the prompt at all.
If PG's upload check reports a UTF8 text file without the "-utf8", then
I rename the file inside the zip file, before unzipping it.
Conversely, if a text file arrives with "-lat1", "-ltn1", "-iso",
"-asc", or some such Latin1/ASCII indicator, I rename the file to remove
it, since addhd handles Latin1/ASCII files correctly.
Also, to avoid spurious files generated by the posting software, I make
sure the base name of the zip file is the same as the base names of the
text and HTML files, e.g. myfile.zip contains myfile.txt (or
myfile-utf8.txt) and myfile.htm.
Re HTML files: PG's extension for them is ".htm". When an uploaded zip
file contains an HTML file with the extension ".html", I rename it, as
mentioned above. If this isn't done, the posting software copies the
file to a new file with the extension ".htm", e.g. "myfile.html" is
copied to "myfile.htm", leaving a spurious ".html" file. (BTW, the
posting software never, ever, modifies the submitted files--it copies
them to new files with the required name, then works with the new
files.)
While I'm at it, more on file names...
As mentioned above, I rename text and HTML files, and sometimes zip
files, so that their base names are the same (except for the "-utf8"
part of text files). There's no need for text/HTML files to have some
versioning component. For example, I recently handled a zip file named
diam-pg.zip, which contained diam-b.html and diam-cc-utf8.txt. I
renamed the files to remove the "-b" and "-cc" parts, and the zip file
to remove the "-pg" part, of their respective names.
I've also seen names like "myfile-text.txt" and "myfile-html.html", and
"myfile-8.txt" and "myfile-h.html". There's no need for such
double-indicators of a file's type.
There have been any number of similar variations, some probably related
to the PPer's versioning system as they work through the PPing process;
some related the PPing software.
In one of the DP posts are these fragments:
> All of PG uploads are currently UTF-8, even if the file is Latin-1.
No idea where this idea comes from, but it's wrong. See below.
> Or even do away with the appending altogether. The direct upload panel
lets
> the White Washers know what encoding to expect. So they know to run
the
> program to convert the Latin-1 characters to UTF-8.
Also wrong. The WWers don't see PG's upload screen (except for their
own projects). They also never convert Latin1 to UTF8--that's done to
posted Latin1/ASCII files by PG's behind-the-scenes software, which the
WWers have no control over.
I think that's sufficient food for thought/discussion for now...
Al