
On Sat, 8 May 2010, Marcello Perathoner wrote:
Andrew Sly wrote:
If you want some history, basically you can blame microsoft. They developed their own character sets for use with Windows, which were _close_ to already-established standards, but not quite identical.
No, you cannot blame Microsoft.
This is one of the few cases were they did right: They registered their character sets with IANA, and this makes them as standard as any other character set, ISO or UNICODE or whatever.
Yes. I am aware that it is a registed charset. I have read before that the general recomendation from microsoft was to simply label your text as Latin-1 because it was close enough that there were no important differences. However, I don't have a source for that, so it is possible that it is merely unfounded microsoft-bashing.
The blame lies with the whitewasher who mislabeled the file as ISO-8859-1 when it really is WINDOWS-1252.
Hmm... I could understand if this was some earlier PG text. (From the time when the only encoding distinction made was 7-bit ascii, or "8-bit") But it was more recent, and should have been caught at posting time.
Whatever. I fixed this by overriding the PG header in the database. Somebody should check all books by http://www.ebooksgratuits.com or all books with RTF files and see if they are correctly labelled.
Would it be possible to run some kind of automated check on all files labelled ISO-8859-1, searching for characters in the 0x80 to 0x9F range? --Andrew