Missing apostrophes in the generated HTML / ePub versions of Madame Bovary

Hi, the ebook http://www.gutenberg.org/etext/14155 is missing all the apostrophes in the generated versions (at least HTML and ePub). The two hand-crafted files (plain text and rtf) contain the apostrophes, for instance one of the very first lines in the plain text file is: Permettez-moi d’inscrire This has been converted in the HTML version to: <p id="id00015">Permettez-moi dinscrire Is this due to a bug in the epub-maker used to convert the file, or is there something buggy in the original text? Cheers, -- Joaquin Cuenca Abela

Hmm.... at a quick look, it appears to be problem in the original file. At the top of the file, we have this: Character set encoding: ISO-8859-1 And all transformations are done according to character-encoding standards. However, this file uses a representation of the apostrophe character which is not included in the ISO-8859-1 standard. If you want some history, basically you can blame microsoft. They developed their own character sets for use with Windows, which were _close_ to already-established standards, but not quite identical. Most often you see the effects of this cropping up when people use "curly quote" characters. For the Latin-1 texts in PG, we use just a plain ' character for an apostrophe. Automatic checking that is done when texts are submitted will flag this before a text is posted. However, there are some older text (such as this one) where this problem can still crop up. --Andrew On Sat, 8 May 2010, Joaquin Cuenca Abela wrote:
Hi,
the ebook http://www.gutenberg.org/etext/14155 is missing all the apostrophes in the generated versions (at least HTML and ePub). The two hand-crafted files (plain text and rtf) contain the apostrophes, for instance one of the very first lines in the plain text file is: Permettez-moi d’inscrire This has been converted in the HTML version to: <p id="id00015">Permettez-moi dinscrire Is this due to a bug in the epub-maker used to convert the file, or is there something buggy in the original text? Cheers, -- Joaquin Cuenca Abela

Andrew Sly wrote:
If you want some history, basically you can blame microsoft. They developed their own character sets for use with Windows, which were _close_ to already-established standards, but not quite identical.
No, you cannot blame Microsoft. This is one of the few cases were they did right: They registered their character sets with IANA, and this makes them as standard as any other character set, ISO or UNICODE or whatever. The blame lies with the whitewasher who mislabeled the file as ISO-8859-1 when it really is WINDOWS-1252. Whatever. I fixed this by overriding the PG header in the database. Somebody should check all books by http://www.ebooksgratuits.com or all books with RTF files and see if they are correctly labelled. -- Marcello Perathoner webmaster@gutenberg.org

On Sat, 8 May 2010, Marcello Perathoner wrote:
Andrew Sly wrote:
If you want some history, basically you can blame microsoft. They developed their own character sets for use with Windows, which were _close_ to already-established standards, but not quite identical.
No, you cannot blame Microsoft.
This is one of the few cases were they did right: They registered their character sets with IANA, and this makes them as standard as any other character set, ISO or UNICODE or whatever.
Yes. I am aware that it is a registed charset. I have read before that the general recomendation from microsoft was to simply label your text as Latin-1 because it was close enough that there were no important differences. However, I don't have a source for that, so it is possible that it is merely unfounded microsoft-bashing.
The blame lies with the whitewasher who mislabeled the file as ISO-8859-1 when it really is WINDOWS-1252.
Hmm... I could understand if this was some earlier PG text. (From the time when the only encoding distinction made was 7-bit ascii, or "8-bit") But it was more recent, and should have been caught at posting time.
Whatever. I fixed this by overriding the PG header in the database. Somebody should check all books by http://www.ebooksgratuits.com or all books with RTF files and see if they are correctly labelled.
Would it be possible to run some kind of automated check on all files labelled ISO-8859-1, searching for characters in the 0x80 to 0x9F range? --Andrew

Andrew Sly wrote:
Would it be possible to run some kind of automated check on all files labelled ISO-8859-1, searching for characters in the 0x80 to 0x9F range?
In theory yes. In practice I've found that there are very many mislabelled files, not always so simple a case as ISO vs. WIN. I'll see if I can extract a list from somewhere. -- Marcello Perathoner webmaster@gutenberg.org

On Sat, 8 May 2010, Marcello Perathoner wrote:
Andrew Sly wrote:
Would it be possible to run some kind of automated check on all files labelled ISO-8859-1, searching for characters in the 0x80 to 0x9F range?
In theory yes. In practice I've found that there are very many mislabelled files, not always so simple a case as ISO vs. WIN.
I could believe that. The one that jumps to my mind is the Swedish Bible that I prepared for re-posting back in 2005. It looked like it had been prepared on a computer using one of the old DOS code pages, only it didn't quite seem to match any standard that I could find. Possibly it had been mangled in a file transfer somewhere. I was able to find what the correct characters should be, do some global search/replace and repost it as ISO-8859-1. --Andrew
participants (3)
-
Andrew Sly
-
Joaquin Cuenca Abela
-
Marcello Perathoner