
This refers to the standard PGheader information we include at the beginning of all of our documents. For instance: Title: The Rejuvenation of Aunt Mary Author: Anne Warner Release Date: May 6, 2005 [eBook #15775] Language: English Character set encoding: ISO-8859-1 *** The character set encoding line makes sense for text files. However, for HTML files it begins to make a little less sense. First of all, an HTML file usually contains an encoding line in the HTML header itself. <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" /> But, this information just refers to how the HTML file is encoding, not necessarily what character set is actually displayed in the browser. For instance, a HTML document encoded in ISO-8859-1 can still contain all sorts of UTF-8 characters. You just have to escape them out (xxx) to get the browser to display the UTF-8 character. So, in that case, if we put a character set encoding line in the PGHeader, which do we use? The file itself is ISO-8859-1 ... but the characters displayed in your browser include UTF-8. Or vice versa ... if you create a HTML doc encoded in UTF-8, but it contains nothing by ASCII characters, which do you say in the PGHeader? My reason for asking this is because currently the TEI->HTML conversion doesn't list a character set encoding in the PGHeader. Should it? How should the automated system determine what to put there if we have that line? I'm looking for opinions and hopefully a consensus can be reached. Josh

Joshua Hutchinson wrote:
My reason for asking this is because currently the TEI->HTML conversion doesn't list a character set encoding in the PGHeader. Should it? How should the automated system determine what to put there if we have that line?
The encoding is already handled very well by the browser and we should not bother the user with things he does not need to know. But with Unicode we face a problem completely different than the character set encoding problem: the problem of character set coverage. With iso-8859-1 and its beggarly 256 characters we can be pretty sure the user has at least one font installed which contains all these characters. The browser will find this font and display the characters, even if running on a chinese PC. With Unicode we can be sure that all the fonts the user has installed, taken as a whole, don't cover the whole Unicode character set. There is no solution to this problem. If you use unicode characters in your text you are gambling on the user having an appropriate font installed. The only hint we could give to the user is, if he can reasonably expect his browser to render this file correctly. As we cannot know which fonts the user has installed, we can just print a list of the unicode blocks used in the file, like this: Unicode blocks: Basic Latin, Latin-1 Supplement, Bengali, Greek and Coptic, General Punctuation so a user who has no Bengali fonts will know some characters will not display. I think this will create more confusion than it solves and opt for leaving any character set encoding line out of the header in non TXT files. -- Marcello Perathoner webmaster@gutenberg.org

Joshua Hutchinson wrote:
This refers to the standard PGheader information we include at the beginning of all of our documents.
For instance:
Title: The Rejuvenation of Aunt Mary
Author: Anne Warner
Release Date: May 6, 2005 [eBook #15775]
Language: English
Character set encoding: ISO-8859-1
***
The character set encoding line makes sense for text files. However, for HTML files it begins to make a little less sense.
First of all, an HTML file usually contains an encoding line in the HTML header itself. <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
But, this information just refers to how the HTML file is encoding, not necessarily what character set is actually displayed in the browser. For instance, a HTML document encoded in ISO-8859-1 can still contain all sorts of UTF-8 characters. You just have to escape them out (xxx) to get the browser to display the UTF-8 character.
I think you may be confusing UTF-8 _encoding_ with Unicode character _mapping_. Unicode promises to provide a unique numerical mapping for every character in every language in the world (mappings for Klingon _were_ rejected as inappropriate, however mappings for Tengwar are still under consideration). ASCII is an encoding method using 7 bits that can encode the first 127 values of Unicode, but no more. iso-8859-1 (aka latin-1) is an encoding method using 8 bits (one byte) that can encode the first 256 values of Unicode, but no more. UTF-8 is an encoding method using one or more bytes that can encode all Unicode values up to 2,097,152 (currently, Unicode only defines mappings through 196,480). UTF-16 is an encoding method using either two or four bytes per character that can encode Unicode values up to 1,073,741,824 (I think; I'm a bit fuzzy on UTF-16). MacRoman and windows-1252 use the same _encoding_ method as iso-8859-1, but use non-Unicode character _mappings_ for characters in the range of 128-255. Character entities, specifically numeric entities (nnnn;) are a method of representing Unicode values above 127 using ASCII encoding. To my knowledge, character entities are unique to XML/HTML. UTF-8 works by using the high bit of a byte to indicate that it is part of a multi-byte group. iso-8859-1 uses this bit to indicate a mapping in the range of 128-255. Thus, without being told, there is now way for a browser to know whether two bytes, each with the high bit set, represent two Unicode values in the range 128-255 or one Unicode value in the range of 128-2047 (there are some heuristics, but they are not infallible).
So, in that case, if we put a character set encoding line in the PGHeader, which do we use? The file itself is ISO-8859-1 ... but the characters displayed in your browser include UTF-8. Or vice versa ... if you create a HTML doc encoded in UTF-8, but it contains nothing by ASCII characters, which do you say in the PGHeader?
My reason for asking this is because currently the TEI->HTML conversion doesn't list a character set encoding in the PGHeader. Should it? How should the automated system determine what to put there if we have that line?
I think it is fair to assume that _all_ PG texts will use Unicode character _mappings_; indeed, I don't think the current PGheader is any indication of the range of character mappings, it only indicates what _encoding_ was used. So to answer the question, "which encoding do we use?" for XML, it should be safe to use the same encoding that was in the PGheader; the header promised a certain encoding, and we should be able to assume that the document will keep its promises. Should the TEI to HTML conversion generate a character set encoding declaration? Absolutely. Which one should it generate? Whichever one it used. If the conversion uses UTF-8 output (and I'm betting it does) it should declare that in the XHTML header. Likewise, it would be perfectly acceptable to use "iso-8859-1" or "us-ascii", just so long as the resulting document matches the declaration. Some browsers rely on the content-encoding declaration in figuring out how to display the text; you can use any one you like, just don't lie to the browser. For texts that truly are ASCII (i.e. limited to 7-bit characters) you can use any of "us-ascii", "utf-8", "iso-8859-1", "window-1252" or "macroman" because these encodings are all identical for values less than 128. Personally, I like "iso-8859" for western European texts, and "utf-8" for all others, although it wouldn't hurt my feelings to use "utf-8" exclusively.
I'm looking for opinions and hopefully a consensus can be reached.
Josh _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
participants (3)
-
Joshua Hutchinson
-
Lee Passey
-
Marcello Perathoner