
Joshua Hutchinson wrote:
This refers to the standard PGheader information we include at the beginning of all of our documents.
For instance:
Title: The Rejuvenation of Aunt Mary
Author: Anne Warner
Release Date: May 6, 2005 [eBook #15775]
Language: English
Character set encoding: ISO-8859-1
***
The character set encoding line makes sense for text files. However, for HTML files it begins to make a little less sense.
First of all, an HTML file usually contains an encoding line in the HTML header itself. <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
But, this information just refers to how the HTML file is encoding, not necessarily what character set is actually displayed in the browser. For instance, a HTML document encoded in ISO-8859-1 can still contain all sorts of UTF-8 characters. You just have to escape them out (xxx) to get the browser to display the UTF-8 character.
I think you may be confusing UTF-8 _encoding_ with Unicode character _mapping_. Unicode promises to provide a unique numerical mapping for every character in every language in the world (mappings for Klingon _were_ rejected as inappropriate, however mappings for Tengwar are still under consideration). ASCII is an encoding method using 7 bits that can encode the first 127 values of Unicode, but no more. iso-8859-1 (aka latin-1) is an encoding method using 8 bits (one byte) that can encode the first 256 values of Unicode, but no more. UTF-8 is an encoding method using one or more bytes that can encode all Unicode values up to 2,097,152 (currently, Unicode only defines mappings through 196,480). UTF-16 is an encoding method using either two or four bytes per character that can encode Unicode values up to 1,073,741,824 (I think; I'm a bit fuzzy on UTF-16). MacRoman and windows-1252 use the same _encoding_ method as iso-8859-1, but use non-Unicode character _mappings_ for characters in the range of 128-255. Character entities, specifically numeric entities (nnnn;) are a method of representing Unicode values above 127 using ASCII encoding. To my knowledge, character entities are unique to XML/HTML. UTF-8 works by using the high bit of a byte to indicate that it is part of a multi-byte group. iso-8859-1 uses this bit to indicate a mapping in the range of 128-255. Thus, without being told, there is now way for a browser to know whether two bytes, each with the high bit set, represent two Unicode values in the range 128-255 or one Unicode value in the range of 128-2047 (there are some heuristics, but they are not infallible).
So, in that case, if we put a character set encoding line in the PGHeader, which do we use? The file itself is ISO-8859-1 ... but the characters displayed in your browser include UTF-8. Or vice versa ... if you create a HTML doc encoded in UTF-8, but it contains nothing by ASCII characters, which do you say in the PGHeader?
My reason for asking this is because currently the TEI->HTML conversion doesn't list a character set encoding in the PGHeader. Should it? How should the automated system determine what to put there if we have that line?
I think it is fair to assume that _all_ PG texts will use Unicode character _mappings_; indeed, I don't think the current PGheader is any indication of the range of character mappings, it only indicates what _encoding_ was used. So to answer the question, "which encoding do we use?" for XML, it should be safe to use the same encoding that was in the PGheader; the header promised a certain encoding, and we should be able to assume that the document will keep its promises. Should the TEI to HTML conversion generate a character set encoding declaration? Absolutely. Which one should it generate? Whichever one it used. If the conversion uses UTF-8 output (and I'm betting it does) it should declare that in the XHTML header. Likewise, it would be perfectly acceptable to use "iso-8859-1" or "us-ascii", just so long as the resulting document matches the declaration. Some browsers rely on the content-encoding declaration in figuring out how to display the text; you can use any one you like, just don't lie to the browser. For texts that truly are ASCII (i.e. limited to 7-bit characters) you can use any of "us-ascii", "utf-8", "iso-8859-1", "window-1252" or "macroman" because these encodings are all identical for values less than 128. Personally, I like "iso-8859" for western European texts, and "utf-8" for all others, although it wouldn't hurt my feelings to use "utf-8" exclusively.
I'm looking for opinions and hopefully a consensus can be reached.
Josh _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d