Re: [gutvol-d] Encoding statement in HTML PG Header

15 Sep 2005

      Joshua Hutchinson wrote:
...
This refers to the standard PGheader information we include at the beginning of all of our documents.
For instance:
Title: The Rejuvenation of Aunt Mary
Author: Anne Warner
Release Date: May 6, 2005  [eBook #15775]
Language: English
Character set encoding: ISO-8859-1
***
The character set encoding line makes sense for text files.  However, for HTML files it begins to make a little less sense.
First of all, an HTML file usually contains an encoding line in the HTML header itself.  <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />
But, this information just refers to how the HTML file is encoding, not necessarily what character set is actually displayed in the browser.  For instance, a HTML document encoded in ISO-8859-1 can still contain all sorts of UTF-8 characters.  You just have to escape them out (&#xxxx) to get the browser to display the UTF-8 character.
I think you may be confusing UTF-8 _encoding_ with Unicode character 
_mapping_. Unicode promises to provide a unique numerical mapping for 
every character in every language in the world (mappings for Klingon 
_were_ rejected as inappropriate, however mappings for Tengwar are still 
under consideration). ASCII is an encoding method using 7 bits that can 
encode the first 127 values of Unicode, but no more. iso-8859-1 (aka 
latin-1) is an encoding method using 8 bits (one byte) that can encode 
the first 256 values of Unicode, but no more. UTF-8 is an encoding 
method using one or more bytes that can encode all Unicode values up to 
2,097,152 (currently, Unicode only defines mappings through 196,480). 
UTF-16 is an encoding method using either two or four bytes per 
character that can encode Unicode values up to 1,073,741,824 (I think; 
I'm a bit fuzzy on UTF-16).

MacRoman and windows-1252 use the same _encoding_ method as iso-8859-1, 
but use non-Unicode character _mappings_ for characters in the range of 
128-255.

Character entities, specifically numeric entities (&#nnnn;) are a method 
of representing Unicode values above 127 using ASCII encoding. To my 
knowledge, character entities are unique to XML/HTML.

UTF-8 works by using the high bit of a byte to indicate that it is part 
of a multi-byte group. iso-8859-1 uses this bit to indicate a mapping in 
the range of 128-255. Thus, without being told, there is now way for a 
browser to know whether two bytes, each with the high bit set, represent 
two Unicode values in the range 128-255 or one Unicode value in the 
range of 128-2047 (there are some heuristics, but they are not infallible).
...
So, in that case, if we put a character set encoding line in the PGHeader, which do we use?  The file itself is ISO-8859-1 ... but the characters displayed in your browser include UTF-8.  Or vice versa ... if you create a HTML doc encoded in UTF-8, but it contains nothing by ASCII characters, which do you say in the PGHeader?
My reason for asking this is because currently the TEI->HTML conversion doesn't list a character set encoding in the PGHeader.  Should it?  How should the automated system determine what to put there if we have that line?
I think it is fair to assume that _all_ PG texts will use Unicode 
character _mappings_; indeed, I don't think the current PGheader is any 
indication of the range of character mappings, it only indicates what 
_encoding_ was used. So to answer the question, "which encoding do we 
use?" for XML, it should be safe to use the same encoding that was in 
the PGheader; the header promised a certain encoding, and we should be 
able to assume that the document will keep its promises.

Should the TEI to HTML conversion generate a character set encoding 
declaration? Absolutely. Which one should it generate? Whichever one it 
used. If the conversion uses UTF-8 output (and I'm betting it does) it 
should declare that in the XHTML header. Likewise, it would be perfectly 
acceptable to use "iso-8859-1" or "us-ascii", just so long as the 
resulting document matches the declaration. Some browsers rely on the 
content-encoding declaration in figuring out how to display the text; 
you can use any one you like, just don't lie to the browser.

For texts that truly are ASCII (i.e. limited to 7-bit characters) you 
can use any of "us-ascii", "utf-8", "iso-8859-1",  "window-1252" or 
"macroman" because these encodings are all identical for values less 
than 128.

Personally, I like "iso-8859" for western European texts, and "utf-8" 
for all others, although it wouldn't hurt my feelings to use "utf-8" 
exclusively.
...
I'm looking for opinions and hopefully a consensus can be reached.
Josh
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d

Re: [gutvol-d] Encoding statement in HTML PG Header

Lee Passey