Encoding statement in HTML PG Header

14 Sep 2005

      This refers to the standard PGheader information we include at the beginning of all of our documents.

For instance:

Title: The Rejuvenation of Aunt Mary

Author: Anne Warner

Release Date: May 6, 2005  [eBook #15775]

Language: English

Character set encoding: ISO-8859-1

***

The character set encoding line makes sense for text files.  However, for HTML files it begins to make a little less sense.

First of all, an HTML file usually contains an encoding line in the HTML header itself.  <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1" />

But, this information just refers to how the HTML file is encoding, not necessarily what character set is actually displayed in the browser.  For instance, a HTML document encoded in ISO-8859-1 can still contain all sorts of UTF-8 characters.  You just have to escape them out (&#xxxx) to get the browser to display the UTF-8 character.

So, in that case, if we put a character set encoding line in the PGHeader, which do we use?  The file itself is ISO-8859-1 ... but the characters displayed in your browser include UTF-8.  Or vice versa ... if you create a HTML doc encoded in UTF-8, but it contains nothing by ASCII characters, which do you say in the PGHeader?

My reason for asking this is because currently the TEI->HTML conversion doesn't list a character set encoding in the PGHeader.  Should it?  How should the automated system determine what to put there if we have that line?

I'm looking for opinions and hopefully a consensus can be reached.

Josh

Joshua Hutchinson

Marcello Perathoner

Lee Passey

tags

participants (3)