
Joshua Hutchinson wrote:
My reason for asking this is because currently the TEI->HTML conversion doesn't list a character set encoding in the PGHeader. Should it? How should the automated system determine what to put there if we have that line?
The encoding is already handled very well by the browser and we should not bother the user with things he does not need to know. But with Unicode we face a problem completely different than the character set encoding problem: the problem of character set coverage. With iso-8859-1 and its beggarly 256 characters we can be pretty sure the user has at least one font installed which contains all these characters. The browser will find this font and display the characters, even if running on a chinese PC. With Unicode we can be sure that all the fonts the user has installed, taken as a whole, don't cover the whole Unicode character set. There is no solution to this problem. If you use unicode characters in your text you are gambling on the user having an appropriate font installed. The only hint we could give to the user is, if he can reasonably expect his browser to render this file correctly. As we cannot know which fonts the user has installed, we can just print a list of the unicode blocks used in the file, like this: Unicode blocks: Basic Latin, Latin-1 Supplement, Bengali, Greek and Coptic, General Punctuation so a user who has no Bengali fonts will know some characters will not display. I think this will create more confusion than it solves and opt for leaving any character set encoding line out of the header in non TXT files. -- Marcello Perathoner webmaster@gutenberg.org