
Jim Adcock wrote:
OK. My point being that IF PG were to accept a "proper" book INPUT encoding format that preserves the hard-won knowledge of the original encoding volunteer, then there would be no need for a future volunteer to have to completely scan that encoding against the original book scans in order to make another pass looking for errors, etc.
There's a misconception here. PG *does* allow you to post additional file formats *along* with TXT and HTML. TEI comes to mind as format perfectly suitable to preserve a lot that HTML cannot. The reason that there isn't a TEI file posted along with *every* ebook is that most PPers at DP don't care to produce one.
Both formats throw away the original volunteers' knowledge about the common parts of books: TOC, author info, pub info, copyright pages, index, chapters, etc.
TEI has elements for all these cases.
TXT files seem to me to almost always have some glyphs outside of the 8-bit char set. Unicode text files would at least overcome this limitation.
I don't see any problem here: Produce utf-8 files. The whitewashers will create some work for themselves by converting the utf-8 to all sorts of embarrassing encodings and then waste more time at the helpdesk to explain to incredulous users what `encodings“ are, but that need not be your problem. -- Marcello Perathoner webmaster@gutenberg.org