[gutvol-d] Re: In search of a more-vanilla vanilla TXT

14 Sep 2009

      Jim Adcock wrote:
...
OK.  My point being that IF PG were to accept a "proper" book INPUT encoding
format that preserves the hard-won knowledge of the original encoding
volunteer, then there would be no need for a future volunteer to have to
completely scan that encoding against the original book scans in order to
make another pass looking for errors, etc.
There's a misconception here.

PG *does* allow you to post additional file formats *along* with TXT and 
HTML. TEI comes to mind as format perfectly suitable to preserve a lot 
that HTML cannot.

The reason that there isn't a TEI file posted along with *every* ebook 
is that most PPers at DP don't care to produce one.
...
Both formats throw away the original volunteers' knowledge about the common
parts of books: TOC, author info, pub info, copyright pages, index,
chapters, etc.
TEI has elements for all these cases.
...
TXT files seem to me to almost always have some glyphs outside of the 8-bit
char set.  Unicode text files would at least overcome this limitation.
I don't see any problem here: Produce utf-8 files.

The whitewashers will create some work for themselves by converting the 
utf-8 to all sorts of embarrassing encodings and then waste more time at 
the helpdesk to explain to incredulous users what `encodings´ are, but 
that need not be your problem.

-- 
Marcello Perathoner
webmaster@gutenberg.org