[gutvol-d] Re: In search of a more-vanilla vanilla TXT

14 Sep 2009

      ...
...
1) Preserves the hard-won effort I have already put into content
creation,
such that a future volunteer can build on my work without having to
"reverse
engineer" those gratuitous errors currently being introduced by the
current
PG use of txt and html.
Please give some real-world examples.
OK.  My point being that IF PG were to accept a "proper" book INPUT encoding
format that preserves the hard-won knowledge of the original encoding
volunteer, then there would be no need for a future volunteer to have to
completely scan that encoding against the original book scans in order to
make another pass looking for errors, etc.  So what all is "wrong" with TXT
and HTML in this regards as stored in the PG databases?

Both formats throw away the original volunteers' knowledge about the common
parts of books: TOC, author info, pub info, copyright pages, index,
chapters, etc.  Yes one can code this information in HTML but there is no
unambiguous way to do so which means that PG HTML encodings all take
different paths, as one rapidly discovers if one tries to automagically
convert PG HTML into other reflow file formats. You could follow common h1,
h2, h3 settings by convention -- if PG were to establish and require such --
but then you end up with really ugly rendered HTML on common displays.  You
can overcome this with style sheets -- but then you are defeating many tools
which automagically convert HTML into a variety of other reflow file formats
for the various e-readers.

Both formats as stored by PG gratuitous throw away hard-won line-by-line
alignments between scan text and hand-scanno corrected text.  These
alignments are needed if a future volunteer wanted to make another pass at
"fixing" errors in the text, for example by running through DP again, or
running it against a future automagic tool comparing a new scan to the PG
text. I submit my HTML to PG WITH the original line-by-line alignments --
because it doesn't in any way hurt the HTML and allows a future volunteer to
make another pass on my work -- but then PG insists on throwing this
information away anyway before posting their HTML files.

Both formats throw away page numbers and page breaks, which again are
necessary to make another volunteer pass against the original scans, and
also to make future passes against broken link info, etc.  Also would be
useful for some college courses, where you need page number refs, even if
reading on a reflow reader device.  I'm NOT suggesting that page numbers
should be typically be displayed in an OUTPUT reflow file format rendering,
rather that this represents hard-won information that ought to be retained
in a well-designed INPUT file format encoding.

TXT files seem to me to almost always have some glyphs outside of the 8-bit
char set.  Unicode text files would at least overcome this limitation.  HTML
in theory doesn't have this limitation, but in practice I find in submitting
"acceptable" HTML to PG running it through their battery of acceptance tools
I find some glyphs I can't get through so I end up punting and throwing away
"correct" glyph information dumbing down the representation of some glyphs.

PG and DP *in practice* have a dumbed-down concept of punctuation, such that
it's impossible to maintain and retain "original authors intent" as
expressed in the printed work.  For example, M-Dash is commonly found in
three contexts: lead-in, lead-out, and connecting, similar to how ellipses
are used at least in three different ways: ...lead in, lead out, and ...
connecting.  But in practice all one can get through PG and DP is connecting
M-dash.  Also consider all the [correct] variety of Unicode quotation marks
which needlessly get reduced in PG and DP to only U+0022 OR U+0027. In
general PG has a dumbed-down concept of punctuation, that near is near
enough, and is actively hostile to accurately encoding the punctuation as
rendered in the original print document.  Again, it is EASY to dumb down an
INPUT file format, for example if you need to output to a 7-bit or even a
5-bit teletypewriter, if that is what you want. So why insist that the input
file encoder get it wrong in the first place?  It is easy to throw away
information when going from an INPUT file encoding to an OUTPUT file
rendering.  It is VERY DIFFICULT to correctly fix introduced errors when
going back from a reduced OUTPUT file rendering to a correctly encoded input
file encoding.

What I am imagining is some simple-to-use file encoding format where a
volunteer can correctly and unambiguously code the common things and
conventions one commonly finds in every day books, such that another
volunteer can pick up and make another pass on the book some years hence --
without having to reinvent nor rediscover work that the previous volunteer
has already put into understanding and coding the book.  Such an INPUT file
encoding having little or nothing to do with how the output will be
displayed in an eventual OUTPUT file rendering.  DP already has much of this
distinction in their work flow.  Unfortunately, their page-by-page
conventions and simplifications "dumbing down" for the sake of the multiple
levels of volunteers guarantees loss of information.  Not to mention that
they also throw away the correctly encoded INPUT file hard-won knowledge for
more ambiguous OUTPUT file renderings in HTML and TXT.

The end result is that both PG and DP end up be "write once" efforts that
are hostile to future improvements by future volunteers -- instead of
encouraging on-going efforts to improve what we got.  Which is also
indicative of a general culture of quantity not quality.

PG pretends that part of why we do what we do is to protect and preserve
books in perpetuity.  This implies in exchange that information that is
gratuitously thrown away during input file encoding [directly in an output
file rendering] is potentially lost for eternity.  Why insist via policy
that volunteer input file encoders must throw away this information?

[gutvol-d] Re: In search of a more-vanilla vanilla TXT

Jim Adcock