[gutvol-d] Re: In search of a more-vanilla vanilla TXT

16 Sep 2009

      As an example of how much author semantic information is lost going from an
author's writing to PG txt format, I went and compared differences between a
recent HTML and PG TXT I did -- where after doing the TXT encoding I went
back and did three more passes over the images to add back in semantic
differences to the HTML that the PG TXT didn't represent.

Now the reality would be that it would take say TEI not HTML to represent
all of the author's intent.  But measuring the loss going from HTML back to
TXT gives an order of magnitude estimate of how much author information we
are throwing away by representing a work in PG TXT.  In the case of this
book, the answer was more than 1000 "losses" -- or an average of about 3
losses per page.  And this is NOT counting about an addition 1000 losses in
representation of emphasis.

Now, let's say we have a PG TXT and some volunteer in the future wants to go
back from that txt and say as correctly as possible represent that text
using PDF.  How many "errors" does that volunteer need to correctly find
where the TXT file loses author's semantic information by carefully
comparing the page images to the PG TXT file, reintroducing information
known to the original volunteer transcribers, but discarded as not being
representable in PG TXT?  The answer is that this volunteer has to find and
fix the txt in literally about 2000 places.  Want to place a bet on how many
of those 2000 places the volunteer trying to create an accurate PDF file is
actually going to "catch" ???  I can tell you in my efforts going from PG
TXT to HTML in the first place it's a good part of a week's work -- not to
imply *I* caught them all either!

[gutvol-d] Re: In search of a more-vanilla vanilla TXT

James Adcock