
As an example of how much author semantic information is lost going from an author's writing to PG txt format, I went and compared differences between a recent HTML and PG TXT I did -- where after doing the TXT encoding I went back and did three more passes over the images to add back in semantic differences to the HTML that the PG TXT didn't represent. Now the reality would be that it would take say TEI not HTML to represent all of the author's intent. But measuring the loss going from HTML back to TXT gives an order of magnitude estimate of how much author information we are throwing away by representing a work in PG TXT. In the case of this book, the answer was more than 1000 "losses" -- or an average of about 3 losses per page. And this is NOT counting about an addition 1000 losses in representation of emphasis. Now, let's say we have a PG TXT and some volunteer in the future wants to go back from that txt and say as correctly as possible represent that text using PDF. How many "errors" does that volunteer need to correctly find where the TXT file loses author's semantic information by carefully comparing the page images to the PG TXT file, reintroducing information known to the original volunteer transcribers, but discarded as not being representable in PG TXT? The answer is that this volunteer has to find and fix the txt in literally about 2000 places. Want to place a bet on how many of those 2000 places the volunteer trying to create an accurate PDF file is actually going to "catch" ??? I can tell you in my efforts going from PG TXT to HTML in the first place it's a good part of a week's work -- not to imply *I* caught them all either!