
As an example of how much author semantic information is lost going from an author's writing to PG txt format, I went and compared differences between a recent HTML and PG TXT I did -- where after doing the TXT encoding I went back and did three more passes over the images to add back in semantic differences to the HTML that the PG TXT didn't represent. The problem is that there are very few systems that truely represent semantic content. Inorder to truely represent such information you have to know about it. This requires one to have aditional information which is know as "world knowledge". This information is provided by
Hi There, Am 16.09.2009 um 08:12 schrieb James Adcock: the reader of books.
Now the reality would be that it would take say TEI not HTML to represent all of the author's intent. But measuring the loss going from HTML back to TXT gives an order of magnitude estimate of how much author information we are throwing away by representing a work in PG TXT. In the case of this book, the answer was more than 1000 "losses" -- or an average of about 3 losses per page. And this is NOT counting about an addition 1000 losses in representation of emphasis.
This problem is a matter of complexity. That is even in pure Vanilla Text one can reprensent these intentions, but one loses readablity. Furthermore one has to make assumptions of the true intent of the author!! regards Keith