Am 16.09.2009 um 08:12 schrieb James Adcock:

As an example of how much author semantic information is lost going from an
author's writing to PG txt format, I went and compared differences between a
recent HTML and PG TXT I did -- where after doing the TXT encoding I went
back and did three more passes over the images to add back in semantic
differences to the HTML that the PG TXT didn't represent.

The problem is that there are very few systems that truely represent

semantic content. Inorder to truely represent such information you

have to know about it. This requires one to have aditional information

which is know as "world knowledge". This information is provided by the

reader of books.

Now the reality would be that it would take say TEI not HTML to represent
all of the author's intent. But measuring the loss going from HTML back to
TXT gives an order of magnitude estimate of how much author information we
are throwing away by representing a work in PG TXT. In the case of this
book, the answer was more than 1000 "losses" -- or an average of about 3
losses per page. And this is NOT counting about an addition 1000 losses in
representation of emphasis.

This problem is a matter of complexity. That is even in pure Vanilla Text

one can reprensent these intentions, but one loses readablity. Furthermore

one has to make assumptions of the true intent of the author!!

regards

Keith