Re: [gutvol-d] Improving the PG library

24 Sep 2012

      I'm not seeing the value of this RTT thing (but I'll admit I'm not
sure what it is - maybe an example would be helpful.)

As best I can tell, it's a linear representation of the
graphical image, provided either by OCR software or
possibly by previous work by proofers preparing texts,
however completely, accurately, and unambiguously,
for PG projects. Then the formatting is removed.
Yes? No?

My experience is that OCR does a pretty poor job
of properly sequencing a text; and that much of text
isn't linear. Page headings and footings are not
well isolated from page text. Footnotes and sidenotes
are problematic. Illustrations (possibly with captions,
attributions, explanatory keys with subcolumns, etc.
are scattered around.

Syntactic distinctions are mainly interpreted by
humans by inference from layout and formatting,
but not detected by OCR software. Examples
are poetry, correspondence, and mathematics.
Even explicitly identifed elements like quotations
are frequently enough ambiguous to OCR,

If there is to be a source text which is the canonical
starting point for further work, it seems to me it
needs to have been treated so as much implicit
syntactical identification as possible has been
explicated and disambiguated with some documented
form of markup - which form doesn't matter much,
because if it is sufficiently complete and unambiguous
it can be converted into any other form.

If there is a preparatory process, especially one
involving people's time and attention, shouldn't
it be spent disambiguating rather than removing
many of the implicit clues needed to detect
structure?

Re: [gutvol-d] Improving the PG library

don kretz