
I'm not seeing the value of this RTT thing (but I'll admit I'm not sure what it is - maybe an example would be helpful.) As best I can tell, it's a linear representation of the graphical image, provided either by OCR software or possibly by previous work by proofers preparing texts, however completely, accurately, and unambiguously, for PG projects. Then the formatting is removed. Yes? No? My experience is that OCR does a pretty poor job of properly sequencing a text; and that much of text isn't linear. Page headings and footings are not well isolated from page text. Footnotes and sidenotes are problematic. Illustrations (possibly with captions, attributions, explanatory keys with subcolumns, etc. are scattered around. Syntactic distinctions are mainly interpreted by humans by inference from layout and formatting, but not detected by OCR software. Examples are poetry, correspondence, and mathematics. Even explicitly identifed elements like quotations are frequently enough ambiguous to OCR, If there is to be a source text which is the canonical starting point for further work, it seems to me it needs to have been treated so as much implicit syntactical identification as possible has been explicated and disambiguated with some documented form of markup - which form doesn't matter much, because if it is sufficiently complete and unambiguous it can be converted into any other form. If there is a preparatory process, especially one involving people's time and attention, shouldn't it be spent disambiguating rather than removing many of the implicit clues needed to detect structure?