
On 2012-09-24, don kretz wrote:
I'm not seeing the value of this RTT thing (but I'll admit I'm not sure what it is - maybe an example would be helpful.)
The RTT is more or less DP's P3 output without the clothing of eol hyphens and dashes so that a line in the RTT will always correspond directly to a line on the page. Rather than: [OCR]--p1-->[P1]--p2-->[P2]--p3-->[P3] the proposal is to do [OCR]--p1-->[P1]--p2-->[P2]--Diff against extant PG text-->[RTT] possibly doing p1 and p2 repeats to improve accuracy. If you follow a workflow where you do formatting seperately, you will likely at some stage in the process have something akin to the RTT. The problem that the RTT seeks to be the solution to is: given that we all have our own ideas about workflows and we will defend those ideas to the death, what is the latest usable snapshot common to the vast majority of these workflows? What is the latest point that I can pick off from your workflow that will allow me to continue with my workflow, and vice versa? For a subset of the workflows that can be based on the RTT, there will be other usable snapshots further on. If someone is carrying out such a workflow there is nothing stopping these being captured as well; other compatible workflows can start from this snapshot instead. There is also nothing precluding deriving from a derivative if that works for you. The RTT is just a low level foundation for all these things that is there if you want to use it. In general, the default answer to how any given thing is encoded into the RTT is "like DP does it", since that is what people know, and DP or a DP like organisation will be needed to produce them. The exception is the eol clothing which actively destroys data and line correspondence, and that exception is only possible because LOTE already uses the exception. There are all sorts of things that can also be done in a "post-process to RTT" sort of way, such as converting to curly quotes and translating form "--" to "—", but, as you correctly point out, there is no point getting to that level of detail until we at least sort out how to select the right master scans.
If there is to be a source text which is the canonical starting point for further work, it seems to me it needs to have been treated so as much implicit syntactical identification as possible has been explicated and disambiguated with some documented form of markup - which form doesn't matter much, because if it is sufficiently complete and unambiguous it can be converted into any other form.
This canonical starting point (CSP -- I love TLAs) is something like F2 output, although I'm sure you would come up with something a lot less random. I suggest Bowerbird wouldn't like it: he would suggest you might as well just use ZML. Marcello would probably point out that RST would be the solution. Jeroan would point to TEI's superior gamut. If you can all agree on a CSP format then that's brilliant: we can have an MS, an RTT and a CSP. But I don't want to get involved in the flame war (*cough* LaTeX). Cheers Jon