
----- Original Message ----- From: Steve Thomas <stephen.thomas@adelaide.edu.au>
A question (possibly better put over on the DP list):
Is it possible to OCR a scan directly to XML? Or is the output from OCR always going to be text?
That is a very DP related question, but I'll answer here as best as I understand the future plans (and let others correct me where needed). The plan at DP is to move from the current 2 round proofing model to a (probably) 4 round proofing/markup model. The content provider will take the scans and OCR them normally. That part doesn't change. Then, there are 2 rounds of proofing that concentrate on typos, spelling, etc. Very similar to the 2 rounds we have now. Then, there are 2 MORE rounds of markup. Here is where all the markup like poetry, italics/bold, footnotes, chapter headings, thoughtbreaks, etc, etc are done. Then, when the final result gets out of 4 rounds, it is nicely marked up (in theory) XML. The post-processor does his/her normal magic, combining all the pages, running validators on it, etc. As far as the OCR process, we currently run some pre-processors on text to fix common scannos, etc. I'd be surprised if those pre-processors didn't improve/change as the XML world emerges at DP. Josh