Re: [gutvol-d] Re: barriers to XML posting

22 Oct 2004

      ----- Original Message -----
From: Steve Thomas <stephen.thomas@adelaide.edu.au>
...
A question (possibly better put over on the DP list):
Is it possible to OCR a scan directly to XML? Or is the output 
from OCR always going to be text?
That is a very DP related question, but I'll answer here as best as I understand the future plans (and let others correct me where needed).

The plan at DP is to move from the current 2 round proofing model to a (probably) 4 round proofing/markup model.

The content provider will take the scans and OCR them normally.  That part doesn't change.

Then, there are 2 rounds of proofing that concentrate on typos, spelling, etc.  Very similar to the 2 rounds we have now.

Then, there are 2 MORE rounds of markup.  Here is where all the markup like poetry, italics/bold, footnotes, chapter headings, thoughtbreaks, etc, etc are done.

Then, when the final result gets out of 4 rounds, it is nicely marked up (in theory) XML.  The post-processor does his/her normal magic, combining all the pages, running validators on it, etc.

As far as the OCR process, we currently run some pre-processors on text to fix common scannos, etc.  I'd be surprised if those pre-processors didn't improve/change as the XML world emerges at DP.

Josh

Re: [gutvol-d] Re: barriers to XML posting

Joshua Hutchinson