Re: [gutvol-d] Re: barriers to XML posting

Steve Thomas writes:
Is it possible to OCR a scan directly to XML? Or is the output from OCR always going to be text?
We don't usually scan to text; we scan to RTF, and guiprep extracts some of the markup and converts it to lightly marked up text. guiprep could certainly convert the RTF to XML if we wanted, but DP plans to seperate the markup and proofing rounds. -- ___________________________________________________________ Sign-up for Ads Free at Mail.com http://promo.mail.com/adsfreejump.htm

D. Starner wrote:
Steve Thomas writes:
Is it possible to OCR a scan directly to XML? Or is the output from OCR always going to be text?
We don't usually scan to text; we scan to RTF, and guiprep extracts some of the markup and converts it to lightly marked up text. guiprep could certainly convert the RTF to XML if we wanted, but DP plans to seperate the markup and proofing rounds.
It is certainly possible to OCR directly to "XML", but it won't be very useful XML. It is nigh impossible to train an OCR program, unless we get breakthroughs with AI so we can build machines with human intelligence, to unambiguously recognize and markup the *structure* and *semantics* of documents and textual content (such as using the TEI vocabulary designed for this purpose.) Thus, there must be substantial human interaction to determine what any chunk of text represents (structurally/semantically). Of course, if the goal is simply to "clone" the original printed text's visual presentation, then forget the above. But then the resulting cloned text is a lot less useful for repurposing, for accessibility and for other advanced purposes. Jon Noring
participants (2)
-
D. Starner
-
Jon Noring