
noring said:
It is certainly possible to OCR directly to "XML", but it won't be very useful XML.
this is an important point here, folks, pay attention. although lots and lots of the time, the x.m.l. advocates talk about x.m.l. as if it were uniformly high-quality -- which it needs to be to spit out all those conversions, or magically transform into a different breed of x.m.l., two qualities that are often discussed as "automatic" -- the fact of the matter is that x.m.l. markup can be awful, and serve no real purpose that is of use to us... you should look at the x.m.l. that some apps churn out. so even after a decision is made as to which brand of t.e.i. to use (such as tei-lite), the real hard decisions about how to implement a markup strategy still exist. and after those are answered, the even more difficult decisions about how to actually implement the strategy will rear their ugly heads. it's gonna be a long road, folks.
It is nigh impossible to train an OCR program, unless we get breakthroughs with AI so we can build machines with human intelligence, to unambiguously recognize and markup the *structure* and *semantics* of documents and textual content (such as using the TEI vocabulary designed for this purpose.)
if you're waiting for "artificial intelligence" to come through, you're gonna be waiting for a really long time.
Thus, there must be substantial human interaction to determine what any chunk of text represents (structurally/semantically).
that is the common understanding. it is also wrong. it might (or might not) be true of the _semantic_ nature of a "chunk". (but we can put that matter aside, because _that_ issue is gonna be difficult enough even when you have _humans_ work on it directly.) but it is _definitely_ not true for the _structural_ role of a chunk. in any book that was prepared by a professional typographer, _presentation_ *is* _structure_, because that is _exactly_ what a good typographer does, uses _presentation_ to show _structure_. that's why humans don't have trouble figuring out a book's structure. i don't blame people for telling you this. they don't know any better. but the mere fact that they don't know any better does _not_ mean that you have to believe what they tell you. because they are wrong. when they tell you the only way you can have that information is to have humans encode it in a complex markup system, don't believe it. they are wrong, and their mistake will waste _tons_ of your labor. their emperor is naked, and you must tell them they need to go away. you can get that information, easily. it's right in front of your eyes. of course, when they willy-nilly flatten the o.c.r. results of a book to plain text, they throw away most of that valuable information. but _even_then_, it's possible to ascertain most all of its structure. for an obvious example, people recognize headers because they are big and bold. strip away fontsize and styling, it gets more difficult. nonetheless, if you're smart, you can still locate headers accurately. you can even write computer routines that will do it for you. fast. i know, because i've written 'em. other people could write 'em too. i repeat: in a well-laid-out book, presentation _is_ structure. and that is the message i have been communicating here for a year. but nobody here seemed to want to believe it. your advance notice period has expired now, so i will go and tell the rest of the world... -bowerbird