presentation *is* structure (it's right in front of your eyes)

noring said:
It is certainly possible to OCR directly to "XML", but it won't be very useful XML.
this is an important point here, folks, pay attention. although lots and lots of the time, the x.m.l. advocates talk about x.m.l. as if it were uniformly high-quality -- which it needs to be to spit out all those conversions, or magically transform into a different breed of x.m.l., two qualities that are often discussed as "automatic" -- the fact of the matter is that x.m.l. markup can be awful, and serve no real purpose that is of use to us... you should look at the x.m.l. that some apps churn out. so even after a decision is made as to which brand of t.e.i. to use (such as tei-lite), the real hard decisions about how to implement a markup strategy still exist. and after those are answered, the even more difficult decisions about how to actually implement the strategy will rear their ugly heads. it's gonna be a long road, folks.
It is nigh impossible to train an OCR program, unless we get breakthroughs with AI so we can build machines with human intelligence, to unambiguously recognize and markup the *structure* and *semantics* of documents and textual content (such as using the TEI vocabulary designed for this purpose.)
if you're waiting for "artificial intelligence" to come through, you're gonna be waiting for a really long time.
Thus, there must be substantial human interaction to determine what any chunk of text represents (structurally/semantically).
that is the common understanding. it is also wrong. it might (or might not) be true of the _semantic_ nature of a "chunk". (but we can put that matter aside, because _that_ issue is gonna be difficult enough even when you have _humans_ work on it directly.) but it is _definitely_ not true for the _structural_ role of a chunk. in any book that was prepared by a professional typographer, _presentation_ *is* _structure_, because that is _exactly_ what a good typographer does, uses _presentation_ to show _structure_. that's why humans don't have trouble figuring out a book's structure. i don't blame people for telling you this. they don't know any better. but the mere fact that they don't know any better does _not_ mean that you have to believe what they tell you. because they are wrong. when they tell you the only way you can have that information is to have humans encode it in a complex markup system, don't believe it. they are wrong, and their mistake will waste _tons_ of your labor. their emperor is naked, and you must tell them they need to go away. you can get that information, easily. it's right in front of your eyes. of course, when they willy-nilly flatten the o.c.r. results of a book to plain text, they throw away most of that valuable information. but _even_then_, it's possible to ascertain most all of its structure. for an obvious example, people recognize headers because they are big and bold. strip away fontsize and styling, it gets more difficult. nonetheless, if you're smart, you can still locate headers accurately. you can even write computer routines that will do it for you. fast. i know, because i've written 'em. other people could write 'em too. i repeat: in a well-laid-out book, presentation _is_ structure. and that is the message i have been communicating here for a year. but nobody here seemed to want to believe it. your advance notice period has expired now, so i will go and tell the rest of the world... -bowerbird

Bowerbird wrote:
Jon Noring said:
It is certainly possible to OCR directly to "XML", but it won't be very useful XML.
although lots and lots of the time, the x.m.l. advocates talk about x.m.l. as if it were uniformly high-quality -- which it needs to be to spit out all those conversions, or magically transform into a different breed of x.m.l., two qualities that are often discussed as "automatic" -- the fact of the matter is that x.m.l. markup can be awful, and serve no real purpose that is of use to us...
Definitely. The key is to use the right markup vocabulary and apply it consistently. Any system representing document structure (such as ZML) must be "right and sufficient" and be applied consistently. Those who understand and speak of XML, they know that XML is not in and of itself a specific markup vocabulary, it is a rule-set or framework on how to apply markup to textual content. There are an infinity of markup vocabularies, and a good markup vocabulary depends upon the purpose of the markup. XML is used for both database and publishing applications, and there are many extraordinarily successful applications of XML. One of the most recent applications of XML which a lot of people recognize and use is RSS, used for blog feeds and the like. XHTML is used a lot on the Internet, and is no more complex (in fact it is simpler in many ways) than legacy HTML. ZML is an example of a "regularized plain text" system to represent certain important textual document structures in a way which is fully machine-readable. I could easily create an XML-based markup vocabulary clone of the ZML system to represent the same identical structures.
Thus, there must be substantial human interaction to determine what any chunk of text represents (structurally/semantically).
in any book that was prepared by a professional typographer, _presentation_ *is* _structure_, because that is _exactly_ what a good typographer does, uses _presentation_ to show _structure_. that's why humans don't have trouble figuring out a book's structure.
Definitely. But what we require is to be able to machine-read and machine-process the structure and semantics of a textual document. Even if humans can figure this out by a simple visual glance of the content in a high-typographic-quality presentation, does not automatically mean it is easy for machines to do likewise. It is also not easy to codify because visual presentation is "fuzzy" (pun not intended), sometimes relying on surrounding context to precisely define the document structure. We have to remember that there are a lot of variances in conventions (both historically and geographically) used for typographic layouts to visually represent structure and semantics. Not only that, in some cases they don't even follow conventions, especially when there are oddities in the content where no convention has been firmly established. And as previously noted, sometimes the context must be factored in to fully ascertain structure and semantics. The "Gedanken" test I use for the minimum requirements of machine- readable markup (or system such as ZML) for textual documents is if a text-to-speech engine is potentially capable of communicating the structure and semantics of the content to a blind listener (who is unfamiliar with any print conventions -- they've never heard the terms 'italic' or 'bold') so they can, in real-time (i.e., a one-time linear audio presentation), gain the same level of comprehension as a sighted person (familar with typographic conventions) would in reading a high-quality print version of the text. Pass this test, and the markup will likely be pretty good for just about any purpose in addition to accessibility. Is ZML or other type of "regularized plain text" (or the XML-based ZML markup vocabulary analog) sufficient to pass this test?
when they tell you the only way you can have that information is to have humans encode it in a complex markup system, don't believe it.
The system only needs to be as complicated as needed to represent the needed document structures and content semantics in a machine-readable way such that it passes the test described above. The $64,000 question therefore is what structure and semantics needs to be represented in a machine-readable way, and to what degree of precision. Maybe ZML (and its markup analog) is sufficient, maybe it isn't. I interpret from those here who have first-hand experience handling large numbers of the various types of texts in Project Gutenberg, that ZML (or any other type of "regularized plain text" system) does not have sufficient granularity to pass the "test." Of course, we can argue whether the test as I describe above is too strict, or maybe not even on-target. But keep in mind this is what the *accessibility community* wants in machine-readable textual documents, and what they are working towards in their activities -- they've wholeheartedly embraced XML-based approaches, for example. To wave one's hand in dismissal and say they are being unrealistic or stupid, or that they don't really matter in our decision-making, is a pretty bigoted and "blind" position (pun intended) to take -- it is also stupid since meeting their needs for structure and semantics has many other benefits as well. I might ask a few text-to-speech experts I know at DAISY to look at the ZML system and tell me if it has sufficient structural granularity for high-quality text-to-speech purposes. As far as I am concerned, if they come back and say "no it doesn't", then I would recommend that PG should not consider ZML for its Master format, but maybe consider ZML for its plain text output versions.
for an obvious example, people recognize headers because they are big and bold. strip away fontsize and styling, it gets more difficult. nonetheless, if you're smart, you can still locate headers accurately. you can even write computer routines that will do it for you. fast. i know, because i've written 'em. other people could write 'em too.
Bold lines which appear by themselves in the flow of text are sometimes used for structures other than headers. There are many other similar weirdities involved with italicized text, indented text, etc., that we see in visual layouts of texts. Context is often important to consider to unambiguously discern structure for a visual cue. For example, one convention often used is that the names of ships is to be italicized. Thus, if a machine is to discern the name of a ship from linguistically emphasized text, it has to look at the context.
i repeat: in a well-laid-out book, presentation _is_ structure.
No, I'd say it is more accurate to say "for reading by eyesight, structure is represented by visual presentation cues." Remember, there are different types of presentation of text, not only visual. To focus on visual as the only form of presentation that matters is being very short-sighted (pun intended.) The only time we must give up and focus only on the visual is when visual presentation is an important and integral part of the content itself, such as "poetry as art" and similar avante-garde things. (Here SVG is of especial appeal, so we have an XML-based solution for this as well.)
and that is the message i have been communicating here for a year. but nobody here seemed to want to believe it. your advance notice period has expired now, so i will go and tell the rest of the world...
And I've stated the core question to answer is: "Is ZML (or any other system of regularized plain text) sufficient to represent document structure and semantics for Project Gutenberg Master texts?" I assume Bowerbird is saying "yes", and many others here are saying "No". I answer the question with a "No". Amusingly, Networker, a very insightful ebook expert who often posts to The eBook Community, calls ZML a type of ITF, "Impoverished Text Format", to indicate ZML has insufficient granularity -- it is "impoverished". Jon Noring
participants (2)
-
Bowerbird@aol.com
-
Jon Noring