Re: [gutvol-d] Heebee Jeebees on Gutenberg

jon noring said:
abbyy7 + your-sooper-dooper-tools = no-more-need-for-DP
well, let's do the equation right, ok? opticbook3600 (most other scanners will start you off wrong) + careful scanning (note that this is _oh_ so very important) + image correction (deskewing and zoning regularization) + abby v7 (and using the old-book version whenever needed) + super-duper-tools, used wisely, for about 4 hours/book = an error rate of 1 error every 10 pages, good enough for + continuous proofreading (with scans available for viewing) + freshly-informed-and-motivated end-users looking for errors + a comprehensive error-detection system + a comprehensive error-reporting system + a comprehensive error-correcting system + a comprehensive system designed to foster community = a steady march toward absolute perfection in the e-texts you can sprinkle in as much or as little d.p. as you want. they are the _cooks_, and not an ingredient in the recipe. and they aren't _required_ for a meal, but they sure help! a ton of dedicated people with expertise and experience! responsible for _half_ of p.g. by the time it hits #20,000! without doubt, _the_ dynamic force in digitization today! more uplifting than google, and million-books, and kahle! (well, kahle is a large factor in their success, but still)!
but it is *equally* important to understand, and markup, the *structures* associated with all portions of the content. This is something that *has* to be done by human beings. It cannot be done automagically, at least with anything near acceptable accuracy (as shown by a few trivial examples I posted here a while back.)
blah blah blah. your examples are bull. you didn't send them here, you sent 'em to ockerbloom's list, and he would not allow my response to go through, since it destroyed your positions so thoroughly. (it was _complete_; he took it as _mean_.) repost your examples if you have the courage. given the normal degree of consideration for consistent formatting, my apps can determine the structure of a text in a matter of seconds, even on my "legacy" mac, less than the acrobat splash is up... :+) -bowerbird

Bowerbird wrote:
jon noring said:
but it is *equally* important to understand, and markup, the *structures* associated with all portions of the content. This is something that *has* to be done by human beings. It cannot be done automagically, at least with anything near acceptable accuracy (as shown by a few trivial examples I posted here a while back.)
blah blah blah. your examples are bull. you didn't send them here, you sent 'em to ockerbloom's list, and he would not allow my response to go through, since it destroyed your positions so thoroughly. (it was _complete_; he took it as _mean_.) repost your examples if you have the courage.
Yes, you are correct. I thought I had sent the examples to this list. Courage? I'll append the whole message below! It is interesting that John Mark Ockerbloom disallowed your reply to the message appended below. So you could not argue rationally with what I wrote? Fascinating. I assume you will simply repost your reply. I look forward to seeing it, but I'm not sure if the others here want to see it knowing that John Mark Ockerbloom disallowed it on his Book People list. He's a pretty tough moderator, but I've known him to be fair.
given the normal degree of consideration for consistent formatting, my apps can determine the structure of a text in a matter of seconds, even on my "legacy" mac, less than the acrobat splash is up... :+)
<laugh type="rotfl"/> Jon Noring *************************************************************************** (O.k., here's what I posted to the Book People list on 27-Dec-2004. Of course, I hope that others here beside Bowerbird will rationally reply to the examples and general argument, whether it be criticism, support, or both.) Bowerbird wrote:
Jon Noring said:
Anyway, no application can discern, with any high degree of accuracy, the underlying structure of texts -- only human eyeballs can do that.
this is even more silly. i've said over and over again that typography is the translation of structure into presentation.
True, but how it's been done in the real world is hit or miss, and requires the reader to also discern the context of the words surrounding the structure (example later.) And just because one can do: structure --> typography, does not mean the reverse process of typography --> structure is just as easy (to the point where it can be done by machine) because oftentimes the same typographic construct is used to represent different structures and semantics. For an example of this "multiple uses of the same typography", the venerable "italics" is a good one. Italics are used for: linguistic emphasis literal emphasis (not the same as linguistic emphasis!) names of ships titles of certain types of books and documents word used as a word foreign phrases sometimes used for headers etc., etc., etc., etc., etc. ,etc., etc., etc. Here we map all kinds of different textual structures/semantics into one output typographical construct -- italics -- since we know the reader will be able to untangle it all by understanding the textual context where the italics appears (the textual context is the actual meaning of the flow of words). (This is an example only since some would say just duplicate the italics and let humans figure it out in the final product -- which those in the accessibility community will rightfully disagree with -- but it illustrates how the same typography is used to represent different structures/semantics, which only a contextual reading can discern -- see the example below.)
in a nutshell, that is the very _job_ that a typographer does! if you closely study the presentation of a well-prepared book, via its typography, you can discern its structure very accurately.
True (about the typographer), but again some of the discerning structure has to do with context. In addition, there is no one standard that permeates through the decades and centuries, and from country to country; even today there is a lot of experimentation with new ways to "signal" structure to the reader. I've seen some *really* odd stuff done the last few years (such as chapter headers which run vertically within the left margin! -- do they represent a header, or a sidebar of some sort?) In most cases the reader, by knowing the textual context surrounding the typographic layout, can learn (infer) what the underlying document structure is. I've seen some pretty oddball stuff done which no program in the world is yet able to discern, but humans can do so in a few moments by "figurin' it out" after reading a part of the text. Here's one very small example to illustrate what I'm saying: (exhibit 1 -- I made this up, inspired by Sherlock Holmes) ********************************************************************** ... she walked up to the door, and on the door was a small sign with a message in stark, bold black letters which read: NO SOLICITORS OR SALES PEOPLE Ignoring the sign as if it wasn't there, she knocked on the door, intending to make the sale... *********************************************************************** (exhibit 2; adapted from an Encylopaedia Britannica article) *********************************************************************** ...Government weakness allowed the mutiny to spread; and although order was eventually restored in Istanbul and more quickly elsewhere, a force from Macedonia (the Action Army) led by Mahmud Sevket Pasa marched on Istanbul and occupied the city (April 24). DISSOLUTION OF THE EMPIRE Abdulhamid was deposed and replaced by Sultan Mehmed V (ruled 1909-18), son of Abdulmecid. The constitution was amended to transfer real power to the Parliament... ********************************************************************** In both exhibits we have a phrase all in capitals, centered between what looks like two paragraphs. Without reading and understanding the context of the capitalized phrase and the surrounding text, the next question is what are they in a document structural sense? Are they title headings to a new chapter or section? If not, what are they? For Exhibit 1 it is obvious by reading and understanding the context of that particular example that it is not a header to a new section -- it is actually a snippet of text acting almost like a facsimile image of the sign on the door -- I notice this structural construct used a lot in Sherlock Holmes works, as an example.) In the second exhibit, we see that the phrase in all capitals is truly a header to a new section of the article, like a new chapter in a book. What if these exhibits were written in some unknown language using a strange non-Latin character set? Could anyone who doesn't understand the language know what the capitalized lines represent? They could be anything, really, when one does not know the meaning of what is being said... Now, how do you write a program to accurately discern what is a header to a new chapter/section, from some other sort of construct? And of course to do so not only in English with left-to-right, but how about Arabic/Hebrew(Yiddish) with right-to-left, and traditional Han with vertical writing, etc.? (Of course, one wonders if those languages and scripts use the same typographic conventions we use here in the West to differentiate document structures? I doubt it.) The above example is only one of hundreds of different kinds of typographic constructs used for multiple purposes to represent document structure, where it is assumed a human being can tell what they mean by the textual context (e.g., the italicized text example given previously illustrates this nicely.) So, in order for an automated system to truly understand the structure of documents, it has to understand the underlying meaning of the text -- it has to become a sentient reader. There is NO computer program yet, which has the required level of artificial intelligence, to do this. That's not to say there won't be in the future, but there's been billions of dollars pumped into AI research since the 1950's, and we are not much closer to true AI than we were 30 years ago (some believe the human brain is a quantum computer, that quantum computing is a necessary prerequisite to true intelligence and sentience -- we'll see...) One can certainly continue to write and refine a computer program (to "train" it so to speak) to be able to discern the structure of text documents from OCR with higher and higher accuracy. And over time the code gets to be unbelievably complex to handle the thousands of "exceptions". But it will still take a human being to peruse over the results and sometimes correct where the program got it wrong, just like OCR doesn't always get the characters right, and so requires a human being to go through and proof the text for scanning errors. I'm not sure what Bowerbird is trying to argue, but I assume he wants a system where perfectly structured (and repurposeable) books can be obtained today by some fancy push-button OCR program not requiring any human being to tweak the output at all, so we can get zillions of high-quality books and documents online by simply pressing a button. I'd love to see this, too, but I don't believe this will be possible for several decades. Books are written, and typeset, by people for people using imperfect systems, and until the programs reach the level of intelligence and sentience of human beings, we will not be able to bypass the final human proofing/hand-structuring stage to get very high quality results -- all we can do is to reduce the amount of time needed for human proofing and hand-structuring by improving OCR programs and in post-processing programs that try to determine overall structure. ***** The other point I want to make is that I really don't think it matters how good or how bad the Internet Archive's, or Google's OCR package is, because the OCRing can be done at *any time*, by *any one*, using *any kind* of OCR package, so long as the scans are available (which IA will make available.) There may only be one chance to scan a certain book (it is quite an effort to scan a book and it may not be available for rescanning), but once the scans are made and put online, they can be duplicated and mirrored all around the world. At any time in the future someone (like Bowerbird) can grab any of these scans and OCR them using *their* nifty OCR engine with "sooper-dooper-intelligence". To me the question is not the quality of Brewster's OCR package -- I don't really give a rip, to be honest, because the scans can be re-OCR'd at any time -- the real question is the quality of the scans themselves. Will they be of sufficient resolution, contrast, cleanness, and linearity (no funny curvature) to make it easier to get high-quality OCR results? The complaints over OCR is a non-issue, and why Bowerbird even cares about the quality of Brewster's OCRing is beyond me. Ten years from now, if Bowerbird is correct, we'll have wonderful programs which will take the page scans, and spit out perfectly structured, 100% proofed structured digital texts. No human being will be needed to tweak the final result. As noted above, *I'd like to see this*, but I'm not holding my breath that we will see this in 10 years, or even 50 years. But let's get the scans done right, and ready for that time when all we have to do is push a button and we get perfect rice everytime. In the meanwhile, we'll have a nice pool of source material to use for the human-driven proofing endeavors such as Distributed Proofreaders. Jon Noring

FWIW, I am not envisioning coming up with anything that can replace human eyeballs. As has been mentioned by others (probably Jon, but I'm too lazy to go back and find out), I'm looking for tools to help those eyeballs work better. Geoff

opticbook3600 (most other scanners will start you off wrong)
After soliciting reviews of it on the DP forums and getting several positive ones, this my mixed review of the opticbook. (1) Scanning into the gutter: The opticbook's main selling point is the position of its glass. At the very edge of the device, you can open a book only 90 degrees and flatten half of it against the glass, in principle eliminating gutter shadow. In practice, though, it doesn't catch the outer half-centimeter of the glass anyway. While there are books you can scan with the opticbook that you couldn't scan with an ordinary flatbed, there are still books with gutters too narrow to do with the opticbook. (2) Book handling: With an ordinary flatbed, the book it pretty much held in place by its own weight, although you can get better results by applying some pressure from above. And you only have to reposition the book for every other page. With the opticbook, you have to turn the book every page, and, for the half of the book where the heavy side is hanging off the end, you have to hold the book in position more physically than for the usual flatbed configuration. While I have never destroyed a book with normal flatbed scanning, the second book I scanned with the opticbook did not survive the process. (To be fair, it wouldn't have survived the process with an ordinary flatbed, but with an ordinary flatbed, I would not have even tried ...) (3) Speed: this is where the optibook really shines. Even doing only one page at a time, I have reached scanning speeds of about 300 octavo pages per hour. And if you have a book where you can use the opticbook like an ordinary flatbed, scanning two pages at a time, well, I'll leave the numbers as an exercise to the reader. If you have the money and want to have another tool in your scanner arsenal, it's not a waste. But it's not a magic bullet, either. The most demanding scanning project I've done to date, the large-format illustrations in Robert Hooke's Micrographia, required using scans from both my opticbook and my HP scanjet 4600 (the thin, transparent series), follwed by about 12 hours of image manipulation, to get the results I got. Neither scanner alone could have delivered those images. -- RS
participants (4)
-
Bowerbird@aol.com
-
Geoff Horton
-
Jon Noring
-
Robert Shimmin