
well, jon, i'd have thought you could have used "the last word" in that thread a bit more wisely. because i believe that you ain't gonna have a leg to stand on once my results come in... but that won't be until next week, so please enjoy your brief reprieve... :+) before i get too deep into the o.c.r./correction process for "my antonia", though, i'd like to know how much time you spent, jon, on (a) scanning and (b) image-manipulation. because my general working rule-of-thumb will be that people should spend _less_ time on the post-o.c.r. steps than they did on the scanning and image-manipulation steps. now, until i get all my procedures hitting on all cylinders, that might be a pipe-dream, but that's my rule-of-thumb... i'd estimate that you spent at least 4 hours on the project, jon. (probably more, since you were still learning the curve, but if you had to repeat the whole thing, you could do it in 4.) that's for the scanning _as_well_as_ the image-manipulation. if i'm badly wrong, in either direction, do please let me know. otherwise, i will give myself a time-limit of 4 hours on this, and we'll see what i can come up with... and jon, please allow me to say a few nice things to you... ;+) first of all, you did a bang-up job on the "my antonia" scans. even though the world doesn't really have a place yet for high-resolution scans like these, it's very good to do them. you can always downsample to lower-resolution, if need be. i understand why many places aren't yet doing high-resolution -- like internet archive, distributed proofreaders, and google -- and i absolutely do _not_ fault them for the practical decision. at the same time, though, i applaud people doing high-resolution. it's not as if what you've done is unprecedented. bennett kobb, for instance, has high-res scans of _nearly_one_hundred_books_, (http://fax.libs.uga.edu) making your single one pale in comparison. (his kick-ass scanner: http://fax.libs.uga.edu/abovevu/abovevu.html) but nonetheless, your quality output is rare enough to merit applause. second, the image-manipulation you did on the scans is first-rate, as far as i can tell from cursory examination. the scans look great! they are straight! and their positioning is standardized very well! (these last two factors are _very_ important in getting good o.c.r.) there is no question in my mind that we'll get good o.c.r. out of 'em. third, you used a reasonable naming-scheme for your image-files! the scan for page 3, for instance, is named 003.png! fantastic! and when you had a blank page, your image-file says "blank page"! please pardon me for making a big deal out of something so trivial -- and i'm sure some lurkers wrongly think i'm being sarcastic -- but most people have no idea how uncommon this common sense is! when you're working with hundreds of files, it _really_ helps you if you _know_ that 183.png is the image of page 183. immensely. even the people over at distributed proofreaders, in spite of their immense experience, haven't learned this first-grade lesson yet. (well, a few of 'em have, and won't go back to that stupidity, but an amazing number of others will even _argue_ with you about it!) what this means, for those of you reading along at home, is that when you scan, start scanning at page 1. (and if the text starts on page 3, like "my antonia" did, then start 2 pages before that.) scan the blank pages. if there are picture "plates" in the book or other unnumbered pages, _skip_'em_, so numbers stay in sync; then do them later, at the _end_ of the regular numbered pages. that's also when you'll do the cover, and all of the front-matter. (this includes a forward, preface, anything with roman numerals.) fourth, jon, you scanned the headers and footers! again, bravo! some people don't, when they scan, and that is a big mistake. let the post-o.c.r. processing software eliminate them later. for now, they are worthwhile to keep in your master images; also later, if you view the images as a book, they're a nice touch. they aren't really necessary, in most cases, but why delete 'em? fifth, your dedication in driving the text to perfection is exemplary. you put together a team of a half-dozen people dedicated to the task, and it shows. while i don't think this approach can scale very well -- your team might well burn itself out after doing a couple books, while page-a-day people at distributed proofreaders go on and on, and an even better approach is to turn readers into proofreaders -- i do think that, as a special effort, what you've done is admirable. drawing attention to the importance of error-free e-texts is great. and setting a positive example, as you've done with your own file, is far superior to the vacuous criticism you make against p.g. files. you've put your time and energy where your mouth is, and i approve. sixth, i understand that you are motivated by good intentions, and i respect your courage in standing up for them while some people (including myself) are kicking you in the teeth, because we disagree. (and _their_ intentions and motivations are just as good as yours.) in case you haven't noticed, i have the exact same type of fortitude, and whenever i see it in other people, i hold it in very high esteem. seventh, i can't think of anything else, but i like to have 7 points, rather than 6, and i'm sure i'll think of the other when i hit "send". anyway, i hope i haven't embarrassed you, saying nice things and all... *** oh yeah, one more thing, just so nobody else wastes any time: jon suggested that people with a range of o.c.r. packages could run it on his scans. i do not think that's necessary, not at all. there's a ton of o.c.r. expertise here, all pointing the same: abbyy finereader v7.x is superior to any other o.c.r. program. combined with proper post-o.c.r. processing, its recognition gives a level of accuracy that is as good as can be expected. until other o.c.r. programs can deliver to us near-perfection, or results equivalent to abbyy's for free, they waste our time. *** anyway, off i go. i'll let you know when i have some results... :+) -bowerbird

Bowerbird wrote:
well, jon, i'd have thought you could have used "the last word" in that thread a bit more wisely.
laugh.
because i believe that you ain't gonna have a leg to stand on once my results come in...
Well, I hope you get an error rate that is one per ten pages for the "My Antonia" scans. And even if you do, I still believe a DP-like process is necessary to catch errors that OCR can't handle, and for someone to properly assemble the pages, structure the document, etc., after the OCRing/proofing is complete. I don't quite put the same level of faith in OCR as you seem to. Btw, I believe as you do that an error reporting system is a good idea so readers may submit errors they find in the texts they use -- sort of an ongoing post-DP proofing process. Obviously, it is necessary to make available the page scans of the source document to aid in this process. How can an error be properly verified and corrected when the source work is not available?
i'd estimate that you spent at least 4 hours on the project, jon. (probably more, since you were still learning the curve, but if you had to repeat the whole thing, you could do it in 4.) that's for the scanning _as_well_as_ the image-manipulation. if i'm badly wrong, in either direction, do please let me know. otherwise, i will give myself a time-limit of 4 hours on this, and we'll see what i can come up with...
Scanning took quite a while (much more than four hours) since all I have at the moment is a flat bed scanner (an el cheapo and slow Microtek ScanMaker X6EL to be exact), so I had to hand place each page on the flat bed. Of course, 600 dpi optical resolution increases the per page scanning time (4 times as many pixels to capture, which slows everything down.) It would have gone a *lot faster* had I used a high-quality sheet feed scanner since I took apart the book to free the pages so as to get high quality, flat scans. Someday...
first of all, you did a bang-up job on the "my antonia" scans.
Thanks!
even though the world doesn't really have a place yet for high-resolution scans like these, it's very good to do them. you can always downsample to lower-resolution, if need be.
Exactly. It is my vision for Distributed Scanners that it should achieve at least this quality.
i understand why many places aren't yet doing high-resolution -- like internet archive, distributed proofreaders, and google -- and i absolutely do _not_ fault them for the practical decision. at the same time, though, i applaud people doing high-resolution. it's not as if what you've done is unprecedented. bennett kobb, for instance, has high-res scans of _nearly_one_hundred_books_, (http://fax.libs.uga.edu) making your single one pale in comparison. (his kick-ass scanner: http://fax.libs.uga.edu/abovevu/abovevu.html) but nonetheless, your quality output is rare enough to merit applause.
Funny that I forgot about the UGA work. Quite an interesting and eclectic list of mostly 19th century works. Will need to contact Bennett one of these days.
third, you used a reasonable naming-scheme for your image-files! the scan for page 3, for instance, is named 003.png! fantastic! and when you had a blank page, your image-file says "blank page"! please pardon me for making a big deal out of something so trivial -- and i'm sure some lurkers wrongly think i'm being sarcastic -- but most people have no idea how uncommon this common sense is!...
Yes, I deemed it important for processing purposes that the name of the image contain semantic information of what it represents, and that naming be consistent for file sorting purposes. As an aside, it is interesting that in my copy of "My Antonia", which is a first edition, the Introduction starts on page 3. There is no page 1 and 2 -- at all. I carefully took the book apart (cutting the sewing) before scanning and proved by this process (plus referring to other info) that pages 1 and 2 never existed. The publisher simply chose to start at page 3. Was this common? (Hmmm, I probably need to take a trip to Utah University's library to check their first edition copy of My Antonia to make sure that there wasn't an inserted page, maybe of an illustration -- but the UNL online Cather edition shows nothing. Maybe there was an intent to insert a page there, which after typesetting it was decided not to.)
fourth, jon, you scanned the headers and footers! again, bravo! some people don't, when they scan, and that is a big mistake. let the post-o.c.r. processing software eliminate them later. for now, they are worthwhile to keep in your master images; also later, if you view the images as a book, they're a nice touch. they aren't really necessary, in most cases, but why delete 'em?
It was my intent to reproduce each page for direct reading purposes -- that is, if somebody wanted to read the book as it was printed, then they could. I attempted *archival scanning*, not *scanning only for OCR*. That OCR benefits from archival quality scanning, though, is obvious.
fifth, your dedication in driving the text to perfection is exemplary. you put together a team of a half-dozen people dedicated to the task, and it shows. while i don't think this approach can scale very well -- your team might well burn itself out after doing a couple books, while page-a-day people at distributed proofreaders go on and on, and an even better approach is to turn readers into proofreaders --
It is not my intent to proof the way we did -- I still believe in the DP approach for proofing. But we had to get something out the door for demo purposes and did not have the time to submit it to the DP process. Maybe we should have. Hindsight is 20-20. And thanks for the rest of your comments. Jon
participants (2)
-
Bowerbird@aol.com
-
Jon Noring