
well, jon, i'd have thought you could have used "the last word" in that thread a bit more wisely. because i believe that you ain't gonna have a leg to stand on once my results come in... but that won't be until next week, so please enjoy your brief reprieve... :+) before i get too deep into the o.c.r./correction process for "my antonia", though, i'd like to know how much time you spent, jon, on (a) scanning and (b) image-manipulation. because my general working rule-of-thumb will be that people should spend _less_ time on the post-o.c.r. steps than they did on the scanning and image-manipulation steps. now, until i get all my procedures hitting on all cylinders, that might be a pipe-dream, but that's my rule-of-thumb... i'd estimate that you spent at least 4 hours on the project, jon. (probably more, since you were still learning the curve, but if you had to repeat the whole thing, you could do it in 4.) that's for the scanning _as_well_as_ the image-manipulation. if i'm badly wrong, in either direction, do please let me know. otherwise, i will give myself a time-limit of 4 hours on this, and we'll see what i can come up with... and jon, please allow me to say a few nice things to you... ;+) first of all, you did a bang-up job on the "my antonia" scans. even though the world doesn't really have a place yet for high-resolution scans like these, it's very good to do them. you can always downsample to lower-resolution, if need be. i understand why many places aren't yet doing high-resolution -- like internet archive, distributed proofreaders, and google -- and i absolutely do _not_ fault them for the practical decision. at the same time, though, i applaud people doing high-resolution. it's not as if what you've done is unprecedented. bennett kobb, for instance, has high-res scans of _nearly_one_hundred_books_, (http://fax.libs.uga.edu) making your single one pale in comparison. (his kick-ass scanner: http://fax.libs.uga.edu/abovevu/abovevu.html) but nonetheless, your quality output is rare enough to merit applause. second, the image-manipulation you did on the scans is first-rate, as far as i can tell from cursory examination. the scans look great! they are straight! and their positioning is standardized very well! (these last two factors are _very_ important in getting good o.c.r.) there is no question in my mind that we'll get good o.c.r. out of 'em. third, you used a reasonable naming-scheme for your image-files! the scan for page 3, for instance, is named 003.png! fantastic! and when you had a blank page, your image-file says "blank page"! please pardon me for making a big deal out of something so trivial -- and i'm sure some lurkers wrongly think i'm being sarcastic -- but most people have no idea how uncommon this common sense is! when you're working with hundreds of files, it _really_ helps you if you _know_ that 183.png is the image of page 183. immensely. even the people over at distributed proofreaders, in spite of their immense experience, haven't learned this first-grade lesson yet. (well, a few of 'em have, and won't go back to that stupidity, but an amazing number of others will even _argue_ with you about it!) what this means, for those of you reading along at home, is that when you scan, start scanning at page 1. (and if the text starts on page 3, like "my antonia" did, then start 2 pages before that.) scan the blank pages. if there are picture "plates" in the book or other unnumbered pages, _skip_'em_, so numbers stay in sync; then do them later, at the _end_ of the regular numbered pages. that's also when you'll do the cover, and all of the front-matter. (this includes a forward, preface, anything with roman numerals.) fourth, jon, you scanned the headers and footers! again, bravo! some people don't, when they scan, and that is a big mistake. let the post-o.c.r. processing software eliminate them later. for now, they are worthwhile to keep in your master images; also later, if you view the images as a book, they're a nice touch. they aren't really necessary, in most cases, but why delete 'em? fifth, your dedication in driving the text to perfection is exemplary. you put together a team of a half-dozen people dedicated to the task, and it shows. while i don't think this approach can scale very well -- your team might well burn itself out after doing a couple books, while page-a-day people at distributed proofreaders go on and on, and an even better approach is to turn readers into proofreaders -- i do think that, as a special effort, what you've done is admirable. drawing attention to the importance of error-free e-texts is great. and setting a positive example, as you've done with your own file, is far superior to the vacuous criticism you make against p.g. files. you've put your time and energy where your mouth is, and i approve. sixth, i understand that you are motivated by good intentions, and i respect your courage in standing up for them while some people (including myself) are kicking you in the teeth, because we disagree. (and _their_ intentions and motivations are just as good as yours.) in case you haven't noticed, i have the exact same type of fortitude, and whenever i see it in other people, i hold it in very high esteem. seventh, i can't think of anything else, but i like to have 7 points, rather than 6, and i'm sure i'll think of the other when i hit "send". anyway, i hope i haven't embarrassed you, saying nice things and all... *** oh yeah, one more thing, just so nobody else wastes any time: jon suggested that people with a range of o.c.r. packages could run it on his scans. i do not think that's necessary, not at all. there's a ton of o.c.r. expertise here, all pointing the same: abbyy finereader v7.x is superior to any other o.c.r. program. combined with proper post-o.c.r. processing, its recognition gives a level of accuracy that is as good as can be expected. until other o.c.r. programs can deliver to us near-perfection, or results equivalent to abbyy's for free, they waste our time. *** anyway, off i go. i'll let you know when i have some results... :+) -bowerbird