
well, that was a lot easier than i thought it would be... :+) i did o.c.r. on half of jon noring's page-scans for "my antonia", using abbyy finereader v7.x. the results were quite excellent. after doing a small number of global corrections to the o.c.r., i checked it against noring's "trustworthy" version of the text. except for exceptions i'll discuss right after this paragraph, most results are given below; each pair of lines represents a difference found between abbyy and noring, with the last word in each line being the point of difference. the number listed at the start of each line is the word-number in the file, and the string of words are the ones preceding the point-of-difference in the file, so that you can easily pinpoint the correct location. most of the o.c.r. errors were on _punctuation_, not _letters_. in particular, there were many instances where a _period_ was misrecognized as a comma. i did not bother to list these cases, mostly to avoid clutter. i do not know what caused these errors. i don't know if it's a _typical_ misrecognition that abbyy makes, if jon's manipulation of the images somehow caused confusion, if i set one of the options incorrectly, or what. help, anyone? i have also not listed differences found in hyphenation, since i don't have the time to write a decent routine to check them. (i just accepted the dehyphenation abbyy did automatically.) another set of differences not listed here is the _n't_ words. words like "couldn't" and "shouldn't" were set with the _n't_ part distinct from the first part. jon's version retained this. abbyy did not. personally, i find it an unnecessary distraction; the first thing i'd do with such the file is to change it globally. jon probably considers that "tampering with trustworthiness". i think it's common sense recognition of a changed convention. if you prefer jon's way, use his text. if not, you can use mine. (the change is global throughout the book, so it is easy to do.) i note that jon _did_ close up some "there's" where the "'s" was set off distinctively, so he was a bit inconsistent in this arena. (i didn't check to see if there were other apostrophe-s words that were set apart, because i would've closed 'em up myself.) i also changed high-bit characters to low, to ease comparison, so those differences are not listed. yes, the book _did_ print "antonia" with a squiggle over the "a". to me, it's unnecessary. (but i'm quite sure _that_ little detail gave jon wet dreams.) ;+) whichever way you like it, it is just one more global change. that's one beauty of plain-text -- it's so easy to manipulate it. so, now back to the quality of the recognition... almost all of the words were correctly identified. the ones that were not would be flagged by a simple spell-check, with merely 2 stealth scanno exceptions: "cur" for "our" and "oven" for "over". i imagine that these pairs are on the lists of know scannos, and the variants appear just 5 times, total, so it's an easy test to do. most of the errors were of two types -- periods and quote-marks. both these error-types are easy to program routines to check them, even if they aren't flagged in spellcheck -- many of them would be. it's relatively easy to detect sentences, so as to check for periods. and quote-marks are usually nested in pairs, and thus easy to check. but my routines for checking these two items are still back in my prototyping test-app, awaiting migration to the current version; that's why i didn't bother doing o.c.r. on the second half of the scans; once i've incorporated the routines, i'll refine 'em on the first half, and then do a solid test of them using the text from the second half. it's not surprising to me that my tools would find all the errors here. this is a relatively straightforward text, with very few complications. total time to do the o.c.r. on this book, once i know what i'm doing? i'd estimate it at about an hour. and for all post-o.c.r. processing? i'd estimate that about the same. total time for the book -- 2 hours. that's much less time than it took to scan and manipulate the images. i'm guessing that those 2 hours of o.c.r. and post-o.c.r. work would make the accuracy level about 1 error for every 50-100 pages or so. and those errors would be in the less-serious arena of punctuation. i won't be able to say for sure until i've done the second-half test, of course, but given the highly accurate recognition of the words that i found on this half, i feel rather safe making that prediction. in this half, of 200+ pages, the only errors that i might have missed -- but found because i had noring's version to compare it to -- were "layout/lay out" and "fairy tale/fairytale". i _might_ have caught "fairytale", because it's not contained in my spellcheck dictionary in its joined variant, and the split variant _is_ in the book (twice). i probably would not have caught "layout", since it's in my dictionary. (but i should take it out of the dictionary for checking older books. old-time typographers _did_ layout, but they didn't _call_ it that.) either way, i'm sure you'll agree that those two errors are trivial. if all the errors in our books were that meaningless, it'd be great. wait, i might have even caught _those_ errors, as they are _right_ in the _project_gutenberg_ e-text, which has been out for years! well, that wraps up my report. for those who might be curious, i'll be releasing my post-o.c.r. tool in the late spring. look for it! anyway... i believe this makes it very clear that i am correct when i say that if you do the scanning carefully, manipulate those scans correctly, use abbyy finereader v7.x to do the o.c.r., and subject its results to a good post-o.c.r. program, it is relatively quick and easy to process an o.c.r. text to the state where it can become a high-powered e-book. the notion that these procedures are difficult or time-consuming is just plain wrong. wrong, wrong, wrong. in one word -- _untrue_. -bowerbird p.s. although jon's highly accurate version of the text gave us little opportunity to find errors in his work, we _did_ find two. (one is an error in the text, i'd say, but jon did not preserve it.) if michael would like to have another "my antonia" in the library, i'll submit the _entirely_ correct version to project gutenberg, and maybe jon can use it to find the error that eluded his team. :+) p.p.s. i _did_ just drop a hint. one i can use later to show that i did indeed find the one error that is non-equivocal. as for the other error, which might or might not be an air, i'll sack that one. ----------------------------------------------------------------- 524 a group of people stood huddied 524 a group of people stood huddled 2442 of Jacob whom He loved. SelahP 2442 of Jacob whom He loved. Selah." 4562 grandmother's hand. The oldest son, Ambro2, 4562 grandmother's hand. The oldest son, Ambroz@, 5564 up like a hare. "Tatinek, Tatinekl" 5564 up like a hare. "Tatinek, Tatinek!" 6344 grumbled, but realized it was Important 6344 grumbled, but realized it was important 10749 was fixed for me by chance; 10749 was fixed for me by chance, 12887 the familiar road. "They still come? "he 12887 the familiar road. "They still come?" he 13132 they were always unfortunate. When PavePs 13132 they were always unfortunate. When Pavel's 16303 "You not mind my poor mamenka> 16303 "You not mind my poor mamenka, 17531 probably, in some deep Bohemian forest..... 17531 probably, in some deep Bohemian forest... 17718 would be lost ten times oven 17718 would be lost ten times over. 18300 the talking tree of the fairytale; 18300 the talking tree of the fairy tale; 21478 Ambrosch found him." "Krajiek could V 21478 Ambrosch found him." "Krajiek could 'a' 23282 that, too, Jelinek. But we beiieve 23282 that, too, Jelinek. But we believe 25309 and I went into the Shimerdas9 25309 and I went into the Shimerdas' 25594 of his long, shapely hands layout 25594 of his long, shapely hands lay out 26036 which is also Thy mercy seat," 26036 which is also Thy mercy seat." 26157 While the tempest still is high." 26157 While the tempest still is high."... 27674 milk like what your grandpa s&y. 27674 milk like what your grandpa say. 29075 in a spiteful, crowing -- "Jake-y, 29075 in a spiteful, crowing voice: -- "Jake-y 29418 kept for hot applications when cur 29418 kept for hot applications when our 30016 Shimerda dropped the rope, ran aftet 30016 Shimerda dropped the rope, ran after 36061 misfortune, his wife, "Crazy Mary," iried 36061 misfortune, his wife, "Crazy Mary," tried 36513 fine, making eyes at the men!?.." 36513 fine, making eyes at the men!..." 37226 given him one of Tiny SoderbalPs 37226 given him one of Tiny Soderball's 38713 pump water for the cattle. '"Oh, 38713 pump water for the cattle. "'Oh,

Bowerbird wrote:
i did o.c.r. on half of jon noring's page-scans for "my antonia", using abbyy finereader v7.x. the results were quite excellent.
Great!
after doing a small number of global corrections to the o.c.r., i checked it against noring's "trustworthy" version of the text.
What type of global corrections were these? One area is how to handle hyphenation, and whether there was a short dash in the compound word in the first place before the typesetter hyphenated the word.
except for exceptions i'll discuss right after this paragraph, most results are given below; each pair of lines represents a difference found between abbyy and noring, with the last word in each line being the point of difference. the number listed at the start of each line is the word-number in the file, and the string of words are the ones preceding the point-of-difference in the file, so that you can easily pinpoint the correct location.
Great!
most of the o.c.r. errors were on _punctuation_, not _letters_. in particular, there were many instances where a _period_ was misrecognized as a comma. i did not bother to list these cases, mostly to avoid clutter. i do not know what caused these errors. i don't know if it's a _typical_ misrecognition that abbyy makes, if jon's manipulation of the images somehow caused confusion, if i set one of the options incorrectly, or what. help, anyone?
I originally scanned the pages at 600 dpi (optical) 24-bit color (which in the future I won't do for b&w works since I determined it is unnecessary overkill.) Then for the online scans they were reduced as follows: original --> 600 dpi bitonal --> 120 dpi greyscale antialiased I'm not sure which set of scans you used (you don't have the original since they occupy 5 gigs of space.) Hopefully you used the 600 dpi bitonal which should OCR the best. Antialiasing actually causes problems (notwithstanding the much lower resolution.) One thing you could do is to look at the 600 dpi pages at 100% size for which the punctuation was not correctly discerned. You probably will see some errant pixels that fooled the OCR into thinking it was some other punctuation mark than it is. Regardless, punctuation is a toughie for OCR to exactly get right, from what I understand. 600 dpi *helps* resolve the fine detail of punctuation. 300 dpi is marginal for a lot of punctuation because the characters are so small and don't occupy enough pixels (while letters retain enough pixels to better identify them.)
i have also not listed differences found in hyphenation, since i don't have the time to write a decent routine to check them. (i just accepted the dehyphenation abbyy did automatically.)
Ah, ok (answering my comments at the beginning.) Resolving this usually requires a human being to go over, especially for Works from the 18th and 19th century where compound words with dashes were much more common than today (e.g., "to-morrow".) Sometimes one has to see what the author did elsewhere in the text. In a few cases a guess is necessary based on understanding what the author did in similar cases in the text. Some of this can be automated. In other cases it requires a human being to make a final decision. I followed the UNL Cather Edition here.
another set of differences not listed here is the _n't_ words. words like "couldn't" and "shouldn't" were set with the _n't_ part distinct from the first part. jon's version retained this. abbyy did not. personally, i find it an unnecessary distraction; the first thing i'd do with such the file is to change it globally. jon probably considers that "tampering with trustworthiness". i think it's common sense recognition of a changed convention. if you prefer jon's way, use his text. if not, you can use mine. (the change is global throughout the book, so it is easy to do.)
Whether or not it is an "unnecessary" distraction, it is better to preserve the original text in the master etext version. My thinking is that if someone wants to produce a derivative "modern reader" edition of "My Antonia", they are welcome to do so and add it to the collection because the original faithful rendition is *already* there. The only requirements I would place (and this applies in general for any Work) are 1) the original textually faithful etext version has already been done and is in the collection, and 2) the type of modernizations done for the modern parallel editions are noted in the texts themselves (such as within an Editor's Introduction.)
i note that jon _did_ close up some "there's" where the "'s" was set off distinctively, so he was a bit inconsistent in this arena. (i didn't check to see if there were other apostrophe-s words that were set apart, because i would've closed 'em up myself.)
I spent some time looking at the " 's " issue last week. In many cases in the original print edition the spacing between the preceding word and the apostrophe s is quite small -- and for the same combination elsewhere was larger -- indicating this was more of a typesetter's convention rather than something Cather specified. [note] In addition, the UNL Cather Edition closed off all the apostrophe s (no spaces), but kept the space for many of " n't" words. So here again I followed the UNL Cather Edition. (Btw, I found quite a few errors in the online UNL Cather Edition of "My Antonia" which have been forwarded to the team overseeing it -- sadly the professor overseeing the online project passed away a few months ago. We are in touch with other Cather scholars.) But I've put the "'s" issue on my "to look at again" list. [note] Cather wanted the line length to be fairly short, so this puts extra pressure on typesetters who will either have to extend character spacing for a particular line or scrunch it up more than usual, depending upon the situation with the rest of the typesetting on the page, and whether certain words can be hyphenated or not.
i also changed high-bit characters to low, to ease comparison,
You mean accented characters?
so those differences are not listed. yes, the book _did_ print "antonia" with a squiggle over the "a". to me, it's unnecessary.
But that's what is in the original, the "A acute". The squigly is an 'acute', btw. :^) Accented characters are *always* important to preserve under all situations. There's no need anymore, in these days of Unicode and the like to stick with 7-bit ASCII. I sense that you don't want to properly deal with accented characters since this poses extra problems with OCRing and proofing, something you are trying to avoid in your zeal to get everything to automagically work. To me, that's going too far in simplifying. Preserving accented characters are important.
(but i'm quite sure _that_ little detail gave jon wet dreams.) ;+) whichever way you like it, it is just one more global change. that's one beauty of plain-text -- it's so easy to manipulate it.
Unicode is plain text. Just more characters to play with. :^) Btw, for those who are interested, here's the "non-Basic Latin" (non-ASCII) alphabetic characters used in "My Antonia": A acute AE ligature ae ligature e acute e circumflex i umlaut n tilde small z with caron
almost all of the words were correctly identified. the ones that were not would be flagged by a simple spell-check, with merely 2 stealth scanno exceptions: "cur" for "our" and "oven" for "over". i imagine that these pairs are on the lists of know scannos, and the variants appear just 5 times, total, so it's an easy test to do.
most of the errors were of two types -- periods and quote-marks.
Which makes sense. But these are the toughest to correct sometimes, and punctuation changes can sometimes subtly affect the meaning. They are hopefully caught by human proofers/readers when grammar checkers don't (I do use Word to help find both spelling and punctuation errors -- when they find something, I then manually check it in the page scans and the master XML.)
both these error-types are easy to program routines to check them, even if they aren't flagged in spellcheck -- many of them would be.
They are "sometimes" easy to spot. Other times the automatic routines will not catch errors (e.g. ":" vs. ";")
it's relatively easy to detect sentences, so as to check for periods.
Usually true, but there are some rare exceptions where an abbreviation can be mistaken for an end of a sentence. Then there's the ellipsis issue where sometimes an ellipsis is at the end of the sentence and sometimes it is not (and incorrectly used.)
and quote-marks are usually nested in pairs, and thus easy to check.
This is also true, but as found in "My Antonia", there are exceptions to pure nesting, such as when a quotation spills over into several paragraphs where the intermediate paragraphs are not terminated by an end quotation mark (whether single or double.) Also, apostrophes are sometimes confused with single right quote marks. Here's a fictional example (imagine the straight quotes and apostrophe marks being represented in print with the appropriate "curly" marks): "And Harry told me, 'the voters' confidence in the candidate waned.' To which I replied to Harry, 'I don't believe so.'" With a smart enough grammar and parser, the above might be properly parsed and the apostrophe correctly differentiated from the single right quote mark. But still, real-world texts tend to throw a lot of curve balls that are sometimes hard to correctly machine process.
but my routines for checking these two items are still back in my prototyping test-app, awaiting migration to the current version; that's why i didn't bother doing o.c.r. on the second half of the scans; once i've incorporated the routines, i'll refine 'em on the first half, and then do a solid test of them using the text from the second half.
great!
total time to do the o.c.r. on this book, once i know what i'm doing?
OCR is quite fast. It's making and cleaning up the scans which is the human and CPU intensive part.
i'd estimate it at about an hour. and for all post-o.c.r. processing? i'd estimate that about the same. total time for the book -- 2 hours. that's much less time than it took to scan and manipulate the images.
Yes.
p.s. although jon's highly accurate version of the text gave us little opportunity to find errors in his work, we _did_ find two. (one is an error in the text, i'd say, but jon did not preserve it.) if michael would like to have another "my antonia" in the library, i'll submit the _entirely_ correct version to project gutenberg, and maybe jon can use it to find the error that eluded his team. :+)
Well, not all of the pages have been doubly proofed. The team is not finished, and I plan to post a plea somewhere for more eyeballs to go over it. I would like to receive error reports as well for this text, since Brewster wants highly proofed texts for some experiments he plans to run similar to yours. But if I have to use the version you donate to PG, so be it. :^)
p.p.s. i _did_ just drop a hint. one i can use later to show that i did indeed find the one error that is non-equivocal. as for the other error, which might or might not be an air, i'll sack that one.
Oh, a clue. :^) Anyway, great work! Jon (p.s., I did find one error in my text based on the list you gave. Thanks. There should be a comma after the first "Jake-y" in "Jake-y Jake-y". So that's been corrected already in the online and archive version. I rechecked the PG edition, and they get the comma right in the text, which I oddly missed doing my "diff" (probably because there were quite a few differences to pour over.) But then they enclose the surrounding sentence within a single quote mark (following the British convention), while the original first edition uses a double quote mark. The PG edition seems to be inconsistent with regards to quotation marks and to British/American spelling, which is why I surmise the PG edition is based on some non-Cather-approved British edition and might have subsequently been selectively and inconsistently edited in trying to "re-Americanize" it. I assume you discovered the several different paragraph breaks in the PG edition?)
participants (2)
-
Bowerbird@aol.com
-
Jon Noring