
Bowerbird wrote:
i did o.c.r. on half of jon noring's page-scans for "my antonia", using abbyy finereader v7.x. the results were quite excellent.
Great!
after doing a small number of global corrections to the o.c.r., i checked it against noring's "trustworthy" version of the text.
What type of global corrections were these? One area is how to handle hyphenation, and whether there was a short dash in the compound word in the first place before the typesetter hyphenated the word.
except for exceptions i'll discuss right after this paragraph, most results are given below; each pair of lines represents a difference found between abbyy and noring, with the last word in each line being the point of difference. the number listed at the start of each line is the word-number in the file, and the string of words are the ones preceding the point-of-difference in the file, so that you can easily pinpoint the correct location.
Great!
most of the o.c.r. errors were on _punctuation_, not _letters_. in particular, there were many instances where a _period_ was misrecognized as a comma. i did not bother to list these cases, mostly to avoid clutter. i do not know what caused these errors. i don't know if it's a _typical_ misrecognition that abbyy makes, if jon's manipulation of the images somehow caused confusion, if i set one of the options incorrectly, or what. help, anyone?
I originally scanned the pages at 600 dpi (optical) 24-bit color (which in the future I won't do for b&w works since I determined it is unnecessary overkill.) Then for the online scans they were reduced as follows: original --> 600 dpi bitonal --> 120 dpi greyscale antialiased I'm not sure which set of scans you used (you don't have the original since they occupy 5 gigs of space.) Hopefully you used the 600 dpi bitonal which should OCR the best. Antialiasing actually causes problems (notwithstanding the much lower resolution.) One thing you could do is to look at the 600 dpi pages at 100% size for which the punctuation was not correctly discerned. You probably will see some errant pixels that fooled the OCR into thinking it was some other punctuation mark than it is. Regardless, punctuation is a toughie for OCR to exactly get right, from what I understand. 600 dpi *helps* resolve the fine detail of punctuation. 300 dpi is marginal for a lot of punctuation because the characters are so small and don't occupy enough pixels (while letters retain enough pixels to better identify them.)
i have also not listed differences found in hyphenation, since i don't have the time to write a decent routine to check them. (i just accepted the dehyphenation abbyy did automatically.)
Ah, ok (answering my comments at the beginning.) Resolving this usually requires a human being to go over, especially for Works from the 18th and 19th century where compound words with dashes were much more common than today (e.g., "to-morrow".) Sometimes one has to see what the author did elsewhere in the text. In a few cases a guess is necessary based on understanding what the author did in similar cases in the text. Some of this can be automated. In other cases it requires a human being to make a final decision. I followed the UNL Cather Edition here.
another set of differences not listed here is the _n't_ words. words like "couldn't" and "shouldn't" were set with the _n't_ part distinct from the first part. jon's version retained this. abbyy did not. personally, i find it an unnecessary distraction; the first thing i'd do with such the file is to change it globally. jon probably considers that "tampering with trustworthiness". i think it's common sense recognition of a changed convention. if you prefer jon's way, use his text. if not, you can use mine. (the change is global throughout the book, so it is easy to do.)
Whether or not it is an "unnecessary" distraction, it is better to preserve the original text in the master etext version. My thinking is that if someone wants to produce a derivative "modern reader" edition of "My Antonia", they are welcome to do so and add it to the collection because the original faithful rendition is *already* there. The only requirements I would place (and this applies in general for any Work) are 1) the original textually faithful etext version has already been done and is in the collection, and 2) the type of modernizations done for the modern parallel editions are noted in the texts themselves (such as within an Editor's Introduction.)
i note that jon _did_ close up some "there's" where the "'s" was set off distinctively, so he was a bit inconsistent in this arena. (i didn't check to see if there were other apostrophe-s words that were set apart, because i would've closed 'em up myself.)
I spent some time looking at the " 's " issue last week. In many cases in the original print edition the spacing between the preceding word and the apostrophe s is quite small -- and for the same combination elsewhere was larger -- indicating this was more of a typesetter's convention rather than something Cather specified. [note] In addition, the UNL Cather Edition closed off all the apostrophe s (no spaces), but kept the space for many of " n't" words. So here again I followed the UNL Cather Edition. (Btw, I found quite a few errors in the online UNL Cather Edition of "My Antonia" which have been forwarded to the team overseeing it -- sadly the professor overseeing the online project passed away a few months ago. We are in touch with other Cather scholars.) But I've put the "'s" issue on my "to look at again" list. [note] Cather wanted the line length to be fairly short, so this puts extra pressure on typesetters who will either have to extend character spacing for a particular line or scrunch it up more than usual, depending upon the situation with the rest of the typesetting on the page, and whether certain words can be hyphenated or not.
i also changed high-bit characters to low, to ease comparison,
You mean accented characters?
so those differences are not listed. yes, the book _did_ print "antonia" with a squiggle over the "a". to me, it's unnecessary.
But that's what is in the original, the "A acute". The squigly is an 'acute', btw. :^) Accented characters are *always* important to preserve under all situations. There's no need anymore, in these days of Unicode and the like to stick with 7-bit ASCII. I sense that you don't want to properly deal with accented characters since this poses extra problems with OCRing and proofing, something you are trying to avoid in your zeal to get everything to automagically work. To me, that's going too far in simplifying. Preserving accented characters are important.
(but i'm quite sure _that_ little detail gave jon wet dreams.) ;+) whichever way you like it, it is just one more global change. that's one beauty of plain-text -- it's so easy to manipulate it.
Unicode is plain text. Just more characters to play with. :^) Btw, for those who are interested, here's the "non-Basic Latin" (non-ASCII) alphabetic characters used in "My Antonia": A acute AE ligature ae ligature e acute e circumflex i umlaut n tilde small z with caron
almost all of the words were correctly identified. the ones that were not would be flagged by a simple spell-check, with merely 2 stealth scanno exceptions: "cur" for "our" and "oven" for "over". i imagine that these pairs are on the lists of know scannos, and the variants appear just 5 times, total, so it's an easy test to do.
most of the errors were of two types -- periods and quote-marks.
Which makes sense. But these are the toughest to correct sometimes, and punctuation changes can sometimes subtly affect the meaning. They are hopefully caught by human proofers/readers when grammar checkers don't (I do use Word to help find both spelling and punctuation errors -- when they find something, I then manually check it in the page scans and the master XML.)
both these error-types are easy to program routines to check them, even if they aren't flagged in spellcheck -- many of them would be.
They are "sometimes" easy to spot. Other times the automatic routines will not catch errors (e.g. ":" vs. ";")
it's relatively easy to detect sentences, so as to check for periods.
Usually true, but there are some rare exceptions where an abbreviation can be mistaken for an end of a sentence. Then there's the ellipsis issue where sometimes an ellipsis is at the end of the sentence and sometimes it is not (and incorrectly used.)
and quote-marks are usually nested in pairs, and thus easy to check.
This is also true, but as found in "My Antonia", there are exceptions to pure nesting, such as when a quotation spills over into several paragraphs where the intermediate paragraphs are not terminated by an end quotation mark (whether single or double.) Also, apostrophes are sometimes confused with single right quote marks. Here's a fictional example (imagine the straight quotes and apostrophe marks being represented in print with the appropriate "curly" marks): "And Harry told me, 'the voters' confidence in the candidate waned.' To which I replied to Harry, 'I don't believe so.'" With a smart enough grammar and parser, the above might be properly parsed and the apostrophe correctly differentiated from the single right quote mark. But still, real-world texts tend to throw a lot of curve balls that are sometimes hard to correctly machine process.
but my routines for checking these two items are still back in my prototyping test-app, awaiting migration to the current version; that's why i didn't bother doing o.c.r. on the second half of the scans; once i've incorporated the routines, i'll refine 'em on the first half, and then do a solid test of them using the text from the second half.
great!
total time to do the o.c.r. on this book, once i know what i'm doing?
OCR is quite fast. It's making and cleaning up the scans which is the human and CPU intensive part.
i'd estimate it at about an hour. and for all post-o.c.r. processing? i'd estimate that about the same. total time for the book -- 2 hours. that's much less time than it took to scan and manipulate the images.
Yes.
p.s. although jon's highly accurate version of the text gave us little opportunity to find errors in his work, we _did_ find two. (one is an error in the text, i'd say, but jon did not preserve it.) if michael would like to have another "my antonia" in the library, i'll submit the _entirely_ correct version to project gutenberg, and maybe jon can use it to find the error that eluded his team. :+)
Well, not all of the pages have been doubly proofed. The team is not finished, and I plan to post a plea somewhere for more eyeballs to go over it. I would like to receive error reports as well for this text, since Brewster wants highly proofed texts for some experiments he plans to run similar to yours. But if I have to use the version you donate to PG, so be it. :^)
p.p.s. i _did_ just drop a hint. one i can use later to show that i did indeed find the one error that is non-equivocal. as for the other error, which might or might not be an air, i'll sack that one.
Oh, a clue. :^) Anyway, great work! Jon (p.s., I did find one error in my text based on the list you gave. Thanks. There should be a comma after the first "Jake-y" in "Jake-y Jake-y". So that's been corrected already in the online and archive version. I rechecked the PG edition, and they get the comma right in the text, which I oddly missed doing my "diff" (probably because there were quite a few differences to pour over.) But then they enclose the surrounding sentence within a single quote mark (following the British convention), while the original first edition uses a double quote mark. The PG edition seems to be inconsistent with regards to quotation marks and to British/American spelling, which is why I surmise the PG edition is based on some non-Cather-approved British edition and might have subsequently been selectively and inconsistently edited in trying to "re-Americanize" it. I assume you discovered the several different paragraph breaks in the PG edition?)