well, that was a lot easier than i thought it would be... :+)
i did o.c.r. on half of jon noring's page-scans for "my antonia",
using abbyy finereader v7.x. the results were quite excellent.
after doing a small number of global corrections to the o.c.r.,
i checked it against noring's "trustworthy" version of the text.
except for exceptions i'll discuss right after this paragraph,
most results are given below; each pair of lines represents a
difference found between abbyy and noring, with the last word
in each line being the point of difference. the number listed at
the start of each line is the word-number in the file, and the
string of words are the ones preceding the point-of-difference
in the file, so that you can easily pinpoint the correct location.
most of the o.c.r. errors were on _punctuation_, not _letters_.
in particular, there were many instances where a _period_ was
misrecognized as a comma. i did not bother to list these cases,
mostly to avoid clutter. i do not know what caused these errors.
i don't know if it's a _typical_ misrecognition that abbyy makes,
if jon's manipulation of the images somehow caused confusion,
if i set one of the options incorrectly, or what. help, anyone?
i have also not listed differences found in hyphenation, since
i don't have the time to write a decent routine to check them.
(i just accepted the dehyphenation abbyy did automatically.)
another set of differences not listed here is the _n't_ words.
words like "couldn't" and "shouldn't" were set with the _n't_
part distinct from the first part. jon's version retained this.
abbyy did not. personally, i find it an unnecessary distraction;
the first thing i'd do with such the file is to change it globally.
jon probably considers that "tampering with trustworthiness".
i think it's common sense recognition of a changed convention.
if you prefer jon's way, use his text. if not, you can use mine.
(the change is global throughout the book, so it is easy to do.)
i note that jon _did_ close up some "there's" where the "'s" was
set off distinctively, so he was a bit inconsistent in this arena.
(i didn't check to see if there were other apostrophe-s words
that were set apart, because i would've closed 'em up myself.)
i also changed high-bit characters to low, to ease comparison,
so those differences are not listed. yes, the book _did_ print
"antonia" with a squiggle over the "a". to me, it's unnecessary.
(but i'm quite sure _that_ little detail gave jon wet dreams.) ;+)
whichever way you like it, it is just one more global change.
that's one beauty of plain-text -- it's so easy to manipulate it.
so, now back to the quality of the recognition...
almost all of the words were correctly identified. the ones that
were not would be flagged by a simple spell-check, with merely
2 stealth scanno exceptions: "cur" for "our" and "oven" for "over".
i imagine that these pairs are on the lists of know scannos, and
the variants appear just 5 times, total, so it's an easy test to do.
most of the errors were of two types -- periods and quote-marks.
both these error-types are easy to program routines to check them,
even if they aren't flagged in spellcheck -- many of them would be.
it's relatively easy to detect sentences, so as to check for periods.
and quote-marks are usually nested in pairs, and thus easy to check.
but my routines for checking these two items are still back in my
prototyping test-app, awaiting migration to the current version;
that's why i didn't bother doing o.c.r. on the second half of the scans;
once i've incorporated the routines, i'll refine 'em on the first half,
and then do a solid test of them using the text from the second half.
it's not surprising to me that my tools would find all the errors here.
this is a relatively straightforward text, with very few complications.
total time to do the o.c.r. on this book, once i know what i'm doing?
i'd estimate it at about an hour. and for all post-o.c.r. processing?
i'd estimate that about the same. total time for the book -- 2 hours.
that's much less time than it took to scan and manipulate the images.
i'm guessing that those 2 hours of o.c.r. and post-o.c.r. work would
make the accuracy level about 1 error for every 50-100 pages or so.
and those errors would be in the less-serious arena of punctuation.
i won't be able to say for sure until i've done the second-half test,
of course, but given the highly accurate recognition of the words
that i found on this half, i feel rather safe making that prediction.
in this half, of 200+ pages, the only errors that i might have missed
-- but found because i had noring's version to compare it to -- were
"layout/lay out" and "fairy tale/fairytale". i _might_ have caught
"fairytale", because it's not contained in my spellcheck dictionary
in its joined variant, and the split variant _is_ in the book (twice).
i probably would not have caught "layout", since it's in my dictionary.
(but i should take it out of the dictionary for checking older books.
old-time typographers _did_ layout, but they didn't _call_ it that.)
either way, i'm sure you'll agree that those two errors are trivial.
if all the errors in our books were that meaningless, it'd be great.
wait, i might have even caught _those_ errors, as they are _right_
in the _project_gutenberg_ e-text, which has been out for years!
well, that wraps up my report. for those who might be curious,
i'll be releasing my post-o.c.r. tool in the late spring. look for it!
anyway...
i believe this makes it very clear that i am correct when i say that
if you do the scanning carefully, manipulate those scans correctly,
use abbyy finereader v7.x to do the o.c.r., and subject its results to
a good post-o.c.r. program, it is relatively quick and easy to process
an o.c.r. text to the state where it can become a high-powered e-book.
the notion that these procedures are difficult or time-consuming
is just plain wrong. wrong, wrong, wrong. in one word -- _untrue_.
-bowerbird
p.s. although jon's highly accurate version of the text gave us
little opportunity to find errors in his work, we _did_ find two.
(one is an error in the text, i'd say, but jon did not preserve it.)
if michael would like to have another "my antonia" in the library,
i'll submit the _entirely_ correct version to project gutenberg,
and maybe jon can use it to find the error that eluded his team. :+)
p.p.s. i _did_ just drop a hint. one i can use later to show that
i did indeed find the one error that is non-equivocal. as for the
other error, which might or might not be an air, i'll sack that one.
-----------------------------------------------------------------
524 a group of people stood huddied
524 a group of people stood huddled
2442 of Jacob whom He loved. SelahP
2442 of Jacob whom He loved. Selah."
4562 grandmother's hand. The oldest son, Ambro2,
4562 grandmother's hand. The oldest son, Ambroz@,
5564 up like a hare. "Tatinek, Tatinekl"
5564 up like a hare. "Tatinek, Tatinek!"
6344 grumbled, but realized it was Important
6344 grumbled, but realized it was important
10749 was fixed for me by chance;
10749 was fixed for me by chance,
12887 the familiar road. "They still come? "he
12887 the familiar road. "They still come?" he
13132 they were always unfortunate. When PavePs
13132 they were always unfortunate. When Pavel's
16303 "You not mind my poor mamenka>
16303 "You not mind my poor mamenka,
17531 probably, in some deep Bohemian forest.....
17531 probably, in some deep Bohemian forest...
17718 would be lost ten times oven
17718 would be lost ten times over.
18300 the talking tree of the fairytale;
18300 the talking tree of the fairy tale;
21478 Ambrosch found him." "Krajiek could V
21478 Ambrosch found him." "Krajiek could 'a'
23282 that, too, Jelinek. But we beiieve
23282 that, too, Jelinek. But we believe
25309 and I went into the Shimerdas9
25309 and I went into the Shimerdas'
25594 of his long, shapely hands layout
25594 of his long, shapely hands lay out
26036 which is also Thy mercy seat,"
26036 which is also Thy mercy seat."
26157 While the tempest still is high."
26157 While the tempest still is high."...
27674 milk like what your grandpa s&y.
27674 milk like what your grandpa say.
29075 in a spiteful, crowing -- "Jake-y,
29075 in a spiteful, crowing voice: -- "Jake-y
29418 kept for hot applications when cur
29418 kept for hot applications when our
30016 Shimerda dropped the rope, ran aftet
30016 Shimerda dropped the rope, ran after
36061 misfortune, his wife, "Crazy Mary," iried
36061 misfortune, his wife, "Crazy Mary," tried
36513 fine, making eyes at the men!?.."
36513 fine, making eyes at the men!..."
37226 given him one of Tiny SoderbalPs
37226 given him one of Tiny Soderball's
38713 pump water for the cattle. '"Oh,
38713 pump water for the cattle. "'Oh,