march forth

4 Mar 2005

      well, that was a lot easier than i thought it would be...      :+)

i did o.c.r. on half of jon noring's page-scans for "my antonia",
using abbyy finereader v7.x.  the results were quite excellent.

after doing a small number of global corrections to the o.c.r.,
i checked it against noring's "trustworthy" version of the text.

except for exceptions i'll discuss right after this paragraph,
most results are given below; each pair of lines represents a
difference found between abbyy and noring, with the last word
in each line being the point of difference.  the number listed at
the start of each line is the word-number in the file, and the
string of words are the ones preceding the point-of-difference
in the file, so that you can easily pinpoint the correct location.

most of the o.c.r. errors were on _punctuation_, not _letters_.
in particular, there were many instances where a _period_ was
misrecognized as a comma.  i did not bother to list these cases,
mostly to avoid clutter.  i do not know what caused these errors.
i don't know if it's a _typical_ misrecognition that abbyy makes, 
if jon's manipulation of the images somehow caused confusion,
if i set one of the options incorrectly, or what.  help, anyone?

i have also not listed differences found in hyphenation, since
i don't have the time to write a decent routine to check them.
(i just accepted the dehyphenation abbyy did automatically.)

another set of differences not listed here is the _n't_ words.
words like "couldn't" and "shouldn't" were set with the _n't_
part distinct from the first part.  jon's version retained this.
abbyy did not.  personally, i find it an unnecessary distraction;
the first thing i'd do with such the file is to change it globally.
jon probably considers that "tampering with trustworthiness".
i think it's common sense recognition of a changed convention.
if you prefer jon's way, use his text.  if not, you can use mine.
(the change is global throughout the book, so it is easy to do.)

i note that jon _did_ close up some "there's" where the "'s" was
set off distinctively, so he was a bit inconsistent in this arena.
(i didn't check to see if there were other apostrophe-s words
that were set apart, because i would've closed 'em up myself.)

i also changed high-bit characters to low, to ease comparison,
so those differences are not listed.  yes, the book _did_ print
"antonia" with a squiggle over the "a".  to me, it's unnecessary.
(but i'm quite sure _that_ little detail gave jon wet dreams.)   ;+)
whichever way you like it, it is just one more global change.
that's one beauty of plain-text -- it's so easy to manipulate it.

so, now back to the quality of the recognition...

almost all of the words were correctly identified.  the ones that
were not would be flagged by a simple spell-check, with merely
2 stealth scanno exceptions: "cur" for "our" and "oven" for "over".
i imagine that these pairs are on the lists of know scannos, and
the variants appear just 5 times, total, so it's an easy test to do.

most of the errors were of two types -- periods and quote-marks.

both these error-types are easy to program routines to check them,
even if they aren't flagged in spellcheck -- many of them would be.

it's relatively easy to detect sentences, so as to check for periods.
and quote-marks are usually nested in pairs, and thus easy to check.

but my routines for checking these two items are still back in my
prototyping test-app, awaiting migration to the current version;
that's why i didn't bother doing o.c.r. on the second half of the scans;
once i've incorporated the routines, i'll refine 'em on the first half,
and then do a solid test of them using the text from the second half.

it's not surprising to me that my tools would find all the errors here.
this is a relatively straightforward text, with very few complications.

total time to do the o.c.r. on this book, once i know what i'm doing?
i'd estimate it at about an hour.  and for all post-o.c.r. processing?
i'd estimate that about the same.  total time for the book -- 2 hours.
that's much less time than it took to scan and manipulate the images.

i'm guessing that those 2 hours of o.c.r. and post-o.c.r. work would
make the accuracy level about 1 error for every 50-100 pages or so.
and those errors would be in the less-serious arena of punctuation.
i won't be able to say for sure until i've done the second-half test,
of course, but given the highly accurate recognition of the words
that i found on this half, i feel rather safe making that prediction.

in this half, of 200+ pages, the only errors that i might have missed
-- but found because i had noring's version to compare it to -- were
"layout/lay out" and "fairy tale/fairytale".  i _might_ have caught
"fairytale", because it's not contained in my spellcheck dictionary
in its joined variant, and the split variant _is_ in the book (twice).
i probably would not have caught "layout", since it's in my dictionary.
(but i should take it out of the dictionary for checking older books.
old-time typographers _did_ layout, but they didn't _call_ it that.)

either way, i'm sure you'll agree that those two errors are trivial.
if all the errors in our books were that meaningless, it'd be great.

wait, i might have even caught _those_ errors, as they are _right_
in the _project_gutenberg_ e-text, which has been out for years!

well, that wraps up my report.  for those who might be curious,
i'll be releasing my post-o.c.r. tool in the late spring.  look for it!

anyway...

i believe this makes it very clear that i am correct when i say that
if you do the scanning carefully, manipulate those scans correctly,
use abbyy finereader v7.x to do the o.c.r., and subject its results to
a good post-o.c.r. program, it is relatively quick and easy to process
an o.c.r. text to the state where it can become a high-powered e-book.

the notion that these procedures are difficult or time-consuming
is just plain wrong.  wrong, wrong, wrong.  in one word -- _untrue_.

-bowerbird

p.s.  although jon's highly accurate version of the text gave us
little opportunity to find errors in his work, we _did_ find two.
(one is an error in the text, i'd say, but jon did not preserve it.)
if michael would like to have another "my antonia" in the library,
i'll submit the _entirely_ correct version to project gutenberg,
and maybe jon can use it to find the error that eluded his team.    :+)

p.p.s.  i _did_ just drop a hint.  one i can use later to show that
i did indeed find the one error that is non-equivocal.  as for the
other error, which might or might not be an air, i'll sack that one.

-----------------------------------------------------------------

524        a group of people stood         huddied
524        a group of people stood         huddled

2442       of Jacob whom He loved.         SelahP
2442       of Jacob whom He loved.         Selah."

4562       grandmother's hand. The oldest son,             Ambro2,
4562       grandmother's hand. The oldest son,             Ambroz@,

5564       up like a hare. "Tatinek,       Tatinekl"
5564       up like a hare. "Tatinek,       Tatinek!"

6344       grumbled, but realized it was   Important
6344       grumbled, but realized it was   important

10749      was fixed for me by             chance;
10749      was fixed for me by             chance,

12887      the familiar road. "They still  come?  "he
12887      the familiar road. "They still  come?"   he

13132      they were always unfortunate. When              PavePs
13132      they were always unfortunate. When              Pavel's

16303      "You not mind my poor           mamenka>
16303      "You not mind my poor           mamenka,

17531      probably, in some deep Bohemian                 forest.....
17531      probably, in some deep Bohemian                 forest...

17718      would be lost ten times         oven
17718      would be lost ten times         over.

18300      the talking tree of the         fairytale;
18300      the talking tree of the         fairy tale;

21478      Ambrosch found him." "Krajiek could             V
21478      Ambrosch found him." "Krajiek could             'a'

23282      that, too, Jelinek. But we      beiieve
23282      that, too, Jelinek. But we      believe

25309      and I went into the             Shimerdas9
25309      and I went into the             Shimerdas'

25594      of his long, shapely hands      layout
25594      of his long, shapely hands      lay out

26036      which is also Thy mercy         seat,"
26036      which is also Thy mercy         seat."

26157      While the tempest still is      high."
26157      While the tempest still is      high."...

27674      milk like what your grandpa     s&y.
27674      milk like what your grandpa     say.

29075      in a spiteful, crowing --       "Jake-y,
29075      in a spiteful, crowing voice: --       "Jake-y

29418      kept for hot applications when  cur
29418      kept for hot applications when  our

30016      Shimerda dropped the rope, ran  aftet
30016      Shimerda dropped the rope, ran  after

36061      misfortune, his wife, "Crazy Mary,"             iried
36061      misfortune, his wife, "Crazy Mary,"             tried

36513      fine, making eyes at the        men!?.."
36513      fine, making eyes at the        men!..."

37226      given him one of Tiny           SoderbalPs
37226      given him one of Tiny           Soderball's

38713      pump water for the cattle.      '"Oh,
38713      pump water for the cattle.      "'Oh,

Bowerbird＠aol.com

Jon Noring

tags

participants (2)