Re: [gutvol-d] march forth

4 Mar 2005

      Bowerbird wrote:
...
i did o.c.r. on half of jon noring's page-scans for "my antonia",
using abbyy finereader v7.x.  the results were quite excellent.
Great!
...
after doing a small number of global corrections to the o.c.r.,
i checked it against noring's "trustworthy" version of the text.
What type of global corrections were these? One area is how to handle
hyphenation, and whether there was a short dash in the compound word
in the first place before the typesetter hyphenated the word.
...
except for exceptions i'll discuss right after this paragraph,
most results are given below; each pair of lines represents a
difference found between abbyy and noring, with the last word
in each line being the point of difference.  the number listed at
the start of each line is the word-number in the file, and the
string of words are the ones preceding the point-of-difference
in the file, so that you can easily pinpoint the correct location.
Great!
...
most of the o.c.r. errors were on _punctuation_, not _letters_.
in particular, there were many instances where a _period_ was
misrecognized as a comma.  i did not bother to list these cases,
mostly to avoid clutter.  i do not know what caused these errors.
i don't know if it's a _typical_ misrecognition that abbyy makes, 
if jon's manipulation of the images somehow caused confusion,
if i set one of the options incorrectly, or what.  help, anyone?
I originally scanned the pages at 600 dpi (optical) 24-bit color
(which in the future I won't do for b&w works since I determined it
is unnecessary overkill.) Then for the online scans they were reduced
as follows:

   original --> 600 dpi bitonal --> 120 dpi greyscale antialiased

I'm not sure which set of scans you used (you don't have the original
since they occupy 5 gigs of space.) Hopefully you used the 600 dpi
bitonal which should OCR the best. Antialiasing actually causes
problems (notwithstanding the much lower resolution.)

One thing you could do is to look at the 600 dpi pages at 100% size
for which the punctuation was not correctly discerned. You probably
will see some errant pixels that fooled the OCR into thinking it
was some other punctuation mark than it is.

Regardless, punctuation is a toughie for OCR to exactly get right,
from what I understand. 600 dpi *helps* resolve the fine detail of
punctuation. 300 dpi is marginal for a lot of punctuation because
the characters are so small and don't occupy enough pixels (while
letters retain enough pixels to better identify them.)
...
i have also not listed differences found in hyphenation, since
i don't have the time to write a decent routine to check them.
(i just accepted the dehyphenation abbyy did automatically.)
Ah, ok (answering my comments at the beginning.)

Resolving this usually requires a human being to go over, especially
for Works from the 18th and 19th century where compound words with
dashes were much more common than today (e.g., "to-morrow".) Sometimes
one has to see what the author did elsewhere in the text. In a few
cases a guess is necessary based on understanding what the author did
in similar cases in the text. Some of this can be automated. In other
cases it requires a human being to make a final decision. I followed
the UNL Cather Edition here.
...
another set of differences not listed here is the _n't_ words.
words like "couldn't" and "shouldn't" were set with the _n't_
part distinct from the first part.  jon's version retained this.
abbyy did not.  personally, i find it an unnecessary distraction;
the first thing i'd do with such the file is to change it globally.
jon probably considers that "tampering with trustworthiness".
i think it's common sense recognition of a changed convention.
if you prefer jon's way, use his text.  if not, you can use mine.
(the change is global throughout the book, so it is easy to do.)
Whether or not it is an "unnecessary" distraction, it is better to
preserve the original text in the master etext version.

My thinking is that if someone wants to produce a derivative "modern
reader" edition of "My Antonia", they are welcome to do so and add it
to the collection because the original faithful rendition is *already*
there.

The only requirements I would place (and this applies in general for
any Work) are 1) the original textually faithful etext version has
already been done and is in the collection, and 2) the type of
modernizations done for the modern parallel editions are noted in the
texts themselves (such as within an Editor's Introduction.)
...
i note that jon _did_ close up some "there's" where the "'s" was
set off distinctively, so he was a bit inconsistent in this arena.
(i didn't check to see if there were other apostrophe-s words
that were set apart, because i would've closed 'em up myself.)
I spent some time looking at the " 's " issue last week. In many cases
in the original print edition the spacing between the preceding word
and the apostrophe s is quite small -- and for the same combination
elsewhere was larger -- indicating this was more of a typesetter's
convention rather than something Cather specified. [note]

In addition, the UNL Cather Edition closed off all the apostrophe s
(no spaces), but kept the space for many of " n't" words. So here again
I followed the UNL Cather Edition. (Btw, I found quite a few errors in
the online UNL Cather Edition of "My Antonia" which have been forwarded
to the team overseeing it -- sadly the professor overseeing the online
project passed away a few months ago. We are in touch with other Cather
scholars.)

But I've put the "'s" issue on my "to look at again" list.

[note] Cather wanted the line length to be fairly short, so this puts
extra pressure on typesetters who will either have to extend character
spacing for a particular line or scrunch it up more than usual,
depending upon the situation with the rest of the typesetting on the
page, and whether certain words can be hyphenated or not.
...
i also changed high-bit characters to low, to ease comparison,
You mean accented characters?
...
so those differences are not listed.  yes, the book _did_ print
"antonia" with a squiggle over the "a".  to me, it's unnecessary.
But that's what is in the original, the "A acute". The squigly is an
'acute', btw. :^)

Accented characters are *always* important to preserve under all
situations. There's no need anymore, in these days of Unicode and
the like to stick with 7-bit ASCII.

I sense that you don't want to properly deal with accented characters
since this poses extra problems with OCRing and proofing, something
you are trying to avoid in your zeal to get everything to
automagically work. To me, that's going too far in simplifying.

Preserving accented characters are important.
...
(but i'm quite sure _that_ little detail gave jon wet dreams.)   ;+)
whichever way you like it, it is just one more global change.
that's one beauty of plain-text -- it's so easy to manipulate it.
Unicode is plain text. Just more characters to play with. :^)

Btw, for those who are interested, here's the "non-Basic Latin"
(non-ASCII) alphabetic characters used in "My Antonia":

     A acute
     AE ligature
     ae ligature
     e acute
     e circumflex
     i umlaut
     n tilde
     small z with caron
...
almost all of the words were correctly identified.  the ones that
were not would be flagged by a simple spell-check, with merely
2 stealth scanno exceptions: "cur" for "our" and "oven" for "over".
i imagine that these pairs are on the lists of know scannos, and
the variants appear just 5 times, total, so it's an easy test to do.
most of the errors were of two types -- periods and quote-marks.
Which makes sense. But these are the toughest to correct sometimes,
and punctuation changes can sometimes subtly affect the meaning.
They are hopefully caught by human proofers/readers when grammar
checkers don't (I do use Word to help find both spelling and
punctuation errors -- when they find something, I then manually
check it in the page scans and the master XML.)
...
both these error-types are easy to program routines to check them,
even if they aren't flagged in spellcheck -- many of them would be.
They are "sometimes" easy to spot. Other times the automatic routines
will not catch errors (e.g. ":" vs. ";")
...
it's relatively easy to detect sentences, so as to check for periods.
Usually true, but there are some rare exceptions where an abbreviation
can be mistaken for an end of a sentence. Then there's the ellipsis
issue where sometimes an ellipsis is at the end of the sentence and
sometimes it is not (and incorrectly used.)
...
and quote-marks are usually nested in pairs, and thus easy to check.
This is also true, but as found in "My Antonia", there are exceptions
to pure nesting, such as when a quotation spills over into several
paragraphs where the intermediate paragraphs are not terminated by
an end quotation mark (whether single or double.)

Also, apostrophes are sometimes confused with single right quote marks.
Here's a fictional example (imagine the straight quotes and apostrophe
marks being represented in print with the appropriate "curly" marks):

   "And Harry told me, 'the voters' confidence in the candidate
   waned.' To which I replied to Harry, 'I don't believe so.'"

With a smart enough grammar and parser, the above might be properly
parsed and the apostrophe correctly differentiated from the single
right quote mark. But still, real-world texts tend to throw a lot of
curve balls that are sometimes hard to correctly machine process.
...
but my routines for checking these two items are still back in my
prototyping test-app, awaiting migration to the current version;
that's why i didn't bother doing o.c.r. on the second half of the scans;
once i've incorporated the routines, i'll refine 'em on the first half,
and then do a solid test of them using the text from the second half.
great!
...
total time to do the o.c.r. on this book, once i know what i'm doing?
OCR is quite fast. It's making and cleaning up the scans which is the
human and CPU intensive part.
...
i'd estimate it at about an hour.  and for all post-o.c.r. processing?
i'd estimate that about the same.  total time for the book -- 2 hours.
that's much less time than it took to scan and manipulate the images.
Yes.
...
p.s.  although jon's highly accurate version of the text gave us
little opportunity to find errors in his work, we _did_ find two.
(one is an error in the text, i'd say, but jon did not preserve it.)
if michael would like to have another "my antonia" in the library,
i'll submit the _entirely_ correct version to project gutenberg,
and maybe jon can use it to find the error that eluded his team.    :+)
Well, not all of the pages have been doubly proofed. The team is not
finished, and I plan to post a plea somewhere for more eyeballs to go
over it.

I would like to receive error reports as well for this text, since
Brewster wants highly proofed texts for some experiments he plans to
run similar to yours. But if I have to use the version you donate to
PG, so be it. :^)
...
p.p.s.  i _did_ just drop a hint.  one i can use later to show that
i did indeed find the one error that is non-equivocal.  as for the
other error, which might or might not be an air, i'll sack that one.
Oh, a clue. :^)

Anyway, great work!

Jon

(p.s., I did find one error in my text based on the list you gave.
Thanks. There should be a comma after the first "Jake-y" in "Jake-y
Jake-y". So that's been corrected already in the online and archive
version. I rechecked the PG edition, and they get the comma right in
the text, which I oddly missed doing my "diff" (probably because there
were quite a few differences to pour over.) But then they enclose the
surrounding sentence within a single quote mark (following the
British convention), while the original first edition uses a double
quote mark. The PG edition seems to be inconsistent with regards to
quotation marks and to British/American spelling, which is why I
surmise the PG edition is based on some non-Cather-approved British
edition and might have subsequently been selectively and
inconsistently edited in trying to "re-Americanize" it. I assume you
discovered the several different paragraph breaks in the PG edition?)

Re: [gutvol-d] march forth

Jon Noring