Re: [gutvol-d] one more thing, for jon noring

27 Feb 2005

      Bowerbird wrote:
...
jon noring said:
...
did you do o.c.r. on it?  if you can retrieve the output, that would
be good. it would allow people to do research on assessing/improving
o.c.r. quality, and assist programmers in developing post-o.c.r.
text-cleanup programs.
No, I did not OCR the scans for producing My Antonia (I did experiment
with scanning though). But since the scans exist, any OCR package will
import them and scan them. So nothing is "lost". There's no law that
says one must OCR them at the same time they are scanned -- they are
separate processes and can be decoupled with no loss of anything to
anyone at any time.

If you need OCRing done, you can probably post a "plea for help" and
find someone who has the OCR software packages you'd like to try (I
don't have the robust, up-to-date ones -- like Abbyy which I asked a
friend to help out with for my experiments -- I just have the cheapo
freebies.) The scans are available online for download, as you know.
...
(but, from later posts, it looks like you grabbed the text from
elsewhere. so what you've done is "blessed" somebody else's work as
"trustworthy", presumably after checking it, and maybe correcting
it.  you could also have done that same thing using project
gutenberg's version of the text, since my comparison of the two
files shows them to be very similar, so much so that i expect they
were indeed based on the same version.)
I won't go into the gory details, but yes, I took two versions and
then combined/diffed them. I then did a very thorough comparison
page-by-page to the original source page scans, a la DP. We are now in
the process of having several people do the same (a page-by-page
comparison of the XHTML version to the page scans -- I want at least
two people to go over each page) -- anyone reading this, you are
welcome to volunteer and help us -- do a few pages just like for DP!

The error rate in the XHTML version (still beta) is now very low, and
can be considered for all practical purposes a very accurate and
textually faithful reproduction of the 1918 1st edition. (But then,
maybe I'll be surprised and find a serious error in the text.)

In retrospect, this process should have been done via DP instead. But
there was a deadline to finish the first beta of the cleaned-up text,
so there was no time to have this done at DP. However, I do plan to
post a request to the relevant DP forums for final proofing help, as
well as to seek help from the DP folk on other matters (such as TEI
markup). If DP wishes to go over it in some fashion and incorporate
it into their "archive" as well as submit it to PG, that's fine by me
(I will not directly contribute the text to PG as I've noted on TeBC.)

Regarding the PG version of My Antonia compared to the 1918 1st
edition, there are a *lot* of differences. I regularized both texts
and ran 'diff' between them, and found over 200 differences, mostly
spelling (the PG version uses mostly British spelling but even here it
is strangely inconsistent!), but also oddities in punctuation, wrong
paragraph breaks, some missing accented characters, a couple places
with changed wording, a few misspellings, etc.

Of course, whenever I encountered a difference, I went to the
original page scans of the 1918 1st edition to verify what was done
there. All 200+ differences with respect to the original text were
with the PG version, which I surmise was derived from the British
edition of My Antonia, which is noted to have been mangled in editing
(Willa Cather was supposedly furious over the quality of editing in
the British edition which went beyond just using British spelling for
words, such as 'colour' instead of 'color'.) 

Anyway, when the final proofing is done, I believe the textual error
rate will be very low, near zero (but one cannot say it is perfect.)
So I think it will be useful for OCR accuracy experiments (which I
assume is what you want to perform?)

Of course, there's always the issue of hyphenated compound words,
figuring out if a hyphenated compound word will have a dash in it or
not, but that's another matter. I believe we did pretty good on this,
with help from the UNL information as well as textual analysis.
...
...
I'll zip up the 600 dpi 2-color (B&W) scans 
which have already gone through a clean-up stage 
(they will be PNG files, and occupy if memory serves me right, 
about 50 megs of space
...
those are too big for my purposes, and for me to download.
Oops, sorry. They are pretty large for downloading by modem (but with
DSL/Cable they can be downloaded pretty quick.)
...
but if i could reimburse you for sending them to me on a cd?
It's on me. In private email send me your address and I'll burn and
mail you a disk of the 600 dpi and 120 dpi scans. I do have the
original 600 dpi 24-bit color scans (which is overkill -- next time
I'll do the raw scans for B&W pages at greyscale), but in PNG they
occupy over 5 gigs of disk space! (Don't have a DVD burner yet
otherwise I'd send those, too.)
...
or the 120-dpi versions would work just fine for my project,
the same ones that are on the website, just zipped together.
Unfortunately, since the 120-dpi scans are antialiased greyscale
(while the 600 dpi are bitonal), the size difference is surprisingly
not that different.

I updated the My Antonio index page to include downloading all the
120-dpi scans in a ZIP file, which is still over 30 megs in size:

   http://www.openreader.org/myantonia/

Bowerbird, I'll be happy to put up a ZML regularized text version of
My Antonia. If I put up plain text, I want the plain text to follow
some regularization rules, and ZML is the only game in town actively
working with etexts (as far as I know at least -- I do recall two
other text regularization schemas, but don't know if the authors are
doing anything with them.)

Jon

Re: [gutvol-d] one more thing, for jon noring

Jon Noring