Re: [gutvol-d] Heebee Jeebees on Gutenberg

13 May 2005

      Geoff wrote:
...
I've been mulling over ideas for applying Natural Language Processing
to catch hard-to-find errors in e-texts. I have made little practical
progress, but for some reason it occurred to me to try a few carefully
chosen Google searches, all restricted to site:www.gutenberg.org.
"around the comer" returns 17 hits.
"turn the comer" returns no hits.
"to he" returns 10,700 hits, a fair number of them not representing typos.
"have clone", 13 hits.
"will bo", 1 hit.
"to bo", 38, some legit (often using "Bo" as a proper noun)
"went borne" (for "went home"), 5 hits, one of which is legit and the
other being four different editions of the same work, all with the
same error.
"fax away", 1 hit.
"coining to", 23 hits, some legit.
"he docs", 7, with some repeat editions.
"it docs", 9, with repeats, but offset by two hits in one work.
"she docs", none.
I don't know what all that proves, but I found it interesting nonetheless.
Fascinating work!

I'm not sure what it proves, either, but it does show that even though
some automated post-processing based on NLP can clean up a lot of OCR
errors, that such processing cannot catch them all, and in more than a
few cases may even "correct" what was not an error (false positive).

(Now such tools can be used to find possible errors, and then a human
being makes a decision on what to do. But now we are back into the
human proofing realm, using tools to make the human proofreader's life
easier, rather than trying to replace the human.)

In short, it shows the need for good ole' human proofing as exemplified
by DP to convert raw OCR texts to finished, high-quality digital texts.
The focus should be on giving the proofers better tools to do their job
better and easier, and not trying to replace them.

Jon Noring