Heebee Jeebees on Gutenberg

I've been mulling over ideas for applying Natural Language Processing to catch hard-to-find errors in e-texts. I have made little practical progress, but for some reason it occurred to me to try a few carefully chosen Google searches, all restricted to site:www.gutenberg.org. "around the comer" returns 17 hits. "turn the comer" returns no hits. "to he" returns 10,700 hits, a fair number of them not representing typos. "have clone", 13 hits. "will bo", 1 hit. "to bo", 38, some legit (often using "Bo" as a proper noun) "went borne" (for "went home"), 5 hits, one of which is legit and the other being four different editions of the same work, all with the same error. "fax away", 1 hit. "coining to", 23 hits, some legit. "he docs", 7, with some repeat editions. "it docs", 9, with repeats, but offset by two hits in one work. "she docs", none. I don't know what all that proves, but I found it interesting nonetheless.

Geoff wrote:
I've been mulling over ideas for applying Natural Language Processing to catch hard-to-find errors in e-texts. I have made little practical progress, but for some reason it occurred to me to try a few carefully chosen Google searches, all restricted to site:www.gutenberg.org.
"around the comer" returns 17 hits. "turn the comer" returns no hits. "to he" returns 10,700 hits, a fair number of them not representing typos. "have clone", 13 hits. "will bo", 1 hit. "to bo", 38, some legit (often using "Bo" as a proper noun) "went borne" (for "went home"), 5 hits, one of which is legit and the other being four different editions of the same work, all with the same error. "fax away", 1 hit. "coining to", 23 hits, some legit. "he docs", 7, with some repeat editions. "it docs", 9, with repeats, but offset by two hits in one work. "she docs", none.
I don't know what all that proves, but I found it interesting nonetheless.
Fascinating work! I'm not sure what it proves, either, but it does show that even though some automated post-processing based on NLP can clean up a lot of OCR errors, that such processing cannot catch them all, and in more than a few cases may even "correct" what was not an error (false positive). (Now such tools can be used to find possible errors, and then a human being makes a decision on what to do. But now we are back into the human proofing realm, using tools to make the human proofreader's life easier, rather than trying to replace the human.) In short, it shows the need for good ole' human proofing as exemplified by DP to convert raw OCR texts to finished, high-quality digital texts. The focus should be on giving the proofers better tools to do their job better and easier, and not trying to replace them. Jon Noring

Interesting analysis, thanks. So where to go from here? With our current set-up, to merely generate a list of these and submit them as corrections would be somewhat overloading our volunteers. Also, usually, where you have one error in a text, there are others close by. So it would be worthwhile to take a closer look at the texts indicated by these searches, and fix more errors at once. However, this process can be rather tedious, and does not attract the attention of volunteers the same way that digitizing a new text does. Andrew On Fri, 13 May 2005, Geoff Horton wrote:
I've been mulling over ideas for applying Natural Language Processing to catch hard-to-find errors in e-texts. I have made little practical progress, but for some reason it occurred to me to try a few carefully chosen Google searches, all restricted to site:www.gutenberg.org.
"around the comer" returns 17 hits. "turn the comer" returns no hits. "to he" returns 10,700 hits, a fair number of them not representing typos. "have clone", 13 hits. "will bo", 1 hit. "to bo", 38, some legit (often using "Bo" as a proper noun) "went borne" (for "went home"), 5 hits, one of which is legit and the other being four different editions of the same work, all with the same error. "fax away", 1 hit. "coining to", 23 hits, some legit. "he docs", 7, with some repeat editions. "it docs", 9, with repeats, but offset by two hits in one work. "she docs", none.
I don't know what all that proves, but I found it interesting nonetheless. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
participants (3)
-
Andrew Sly
-
Geoff Horton
-
Jon Noring