
Geoff wrote:
I've been mulling over ideas for applying Natural Language Processing to catch hard-to-find errors in e-texts. I have made little practical progress, but for some reason it occurred to me to try a few carefully chosen Google searches, all restricted to site:www.gutenberg.org.
"around the comer" returns 17 hits. "turn the comer" returns no hits. "to he" returns 10,700 hits, a fair number of them not representing typos. "have clone", 13 hits. "will bo", 1 hit. "to bo", 38, some legit (often using "Bo" as a proper noun) "went borne" (for "went home"), 5 hits, one of which is legit and the other being four different editions of the same work, all with the same error. "fax away", 1 hit. "coining to", 23 hits, some legit. "he docs", 7, with some repeat editions. "it docs", 9, with repeats, but offset by two hits in one work. "she docs", none.
I don't know what all that proves, but I found it interesting nonetheless.
Fascinating work! I'm not sure what it proves, either, but it does show that even though some automated post-processing based on NLP can clean up a lot of OCR errors, that such processing cannot catch them all, and in more than a few cases may even "correct" what was not an error (false positive). (Now such tools can be used to find possible errors, and then a human being makes a decision on what to do. But now we are back into the human proofing realm, using tools to make the human proofreader's life easier, rather than trying to replace the human.) In short, it shows the need for good ole' human proofing as exemplified by DP to convert raw OCR texts to finished, high-quality digital texts. The focus should be on giving the proofers better tools to do their job better and easier, and not trying to replace them. Jon Noring