Heebee Jeebees on Gutenberg

13 May 2005

      I've been mulling over ideas for applying Natural Language Processing
to catch hard-to-find errors in e-texts. I have made little practical
progress, but for some reason it occurred to me to try a few carefully
chosen Google searches, all restricted to site:www.gutenberg.org.

"around the comer" returns 17 hits.
"turn the comer" returns no hits.
"to he" returns 10,700 hits, a fair number of them not representing typos.
"have clone", 13 hits.
"will bo", 1 hit.
"to bo", 38, some legit (often using "Bo" as a proper noun)
"went borne" (for "went home"), 5 hits, one of which is legit and the
other being four different editions of the same work, all with the
same error.
"fax away", 1 hit.
"coining to", 23 hits, some legit.
"he docs", 7, with some repeat editions.
"it docs", 9, with repeats, but offset by two hits in one work.
"she docs", none.

I don't know what all that proves, but I found it interesting nonetheless.

Geoff Horton

Jon Noring

Andrew Sly

tags

participants (3)