
Interesting analysis, thanks. So where to go from here? With our current set-up, to merely generate a list of these and submit them as corrections would be somewhat overloading our volunteers. Also, usually, where you have one error in a text, there are others close by. So it would be worthwhile to take a closer look at the texts indicated by these searches, and fix more errors at once. However, this process can be rather tedious, and does not attract the attention of volunteers the same way that digitizing a new text does. Andrew On Fri, 13 May 2005, Geoff Horton wrote:
I've been mulling over ideas for applying Natural Language Processing to catch hard-to-find errors in e-texts. I have made little practical progress, but for some reason it occurred to me to try a few carefully chosen Google searches, all restricted to site:www.gutenberg.org.
"around the comer" returns 17 hits. "turn the comer" returns no hits. "to he" returns 10,700 hits, a fair number of them not representing typos. "have clone", 13 hits. "will bo", 1 hit. "to bo", 38, some legit (often using "Bo" as a proper noun) "went borne" (for "went home"), 5 hits, one of which is legit and the other being four different editions of the same work, all with the same error. "fax away", 1 hit. "coining to", 23 hits, some legit. "he docs", 7, with some repeat editions. "it docs", 9, with repeats, but offset by two hits in one work. "she docs", none.
I don't know what all that proves, but I found it interesting nonetheless. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d