Re: [gutvol-d] Heebee Jeebees on Gutenberg

13 May 2005

      Interesting analysis, thanks.

So where to go from here?

With our current set-up, to merely generate a list of these
and submit them as corrections would be somewhat overloading
our volunteers. Also, usually, where you have one error in a
text, there are others close by. So it would be worthwhile to
take a closer look at the texts indicated by these searches,
and fix more errors at once.

However, this process can be rather tedious, and does not
attract the attention of volunteers the same way that digitizing
a new text does.

Andrew

On Fri, 13 May 2005, Geoff Horton wrote:
...
I've been mulling over ideas for applying Natural Language Processing
to catch hard-to-find errors in e-texts. I have made little practical
progress, but for some reason it occurred to me to try a few carefully
chosen Google searches, all restricted to site:www.gutenberg.org.
"around the comer" returns 17 hits.
"turn the comer" returns no hits.
"to he" returns 10,700 hits, a fair number of them not representing typos.
"have clone", 13 hits.
"will bo", 1 hit.
"to bo", 38, some legit (often using "Bo" as a proper noun)
"went borne" (for "went home"), 5 hits, one of which is legit and the
other being four different editions of the same work, all with the
same error.
"fax away", 1 hit.
"coining to", 23 hits, some legit.
"he docs", 7, with some repeat editions.
"it docs", 9, with repeats, but offset by two hits in one work.
"she docs", none.
I don't know what all that proves, but I found it interesting nonetheless.
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d

Re: [gutvol-d] Heebee Jeebees on Gutenberg

Andrew Sly