
jeff said:
I've been mulling over ideas
alright! someone around here is thinking for once! :+)
for applying Natural Language Processing to catch hard-to-find errors in e-texts.
oh-oh. a lapse into acronym-land. not a good sign of quality thinking... ;+)
I have made little practical progress
i suggest you ditch the acronym; those things just weigh you down... :+)
but for some reason it occurred to me to try a few carefully chosen Google searches, all restricted to site:www.gutenberg.org.
good idea. so ok, let's take a look.
"around the comer" returns 17 hits.
seems like that's a stealth scanno that should be in the dictionary already, yep. is jim tinsley taking these notes down as error-reports?
"turn the comer" returns no hits.
i guess we turned the corner on that one...
"to he" returns 10,700 hits, a fair number of them not representing typos.
false alarms are counterproductive; that's the big problem with gutcheck. every false alarm wastes your time. however -- in contrast to misses, which are invisible, and therefore cannot be instructive -- false alarms _do_ give you information about how to improve your process... that is why you need to treasure them, and to study them. (yes, for the acronym-lovers among us, this is signal-detection theory in action. s.d.t. to you.)
"have clone", 13 hits.
have clone, will reproduce.
"will bo", 1 hit.
roger wilco.
"to bo", 38, some legit (often using "Bo" as a proper noun)
again, you need to examine the false alarms closely, to see if they reveal clues how to make your search more fine-grained. here, for instance, the "b" in the false alarms is capitalized, when it otherwise would not be, indicating it's a proper noun, and gives a reason to weight those instances less heavily, or indeed, if happening repeatedly in one e-text, even ignore 'em. this is where most people go bad. they hit a few false alarms, think "oh no, this isn't working", and give up. that is so wrong! as long as the number of false alarms isn't overwhelming you, it's much more cost-effective to take the hits that you got and then go to work on improving performance on the false alarms...
"went borne" (for "went home"), 5 hits, one of which is legit and the other being four different editions of the same work, all with the same error.
at least the _errors_ are consistent! ;+) but do you see a pattern emerging here?
"fax away", 1 hit.
what should it be? although "fax" _could_ be in a scanno dictionary, for "far", i'd take it out, because very few public-domain books have the word "fax" in them, do they? ;+) so look at every instance of "fax", pure and simple.
"coining to", 23 hits, some legit.
any clues as to how to tell the machine to tell the difference? start digging, and dig deep.
"he docs", 7, with some repeat editions.
"docs" might be in a list of stealth scannos too. but again, let's be realistic. how often does it come up legitimately? and in the rare case it does come up legitimately...
"it docs", 9, with repeats, but offset by two hits in one work.
...it might come up in that book more than one time, right? and by now, you should definitely see a certain pattern...
"she docs", none.
does she? i guess she doesn't...
I don't know what all that proves, but I found it interesting nonetheless.
it proves you're getting smarter about how to find errors in an e-text. (inform jon noring you've started brewing some artificial intelligence!) um, and yes, i know the impetus for this comes from jim's approach. and i could tell you, and jim, how to go about making it even smarter. but my impression is that jim just doesn't care too much to listen... so, meanwhile, can you say -- explicitly and consciously -- what you have done here that has extended jim's notions in a meaningful and important way? and how is that related to the pattern that revealed itself? (and what is that pattern?) -bowerbird