Re: [gutvol-d] Heebee Jeebees on Gutenberg

13 May 2005

      On Fri, May 13, 2005 at 03:23:28PM -0400, Geoff Horton wrote:
...
The key to the original program, and to what I was doing playing with
the archives, is coming up with the phrases that will find the
problem. For example, just searching for "clone" turns up 107 hits,
many of which are legit. Searching for "be clone" turns up 19,
including repeats, none of which are legit. So I think the basic idea
has merit, but darned if I know how to move it into a more practical
stage.
Hint: look hard at the source of jeebies and my discussion of the half
I dropped on the floor in writing my current version. Apply both 
the dumb-brick logic and the too-clever-for-its-own-good logic, 
and you're much closer -- or should that be doser? :-)

Gutcheck and GuiGuts would both have flagged all of your original
stealth scannos -- with, admittedly, a considerable helping of
false positives, depending on the text. Getting rid of the false
positives is, as you've found, more difficult. "clone" and "modem"
are very safe to flag as queries, ear / car, eat / cat and he / be
are more difficult, requiring either heuristics or patterns, 
preferably both, and I will bow to anyone who manages tram / train!
and / arid is a particularly interesting case: "and" is about 
seven and a half thousand times as common as "arid", and,
incidentally, if you follow this through, you'll find that
conjunctions are the bane of any analysis scheme for _all_ words,
so there is no benefit in trying to distinguish these either by
patterns or heuristics: just flag "arid" every time it appears,
and move on.

And so it goes. Jeebies also works for other stealth scanno pairs,
if you feed it their databases, by the way; hut / but, tom / torn, 
eat / cat and so on. But as I said in the forums, my disappointment
once I got the current scheme going was that OCR quality has 
improved so much, it's not as effective as it would have been
10 years ago. However, I will notch your interest up as another
vote for me to finish it. :-)

jim

Re: [gutvol-d] Heebee Jeebees on Gutenberg

Jim Tinsley