
On Fri, May 13, 2005 at 03:23:28PM -0400, Geoff Horton wrote:
The key to the original program, and to what I was doing playing with the archives, is coming up with the phrases that will find the problem. For example, just searching for "clone" turns up 107 hits, many of which are legit. Searching for "be clone" turns up 19, including repeats, none of which are legit. So I think the basic idea has merit, but darned if I know how to move it into a more practical stage.
Hint: look hard at the source of jeebies and my discussion of the half I dropped on the floor in writing my current version. Apply both the dumb-brick logic and the too-clever-for-its-own-good logic, and you're much closer -- or should that be doser? :-) Gutcheck and GuiGuts would both have flagged all of your original stealth scannos -- with, admittedly, a considerable helping of false positives, depending on the text. Getting rid of the false positives is, as you've found, more difficult. "clone" and "modem" are very safe to flag as queries, ear / car, eat / cat and he / be are more difficult, requiring either heuristics or patterns, preferably both, and I will bow to anyone who manages tram / train! and / arid is a particularly interesting case: "and" is about seven and a half thousand times as common as "arid", and, incidentally, if you follow this through, you'll find that conjunctions are the bane of any analysis scheme for _all_ words, so there is no benefit in trying to distinguish these either by patterns or heuristics: just flag "arid" every time it appears, and move on. And so it goes. Jeebies also works for other stealth scanno pairs, if you feed it their databases, by the way; hut / but, tom / torn, eat / cat and so on. But as I said in the forums, my disappointment once I got the current scheme going was that OCR quality has improved so much, it's not as effective as it would have been 10 years ago. However, I will notch your interest up as another vote for me to finish it. :-) jim