
And so it goes. Jeebies also works for other stealth scanno pairs, if you feed it their databases, by the way; hut / but, tom / torn, eat / cat and so on.
I was trying/hoping to come up with a way to catch such things without having to build the databases (Larry Wall says laziness is a programming virtue, after all). In particular, I was (and am) looking for a way to deal with the scannos where both words are common--the thought of going through a text looking at each instance of "is" to determine whether it should be "as", and vice-versa, is markedly unappealing. I really can't see vocab lists picking that up. I will go back and look at the source, though I'm not a C expert by any stretch.
But as I said in the forums, my disappointment once I got the current scheme going was that OCR quality has improved so much, it's not as effective as it would have been 10 years ago. However, I will notch your interest up as another vote for me to finish it. :-)
Please do. I think the better OCR makes the problem worse, not better, because it makes the signal to noise ratio (viewing errors as the signal, which admittedly is weird) so low that it's really, really easy to see what _should_ be there rather than what actually as. Is. :) Geoff