Re: [gutvol-d] Heebee Jeebees on Gutenberg

Geoff, I find your approach fascinating because I usually think of error-catching in posted texts as a smoothreading sort of task (I can't read anything from PG without ending up sending back a list of errors...I keep Jim busy), or at least a text-by-text verification. But, rather than searching for all errors in one book, you're searching for particular errors across the whole set--the obvious extrapolation here is some sort of tool that would blast through the whole archive and create a list of candidate errors that could then be checked over and eventually result in a list of corrections to each text. It'd be a lot of work (to develop the tool, sift through the output, and eventually apply to the texts), but it could result in a significant one-time jump in the quality of the texts that are already posted. And a serious bandwidth hit to the PG server the first time they're all pulled for checking... Yahoo! Mail Stay connected, organized, and protected. Take the tour: http://tour.mail.yahoo.com/mailtour.html

Geoff, I find your approach fascinating because I usually think of error-catching in posted texts as a smoothreading sort of task
That's how I do it when I'm working for PG (or PGDP). But I get frustrated when gutcheck finds stuff I should have caught, and that also causes me to worry that I've missed something else of the same sort. My original plan (which is probably what put me onto this idea in the first place) was inspired by a he/be finding program mentioned in the PGDP forum. That led me to wonder whether I could let a work proof itself (so to speak) by building a list of all pairs of words in the book and then looking at the ones which appeared only once or twice. Unfortunately, applying this to a text I just post-processed (_The King's Achievement_, not yet posted to PG) resulted in 61,920 pairs of words--46,354 of which appeared only once. So much for that theory. I could cut down the totals a bit by being smarter about the end of sentences, but I don't think it would make enough difference to make the idea workable. The key to the original program, and to what I was doing playing with the archives, is coming up with the phrases that will find the problem. For example, just searching for "clone" turns up 107 hits, many of which are legit. Searching for "be clone" turns up 19, including repeats, none of which are legit. So I think the basic idea has merit, but darned if I know how to move it into a more practical stage. Geoff

On Fri, 13 May 2005, Geoff Horton wrote:
Geoff, I find your approach fascinating because I usually think of error-catching in posted texts as a smoothreading sort of task
That's how I do it when I'm working for PG (or PGDP). But I get frustrated when gutcheck finds stuff I should have caught, and that also causes me to worry that I've missed something else of the same sort.
Geoff, don't feel bad. In my experience, there are _always_ at least one or two small things that you've overlooked that someone else can find, if they try hard enough. A while ago, Jim oversaw a "cooperative reproofing" experiment. He would select a text already in the archive, and everyone who wanted to participate would proofread it in whatever way they wished, and correct any errors found. Jim would then collate the results, make a single improved text, and assign a "score" for each contributor based on the number of errors found minus the number of "false positives". We went through a few texts this way. The interesting point is that, apparantly, with a group of 5-7 people individually examining the same text, each person usually found at least one error that none of the others did. (Although I suppose you could look closer into the fine line between what constitutes an "error" and what is only a formatting difference, etc.) Andrew

On Fri, May 13, 2005 at 03:23:28PM -0400, Geoff Horton wrote:
The key to the original program, and to what I was doing playing with the archives, is coming up with the phrases that will find the problem. For example, just searching for "clone" turns up 107 hits, many of which are legit. Searching for "be clone" turns up 19, including repeats, none of which are legit. So I think the basic idea has merit, but darned if I know how to move it into a more practical stage.
Hint: look hard at the source of jeebies and my discussion of the half I dropped on the floor in writing my current version. Apply both the dumb-brick logic and the too-clever-for-its-own-good logic, and you're much closer -- or should that be doser? :-) Gutcheck and GuiGuts would both have flagged all of your original stealth scannos -- with, admittedly, a considerable helping of false positives, depending on the text. Getting rid of the false positives is, as you've found, more difficult. "clone" and "modem" are very safe to flag as queries, ear / car, eat / cat and he / be are more difficult, requiring either heuristics or patterns, preferably both, and I will bow to anyone who manages tram / train! and / arid is a particularly interesting case: "and" is about seven and a half thousand times as common as "arid", and, incidentally, if you follow this through, you'll find that conjunctions are the bane of any analysis scheme for _all_ words, so there is no benefit in trying to distinguish these either by patterns or heuristics: just flag "arid" every time it appears, and move on. And so it goes. Jeebies also works for other stealth scanno pairs, if you feed it their databases, by the way; hut / but, tom / torn, eat / cat and so on. But as I said in the forums, my disappointment once I got the current scheme going was that OCR quality has improved so much, it's not as effective as it would have been 10 years ago. However, I will notch your interest up as another vote for me to finish it. :-) jim

And so it goes. Jeebies also works for other stealth scanno pairs, if you feed it their databases, by the way; hut / but, tom / torn, eat / cat and so on.
I was trying/hoping to come up with a way to catch such things without having to build the databases (Larry Wall says laziness is a programming virtue, after all). In particular, I was (and am) looking for a way to deal with the scannos where both words are common--the thought of going through a text looking at each instance of "is" to determine whether it should be "as", and vice-versa, is markedly unappealing. I really can't see vocab lists picking that up. I will go back and look at the source, though I'm not a C expert by any stretch.
But as I said in the forums, my disappointment once I got the current scheme going was that OCR quality has improved so much, it's not as effective as it would have been 10 years ago. However, I will notch your interest up as another vote for me to finish it. :-)
Please do. I think the better OCR makes the problem worse, not better, because it makes the signal to noise ratio (viewing errors as the signal, which admittedly is weird) so low that it's really, really easy to see what _should_ be there rather than what actually as. Is. :) Geoff

On Fri, May 13, 2005 at 04:38:31PM -0400, Geoff Horton wrote:
I will go back and look at the source, though I'm not a C expert by any stretch.
Well, the point is that it uses a three-word, not two-word, phrasebook where possible, and (what I forgot was not in that version of the source) clues from the sentence structure and "nearby" words.
But as I said in the forums, my disappointment once I got the current scheme going was that OCR quality has improved so much, it's not as effective as it would have been 10 years ago. However, I will notch your interest up as another vote for me to finish it. :-)
Please do. I think the better OCR makes the problem worse, not better, because it makes the signal to noise ratio (viewing errors as the signal, which admittedly is weird) so low that it's really, really easy to see what _should_ be there rather than what actually as. Is. :)
Ah! So you're a fan of Pauline's "assisi"! :-) jim

Jim Tinsley wrote:
Ah! So you're a fan of Pauline's "assisi"! :-)
:) For less frequently occuring scannos, I am a huge fan of guiguts (gooey front end to gutcheck) which has inbuilt & completely customisable scanno highlighting along with so many other features which assist processing texts that I wouldn't process a text without it. You can have a look at a guiguts screen shot here, showing the scanno highlighting: http://www.pgdp.net/squirrels/guiguts_screen.jpg Guiguts home page: http://mywebpages.comcast.net/thundergnat/guiguts.html & there's lots of guiguts discussion & help on the DP Forums, along with a collection thread for the scannos which post-processors find when working on texts. Collected common scannos are added to the lists which guiguts uses by default (or you can modify the scanno lists yourself if you wish). Cheers, P -- Help digitise public domain books: Distributed Proofreaders: http://www.pgdp.net "Preserving history one page at a time." Set free dead-tree books: http://bookcrossing.com/referral/servalan

:) For less frequently occuring scannos, I am a huge fan of guiguts (gooey front end to gutcheck) which has inbuilt & completely customisable scanno highlighting along with so many other features which assist processing texts that I wouldn't process a text without it.
I use it and second your endorsement. The infrequent scannos aren't really the problem. It's the ones where the potential mis-scan is itself a common word that are the problem. Geoff
participants (5)
-
Andrew Sly
-
Geoff Horton
-
Jim Tinsley
-
Jon Niehof
-
Pauline