
ok, jim, here's some sample files for your tool... i'm using the book "sitka" that rfrank used on his test-site. here's the original text uploaded by rfrank for his proofers:
and here's the text after the proofers were done with it:
if you can run that through your tool and share its output, that would be great. or i'll do it, when i next encounter a windows box. :+) -bowerbird

here's the original text uploaded by rfrank for his proofers: http://z-m-l.com/go/jimad/sitka0-ocr.txt
and here's the text after the proofers were done with it:
if you can run that through your tool and share its output, that would be great.
OK, I put the output at: http://www.freekindlebooks.org/Dev/StringMatch/BBoutput.txt where I have changed the page separators on the two files to be identically named, because I am assuming you wouldn't want to find all the file name changes. Please note that the problem domain you are applying the tool to is not the same problem domain intended for the tool - so one shouldn't be surprised then if you consider the results in some sense "suboptimal." Even these "simple" outputs however, show how often it really isn't simply a problem of "Choose Word A" or "Choose Word B" but rather there a often lots of other issues involved at the same time, such as whitespace issues, punc issues, line break issues, etc, which complicate the design of the editor interface - assuming one *wants* to design a custom editor. Again, the problem this tool was designed to address was when you have two "independent" OCR outputs and you want to compare them to find those words or sections where a human being needs to perform an edit. Or for versioning. The results after human editing then would be expected to be about the quality of the output of a "P1" pass which then would have to be further carefully checked by more passes. And it is envisioned that even during the "P1" pass the editor is comparing to the page images. When applied to the problem domain envisioned you have at least 2X as many errors to deal with, and the resulting errors are more difficult than the ones in your example input files. Please see at: http://www.freekindlebooks.org/Dev/StringMatch/hkdiff.txt what I think is a more reasonable example of the kinds problems this tool is designed to address - here being used for versioning - an OCR from one edition of a text is being compared to an existing but old copy of a human-corrected PG text. On this example ideally a smart de-hyphenator ought to be run before making the comparison, but, its still interesting to see what happens when this isn't done.

PS RE: http://www.freekindlebooks.org/Dev/StringMatch/hkdiff.txt Attempting to hand-score the kinds of edits one would need to do on hkdiff.txt, it seems to me like an intelligent editor could present "Choose A" vs. "Choose B" alternatives about 85% of the time, whereas the other 15% of the time a more complicated interface would have to be presented - or else the editor just punts and points to the text and says "You Fix It! (which is basically the approach my current choice of editor takes 100% of the time ;-) However, if the editor gives a "Choose A" vs. "Choose B" interface sometimes the editor (and/or the user) is going to be deceived because what looks like an A/B choice really ISN'T. For example a hypothetical example: .. one { must | MUST } be careful! And the correct answer is neither A nor B but rather C == _must_

Hi All, It is interesting how most here are in love with their tools. I have notice how statistics and proofs are stated to show what makes THIER tools the best. No, James this is not directly aimed at you, but all the others. I could show you all easily that diff is the wrong tool. It is inefficient been proven since the 60s or was that the 70s. But, who cares. Your example can only be handled by an ABLE PROOFER. Neither your tool nor anybodies elses is better. Come on, peolple! get productive. regards Keith. Am 18.03.2010 um 00:59 schrieb James Adcock:
Attempting to hand-score the kinds of edits one would need to do on hkdiff.txt, it seems to me like an intelligent editor could present “Choose A” vs. “Choose B” alternatives about 85% of the time, whereas the other 15% of the time a more complicated interface would have to be presented – or else the editor just punts and points to the text and says “You Fix It! (which is basically the approach my current choice of editor takes 100% of the time ;-)
However, if the editor gives a “Choose A” vs. “Choose B” interface sometimes the editor (and/or the user) is going to be deceived because what looks like an A/B choice really ISN’T. For example a hypothetical example:
…. one { must | MUST } be careful!
And the correct answer is neither A nor B but rather C == _must_
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (3)
-
Bowerbird@aol.com
-
James Adcock
-
Keith J. Schultz