>>here's the original text uploaded by rfrank for his proofers:
>> http://z-m-l.com/go/jimad/sitka0-ocr.txt
>
>and here's the text after the proofers were done with it:
>> http://z-m-l.com/go/jimad/sitka1-pp.txt
>
>if you can run that through your tool and share its output,
>that would be great.

OK, I put the output at:

http://www.freekindlebooks.org/Dev/StringMatch/BBoutput.txt

where I have changed the page separators on the two files to be identically named, because I am assuming you wouldn’t want to find all the file name changes. Please note that the problem domain you are applying the tool to is not the same problem domain intended for the tool – so one shouldn’t be surprised then if you consider the results in some sense “suboptimal.” Even these “simple” outputs however, show how often it really isn’t simply a problem of “Choose Word A” or “Choose Word B” but rather there a often lots of other issues involved at the same time, such as whitespace issues, punc issues, line break issues, etc, which complicate the design of the editor interface – assuming one *wants* to design a custom editor.

Again, the problem this tool was designed to address was when you have two “independent” OCR outputs and you want to compare them to find those words or sections where a human being needs to perform an edit. Or for versioning. The results after human editing then would be expected to be about the quality of the output of a “P1” pass which then would have to be further carefully checked by more passes. And it is envisioned that even during the “P1” pass the editor is comparing to the page images. When applied to the problem domain envisioned you have at least 2X as many errors to deal with, and the resulting errors are more difficult than the ones in your example input files.

Please see at:

http://www.freekindlebooks.org/Dev/StringMatch/hkdiff.txt

what I think is a more reasonable example of the kinds problems this tool is designed to address – here being used for versioning – an OCR from one edition of a text is being compared to an existing but old copy of a human-corrected PG text. On this example ideally a smart de-hyphenator ought to be run before making the comparison, but, its still interesting to see what happens when this isn’t done.