>>here's the original text uploaded by rfrank for his proofers:
>> http://z-m-l.com/go/jimad/sitka0-ocr.txt
>
>and here's the text after the proofers were done with it:
>> http://z-m-l.com/go/jimad/sitka1-pp.txt
>
>if you can run that through your tool and share its output,
>that would be great.
OK, I put the output at:
http://www.freekindlebooks.org/Dev/StringMatch/BBoutput.txt
where I have changed the page separators on the two files to be
identically named, because I am assuming you wouldn’t want to find all
the file name changes. Please note that the problem domain you are
applying the tool to is not the same problem domain intended for the tool –
so one shouldn’t be surprised then if you consider the results in some
sense “suboptimal.” Even these “simple” outputs
however, show how often it really isn’t simply a problem of “Choose
Word A” or “Choose Word B” but rather there a often lots of
other issues involved at the same time, such as whitespace issues, punc issues,
line break issues, etc, which complicate the design of the editor interface –
assuming one *wants* to design a custom editor.
Again, the problem this tool was designed to address was when
you have two “independent” OCR outputs and you want to compare them
to find those words or sections where a human being needs to perform an
edit. Or for versioning. The results after human editing then would
be expected to be about the quality of the output of a “P1” pass
which then would have to be further carefully checked by more passes. And
it is envisioned that even during the “P1” pass the editor is
comparing to the page images. When applied to the problem domain envisioned
you have at least 2X as many errors to deal with, and the resulting errors are
more difficult than the ones in your example input files.
Please see at:
http://www.freekindlebooks.org/Dev/StringMatch/hkdiff.txt
what I think is a more reasonable example of the kinds problems
this tool is designed to address – here being used for versioning –
an OCR from one edition of a text is being compared to an existing but old copy
of a human-corrected PG text. On this example ideally a smart de-hyphenator
ought to be run before making the comparison, but, its still interesting to see
what happens when this isn’t done.