[gutvol-d] Re: [SPAM] re: New Tool "pgdiff"

17 Mar 2010

      ...
...
here's the original text uploaded by rfrank for his proofers:
  http://z-m-l.com/go/jimad/sitka0-ocr.txt
and here's the text after the proofers were done with it:
...
http://z-m-l.com/go/jimad/sitka1-pp.txt
if you can run that through your tool and share its output,
that would be great.
OK, I put the output at:

http://www.freekindlebooks.org/Dev/StringMatch/BBoutput.txt

where I have changed the page separators on the two files to be identically
named, because I am assuming you wouldn't want to find all the file name
changes.  Please note that the problem domain you are applying the tool to
is not the same problem domain intended for the tool - so one shouldn't be
surprised then if you consider the results in some sense "suboptimal."  Even
these "simple" outputs however, show how often it really isn't simply a
problem of "Choose Word A" or "Choose Word B" but rather there a often lots
of other issues involved at the same time, such as whitespace issues, punc
issues, line break issues, etc, which complicate the design of the editor
interface - assuming one *wants* to design a custom editor.

Again, the problem this tool was designed to address was when you have two
"independent" OCR outputs and you want to compare them to find those words
or sections where a human being needs to perform an edit.  Or for
versioning.  The results after human editing then would be expected to be
about the quality of the output of a "P1" pass which then would have to be
further carefully checked by more passes.  And it is envisioned that even
during the "P1" pass the editor is comparing to the page images.  When
applied to the problem domain envisioned you have at least 2X as many errors
to deal with, and the resulting errors are more difficult than the ones in
your example input files.

Please see at:

http://www.freekindlebooks.org/Dev/StringMatch/hkdiff.txt

what I think is a more reasonable example of the kinds problems this tool is
designed to address - here being used for versioning - an OCR from one
edition of a text is being compared to an existing but old copy of a
human-corrected PG text.  On this example ideally a smart de-hyphenator
ought to be run before making the comparison, but, its still interesting to
see what happens when this isn't done.

[gutvol-d] Re: [SPAM] re: New Tool "pgdiff"

James Adcock