[gutvol-d] Re: jim, i have some questions about pgdiff output

18 Mar 2010

      ...
jim, here are 55 cases where your tool seems to give us more than just the
2 choices that i would expect to see...
See my discussion below of what the Levenshtein Distance is and how pgdiff
implements it.
...
in some cases, such as the second one listed (hot springs), it's because
one of the proofer's notes contained a "|" in it.
you'll want to screen the input for your significant characters, i.e., any
"{" or "}" or "|", and eliminate them to avoid confusion.
Agreed that this would be a problem if my tool is used as input to another
"smart editor" tool that wants to present "Choose A" vs. "Choose B" type
choices.

Since instead the tool was targeting a regex editor being driven by a real
human being who can recognize from context whether the "{|}" chars are being
used to highlight differences vs. being used as part of the input text it
hasn't been a problem for me re the intended problem domain.

======

Levenshtein Distance is the measure of the number of changes needed to
transform one string of tokens into a different string of tokens, where the
allowable edits are "insert", "delete" or "substitute."  Different
implementations of the algorithm would have different interpretation of what
constitutes a "token" and what constitutes a "string".  One obvious
interpretation would be that a "token" is an ascii char and a string is a
line of text (dictionary lookups of miss-spelled words) Another obvious
interpretation is a "token" is a line of text and the string is the list of
lines of text within a file (diff)  pgdiff implements neither of these but
rather a "token" to be a "word" where a "word" is a non-white sequence of
chars followed by a white sequence of chars, where the white sequence of
chars is considered not-significant for the purposes of the Levenshtein
Distance, but IS significant for the display of output.  pgdiff considers
the "string" to be the entire list of words in the input file. The typical
importance of the white part is whether words are separated by a space or by
a linebreak.  Pgdiff doesn't care about the white part in terms of the
Levenshtein Distance, so that the two input files can have different line
lengths and different linebreak locations, and still be comparable.  This
also means that typically including page break information in the input
files such as the "====== filename.101 ====" type stuff would NOT be a good
idea, since typically the input files may have their page breaks in
different locations re their word content -- unless the two input files are
from the same identical edition. 

So here's some answers to some implied questions or assumptions:

Does pgdiff look for word differences within a line of text?  No.

Does pgdiff look for single word changes? No.

OK, what does pgdiff do?

What pgdiff does is to calculate a best match of words across two entire
files.  Assuming you set the input options large enough, for example, one
input file could contain an entire chapter that the other input file doesn't
contain and the algorithm would sync up just fine.  Or in the case of a book
I've worked on previously the US version had paragraphs removed by a censor,
whereas the European version of the text had them intact.  When the words do
not match exactly, the mismatches are categorized three ways 1) Insert this
missing word. 2) Delete this extraneous word. Or 3) Substitute this one word
for a different word.  Now by reversing the input order options 1) and 2)
obviously become symmetrical -- an insertion in one case becomes a deletion
in the other case.  So in either case an isolated word difference is
displayed like { this } or if a bunch of words in a row are delete or insert
like { this is in one text but not the other } In case 3) if only one word
is different in a row it displays the output choice like { this | that } But
in case three if a bunch of words are different in a row how to display
them?  If the differences are due to scannos it is probably best to display
the words next to each other { this | th*s is | iz a | u test | tost }
whereas if the differences are due to human editing it would probably be
best to display them as "sentences" { THIS IS A TEST | _this is a test_ }
If you are implementing a "smart editor" then clearly you can choose to
display them which way you want.  In practice what one normally sees is some
weird mixture of the two possible situations, and it isn't clear to me which
display technique is best, so so far I have chosen the easiest approach to
implement -- which is the first pattern of display { this | th*s is | iz a |
u test | tost }
...
From the BBoutput.txt file, for example, consider:
{ <i>Seattle, | Seattle, Washington</i> | Washington

}

Which is of the first pattern.  The ending } is on a newline since the two
tokens differing in whitespace, space vs. linebreak.  Taking that diff back
out one gets:

{ <i>Seattle, | Seattle, Washington</i> | Washington }

Which one can read as:

Choose one of: <i>Seattle, OR Seattle,

Followed by:

Choose one of: Washington</i> OR Washington

In this case if one KNEW the differences are due to humans rather than
scannos , then it is "obvious" that the better display pattern would be the
second one:

{ <i>Seattle, Washington</i> | Seattle, Washington }

IE Choose one of: "<i>Seattle, Washington</i>" OR "Seattle, Washington"

But in general the tool doesn't know if differences are due to human edits
or scannos, and in general what one sees is a mixture of both problems
happening at the same time.

PS: OK pgdiff doesn't REALLY match across ENTIRE files since if the files
are huge Levenshtein is an n^2 algorithm in space and time.  What it does do
is break a file into large overlapping chunks of text and calculate the
measure across the chunks, where the size of the chunks can be specified as
in input parm if you prefer, and where the chunks get sewn back together
using an invariant of choosing places in the match where words DO match, and
checking the sanity of that match to make sure we haven't lost sync.  What
this means in practice is that if you specify a parm of -10000 as an input
setting then the algorithm can "ONLY" handle about 10000 word mismatches
adjacent to each other in a row without erroring out. This parm in practice
is important for versioning where two editions of a book have large chunks
of text which don't match each other.  IE a chapter is edited out or
edited in or a censor has taken their knife to the text. Common problems are
that two texts from different editions have entire book prefixes
(introductions) or entire book suffixes (postscripts or indexes) which don't
match -- which one is better to explicitly remove and deal with separately,
but which the algorithm will try to handle if you set the input parm large
enough.

[gutvol-d] Re: jim, i have some questions about pgdiff output

James Adcock