[gutvol-d] [SPAM] RE: Re: jim, i have some questions about pgdiff output

19 Mar 2010

      I've put up a new copy of the tool pgdiff that contains an option "-smarted"
which outputs the text in a form similar to what I think you want BB for
your "smart editor" tool.  It is similar to what pgdiff originally output
but that output I had found too tedious and verbose for my taste when I am
editing the output using a regex editor.  Your suggestions work in simple
cases but  I think you will find that they fail relatively spectacularly on
difficult cases, such as when performing versioning across different
editions.

I also updated the example output file "BBoutput.txt" to show the new
output. 

"Non-diff" material will show up in the output if the "Non-diff" material is
in a mixed order.  For example if the two files have:

The quick dog jumps...

And the other file has:

The dog quick jumps.

Then dog and/or quick will show up in the edits because there is no way you
can do a Levenshtein edit that doesn't include both "dog" and "quick"
because the Levenshtein measure doesn't include a notion of "reverse the
order of these two tokens."  Also you may THINK two tokens are identical but
they aren't identical unless they ARE identical - the measure also doesn't
have a notion of "these two tokens look really similar so I want them to
match up."  Either tokens match or they don't.  So in the case of:

The quick dogs jumps..

Vs.

The dog quick jumps..

The algorithm isn't going to try to match up "dog" and "dogs" because it has
no notion of token "similarity" - "dog" and "dogs" are simply two different
tokens and they don't match.  Further, even if they do match they still may
not compare to each other if there are nearby edits that also don't match,
such that the total number of "insert" "delete" and "substitute" edits is
minimized by NOT making the two identical tokens match up.  If you look
carefully at the output of diff you will see it has the same problem (where
a "token" is a line of text not a word) - diff DOES NOT always
"successfully" match up two lines of identical text - because like pgdiff
diff isn't trying to maximize the number of token matches, rather it is
trying to minimize the number of Levenshtein edits.

Again, the problem is basically the domain you are interested in working on
and the domain I am interested in working on is very different.  You want a
tool that catches small changes within a line of text, and I want a tool
that catches large changes within a file.  It is easy to hypothesize what
the "answer" is if you are not the one doing the work.  But if you are the
one doing the work you rapidly find "oops that idea doesn't work after all!"
The real goal of the tool is to find places in the text where a human bean
needs to step in to fix the problem, and that it does extremely well when
the human bean is driving a regex editor and looking at a copy of the
original bitmap page. If one wants to try to do a "smart editor" sometimes
its going to work and other times its going to fail spectacularly - other
than identifying there IS a problem - and then again the human bean is going
to have to sort out and fix the problem.  In the worse case this involves
deleting the text being questioned and typing in the text seen on the bitmap
page - which again is not typically a terrible situation - if you have a
tool that will point you to the problem in the first place which certainly
pgdiff does.

[gutvol-d] [SPAM] RE: Re: jim, i have some questions about pgdiff output

James Adcock