I’ve put up a new copy of the tool pgdiff that contains an
option “-smarted” which outputs the text in a form similar to what
I think you want BB for your “smart editor” tool. It is
similar to what pgdiff originally output but that output I had found too
tedious and verbose for my taste when I am editing the output using a regex
editor. Your suggestions work in simple cases but I think you will
find that they fail relatively spectacularly on difficult cases, such as when
performing versioning across different editions.
I also updated the example output file “BBoutput.txt”
to show the new output.
“Non-diff” material will show up in the output if
the “Non-diff” material is in a mixed order. For example if
the two files have:
The quick dog jumps…..
And the other file has:
The dog quick jumps…
Then dog and/or quick will show up in the edits because there is
no way you can do a Levenshtein edit that doesn’t include both “dog”
and “quick” because the Levenshtein measure doesn’t include a
notion of “reverse the order of these two tokens.” Also you
may THINK two tokens are identical but they aren’t identical unless they ARE
identical – the measure also doesn’t have a notion of “these
two tokens look really similar so I want them to match up.” Either
tokens match or they don’t. So in the case of:
The quick dogs jumps….
Vs.
The dog quick jumps….
The algorithm isn’t going to try to match up “dog”
and “dogs” because it has no notion of token “similarity”
– “dog” and “dogs” are simply two different
tokens and they don’t match. Further, even if they do match they
still may not compare to each other if there are nearby edits that also don’t
match, such that the total number of “insert” “delete”
and “substitute” edits is minimized by NOT making the two identical
tokens match up. If you look carefully at the output of diff you will see
it has the same problem (where a “token” is a line of text not a
word) – diff DOES NOT always “successfully” match up two
lines of identical text – because like pgdiff diff isn’t trying to
maximize the number of token matches, rather it is trying to minimize the number
of Levenshtein edits.
Again, the problem is basically the domain you are interested in
working on and the domain I am interested in working on is very
different. You want a tool that catches small changes within a line of
text, and I want a tool that catches large changes within a file. It is
easy to hypothesize what the “answer” is if you are not the one
doing the work. But if you are the one doing the work you rapidly find “oops
that idea doesn’t work after all!” The real goal of the tool
is to find places in the text where a human bean needs to step in to fix the
problem, and that it does extremely well when the human bean is driving a regex
editor and looking at a copy of the original bitmap page. If one wants to try
to do a “smart editor” sometimes its going to work and other times
its going to fail spectacularly – other than identifying there IS a
problem – and then again the human bean is going to have to sort out and
fix the problem. In the worse case this involves deleting the text being
questioned and typing in the text seen on the bitmap page – which again
is not typically a terrible situation – if you have a tool that will
point you to the problem in the first place which certainly pgdiff does.