I’ve put up a new copy of the tool pgdiff that contains an option “-smarted” which outputs the text in a form similar to what I think you want BB for your “smart editor” tool. It is similar to what pgdiff originally output but that output I had found too tedious and verbose for my taste when I am editing the output using a regex editor. Your suggestions work in simple cases but I think you will find that they fail relatively spectacularly on difficult cases, such as when performing versioning across different editions.

I also updated the example output file “BBoutput.txt” to show the new output.

“Non-diff” material will show up in the output if the “Non-diff” material is in a mixed order. For example if the two files have:

The quick dog jumps…..

And the other file has:

The dog quick jumps…

Then dog and/or quick will show up in the edits because there is no way you can do a Levenshtein edit that doesn’t include both “dog” and “quick” because the Levenshtein measure doesn’t include a notion of “reverse the order of these two tokens.” Also you may THINK two tokens are identical but they aren’t identical unless they ARE identical – the measure also doesn’t have a notion of “these two tokens look really similar so I want them to match up.” Either tokens match or they don’t. So in the case of:

The quick dogs jumps….

Vs.

The dog quick jumps….

The algorithm isn’t going to try to match up “dog” and “dogs” because it has no notion of token “similarity” – “dog” and “dogs” are simply two different tokens and they don’t match. Further, even if they do match they still may not compare to each other if there are nearby edits that also don’t match, such that the total number of “insert” “delete” and “substitute” edits is minimized by NOT making the two identical tokens match up. If you look carefully at the output of diff you will see it has the same problem (where a “token” is a line of text not a word) – diff DOES NOT always “successfully” match up two lines of identical text – because like pgdiff diff isn’t trying to maximize the number of token matches, rather it is trying to minimize the number of Levenshtein edits.

Again, the problem is basically the domain you are interested in working on and the domain I am interested in working on is very different. You want a tool that catches small changes within a line of text, and I want a tool that catches large changes within a file. It is easy to hypothesize what the “answer” is if you are not the one doing the work. But if you are the one doing the work you rapidly find “oops that idea doesn’t work after all!” The real goal of the tool is to find places in the text where a human bean needs to step in to fix the problem, and that it does extremely well when the human bean is driving a regex editor and looking at a copy of the original bitmap page. If one wants to try to do a “smart editor” sometimes its going to work and other times its going to fail spectacularly – other than identifying there IS a problem – and then again the human bean is going to have to sort out and fix the problem. In the worse case this involves deleting the text being questioned and typing in the text seen on the bitmap page – which again is not typically a terrible situation – if you have a tool that will point you to the problem in the first place which certainly pgdiff does.