Re: jim, i have some questions about pgdiff output

jim said:
In practice what one normally sees is some weird mixture of the two possible situations, and it isn't clear to me which display technique is best, so so far I have chosen the easiest approach to implement
in view of the frank admission, let me make some suggestions. i believe these would make your tool's output more workable for the end-user who has to resolve the diffs, no matter _what_ method they use, including the reg-ex editor you use yourself. *** let me pull out the last 2 of the 55 anomalies i posted for you... *** here's the first:
drew their stores of { <i>krasnia | Jcrasnia ruiba</i> | ruiba } (the red
some people might prefer the version as it was in your file:
stores of { <i>krasnia | Jcrasnia ruiba</i> | ruiba } (the red
rather than showing this as a single diff, i'd present it as two... the first would be:
{ <i>krasnia | Jcrasnia }.
the second would be
{ ruiba</i> | ruiba }
*** here's the second example:
the trough of the watering place of { the | the" "Jamestown," | Jamestown,"} came to the beach. This place may be
or, more in keeping with how it's displayed in your output:
place of { the | the" "Jamestown," | Jamestown,"} came to
again, i would present this as two diffs... the first would be:
{ the | the" }
the second would be:
{ "Jamestown," | Jamestown,"}
*** in both of these examples, i think combining the 2 diffs into one bracket-bound item confuses the item unnecessarily, and confuses the end-user in the process, making the resolution much more difficult than it needs to be... in many of these "multiple diff" brackets, i could have my tool pull apart the various diffs, and display them appropriately... so, you know, if you think the output you are showing now is done the way you _want_ to have it done, that's your decision. but i think it will be more clear if you did it slightly differently. *** another confusion i had with your output was that there were several bracketed items that contained some separator-lines... since all of those separator-lines were standaridized by you before you ran your pgdiff, it seems to me that none of them should've been included in any of the brackets. should they? if you start bringing non-diff material into the edit process, you're asking for problems, it would seem to me, so i would rework that code to try to avoid such problems if i were you. anyway, just a few suggestions, hopefully helpful ones... :+) -bowerbird

I've put up a new copy of the tool pgdiff that contains an option "-smarted" which outputs the text in a form similar to what I think you want BB for your "smart editor" tool. It is similar to what pgdiff originally output but that output I had found too tedious and verbose for my taste when I am editing the output using a regex editor. Your suggestions work in simple cases but I think you will find that they fail relatively spectacularly on difficult cases, such as when performing versioning across different editions. I also updated the example output file "BBoutput.txt" to show the new output. "Non-diff" material will show up in the output if the "Non-diff" material is in a mixed order. For example if the two files have: The quick dog jumps... And the other file has: The dog quick jumps. Then dog and/or quick will show up in the edits because there is no way you can do a Levenshtein edit that doesn't include both "dog" and "quick" because the Levenshtein measure doesn't include a notion of "reverse the order of these two tokens." Also you may THINK two tokens are identical but they aren't identical unless they ARE identical - the measure also doesn't have a notion of "these two tokens look really similar so I want them to match up." Either tokens match or they don't. So in the case of: The quick dogs jumps.. Vs. The dog quick jumps.. The algorithm isn't going to try to match up "dog" and "dogs" because it has no notion of token "similarity" - "dog" and "dogs" are simply two different tokens and they don't match. Further, even if they do match they still may not compare to each other if there are nearby edits that also don't match, such that the total number of "insert" "delete" and "substitute" edits is minimized by NOT making the two identical tokens match up. If you look carefully at the output of diff you will see it has the same problem (where a "token" is a line of text not a word) - diff DOES NOT always "successfully" match up two lines of identical text - because like pgdiff diff isn't trying to maximize the number of token matches, rather it is trying to minimize the number of Levenshtein edits. Again, the problem is basically the domain you are interested in working on and the domain I am interested in working on is very different. You want a tool that catches small changes within a line of text, and I want a tool that catches large changes within a file. It is easy to hypothesize what the "answer" is if you are not the one doing the work. But if you are the one doing the work you rapidly find "oops that idea doesn't work after all!" The real goal of the tool is to find places in the text where a human bean needs to step in to fix the problem, and that it does extremely well when the human bean is driving a regex editor and looking at a copy of the original bitmap page. If one wants to try to do a "smart editor" sometimes its going to work and other times its going to fail spectacularly - other than identifying there IS a problem - and then again the human bean is going to have to sort out and fix the problem. In the worse case this involves deleting the text being questioned and typing in the text seen on the bitmap page - which again is not typically a terrible situation - if you have a tool that will point you to the problem in the first place which certainly pgdiff does.
participants (2)
-
Bowerbird@aol.com
-
James Adcock