Re: jim, i have some questions about pgdiff output

jim, here are 55 cases where your tool seems to give us more than just the 2 choices that i would expect to see...
See my discussion below of what the Levenshtein Distance is and how pgdiff implements it.
in some cases, such as the second one listed (hot springs), it's because one of the proofer's notes contained a "|" in it. you'll want to screen the input for your significant characters, i.e., any "{" or "}" or "|", and eliminate them to avoid confusion.
Agreed that this would be a problem if my tool is used as input to another "smart editor" tool that wants to present "Choose A" vs. "Choose B" type choices. Since instead the tool was targeting a regex editor being driven by a real human being who can recognize from context whether the "{|}" chars are being used to highlight differences vs. being used as part of the input text it hasn't been a problem for me re the intended problem domain. ====== Levenshtein Distance is the measure of the number of changes needed to transform one string of tokens into a different string of tokens, where the allowable edits are "insert", "delete" or "substitute." Different implementations of the algorithm would have different interpretation of what constitutes a "token" and what constitutes a "string". One obvious interpretation would be that a "token" is an ascii char and a string is a line of text (dictionary lookups of miss-spelled words) Another obvious interpretation is a "token" is a line of text and the string is the list of lines of text within a file (diff) pgdiff implements neither of these but rather a "token" to be a "word" where a "word" is a non-white sequence of chars followed by a white sequence of chars, where the white sequence of chars is considered not-significant for the purposes of the Levenshtein Distance, but IS significant for the display of output. pgdiff considers the "string" to be the entire list of words in the input file. The typical importance of the white part is whether words are separated by a space or by a linebreak. Pgdiff doesn't care about the white part in terms of the Levenshtein Distance, so that the two input files can have different line lengths and different linebreak locations, and still be comparable. This also means that typically including page break information in the input files such as the "====== filename.101 ====" type stuff would NOT be a good idea, since typically the input files may have their page breaks in different locations re their word content -- unless the two input files are from the same identical edition. So here's some answers to some implied questions or assumptions: Does pgdiff look for word differences within a line of text? No. Does pgdiff look for single word changes? No. OK, what does pgdiff do? What pgdiff does is to calculate a best match of words across two entire files. Assuming you set the input options large enough, for example, one input file could contain an entire chapter that the other input file doesn't contain and the algorithm would sync up just fine. Or in the case of a book I've worked on previously the US version had paragraphs removed by a censor, whereas the European version of the text had them intact. When the words do not match exactly, the mismatches are categorized three ways 1) Insert this missing word. 2) Delete this extraneous word. Or 3) Substitute this one word for a different word. Now by reversing the input order options 1) and 2) obviously become symmetrical -- an insertion in one case becomes a deletion in the other case. So in either case an isolated word difference is displayed like { this } or if a bunch of words in a row are delete or insert like { this is in one text but not the other } In case 3) if only one word is different in a row it displays the output choice like { this | that } But in case three if a bunch of words are different in a row how to display them? If the differences are due to scannos it is probably best to display the words next to each other { this | th*s is | iz a | u test | tost } whereas if the differences are due to human editing it would probably be best to display them as "sentences" { THIS IS A TEST | _this is a test_ } If you are implementing a "smart editor" then clearly you can choose to display them which way you want. In practice what one normally sees is some weird mixture of the two possible situations, and it isn't clear to me which display technique is best, so so far I have chosen the easiest approach to implement -- which is the first pattern of display { this | th*s is | iz a | u test | tost }
From the BBoutput.txt file, for example, consider:
{ <i>Seattle, | Seattle, Washington</i> | Washington } Which is of the first pattern. The ending } is on a newline since the two tokens differing in whitespace, space vs. linebreak. Taking that diff back out one gets: { <i>Seattle, | Seattle, Washington</i> | Washington } Which one can read as: Choose one of: <i>Seattle, OR Seattle, Followed by: Choose one of: Washington</i> OR Washington In this case if one KNEW the differences are due to humans rather than scannos , then it is "obvious" that the better display pattern would be the second one: { <i>Seattle, Washington</i> | Seattle, Washington } IE Choose one of: "<i>Seattle, Washington</i>" OR "Seattle, Washington" But in general the tool doesn't know if differences are due to human edits or scannos, and in general what one sees is a mixture of both problems happening at the same time. PS: OK pgdiff doesn't REALLY match across ENTIRE files since if the files are huge Levenshtein is an n^2 algorithm in space and time. What it does do is break a file into large overlapping chunks of text and calculate the measure across the chunks, where the size of the chunks can be specified as in input parm if you prefer, and where the chunks get sewn back together using an invariant of choosing places in the match where words DO match, and checking the sanity of that match to make sure we haven't lost sync. What this means in practice is that if you specify a parm of -10000 as an input setting then the algorithm can "ONLY" handle about 10000 word mismatches adjacent to each other in a row without erroring out. This parm in practice is important for versioning where two editions of a book have large chunks of text which don't match each other. IE a chapter is edited out or edited in or a censor has taken their knife to the text. Common problems are that two texts from different editions have entire book prefixes (introductions) or entire book suffixes (postscripts or indexes) which don't match -- which one is better to explicitly remove and deal with separately, but which the algorithm will try to handle if you set the input parm large enough.

Hi, The more i think about the tools discussed here and the use of "diff"s I get the feeling that the use of diff is actually overkill. diff is basically n^2. It was developed when text/string processing was not efficient. Designed for revisioning and compression. It works best for frequent and large differences. Furthermore, it aides in analysis. Proofing is per se linear, has relatively few differences, and is aided by humans, and a new version is to be created and not a merge. The process is simple compare text A and B as long as they are equal and then gather the information as long as the differ, present the difference, offer possible changes, continue. Without much analysis one can see that this process is linear. So maybe a more direct approach could be viable. Of course, other problems of the collaboration have to dealt with elsewhere. O.K. this approach may seem simplistic and primitive, yet it solves a few problems. 1) equality and proofing are done in one pass 2) works with files of any size 3) works with text divided among several files 4) can be easily integrated into different editor modals 5) presentation of the two versions is part of the tool and not dependent on other EXTERNAL representations 6) the processing of metadata and formatting is controlled by the proofing/editor tool. No more worry about pollution for the external diff-tool Cavets: a) you would need a logging system for changes b) higher storage requirements for the entire system c) would have to be programmed from start d) highly adjustable. regards Keith. Am 18.03.2010 um 23:16 schrieb James Adcock:
jim, here are 55 cases where your tool seems to give us more than just the 2 choices that i would expect to see...
See my discussion below of what the Levenshtein Distance is and how pgdiff implements it.
in some cases, such as the second one listed (hot springs), it's because one of the proofer's notes contained a "|" in it. you'll want to screen the input for your significant characters, i.e., any "{" or "}" or "|", and eliminate them to avoid confusion.
Agreed that this would be a problem if my tool is used as input to another "smart editor" tool that wants to present "Choose A" vs. "Choose B" type choices.
Since instead the tool was targeting a regex editor being driven by a real human being who can recognize from context whether the "{|}" chars are being used to highlight differences vs. being used as part of the input text it hasn't been a problem for me re the intended problem domain.
======
Levenshtein Distance is the measure of the number of changes needed to transform one string of tokens into a different string of tokens, where the allowable edits are "insert", "delete" or "substitute." Different implementations of the algorithm would have different interpretation of what constitutes a "token" and what constitutes a "string". One obvious interpretation would be that a "token" is an ascii char and a string is a line of text (dictionary lookups of miss-spelled words) Another obvious interpretation is a "token" is a line of text and the string is the list of lines of text within a file (diff) pgdiff implements neither of these but rather a "token" to be a "word" where a "word" is a non-white sequence of chars followed by a white sequence of chars, where the white sequence of chars is considered not-significant for the purposes of the Levenshtein Distance, but IS significant for the display of output. pgdiff considers the "string" to be the entire list of words in the input file. The typical importance of the white part is whether words are separated by a space or by a linebreak. Pgdiff doesn't care about the white part in terms of the Levenshtein Distance, so that the two input files can have different line lengths and different linebreak locations, and still be comparable. This also means that typically including page break information in the input files such as the "====== filename.101 ====" type stuff would NOT be a good idea, since typically the input files may have their page breaks in different locations re their word content -- unless the two input files are from the same identical edition.
So here's some answers to some implied questions or assumptions:
Does pgdiff look for word differences within a line of text? No.
Does pgdiff look for single word changes? No.
OK, what does pgdiff do?
What pgdiff does is to calculate a best match of words across two entire files. Assuming you set the input options large enough, for example, one input file could contain an entire chapter that the other input file doesn't contain and the algorithm would sync up just fine. Or in the case of a book I've worked on previously the US version had paragraphs removed by a censor, whereas the European version of the text had them intact. When the words do not match exactly, the mismatches are categorized three ways 1) Insert this missing word. 2) Delete this extraneous word. Or 3) Substitute this one word for a different word. Now by reversing the input order options 1) and 2) obviously become symmetrical -- an insertion in one case becomes a deletion in the other case. So in either case an isolated word difference is displayed like { this } or if a bunch of words in a row are delete or insert like { this is in one text but not the other } In case 3) if only one word is different in a row it displays the output choice like { this | that } But in case three if a bunch of words are different in a row how to display them? If the differences are due to scannos it is probably best to display the words next to each other { this | th*s is | iz a | u test | tost } whereas if the differences are due to human editing it would probably be best to display them as "sentences" { THIS IS A TEST | _this is a test_ } If you are implementing a "smart editor" then clearly you can choose to display them which way you want. In practice what one normally sees is some weird mixture of the two possible situations, and it isn't clear to me which display technique is best, so so far I have chosen the easiest approach to implement -- which is the first pattern of display { this | th*s is | iz a | u test | tost }
From the BBoutput.txt file, for example, consider:
{ <i>Seattle, | Seattle, Washington</i> | Washington
}
Which is of the first pattern. The ending } is on a newline since the two tokens differing in whitespace, space vs. linebreak. Taking that diff back out one gets:
{ <i>Seattle, | Seattle, Washington</i> | Washington }
Which one can read as:
Choose one of: <i>Seattle, OR Seattle,
Followed by:
Choose one of: Washington</i> OR Washington
In this case if one KNEW the differences are due to humans rather than scannos , then it is "obvious" that the better display pattern would be the second one:
{ <i>Seattle, Washington</i> | Seattle, Washington }
IE Choose one of: "<i>Seattle, Washington</i>" OR "Seattle, Washington"
But in general the tool doesn't know if differences are due to human edits or scannos, and in general what one sees is a mixture of both problems happening at the same time.
PS: OK pgdiff doesn't REALLY match across ENTIRE files since if the files are huge Levenshtein is an n^2 algorithm in space and time. What it does do is break a file into large overlapping chunks of text and calculate the measure across the chunks, where the size of the chunks can be specified as in input parm if you prefer, and where the chunks get sewn back together using an invariant of choosing places in the match where words DO match, and checking the sanity of that match to make sure we haven't lost sync. What this means in practice is that if you specify a parm of -10000 as an input setting then the algorithm can "ONLY" handle about 10000 word mismatches adjacent to each other in a row without erroring out. This parm in practice is important for versioning where two editions of a book have large chunks of text which don't match each other. IE a chapter is edited out or edited in or a censor has taken their knife to the text. Common problems are that two texts from different editions have entire book prefixes (introductions) or entire book suffixes (postscripts or indexes) which don't match -- which one is better to explicitly remove and deal with separately, but which the algorithm will try to handle if you set the input parm large enough.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Proofing is per se linear, has relatively few differences, and is aided by humans, and a new version is to be created and not a merge. The process is simple compare text A and B as long as they are equal and then gather the information as long as the differ, present the difference, offer possible changes, continue. Without much analysis one can see that this process is linear.
Agreed -- although again you run into problems when your assumptions break down. Pgdiff wasn't intended for these simply "change a couple letters within a line of text" problems. It was intended for problems of the nature of "I have two different editions of the text from two different continents one using English spellings and one using American spellings and having different linebreaks and different pagebreak and different intros and censorship and different indexes and I want to use one to help find scannos in the other." Yes it can be used for simpler tasks but if you have a simpler task you might be better off to figure out exactly what that task is and write a tool to match that task. Human edits within line tend to be char-by-char and you might be better off using a Levenshtein measure with the "token" set to be a char and the "string" set to be a line of text -- to give an obvious example -- since its not obvious to me how someone uses a mouse and a keyboard to make changes other than "insert a char" "delete a char" or "substitute a char" -- unless one uses cut and paste, in which case all assumptions are off again....

Hi James, I do understand the the levenstein measure and actually do not think we need to discuss it caveats far as precision and sucessfulness. An interesting approach by using English and American versions. Yet, that makes pgdiff specific to one set of languages. On the other side. If you out the problem of the forwards, tocs, and indices et. al. you could simply try adding in a component that rewrites with the others spelling conventions. That I know is no trivial task. As far as my considering not using diff and just a simple comparison method which is linear, the problem of alignment does remain. I admit I have done the math or have an exact algorithm but it does seem to me that it would be polynominal and still far better than n^2. regards Keith. Am 19.03.2010 um 19:29 schrieb James Adcock:
Proofing is per se linear, has relatively few differences, and is aided by humans, and a new version is to be created and not a merge. The process is simple compare text A and B as long as they are equal and then gather the information as long as the differ, present the difference, offer possible changes, continue. Without much analysis one can see that this process is linear.
Agreed -- although again you run into problems when your assumptions break down. Pgdiff wasn't intended for these simply "change a couple letters within a line of text" problems. It was intended for problems of the nature of "I have two different editions of the text from two different continents one using English spellings and one using American spellings and having different linebreaks and different pagebreak and different intros and censorship and different indexes and I want to use one to help find scannos in the other." Yes it can be used for simpler tasks but if you have a simpler task you might be better off to figure out exactly what that task is and write a tool to match that task. Human edits within line tend to be char-by-char and you might be better off using a Levenshtein measure with the "token" set to be a char and the "string" set to be a line of text -- to give an obvious example -- since its not obvious to me how someone uses a mouse and a keyboard to make changes other than "insert a char" "delete a char" or "substitute a char" -- unless one uses cut and paste, in which case all assumptions are off again....
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (2)
-
James Adcock
-
Keith J. Schultz