
jim, here are 55 cases where your tool seems to give us more than just the 2 choices that i would expect to see...
See my discussion below of what the Levenshtein Distance is and how pgdiff implements it.
in some cases, such as the second one listed (hot springs), it's because one of the proofer's notes contained a "|" in it. you'll want to screen the input for your significant characters, i.e., any "{" or "}" or "|", and eliminate them to avoid confusion.
Agreed that this would be a problem if my tool is used as input to another "smart editor" tool that wants to present "Choose A" vs. "Choose B" type choices. Since instead the tool was targeting a regex editor being driven by a real human being who can recognize from context whether the "{|}" chars are being used to highlight differences vs. being used as part of the input text it hasn't been a problem for me re the intended problem domain. ====== Levenshtein Distance is the measure of the number of changes needed to transform one string of tokens into a different string of tokens, where the allowable edits are "insert", "delete" or "substitute." Different implementations of the algorithm would have different interpretation of what constitutes a "token" and what constitutes a "string". One obvious interpretation would be that a "token" is an ascii char and a string is a line of text (dictionary lookups of miss-spelled words) Another obvious interpretation is a "token" is a line of text and the string is the list of lines of text within a file (diff) pgdiff implements neither of these but rather a "token" to be a "word" where a "word" is a non-white sequence of chars followed by a white sequence of chars, where the white sequence of chars is considered not-significant for the purposes of the Levenshtein Distance, but IS significant for the display of output. pgdiff considers the "string" to be the entire list of words in the input file. The typical importance of the white part is whether words are separated by a space or by a linebreak. Pgdiff doesn't care about the white part in terms of the Levenshtein Distance, so that the two input files can have different line lengths and different linebreak locations, and still be comparable. This also means that typically including page break information in the input files such as the "====== filename.101 ====" type stuff would NOT be a good idea, since typically the input files may have their page breaks in different locations re their word content -- unless the two input files are from the same identical edition. So here's some answers to some implied questions or assumptions: Does pgdiff look for word differences within a line of text? No. Does pgdiff look for single word changes? No. OK, what does pgdiff do? What pgdiff does is to calculate a best match of words across two entire files. Assuming you set the input options large enough, for example, one input file could contain an entire chapter that the other input file doesn't contain and the algorithm would sync up just fine. Or in the case of a book I've worked on previously the US version had paragraphs removed by a censor, whereas the European version of the text had them intact. When the words do not match exactly, the mismatches are categorized three ways 1) Insert this missing word. 2) Delete this extraneous word. Or 3) Substitute this one word for a different word. Now by reversing the input order options 1) and 2) obviously become symmetrical -- an insertion in one case becomes a deletion in the other case. So in either case an isolated word difference is displayed like { this } or if a bunch of words in a row are delete or insert like { this is in one text but not the other } In case 3) if only one word is different in a row it displays the output choice like { this | that } But in case three if a bunch of words are different in a row how to display them? If the differences are due to scannos it is probably best to display the words next to each other { this | th*s is | iz a | u test | tost } whereas if the differences are due to human editing it would probably be best to display them as "sentences" { THIS IS A TEST | _this is a test_ } If you are implementing a "smart editor" then clearly you can choose to display them which way you want. In practice what one normally sees is some weird mixture of the two possible situations, and it isn't clear to me which display technique is best, so so far I have chosen the easiest approach to implement -- which is the first pattern of display { this | th*s is | iz a | u test | tost }
From the BBoutput.txt file, for example, consider:
{ <i>Seattle, | Seattle, Washington</i> | Washington } Which is of the first pattern. The ending } is on a newline since the two tokens differing in whitespace, space vs. linebreak. Taking that diff back out one gets: { <i>Seattle, | Seattle, Washington</i> | Washington } Which one can read as: Choose one of: <i>Seattle, OR Seattle, Followed by: Choose one of: Washington</i> OR Washington In this case if one KNEW the differences are due to humans rather than scannos , then it is "obvious" that the better display pattern would be the second one: { <i>Seattle, Washington</i> | Seattle, Washington } IE Choose one of: "<i>Seattle, Washington</i>" OR "Seattle, Washington" But in general the tool doesn't know if differences are due to human edits or scannos, and in general what one sees is a mixture of both problems happening at the same time. PS: OK pgdiff doesn't REALLY match across ENTIRE files since if the files are huge Levenshtein is an n^2 algorithm in space and time. What it does do is break a file into large overlapping chunks of text and calculate the measure across the chunks, where the size of the chunks can be specified as in input parm if you prefer, and where the chunks get sewn back together using an invariant of choosing places in the match where words DO match, and checking the sanity of that match to make sure we haven't lost sync. What this means in practice is that if you specify a parm of -10000 as an input setting then the algorithm can "ONLY" handle about 10000 word mismatches adjacent to each other in a row without erroring out. This parm in practice is important for versioning where two editions of a book have large chunks of text which don't match each other. IE a chapter is edited out or edited in or a censor has taken their knife to the text. Common problems are that two texts from different editions have entire book prefixes (introductions) or entire book suffixes (postscripts or indexes) which don't match -- which one is better to explicitly remove and deal with separately, but which the algorithm will try to handle if you set the input parm large enough.