
On 9/26/2012 12:11 PM, Robert Gibbins wrote:
Somewhere in the recent spate of discussion, several contributors, including BB and/or Jim Adcock I think, described diffing two different (OCRed) editions of the same work, and also finding many scannos.
These appends seem to me to imply that diffing parallel/independently produced texts is a faster and more efficient way to correct the OCR process than sequential manual checking.
Double blind scanning has come up frequently, but never made it past the discussion phase. In my own experimentation, I have encountered two major impediments: Automated diff programs are good, but they are not infallible. When diffing two files which are significantly different, they diff point on the two files needs to be regularly sync'ed. For example, suppose you are using the standard gnu diff program, and one file contains a section of text that the other does not. The program cannot just go on diffing line for line, because when the additional text is encountered every line thereafter will be reported as different. Because of this problem, all diff programs have the capability to "look-ahead" and try to establish a point where the lines again become identical. There will always be files that exceed the "look-ahead" capability, and thus cannot be diffed. Differing line lengths exacerbate this problem, because line beginning or ending words may move to a different line, causing a line-based diff to get out-of-sync for entire paragraphs (and if this happens for one paragraph, it is likely to occur for all paragraphs). There are, of course, ways to ameliorate this problem. One way is for each OCR to start with, if not the same scan, at least scans of the same DT edition, and then to be sure that the resultant text is always saved with line-endings (LF or CR/LF) intact. Another approach (which I believe Mr. Traverso uses) is to use wdiff. wdiff is a front-end to standard diff, which takes each input file and breaks it down into lines consisting of a single word each. Diffs using wdiff are immune to line break/length issues, but increase the risk of exceeding the look-ahead limit, and sometimes make it hard to find the context of the word in the candidate text file. Both problems can be minimized by careful normalization of the input text (but "standard" is a four-letter word at PG). Of course, all of these methods still require a human to resolve the difference found. A more automated approach would be desirable. One idea is to do a /triple/ blind scan/diff, where a voting algorithm is used to resolve differences. The assumption is that if two OCR algorithms got one result, the remaining result must be incorrect. This is not always the case, but can usually be caught during smooth proofing. Another method, suggested by BowerBird, is to build a dictionary based on the OCR text, containing a count of word instances. Every word difference is then looked up in the dictionary, and the word selected is the one that has the greatest number of instance is the dictionary, the assumption being that OCR errors are relatively rare, and if the word spelling is used frequently in the remainder of the text it is probably correct. Again, smooth proofing can be used to catch these odd errors (assuming availability of page scans). The second problem I have encountered results from my own idiosyncratic commitment to text markup. When I save OCR text I always save it in a format that preserves all the markup possible. To run a diff, normalizing the text also means removing the markup. Then, when textual differences are found, I have to go in and make the changes in the marked-up file. I have simply found that the overhead of stripping markup, normalizing files, running the diff, the manually changing the marked-up version tends to be greater than the time required to simply do a page-by-page, image against result proof-reading inside FineReader. Ideas as to how to increase automation while reducing overhead are always welcome.