Re: [gutvol-d] Double blind OCRing?

26 Sep 2012

      On 9/26/2012 12:11 PM, Robert Gibbins wrote:
...
Somewhere in the recent spate of discussion, several contributors, including
BB and/or Jim Adcock I think, described diffing two different (OCRed)
editions of the same work, and also finding many scannos.
These appends seem to me to imply that diffing parallel/independently
produced texts is a faster and more efficient way to correct the OCR process
than sequential manual checking.
Double blind scanning has come up frequently, but never made it past the 
discussion phase.

In my own experimentation, I have encountered two major impediments:

Automated diff programs are good, but they are not infallible. When 
diffing two files which are significantly different, they diff point on 
the two files needs to be regularly sync'ed. For example, suppose you 
are using the standard gnu diff program, and one file contains a section 
of text that the other does not. The program cannot just go on diffing 
line for line, because when the additional text is encountered every 
line thereafter will be reported as different.

Because of this problem, all diff programs have the capability to 
"look-ahead" and try to establish a point where the lines again become 
identical. There will always be files that exceed the "look-ahead" 
capability, and thus cannot be diffed.

Differing line lengths exacerbate this problem, because line beginning 
or ending words may move to a different line, causing a line-based diff 
to get out-of-sync for entire paragraphs (and if this happens for one 
paragraph, it is likely to occur for all paragraphs).

There are, of course, ways to ameliorate this problem. One way is for 
each OCR to start with, if not the same scan, at least scans of the same 
DT edition, and then to be sure that the resultant text is always saved 
with line-endings (LF or CR/LF) intact.

Another approach (which I believe Mr. Traverso uses) is to use wdiff. 
wdiff is a front-end to standard diff, which takes each input file and 
breaks it down into lines consisting of a single word each. Diffs using 
wdiff are immune to line break/length issues, but increase the risk of 
exceeding the look-ahead limit, and sometimes make it hard to find the 
context of the word in the candidate text file.

Both problems can be minimized by careful normalization of the input 
text (but "standard" is a four-letter word at PG).

Of course, all of these methods still require a human to resolve the 
difference found. A more automated approach would be desirable. One idea 
is to do a /triple/ blind scan/diff, where a voting algorithm is used to 
resolve differences. The assumption is that if two OCR algorithms got 
one result, the remaining result must be incorrect. This is not always 
the case, but can usually be caught during smooth proofing.

Another method, suggested by BowerBird, is to build a dictionary based 
on the OCR text, containing a count of word instances. Every word 
difference is then looked up in the dictionary, and the word selected is 
the one that has the greatest number of instance is the dictionary, the 
assumption being that OCR errors are relatively rare, and if the word 
spelling is used frequently in the remainder of the text it is probably 
correct. Again, smooth proofing can be used to catch these odd errors 
(assuming availability of page scans).

The second problem I have encountered results from my own idiosyncratic 
commitment to text markup. When I save OCR text I always save it in a 
format that preserves all the markup possible. To run a diff, 
normalizing the text also means removing the markup. Then, when textual 
differences are found, I have to go in and make the changes in the 
marked-up file. I have simply found that the overhead of stripping 
markup, normalizing files, running the diff, the manually changing the 
marked-up version tends to be greater than the time required to simply 
do a page-by-page, image against result proof-reading inside FineReader.

Ideas as to how to increase automation while reducing overhead are 
always welcome.

Re: [gutvol-d] Double blind OCRing?

Lee Passey