[gutvol-d] Double blind OCRing?

26 Sep 2012

      Somewhere in the recent spate of discussion, several contributors, including
BB and/or Jim Adcock I think, described diffing two different (OCRed)
editions of the same work, and also finding many scannos.

These appends seem to me to imply that diffing parallel/independently
produced texts is a faster and more efficient way to correct the OCR process
than sequential manual checking.

I wonder what happens if, starting from the same physical text, you do one
or more of:
- scanning with two different scanners
- OCRing with two different algorithms/programs
- manual correction (PPing?) independently by two different people
And then diffing the results.

Has anyone (possibly including DP) ever tried any of the above, and
documented the results in a scientifically valid way? Does DP work that way
anyway?

It sort of makes sense to me that if the above processes are basically
95%-99% accurate, then automatically comparing two independent results to
find errors might be a lot more reliable than manually refining something
that's already so good that humans can't see the difference. It's not quite
the same, but I've spent enough of my life trying to see the errors in
program code, i.e. read what the code actually says, rather than what I
think it says, to know that humans have amazingly good subconscious error
correction algorithms which it's impossible to turn off.

Bob Gibbins

[gutvol-d] Double blind OCRing?

Robert Gibbins