Double blind OCRing?

26 Sep 2012

>These appends seem to me to imply that diffing parallel/independently
produced texts is a faster and more efficient way to correct the OCR process
than sequential manual checking.

And the results are more accurate.  The DP approach does not lead to
particularly accurate texts.

>I wonder what happens if, starting from the same physical text, you do one
or more of:
- scanning with two different scanners
- OCRing with two different algorithms/programs
- manual correction (PPing?) independently by two different people And then
diffing the results.

What I have done, more for convenience than anything else, includes:

*Scanning with two different scanners on two different editions, OCR'ing
using two different OCRS, and cross-diffing the results to find points of
disagreement, and then using a DJVU and/or PDF/a version of the images to
allow me to search on the questionable text and manually make a decisions
based on what my eyes actually claim to see there. 

*OCR'ing with two different algorithms/programs.

*manual correction by two different people, and then diffing the results.
In this case one of the manual corrections has typically been done more
rigorously than the other, but in any case the difference areas are found
and again rigorously checked against the original images.

If one makes stupid choices one figures this out very rapidly, simply
because stupid choices lead to an ungodly amount of work. [Well, I guess one
figures this out very rapidly if you are the one actually doing the work, I
guess if you can get someone else to do the work, then you never figure it
out.]

All of these can work well, but still require considerable work, meaning
1000s if not 10,000s of places in a text where there are differences found
which still need to be carefully scrutinized against the original images to
see with your own real eyes what those images actually "say."  The point is
to reduce the problem to an actual finite number of places that one ACTUALLY
needs to carefully scrutinize, and then ACTUALLY carefully scrutinize.  When
you ask people to find a "needle in a haystack" [which is what DP *is*
asking people to do] then people *reliably* make two kinds of errors:

a) They leave many needles in the haystack.

And perhaps more troublesome:

b) They insist to the death they have found a needle in the haystack when in
fact they have not found a needle in the haystack.  DP being a political
organization, these second problems cannot in practice be fixed.

>Has anyone (possibly including DP) ever tried any of the above, and
documented the results in a scientifically valid way?

The notion of "scientifically valid way" would pretend to understand that
any of this can be reduced to a fully "mechanized" process, and what I find
the more I do this stuff is how much actually involves judgment calls on the
part of the transcriber, someone who hopefully is trying both to make a work
which is incredibly faithful to the original author's intent, and also is
trying to make something which real world people can actually read today on
the computing devices they actually own this century.  Which at best might
be called an "engineering judgment" process not a "scientific" process.

>Does DP work that way anyway?

No.  The DP incentive process drives many people to short-change their
efforts, where they are effectively smooth-reading texts rather than
rigorously comparing them to the actual images.  Some of the old-timers are
the worst offenders in these matters, but then again so are some of the
newbies.  Some of the intermediates are really good, conscientious and
thoughtful workers.  You get results all over the place.

>It sort of makes sense to me that if the above processes are basically
95%-99% accurate, then automatically comparing two independent results to
find errors might be a lot more reliable than manually refining something
that's already so good that humans can't see the difference. 

Whenever you ask real flesh-and-blood people to do something their efforts
can never be better than about 99.9% accurate.  IE we analog computers tend
to blow it, often for totally inexplicable reasons, about 1 out of 1,000
times even when we are being incredibly conscientious. When we are not at
the top of our game, of course, all bets are off. A simple example of this
kind of problem is in email when we "think" a certain word in our head but
the typewriter-keyboard-process part of our brain "types" a different word
into the email -- and I'm not talking about the mechanistic "off by one key"
kind of mechanical error.  Just like when speaking sometimes a wrong word
slips out (one daughter's name for the other) sometimes when typing a wrong
word slips out, and sometimes when reading a wrong word slips in. [.And
getting more people involved does not necessarily make the process more
reliable -- note what happens when you send two outfielders to catch one
ball! ]

>It's not quite the same, but I've spent enough of my life trying to see the
errors in program code, i.e. read what the code actually says, rather than
what I think it says, to know that humans have amazingly good subconscious
error correction algorithms which it's impossible to turn off.

I can think of a situation where a typesetter error was "for sure"
interpreted by a WW'er one way and "so why don't you just fix it?" whereas I
"for sure" interpreted it the other way.  For example is a "he said he said"
typesetter error to be interpreted as "he said she said" or "she said he
said" ?  Well, that depends on your understanding of the state of the
relationship of he and she at that point in time, which depends on your
understanding of the plot development.  And that is not something that
scanner software is going to fix anytime soon.  [I ending up leaving the
typesetting error in place]

PS: Note that a typical DP text contains maybe literally 1 million
characters, so reducing that problem to "only" 1,000 places that need to be
carefully checked is a huge step in the right direction.

PPS:  Of course conscientious "dead tree" books have maybe 10 errors in
them, and less conscientious efforts 10 times that amount.  And god knows
how much gets lost between the "original author's intent" and the "first
edition."

PPPS: And neither Unicode nor HTML give us good tools to transcribe what one
actually finds in historical books in the first place.

Robert Gibbins

Lee Passey

James Adcock

Jon Hurst

James Adcock

David Starner

tags

participants (5)