
28 Sep
2012
28 Sep
'12
3:33 p.m.
>These appends seem to me to imply that diffing parallel/independently produced texts is a faster and more efficient way to correct the OCR process than sequential manual checking. And the results are more accurate. The DP approach does not lead to particularly accurate texts. >I wonder what happens if, starting from the same physical text, you do one or more of: - scanning with two different scanners - OCRing with two different algorithms/programs - manual correction (PPing?) independently by two different people And then diffing the results. What I have done, more for convenience than anything else, includes: *Scanning with two different scanners on two different editions, OCR'ing using two different OCRS, and cross-diffing the results to find points of disagreement, and then using a DJVU and/or PDF/a version of the images to allow me to search on the questionable text and manually make a decisions based on what my eyes actually claim to see there. *OCR'ing with two different algorithms/programs. *manual correction by two different people, and then diffing the results. In this case one of the manual corrections has typically been done more rigorously than the other, but in any case the difference areas are found and again rigorously checked against the original images. If one makes stupid choices one figures this out very rapidly, simply because stupid choices lead to an ungodly amount of work. [Well, I guess one figures this out very rapidly if you are the one actually doing the work, I guess if you can get someone else to do the work, then you never figure it out.] All of these can work well, but still require considerable work, meaning 1000s if not 10,000s of places in a text where there are differences found which still need to be carefully scrutinized against the original images to see with your own real eyes what those images actually "say." The point is to reduce the problem to an actual finite number of places that one ACTUALLY needs to carefully scrutinize, and then ACTUALLY carefully scrutinize. When you ask people to find a "needle in a haystack" [which is what DP *is* asking people to do] then people *reliably* make two kinds of errors: a) They leave many needles in the haystack. And perhaps more troublesome: b) They insist to the death they have found a needle in the haystack when in fact they have not found a needle in the haystack. DP being a political organization, these second problems cannot in practice be fixed. >Has anyone (possibly including DP) ever tried any of the above, and documented the results in a scientifically valid way? The notion of "scientifically valid way" would pretend to understand that any of this can be reduced to a fully "mechanized" process, and what I find the more I do this stuff is how much actually involves judgment calls on the part of the transcriber, someone who hopefully is trying both to make a work which is incredibly faithful to the original author's intent, and also is trying to make something which real world people can actually read today on the computing devices they actually own this century. Which at best might be called an "engineering judgment" process not a "scientific" process. >Does DP work that way anyway? No. The DP incentive process drives many people to short-change their efforts, where they are effectively smooth-reading texts rather than rigorously comparing them to the actual images. Some of the old-timers are the worst offenders in these matters, but then again so are some of the newbies. Some of the intermediates are really good, conscientious and thoughtful workers. You get results all over the place. >It sort of makes sense to me that if the above processes are basically 95%-99% accurate, then automatically comparing two independent results to find errors might be a lot more reliable than manually refining something that's already so good that humans can't see the difference. Whenever you ask real flesh-and-blood people to do something their efforts can never be better than about 99.9% accurate. IE we analog computers tend to blow it, often for totally inexplicable reasons, about 1 out of 1,000 times even when we are being incredibly conscientious. When we are not at the top of our game, of course, all bets are off. A simple example of this kind of problem is in email when we "think" a certain word in our head but the typewriter-keyboard-process part of our brain "types" a different word into the email -- and I'm not talking about the mechanistic "off by one key" kind of mechanical error. Just like when speaking sometimes a wrong word slips out (one daughter's name for the other) sometimes when typing a wrong word slips out, and sometimes when reading a wrong word slips in. [.And getting more people involved does not necessarily make the process more reliable -- note what happens when you send two outfielders to catch one ball! ] >It's not quite the same, but I've spent enough of my life trying to see the errors in program code, i.e. read what the code actually says, rather than what I think it says, to know that humans have amazingly good subconscious error correction algorithms which it's impossible to turn off. I can think of a situation where a typesetter error was "for sure" interpreted by a WW'er one way and "so why don't you just fix it?" whereas I "for sure" interpreted it the other way. For example is a "he said he said" typesetter error to be interpreted as "he said she said" or "she said he said" ? Well, that depends on your understanding of the state of the relationship of he and she at that point in time, which depends on your understanding of the plot development. And that is not something that scanner software is going to fix anytime soon. [I ending up leaving the typesetting error in place] PS: Note that a typical DP text contains maybe literally 1 million characters, so reducing that problem to "only" 1,000 places that need to be carefully checked is a huge step in the right direction. PPS: Of course conscientious "dead tree" books have maybe 10 errors in them, and less conscientious efforts 10 times that amount. And god knows how much gets lost between the "original author's intent" and the "first edition." PPPS: And neither Unicode nor HTML give us good tools to transcribe what one actually finds in historical books in the first place.