Double blind OCRing?

Somewhere in the recent spate of discussion, several contributors, including BB and/or Jim Adcock I think, described diffing two different (OCRed) editions of the same work, and also finding many scannos. These appends seem to me to imply that diffing parallel/independently produced texts is a faster and more efficient way to correct the OCR process than sequential manual checking. I wonder what happens if, starting from the same physical text, you do one or more of: - scanning with two different scanners - OCRing with two different algorithms/programs - manual correction (PPing?) independently by two different people And then diffing the results. Has anyone (possibly including DP) ever tried any of the above, and documented the results in a scientifically valid way? Does DP work that way anyway? It sort of makes sense to me that if the above processes are basically 95%-99% accurate, then automatically comparing two independent results to find errors might be a lot more reliable than manually refining something that's already so good that humans can't see the difference. It's not quite the same, but I've spent enough of my life trying to see the errors in program code, i.e. read what the code actually says, rather than what I think it says, to know that humans have amazingly good subconscious error correction algorithms which it's impossible to turn off. Bob Gibbins

On 9/26/2012 12:11 PM, Robert Gibbins wrote:
Somewhere in the recent spate of discussion, several contributors, including BB and/or Jim Adcock I think, described diffing two different (OCRed) editions of the same work, and also finding many scannos.
These appends seem to me to imply that diffing parallel/independently produced texts is a faster and more efficient way to correct the OCR process than sequential manual checking.
Double blind scanning has come up frequently, but never made it past the discussion phase. In my own experimentation, I have encountered two major impediments: Automated diff programs are good, but they are not infallible. When diffing two files which are significantly different, they diff point on the two files needs to be regularly sync'ed. For example, suppose you are using the standard gnu diff program, and one file contains a section of text that the other does not. The program cannot just go on diffing line for line, because when the additional text is encountered every line thereafter will be reported as different. Because of this problem, all diff programs have the capability to "look-ahead" and try to establish a point where the lines again become identical. There will always be files that exceed the "look-ahead" capability, and thus cannot be diffed. Differing line lengths exacerbate this problem, because line beginning or ending words may move to a different line, causing a line-based diff to get out-of-sync for entire paragraphs (and if this happens for one paragraph, it is likely to occur for all paragraphs). There are, of course, ways to ameliorate this problem. One way is for each OCR to start with, if not the same scan, at least scans of the same DT edition, and then to be sure that the resultant text is always saved with line-endings (LF or CR/LF) intact. Another approach (which I believe Mr. Traverso uses) is to use wdiff. wdiff is a front-end to standard diff, which takes each input file and breaks it down into lines consisting of a single word each. Diffs using wdiff are immune to line break/length issues, but increase the risk of exceeding the look-ahead limit, and sometimes make it hard to find the context of the word in the candidate text file. Both problems can be minimized by careful normalization of the input text (but "standard" is a four-letter word at PG). Of course, all of these methods still require a human to resolve the difference found. A more automated approach would be desirable. One idea is to do a /triple/ blind scan/diff, where a voting algorithm is used to resolve differences. The assumption is that if two OCR algorithms got one result, the remaining result must be incorrect. This is not always the case, but can usually be caught during smooth proofing. Another method, suggested by BowerBird, is to build a dictionary based on the OCR text, containing a count of word instances. Every word difference is then looked up in the dictionary, and the word selected is the one that has the greatest number of instance is the dictionary, the assumption being that OCR errors are relatively rare, and if the word spelling is used frequently in the remainder of the text it is probably correct. Again, smooth proofing can be used to catch these odd errors (assuming availability of page scans). The second problem I have encountered results from my own idiosyncratic commitment to text markup. When I save OCR text I always save it in a format that preserves all the markup possible. To run a diff, normalizing the text also means removing the markup. Then, when textual differences are found, I have to go in and make the changes in the marked-up file. I have simply found that the overhead of stripping markup, normalizing files, running the diff, the manually changing the marked-up version tends to be greater than the time required to simply do a page-by-page, image against result proof-reading inside FineReader. Ideas as to how to increase automation while reducing overhead are always welcome.

Automated diff programs are good, but they are not infallible. When diffing two files which are significantly different, they diff point on the two files needs to be regularly sync'ed. For example, suppose you are using the standard gnu diff program, and one file contains a section of text that the other does not. The program cannot just go on diffing line for line, because when the additional text is encountered every line thereafter will be reported as different.
Depends on the diff program. The one I created I created specifically for the purpose of cross-diffing will correctly deal with not just line miss-matches but with whole missing or entirely changed sections of text.
Because of this problem, all diff programs have the capability to "look-ahead" and try to establish a point where the lines again become identical. There will always be files that exceed the "look-ahead" capability, and thus cannot be diffed.
This statement is certainly not true, at least not the first part. Certainly files that contain sections of millions of words mismatched in a row will prove to be problematic.

On 2012-09-26, Robert Gibbins wrote:
- manual correction (PPing?) independently by two different people And then diffing the results.
I had a play with this. I did a brief report at: http://www.pgdp.net/phpBB2/viewtopic.php?t=51418 There are links to a parallel proofing interface (hosted at DP.it) in the second paragraph. The twist is that the proofer themselves gets the diffs as soon as they submit a page, so can quickly check their own output against the image for errors.

>These appends seem to me to imply that diffing parallel/independently produced texts is a faster and more efficient way to correct the OCR process than sequential manual checking. And the results are more accurate. The DP approach does not lead to particularly accurate texts. >I wonder what happens if, starting from the same physical text, you do one or more of: - scanning with two different scanners - OCRing with two different algorithms/programs - manual correction (PPing?) independently by two different people And then diffing the results. What I have done, more for convenience than anything else, includes: *Scanning with two different scanners on two different editions, OCR'ing using two different OCRS, and cross-diffing the results to find points of disagreement, and then using a DJVU and/or PDF/a version of the images to allow me to search on the questionable text and manually make a decisions based on what my eyes actually claim to see there. *OCR'ing with two different algorithms/programs. *manual correction by two different people, and then diffing the results. In this case one of the manual corrections has typically been done more rigorously than the other, but in any case the difference areas are found and again rigorously checked against the original images. If one makes stupid choices one figures this out very rapidly, simply because stupid choices lead to an ungodly amount of work. [Well, I guess one figures this out very rapidly if you are the one actually doing the work, I guess if you can get someone else to do the work, then you never figure it out.] All of these can work well, but still require considerable work, meaning 1000s if not 10,000s of places in a text where there are differences found which still need to be carefully scrutinized against the original images to see with your own real eyes what those images actually "say." The point is to reduce the problem to an actual finite number of places that one ACTUALLY needs to carefully scrutinize, and then ACTUALLY carefully scrutinize. When you ask people to find a "needle in a haystack" [which is what DP *is* asking people to do] then people *reliably* make two kinds of errors: a) They leave many needles in the haystack. And perhaps more troublesome: b) They insist to the death they have found a needle in the haystack when in fact they have not found a needle in the haystack. DP being a political organization, these second problems cannot in practice be fixed. >Has anyone (possibly including DP) ever tried any of the above, and documented the results in a scientifically valid way? The notion of "scientifically valid way" would pretend to understand that any of this can be reduced to a fully "mechanized" process, and what I find the more I do this stuff is how much actually involves judgment calls on the part of the transcriber, someone who hopefully is trying both to make a work which is incredibly faithful to the original author's intent, and also is trying to make something which real world people can actually read today on the computing devices they actually own this century. Which at best might be called an "engineering judgment" process not a "scientific" process. >Does DP work that way anyway? No. The DP incentive process drives many people to short-change their efforts, where they are effectively smooth-reading texts rather than rigorously comparing them to the actual images. Some of the old-timers are the worst offenders in these matters, but then again so are some of the newbies. Some of the intermediates are really good, conscientious and thoughtful workers. You get results all over the place. >It sort of makes sense to me that if the above processes are basically 95%-99% accurate, then automatically comparing two independent results to find errors might be a lot more reliable than manually refining something that's already so good that humans can't see the difference. Whenever you ask real flesh-and-blood people to do something their efforts can never be better than about 99.9% accurate. IE we analog computers tend to blow it, often for totally inexplicable reasons, about 1 out of 1,000 times even when we are being incredibly conscientious. When we are not at the top of our game, of course, all bets are off. A simple example of this kind of problem is in email when we "think" a certain word in our head but the typewriter-keyboard-process part of our brain "types" a different word into the email -- and I'm not talking about the mechanistic "off by one key" kind of mechanical error. Just like when speaking sometimes a wrong word slips out (one daughter's name for the other) sometimes when typing a wrong word slips out, and sometimes when reading a wrong word slips in. [.And getting more people involved does not necessarily make the process more reliable -- note what happens when you send two outfielders to catch one ball! ] >It's not quite the same, but I've spent enough of my life trying to see the errors in program code, i.e. read what the code actually says, rather than what I think it says, to know that humans have amazingly good subconscious error correction algorithms which it's impossible to turn off. I can think of a situation where a typesetter error was "for sure" interpreted by a WW'er one way and "so why don't you just fix it?" whereas I "for sure" interpreted it the other way. For example is a "he said he said" typesetter error to be interpreted as "he said she said" or "she said he said" ? Well, that depends on your understanding of the state of the relationship of he and she at that point in time, which depends on your understanding of the plot development. And that is not something that scanner software is going to fix anytime soon. [I ending up leaving the typesetting error in place] PS: Note that a typical DP text contains maybe literally 1 million characters, so reducing that problem to "only" 1,000 places that need to be carefully checked is a huge step in the right direction. PPS: Of course conscientious "dead tree" books have maybe 10 errors in them, and less conscientious efforts 10 times that amount. And god knows how much gets lost between the "original author's intent" and the "first edition." PPPS: And neither Unicode nor HTML give us good tools to transcribe what one actually finds in historical books in the first place.

On Fri, Sep 28, 2012 at 8:33 AM, James Adcock <jimad@msn.com> wrote:
These appends seem to me to imply that diffing parallel/independently produced texts is a faster and more efficient way to correct the OCR process than sequential manual checking.
Independent scans and OCR with independent programs is not fast or efficient. Nor can simple hours be counted; the whole point of DP and similar projects is that it's easier to get many hours from many people then to get one volunteer to put in fewer hours.
And the results are more accurate. The DP approach does not lead to particularly accurate texts.
Right, whatever. Instead of sneering, how about some actual evidence and numbers. Show me texts, and let's see what you consider "not particularly accurate".
PPPS: And neither Unicode nor HTML give us good tools to transcribe what one actually finds in historical books in the first place.
Unicode doesn't? Whatever. Unless you're talking manuscripts and EETS books, it's pretty solid. -- Kie ekzistas vivo, ekzistas espero.
participants (5)
-
David Starner
-
James Adcock
-
Jon Hurst
-
Lee Passey
-
Robert Gibbins