
On 3/4/2010 11:50 AM, Marcello Perathoner wrote:
Michael S. Hart wrote:
[snip]
The bet is that a Xerox machine type of scanning and OCR will produce a 95% accurate copy of certain pages selected from an average set of books, magazines, etc. Just go to a library and ask for samples.
Accuracy of OCR already exceeds 99%.
Absolutely. According to what I learned in typing class (yes, I really am that old) a standard typewritten sheet of paper averages 72 lines of 66 characters each, resulting in 4752 characters per page. Based solely on a per character basis 99% accuracy would allow 47 errors per page. Modern OCR, even that POS that IA uses, gives better accuracy than that. If you choose to look at words instead of characters, it is generally accepted that the average word length is 6 characters, for an average of 9.5 words per line (I have omitted spaces which is why it is not 11 words per line). This results in an average of 679 words per page, which at 99% accuracy would allow for 6 misrecognized /words/ per page. That is still well within the recognition accuracy of modern OCR. Personally, I find bowerbird's stated goal of 1 error per 10 pages a worthwhile goal. This is actually an accuracy rate (based upon words) of 99.9998527%. So maybe the bet ought to be when automated OCR will exceed four 9s of accuracy (basically one word error per page). Some of the recent work I have done, from my own scans, already reaches that threshold. (Accuracy will, of course, vary depending on the quality of the scanned image. YMMV and all that jazz.)