
----- Original Message ----- From Juliet Sutherland <vze3rknp@verizon.net> Date Thu, 27 Jan 2005 17:00:11 -0500 To Project Gutenberg Volunteer Discussion <gutvol-d@lists.pglaf.org> Subject Re: [gutvol-d] million book project
Virtually none of the other image archives provide corrected text. It is simply cost-prohibitive to do so. What DP does as a volunteer effort would be extremely costly to replicate. For this reason alone, I believe that Google will be using raw OCR behind their scans. On new material, raw OCR from a good program can be very close to 100% correct. It is the older material that causes problems.
I actually find, within limits, the opposite to be true. Material from about 1905 through to about 1955 OCRs very, very well. Material before 1905 OCRs progressively worse the further back you go, although "bright" characters on pre-acid paper are almost as good as early 20th-century stuff. And after the 1950s, I find the OCRability goes down again. Many post-PC printed books, say from the mid-1980s on, OCR almost as poorly as pre-20th century stuff; there's just something about those typefaces, I guess.