
Does the 300 dpi bitonal rule apply for OCRing all text found in all books? What's the smallest point size text which 300 dpi bitonal will still allow reasonably accurate OCR (at least sufficient for the DP process)?
This is a font-dependant question. What I am about to say applies to Abbyy Finereader; I am not sure how well it applies to other OCR engines. When Abbyy receives grayscale data, before recognizing anything, it converts the image to a B/W image by applying a thresholding algorithm. (It chooses its own threshold based on what it thinks will give the best recognition.) The only advantage of having grayscale scans for OCR purposes is that Abbyy gets to choose its own threshold. If one is making one's own scans using Abbyy, even that doesn't matter, because Abbyy will choose the threshold for turning the grayscale scanner data into B/W images. Too-small text is difficult to OCR, but if the resolution is high enough, then even higher resolution does not, in my experience, noticeably improve OCR. High enough is font dependent; it seems to me that for Roman types, if the thinnest stems are reliably one pixel wide, then the resolution is high enough. Italics require higher resolution than Roman type for the same font size. 300 dpi is high enough for most books, but is often not large enough for the smaller print used for footnotes, esp. if those footnotes contain italics. I would be surprised to find type too small to be recognized at 600 dpi. One other issue with using Abbyy to acquire B/W images is that when it applies the threshold to pages containing figures, often a threshold that is proper for recognizing text is improper for the finer features of engravings, or for the shades of gray in lithographs. Large white blocks result. A good practice, then, is to acquire images as grayscale, and then, if disk space is an issue, to save as B/W for non-illustrated pages and as grayscale for illustrated ones. The last heavily illustrated project I worked on for PG was Robert Hooke's Micrographia. The images in that project as presented are 100-dpi grayscales, and on most screens appear approximately the size they do on the paper edition. However, I could not have acquired images of that quality using the scanner directly; they were acquired at 600 dpi grayscale, heavily Photoshopped, and then reduced to manageable size. (Even simple operations on images of that size took a minute or two to complete.) -- RS