Scanning/OCR tips

In margin of the BB-Jon discussion, I would like to issue a warning from my experience with FineReader OCR: it is not true that higher resolution scans always provide better OCR. Often they cause minute imperfections of the original print to be recognized as letters, punctuations or diacritics. Sometimes it pays to reduce the resolution of the scans to 300DPI and use the reduced images; FineReader seems to expect 300DPI scans. Higher resolution is only better with very small type. I think that this is a bug of FineeReader (should not recognize as letters etc. image details that are much smaller than the other characters, or incorrectly placed) but this is something on which we don't have control, except than pre-processing images. Often the higher resolution scans have different errors from reduced resolution scans; procedures to compare the OCR at different resolutions might lead to better global recognition. Globally, and not unexpectedly, the OCR seems well tuned to recent print in contemporary language. My impression is that an effort of developing free OCR software of good quality, in which the knowledge of the source can be used in the recognition process, could be well spent for the needs of PG. Another domain in which a considerable progress could be attained is the spell-checking software, that is much more tuned to typing than to OCR, especially of old texts. It is common experience that the most common OCR errors are down in the list of suggestions. This is however a domain in which free software exists, and the problem is one of a metric tuned for OCR in the corrections space. Carlo

Carlo Traverso wrote:
In margin of the BB-Jon discussion, I would like to issue a warning from my experience with FineReader OCR: it is not true that higher resolution scans always provide better OCR. Often they cause minute imperfections of the original print to be recognized as letters, punctuations or diacritics. Sometimes it pays to reduce the resolution of the scans to 300DPI and use the reduced images; FineReader seems to expect 300DPI scans. Higher resolution is only better with very small type.
Agreed. Higher resolution only improves recognition of small type. Once the resolution is high enough that the thinnest parts of the letters are reliably one pixel thick, if the software misrecognizes the character, it will misrecognize a larger character of the same shape. At 300 dpi, "normal" sized roman fonts seem to usually have thick stems of 3-4 pixels wide, thin stems of 1 pixel wide, and serifs also 1 pixel wide. Also, greyscale images do not appear to improve OCR with Abbyy either. Although I'm not privvy to their algorithms, certain aspects of the user interface suggest to me that the software only operates on black / white values, and even if you take greyscale scans, the software threshholds them for the purposes of recognition. You have the greyscales to save for whatever other purposes you wish to put them to, but the software itself seems to make use of a B/W version. My usual scanning practice for DP is to 300 dpi B/W scans for text, and 300 or 600 dpi greyscale scans for illustrations.
Globally, and not unexpectedly, the OCR seems well tuned to recent print in contemporary language. My impression is that an effort of developing free OCR software of good quality, in which the knowledge of the source can be used in the recognition process, could be well spent for the needs of PG.
But it can be trained to recognize other fonts with some success. I have trained Abbyy 5 Pro on blackletter with not stellar, but not exactly embarrassing, results. There is an (unfortunately out of most people's price range) version of Abbyy 7 that is designed with oldstyle fonts in mind. If this software is the Abbyy 7 engine, specially trained on old text, it suggests that we might do well to set up a place to share our pre-trained user patterns for old printing styles. -- RS

Robert Shimmin wrote:
Also, greyscale images do not appear to improve OCR with Abbyy either. Although I'm not privvy to their algorithms, certain aspects of the user interface suggest to me that the software only operates on black / white values, and even if you take greyscale scans, the software threshholds them for the purposes of recognition. You have the greyscales to save for whatever other purposes you wish to put them to, but the software itself seems to make use of a B/W version.
I have found, using Finereader 6.0 Corporate, that for certain kinds of material I do get substantially better recognition results from greyscale. The best example are some old medical journals from the 1820's that are severely foxed. Finereader is able to recognize most of the text on these in greyscale, where B&W scanning produced images that even humans can't read. In sizing these down for proofing at DP, I found I could not go to B&W but had to go to 2 bit greyscale, and even then there were a few pages that need the full 8-bit greyscale to be legible. I always scan at 600 dpi B&W with the sheet-fed high-speed scanner because that slows it down enough for me to hand feed it (which is often necessary with the old paper). It doesn't seem to change the recognition quality much either way. JulietS

I confirm that FineReader stores the images internally as monochrome. Probably grayscale works better because of an optimized thresholding algorithm; but in general the quality of the B/W scans very much depend on the quality of the scanning software: my B/W scans with the Plustek OpticBook are very much better that the scans of a (low-end) Epson. Carlo
participants (3)
-
Carlo Traverso
-
Juliet Sutherland
-
Robert Shimmin