
Carlo Traverso wrote:
In margin of the BB-Jon discussion, I would like to issue a warning from my experience with FineReader OCR: it is not true that higher resolution scans always provide better OCR. Often they cause minute imperfections of the original print to be recognized as letters, punctuations or diacritics. Sometimes it pays to reduce the resolution of the scans to 300DPI and use the reduced images; FineReader seems to expect 300DPI scans. Higher resolution is only better with very small type.
Agreed. Higher resolution only improves recognition of small type. Once the resolution is high enough that the thinnest parts of the letters are reliably one pixel thick, if the software misrecognizes the character, it will misrecognize a larger character of the same shape. At 300 dpi, "normal" sized roman fonts seem to usually have thick stems of 3-4 pixels wide, thin stems of 1 pixel wide, and serifs also 1 pixel wide. Also, greyscale images do not appear to improve OCR with Abbyy either. Although I'm not privvy to their algorithms, certain aspects of the user interface suggest to me that the software only operates on black / white values, and even if you take greyscale scans, the software threshholds them for the purposes of recognition. You have the greyscales to save for whatever other purposes you wish to put them to, but the software itself seems to make use of a B/W version. My usual scanning practice for DP is to 300 dpi B/W scans for text, and 300 or 600 dpi greyscale scans for illustrations.
Globally, and not unexpectedly, the OCR seems well tuned to recent print in contemporary language. My impression is that an effort of developing free OCR software of good quality, in which the knowledge of the source can be used in the recognition process, could be well spent for the needs of PG.
But it can be trained to recognize other fonts with some success. I have trained Abbyy 5 Pro on blackletter with not stellar, but not exactly embarrassing, results. There is an (unfortunately out of most people's price range) version of Abbyy 7 that is designed with oldstyle fonts in mind. If this software is the Abbyy 7 engine, specially trained on old text, it suggests that we might do well to set up a place to share our pre-trained user patterns for old printing styles. -- RS