Scanning/OCR tips

12 Mar 2005

      In margin of the BB-Jon discussion, I would like to issue a warning
from my experience with FineReader OCR: it is not true that higher
resolution scans always provide better OCR. Often they cause minute
imperfections of the original print to be recognized as letters,
punctuations or diacritics. Sometimes it pays to reduce the resolution
of the scans to 300DPI and use the reduced images; FineReader seems to
expect 300DPI scans. Higher resolution is only better with very small
type. 

I think that this is a bug of FineeReader (should not recognize as
letters etc. image details that are much smaller than the other
characters, or incorrectly placed) but this is something on which we
don't have control, except than pre-processing images.

Often the higher resolution scans have different errors from reduced
resolution scans; procedures to compare the OCR at different
resolutions might lead to better global recognition.

Globally, and not unexpectedly, the OCR seems well tuned to recent
print in contemporary language. My impression is that an effort of
developing free OCR software of good quality, in which the knowledge
of the source can be used in the recognition process, could be well
spent for the needs of PG.

Another domain in which a considerable progress could be attained is
the spell-checking software, that is much more tuned to typing than to
OCR, especially of old texts. It is common experience that the most
common OCR errors are down in the list of suggestions.

This is however a domain in which free software exists, and the
problem is one of a metric tuned for OCR in the corrections space.

Carlo

Carlo Traverso

Robert Shimmin

Juliet Sutherland

Carlo Traverso

tags

participants (3)