
In margin of the BB-Jon discussion, I would like to issue a warning from my experience with FineReader OCR: it is not true that higher resolution scans always provide better OCR. Often they cause minute imperfections of the original print to be recognized as letters, punctuations or diacritics. Sometimes it pays to reduce the resolution of the scans to 300DPI and use the reduced images; FineReader seems to expect 300DPI scans. Higher resolution is only better with very small type. I think that this is a bug of FineeReader (should not recognize as letters etc. image details that are much smaller than the other characters, or incorrectly placed) but this is something on which we don't have control, except than pre-processing images. Often the higher resolution scans have different errors from reduced resolution scans; procedures to compare the OCR at different resolutions might lead to better global recognition. Globally, and not unexpectedly, the OCR seems well tuned to recent print in contemporary language. My impression is that an effort of developing free OCR software of good quality, in which the knowledge of the source can be used in the recognition process, could be well spent for the needs of PG. Another domain in which a considerable progress could be attained is the spell-checking software, that is much more tuned to typing than to OCR, especially of old texts. It is common experience that the most common OCR errors are down in the list of suggestions. This is however a domain in which free software exists, and the problem is one of a metric tuned for OCR in the corrections space. Carlo