
I would like that we archive the scans. * The original print is more pleasant to read than the ascii or html text. * Text with figures is better in its original layout. * Math text is better in its original print than in the TeX or math-html equivalent. * Many old books are simply too beautiful to be replaced with an ascii copy. However you archive the scans, please avoid 1-bit scans. 16 grey levels with 200 dpi would be better than 1-bit with 600 dpi in my experience. Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software

On Tue, 19 Jul 2005 14:15:16 +0300, Juhana Sadeharju <kouhia@nic.funet.fi> wrote: | | I would like that we archive the scans. | | * The original print is more pleasant to read than the ascii | or html text. | * Text with figures is better in its original layout. | * Math text is better in its original print than in the TeX or | math-html equivalent. | * Many old books are simply too beautiful to be replaced with | an ascii copy. The object of scanning is to produce a good OCRed result, with minimum proof reading. For myself I would disregard the alleged beauty of the paper copies to save proofreading effort | However you archive the scans, please avoid 1-bit scans. | 16 grey levels with 200 dpi would be better than 1-bit | with 600 dpi in my experience. As this is to be, in part, a retroactive exercise, we will have to use whatever exists. -- Dave Fawthrop <dave hyphenologist co uk> In Case of Emergency Store the word "ICE" in your mobile phone address book, and against it enter the number of the person you would want to be contacted "In Case of Emergency". http://tinyurl.com/79lz9

--- Dave Fawthrop <hyphen@hyphenologist.co.uk> wrote:
| However you archive the scans, please avoid 1-bit scans. | 16 grey levels with 200 dpi would be better than 1-bit | with 600 dpi in my experience.
As this is to be, in part, a retroactive exercise, we will have to use whatever exists.
Indeed. The vast majority of scanning done for DP has been bitonal 300DPI. This provides perfectly adequate images for OCR. While low-dpi grayscale images may look prettier, higher resolution black-and-white images generally OCR better, as well as taking up much less disk space. -- Jon Ingram __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Jon Ingram wrote:
Dave Fawthrop wrote:
Someone else wrote (attribution lost, was it Juhana?):
However you archive the scans, please avoid 1-bit scans. 16 grey levels with 200 dpi would be better than 1-bit with 600 dpi in my experience.
As this is to be, in part, a retroactive exercise, we will have to use whatever exists.
Indeed. The vast majority of scanning done for DP has been bitonal 300DPI. This provides perfectly adequate images for OCR. While low-dpi grayscale images may look prettier, higher resolution black-and-white images generally OCR better, as well as taking up much less disk space.
Does the 300 dpi bitonal rule apply for OCRing all text found in all books? What's the smallest point size text which 300 dpi bitonal will still allow reasonably accurate OCR (at least sufficient for the DP process)? Jon

Does the 300 dpi bitonal rule apply for OCRing all text found in all books? What's the smallest point size text which 300 dpi bitonal will still allow reasonably accurate OCR (at least sufficient for the DP process)?
This is a font-dependant question. What I am about to say applies to Abbyy Finereader; I am not sure how well it applies to other OCR engines. When Abbyy receives grayscale data, before recognizing anything, it converts the image to a B/W image by applying a thresholding algorithm. (It chooses its own threshold based on what it thinks will give the best recognition.) The only advantage of having grayscale scans for OCR purposes is that Abbyy gets to choose its own threshold. If one is making one's own scans using Abbyy, even that doesn't matter, because Abbyy will choose the threshold for turning the grayscale scanner data into B/W images. Too-small text is difficult to OCR, but if the resolution is high enough, then even higher resolution does not, in my experience, noticeably improve OCR. High enough is font dependent; it seems to me that for Roman types, if the thinnest stems are reliably one pixel wide, then the resolution is high enough. Italics require higher resolution than Roman type for the same font size. 300 dpi is high enough for most books, but is often not large enough for the smaller print used for footnotes, esp. if those footnotes contain italics. I would be surprised to find type too small to be recognized at 600 dpi. One other issue with using Abbyy to acquire B/W images is that when it applies the threshold to pages containing figures, often a threshold that is proper for recognizing text is improper for the finer features of engravings, or for the shades of gray in lithographs. Large white blocks result. A good practice, then, is to acquire images as grayscale, and then, if disk space is an issue, to save as B/W for non-illustrated pages and as grayscale for illustrated ones. The last heavily illustrated project I worked on for PG was Robert Hooke's Micrographia. The images in that project as presented are 100-dpi grayscales, and on most screens appear approximately the size they do on the paper edition. However, I could not have acquired images of that quality using the scanner directly; they were acquired at 600 dpi grayscale, heavily Photoshopped, and then reduced to manageable size. (Even simple operations on images of that size took a minute or two to complete.) -- RS

What I am about to say applies to Abbyy Finereader; I am not sure how well it applies to other OCR engines. When Abbyy receives grayscale data, before recognizing anything, it converts the image to a B/W image by applying a thresholding algorithm. (It chooses its own threshold based on what it thinks will give the best recognition.)
I believe ScanSoft works the same way. This is usually not a problem, but it can be with unevenly lit pages, pages which have faded at different rates, or pages where the printer goofed and didn't make an even impression.

--- Jon Noring <jon@noring.name> wrote:
Does the 300 dpi bitonal rule apply for OCRing all text found in all books? What's the smallest point size text which 300 dpi bitonal will still allow reasonably accurate OCR (at least sufficient for the DP process)?
No, but it's a good rule of thumb. Indeed, the resolution may even be smaller, as some of the content providers (wrongly, in my opinion) resize the images before uploading to reduce the image size so that the images load at an acceptable speed for dialup users. Those of you wanting to use DP's page images as the basis of an image archive must realise that DP's primary use for the images is as an aid for proofing... it would actually make life worse, not better, for the majority of DP's users and content providers if we started using higher resolution, grayscale images. -- Jon __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

On 7/19/05, Juhana Sadeharju <kouhia@nic.funet.fi> wrote:
* The original print is more pleasant to read than the ascii or html text.
In some cases, but that generally indicates you're handling it wrong. In other cases, the Gutenberg edition may be the only transcription of the work that isn't in black-letter fonts or is easily legible, very common things when you're working with books available only in facsimile reprint of 16th-18th century copies.
* Text with figures is better in its original layout. * Math text is better in its original print than in the TeX or math-html equivalent.
Typewritten text with equations added in in pen is better than TeX? I think there's good reasons why Knuth made TeX.
However you archive the scans, please avoid 1-bit scans. 16 grey levels with 200 dpi would be better than 1-bit with 600 dpi in my experience.
That's what the OCR program likes. Distributed Proofreaders are very likely to continue producing B&W 300 dpi scans in most cases for the near future.
participants (7)
-
Dave Fawthrop
-
David Starner
-
Geoff Horton
-
Jon Noring
-
Jonathan Ingram
-
Juhana Sadeharju
-
Robert Shimmin