600 dpi vs. 300 dpi for text (a quickie visual experiment)

Everyone, I've placed online several bitonal test images (lossless PNG format) derived from scans of 5 point, 6 point and 11 point text. There are both 600 dpi and 300 dpi versions for each point size. http://www.openreader.org/600vs300/ (I won't go into detail of how I generated these, except to say the original scans were done at 600 dpi optical full color, then a 300 dpi full color version was generated by high-quality resampling. Then both the 600 dpi and derived 300 dpi full color were converted to bitonal by thresholding, where the threshold values were adjusted by eye to give consistent results (my eyes, for better and for worse.) Nowhere in the process was a lossy format like JPEG used. For each point size, the best way to visually compare the 600 dpi with the 300 dpi is to print them out using Photoshop or Paint Shop Pro (or similar higher-end graphics program), adjusting the scaling so the same block of text appears identical in size on paper. From this quickie experiment, which certainly could be improved, I make the following preliminary conclusions: 1) If the scans are to be used for OCR only (which is DP's focus at present), then from the visual test alone 300 dpi bitonal appears sufficient for 5 point and larger Latin character set text -- this applies to just about all text documents DP will ever encounter. Greyscale certainly improve things, but bitonal appears to be minimally sufficient for OCR. Of course, this observation, based solely on eye, jives with DP's OCR experience. So I'm not concluding anything revolutionary here. 2) Likewise, 300 dpi bitonal is *readable* by human beings for 5 points and larger Latin character set text. 3) However, for smooth, comfortable readability, 600 dpi is definitely better, even for the 11 point text. 300 dpi clearly looks ragged (especially the 5 point text). Of course, anti-aliasing during presentation will overcome some of this raggedness, but such anti-aliasing is strictly artificial and won't fix letters which are mangled in some manner due to the reduced resolution. If the purpose of scans is for multiple use cases (and not only OCR), then it appears wise to scan text at 600 dpi, preferably 24-bit full color (which aids with image cleanup, and of course necessary when we are dealing with colored text and color illustrations.) These master scans can be resized and/or reduced in color depth using batch image processing for whatever purpose is required (e.g., direct online reading, OCR, etc.) Of course, the huge downside to higher-rez, higher-color-depth images are much greater file sizes. This causes difficulty with online archival storage and transport. For the short term, these master images probably need to be stored offline on some sort of storage media (such as DVD-ROM, tape or removable high-capacity harddrives.) [see note below] At this point, since DP is not concerned with archivability and multiple use cases, and has limited bandwidth and disk storage, then there is no reason for them to require 600 dpi for scanning of text. (Illustrations are another matter, as Juliet has noted.) But I do believe that those who are submitting scans to DP should seriously consider doing all scans at 600 dpi full color (especially if the scans can be done without page distortion such as if the book is chopped and run through a sheet-feed scanner), and then resample them to 300 dpi (bitonal or greyscale) for submission to DP. Backup the original master scans in lossless form until they can be donated to a future page scan archive. Just some thoughts.. Comments? Jon Noring [Note: Obviously, high-resolution, high-color depth scans can be subtantially compressed using a lossy algorithm, such as JPEG and a few others. However, lossy compression adds artifacts to the images, so I believe lossy algorithms should be avoided for all steps in the process of producing the master scan images. Rather, use PNG or other high-quality lossless compression algorithm for all steps in producing the master scan images.]

--- Jon Noring <jon@noring.name> wrote:
Of course, the huge downside to higher-rez, higher-color-depth images are much greater file sizes.
600dpi also takes about six times as long on my scanner as 300dpi. Remember some of us are poor and dealing with old equipment :) __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

Certainly, it takes a scanner longer to scan at 600 dpi than 300 dpi, so for major book scanning projects, scanning speed is a factor to consider. However, I would think that many projects submitted to DP are done by individuals who are only doing one to a few books in a year, not hundreds of them a week. So for them, choosing the slower approach may be acceptable to them, especially if they cherish the book and wish to assure the scans they make are of the highest reasonable quality for archival/preservation purposes. (To the DP folk: what is the breakdown of book scans submitted to DP by users -- do just a few supply most of the scans, or are most of the scans submitted by a lot of people?) In my experience scanning both My Antonia and the Kama Sutra, I took advantage of the slower speed of scanning (and I have a slow scanner to begin with) to do filenaming and other tasks that needed to be done anyway. After all, each scan needed to be looked at to determine scan quality and to read off the publisher supplied page number, so that info could be written into the filename when the image was saved. So in total time *to me*, it differed little whether it was 300 dpi or 600 dpi. I simply multitasked. In fact, at 300 dpi, where I could not multitask, it may have taken me longer to finish the job. YMMV... Speed also depends upon the speed and quality of the scanner one uses. My scanner is a Microtek Scanmaker X6EL. A pretty inexpensive but reasonable quality flat bed scanner with 600 dpi optical resolution. Not exactly a speed demon. (Most flatbeds today are 1200 dpi optical, which is pretty much the practical limit for such scanners due to mechanism vibration and other factors, so the experts have told me.) My plea is that those who are doing scans for DP seriously consider 600 dpi full color, especially if they've chopped the book or are using a scanner designed for book scanning, such as the Plustek Optibook, where there's no page and lighting distortion due to bending pages (it's sort of silly to scan at 600 dpi when there's subtantial page distortion due to scanning a bound book on an ordinary flatbed.) Then convert those master scans to DP requirements, and archive the master originals. So, all I'm simply doing is suggesting DP's scan contributors to consider higher-rez/full color -- a few may choose to take this route as they assess it for themselves. Jon

On 7/20/05, Jon Noring <jon@noring.name> wrote:
Certainly, it takes a scanner longer to scan at 600 dpi than 300 dpi, so for major book scanning projects, scanning speed is a factor to consider.
However, I would think that many projects submitted to DP are done by individuals who are only doing one to a few books in a year, not hundreds of them a week. So for them, choosing the slower approach may be acceptable to them, especially if they cherish the book and wish to assure the scans they make are of the highest reasonable quality for archival/preservation purposes.
(To the DP folk: what is the breakdown of book scans submitted to DP by users -- do just a few supply most of the scans, or are most of the scans submitted by a lot of people?)
http://www.pgdp.net/c/stats/pm_stats.php is the closest available. Note that meta-PM accounts (or whatever they're called) like BEGIN will throw the numbers off, as will items scanned by providers-only (proofraided pages or non-PM CPs). In any case, the majority of the texts available are scanned by the high-volume PMs. Note that I am #39 on the list, despite only having created 67 projects. Very few image repositories supply page images in color.. archive.org being the notable exception. (And how many people will download 2 gigs of TIFFs for a single work?) Also, I believe you are over-estimating the quality of the high speed scanners.. Juliet is constantly cleaning it to get decent B/W images. It has a tendency to gather book dust on mirrors and calibration areas, and get lines through the output.
Speed also depends upon the speed and quality of the scanner one uses. My scanner is a Microtek Scanmaker X6EL. A pretty inexpensive but reasonable quality flat bed scanner with 600 dpi optical resolution. Not exactly a speed demon. (Most flatbeds today are 1200 dpi optical, which is pretty much the practical limit for such scanners due to mechanism vibration and other factors, so the experts have told me.)
Err.. not sure about that. I just purchased a 4800 DPI optical scanner for scanning microform works, for a fairly reasonable price. The standard resolutions now seem to 2400 and 3600 DPI optical. (4800x9600, 3600x7200, and 2400x4800 in marketing speak.) As for scanning in full color 600 DPI.. I tried that on a few of the Beatrix Potter books, which are very dense with illustrations. It took significantly longer to scan them and process them in that fashion than my standard technique. (My standard technique for the Potters: 1st pass scan in 300 DPI greyscale, let Finereader threshold to 1-bit. Second pass for images, 600 DPI full-color. Then I descreen, adjust the levels and color balance, downsample to 300 DPI and save as PNGs.) Uploading the unprocessed 600 DPI plates is impractical, even for works as small as the Potter books. Ordinary books I leave the image processing for the PPs, except perhaps descreening. R C

--- Jon Noring <jon@noring.name> wrote:
But I do believe that those who are submitting scans to DP should seriously consider doing all scans at 600 dpi full color (especially if the scans can be done without page distortion such as if the book is chopped and run through a sheet-feed scanner), and then resample them to 300 dpi (bitonal or greyscale) for submission to DP. Backup the original master scans in lossless form until they can be donated to a future page scan archive. Just some thoughts..
Comments?
Obviously 600DPI full colour looks better than 300DPI bitonal, but this extra quality comes at a high price. Personally, moving from one to the other would not happen until someone buys me a faster scanner, a faster computer, a new large hard disk, and pays me for the extra time it will take me to scan and process the material even with this upgraded equipment. Even then, the high quality scans will never get off my computer unless you buy me a faster internet connection. Even if some people do decide to make high resolution masters, there's little need to make colour scans of black-and-white originals. Grayscale, maybe, but all a colour scan will show you is how yellowed the paper is. You also blithely say 'backup the original master scans', perhaps ignoring just how large lossless full-colour 600DPI scans are. Because I've been scanning for OCR, I can store the over 900 items I've scanned for DP on my hard disk (the folder takes up a little over 33 GB). A single large and long book scanned at 600DPI full-colour could end up taking up that much space by itself... and I don't particularly want to have to burn multiple DVDs every single time I scan something (ah yes, you'll need to pay for the DVD recorder and media). In case you think this is hyperbole, I've just tested out the difference in size and speed, scanning a page of a quarto work (Lloyd's Encyclopaedic Dictionary) using my trusty Epson 1660: 300DPI bitonal: 10 seconds (plus 1 second to save the PNG), 216 KB. 600DPI full-colour: 58 seconds (plus another minute to save the PNG), 42.2 MB. There are 770 pages in this volume (which is one of seven). Assuming these times and sizes are indicative of the average (and it's an average page from the text), and allowing 8 seconds after each page to set the next scan going, scanning at 300DPI bitonal would take a touch over 4 hours without a break, and take up 162 MB. Scanning at 600DPI full-colour without a break would take 25 1/4 hours, and take up *31 GB*. -- Jon Ingram __________________________________________________ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com

On 20 Jul 2005, at 13:39, Jon Noring wrote:
But I do believe that those who are submitting scans to DP should seriously consider doing all scans at 600 dpi full color
DP should be as accessible as possible to content providers (those who provide scans) and every roadblock we put in their way is A Bad Thing, period. If you have use for our waste product (the scans), then more power to you! But as long as our main product serves a higher goal than the waste product, I think we should squarely focus on producing the main product, i.e. plain vanilla etexts of as many books as possible for as many people as possible for as long a time as possible. The one exception would be if you could somehow provide us with scans (as many projects already do) in as troublefree a manner as is humanly possible. But such a thing need not be done in the context of PG or DP, and I doubt it even needs to be discussed here (although, of course, here is where you will find like-minded people). -- branko collin collin@xs4all.nl

Branko wrote:
Jon Noring wrote:
But I do believe that those who are submitting scans to DP should seriously consider doing all scans at 600 dpi full color
DP should be as accessible as possible to content providers (those who provide scans) and every roadblock we put in their way is A Bad Thing, period.
Note carefully what I said above. I am not suggesting that DP increase their scan submission requirements, but *suggest* that those who provide scans should scan them at higher resolution and color depth.
If you have use for our waste product (the scans), then more power to you! But as long as our main product serves a higher goal than the waste product, I think we should squarely focus on producing the main product, i.e. plain vanilla etexts of as many books as possible for as many people as possible for as long a time as possible.
But this begs the question -- are book scans a "waste product"? This is the crux of the issue: the value of the book scans themselves. I believe they are not a waste product, while others in the PG universe consider them solely as a necessary evil to get to the final structured digital text.
The one exception would be if you could somehow provide us with scans (as many projects already do) in as troublefree a manner as is humanly possible. But such a thing need not be done in the context of PG or DP, and I doubt it even needs to be discussed here (although, of course, here is where you will find like-minded people).
Definitely! There's only two communities really interested in scanning old books: PG and DP. (There's also some academic communities, but by and large they are either interested only in a very small subset, or take a closed and proprietary position to the availability of the scans to the public.) Jon

On 22 Jul 2005, at 8:31, Jon Noring wrote:
Branko wrote:
Jon Noring wrote:
But I do believe that those who are submitting scans to DP should seriously consider doing all scans at 600 dpi full color
DP should be as accessible as possible to content providers (those who provide scans) and every roadblock we put in their way is A Bad Thing, period.
Note carefully what I said above. I am not suggesting that DP increase their scan submission requirements, but *suggest* that those who provide scans should scan them at higher resolution and color depth.
Unfortunately, people might take that to heart and start providing high-quality scans in the time that they could have provided four times as many low-quality scans. Good for you, bad for PG.
If you have use for our waste product (the scans), then more power to you! But as long as our main product serves a higher goal than the waste product, I think we should squarely focus on producing the main product, i.e. plain vanilla etexts of as many books as possible for as many people as possible for as long a time as possible.
But this begs the question -- are book scans a "waste product"?
To PG/DP: yes, most of the time. Don't take that as a negative thing: one man's waste product can be another man's gold.
This is the crux of the issue: the value of the book scans themselves. I believe they are not a waste product, while others in the PG universe consider them solely as a necessary evil to get to the final structured digital text.
I think it goes deeper than that, even to or near the core of PG's philosophy. If I had been Michael Hart, I might have set up a scan archive first, reasoning that once OCR quality had improved to the point that it would yield 99.8 % perfect texts, I could always convert images to text. But I am not. Of course, I am always free to start my own project, one that works exactly on the basis I just outlined, but I personally think that is not worth the bother. I prefer to create value now at PG than in the distant future at my own project.
The one exception would be if you could somehow provide us with scans (as many projects already do) in as troublefree a manner as is humanly possible. But such a thing need not be done in the context of PG or DP, and I doubt it even needs to be discussed here (although, of course, here is where you will find like-minded people).
Definitely! There's only two communities really interested in scanning old books: PG and DP. (There's also some academic communities, but by and large they are either interested only in a very small subset, or take a closed and proprietary position to the availability of the scans to the public.)
There's archive.org, the Million Books project, the Canadian Libraries, several PG-like projects (Runeberg, Project Madura), CCEL, Blackmask, Sacred Texts, and I am sure there are dozens others (only think of all the author-related associations that scan books!). PG is just one of the biggest (and certainly oldest) fishes in the pond, but by no means the only one. -- branko collin collin@xs4all.nl

Branko wrote:
Jon wrote:
Definitely! There's only two communities really interested in scanning old books: PG and DP. (There's also some academic communities, but by and large they are either interested only in a very small subset, or take a closed and proprietary position to the availability of the scans to the public.)
There's archive.org, the Million Books project, the Canadian Libraries, several PG-like projects (Runeberg, Project Madura), CCEL, Blackmask, Sacred Texts, and I am sure there are dozens others (only think of all the author-related associations that scan books!). PG is just one of the biggest (and certainly oldest) fishes in the pond, but by no means the only one.
Yes, you are right. When I mentioned "community", I was thinking of discussion communities/forums. And though a couple of the above I'm not that familiar with, I would assume the most active public discussion forums with regards to digitizing texts is gutvol-d and a couple of the DP forums. (Well, there is some discussion on Ockerbloom's "Book People" forum.) Do any of the above mentioned projects have public discussion forums that match gutvol-d and the DP forums for volume and diversity of topics? I know that IA's public discussion forum ('archivists-talk') is pretty much dead (I know this because I moderate it for Brewster -- I haven't yet started to promote it, but awaiting some decisions at IA's end.) Jon

On 22 Jul 2005, at 11:52, Jon Noring wrote:
Do any of the above mentioned projects have public discussion forums that match gutvol-d and the DP forums for volume and diversity of topics?
Size matters? Tsk tsk... Anyway, I do not know these communities well enough, but I figure that if they have a somewhat decent throughput, they will have thought about these matters. I know that CCEL has forums. -- branko collin collin@xs4all.nl
participants (5)
-
Branko Collin
-
Jon Niehof
-
Jon Noring
-
Jonathan Ingram
-
Robert Cicconetti