
Bowerbird said:
geoff said:
The last, oh, five or six books I've submitted to PGDP have been photographed (on a 4 MP camera), not scanned. It's a lot less rough on books, and the results were as good as scanning once I figured out what I was doing.
i'm skeptical of that claim.
I do think it is possible to get good results using a digital camera for scanning books (especially if one is not concerned with the archival quality of the scans.) HOWEVER, there are issues that have to be dealt with, that only a few people have the necessary aptitude. Thus Bowerbird's skepticism is valid if we look at "scanning by the masses". Here's some of the technical issues as I see them: 1) Focus. It is important that one gets a very good focus. Focus is affected by the focal length of the lens, the aperture f-stop, and the lens quality. A higher f-stop helps improve focus (and improve focal plane depth), but which increases the needed lighting and/or longer shutter speed (and longer shutter speeds are much more succeptible to camera and tripod vibration.) With even a low-end consumer scanner, none of the above are issues. 2) Minimizing optical distortion. Optical distortion is caused by two factors: a) astigmatism due to poor optics found on cheaper cameras, and b) barrel distortion due to using too short of a lens focal length (meaning the lens is closer to the page to be photographed.) Poor optics is fixed by $$$ (better camera), and barrel distortion is fixed either by using a longer focal length lens or by digital post-processing to remove the barrel distortion (using a calibration grid helps with correction.) The downside with a longer focal length lens is that the camera must be further from the page, making a frame to support the camera possibly bigger (leading to greater vibration issues). Lighting needs are also increased (or by getting a *big* aperture lens, which is $$$ and not available for most non-SLR cameras -- in fact, the optimum lens for page scanning is probably not available for most consumer-level, non-SLR digital cameras.) Anyone who has experience in photography knows all of these issues. Those who don't, probably are beginning to realize the non-triviality of "do-it-yourself scanning" using a digital camera. Of course, reasonable quality scanners don't have any of these issues. 3) Lighting. As should be obvious now, it is important to have good lighting to increase f-stop, decrease shutter speed, etc. However, it is also important to have the right kind of lighting (mostly diffuse) so one does not have errant reflections of the page and varying intensity across the page. Achieving good lighting is not a trivial exercise. Lighting is not an issue with scanners. 4) General complexity of the process. Unless one uses a well-designed frame specially designed for a higher-quality digital camera (preferably a professional or "prosumer" level SLR which is minimum $1000), using a digital camera to achieve good to excellent-quality page scan results is simply out-of-reach except for the very mechanically-adept DIY kind of people. Some DIY solutions are likely to be very kludgy -- unstable and requiring "4 hands" to operate. Overall, scanners are simpler for the average Joe to run. If minimizing harm to the book is required, as Bowerbird recommends, get a Plustek OpticBook or similar scanner designed to be easy on book bindings. Now, this should not dissuade people from using digital cameras for book scanning, but it's not something one just runs down to Wal-Mart to buy the $100 4-megapixel camera and then get perfect scans right out of the box. It requires not only a better quality camera, but well-done lighting, some kind of custom frame or tripod to hold the camera in the right position (plus a means to assure the page of the book is within the focal plane of the lens, another issue not mentioned above), lots of futzing to get the system and settings right (and to overcome engineering issues), etc., etc. The Internet Archive, for example, has been working on just such a gentle-on-books scanner setup using digital cameras. But they are engineering the system to overcome the deficiencies mentioned above, resulting in a system which is better and cheaper for the high-volume and reasonably high-quality scanning they want to do (this is compared to the ultra-expensive $100,000 "page turning" commercial scanners they have been using.) I'm hoping they will "open source" their engineering effort to share with the world and allow other engineers to continue to improve upon.
i think if we tested the quality of the o.c.r. recognition results, using the best-in-class o.c.r. app, we would find a significant difference between images from a 4-m.p. camera and the best-in-class scanners, such as the opticbook3600. and dollar-for-dollar scanners give better images than cameras, though i'm not disputing your assertion that cameras might be "less rough" on the books.
Not sure about your first point *if* the person using the digital camera does things right. But as noted above, using a digital camera for page scanning and getting good results is not a trivial exercise, and out-of-reach of the average Joe who just wants to scan books and does not have a Ph.D. in photography or mechanical engineering. You are right in that, dollar-for-dollar, and for more consistent and easier-to-obtain results, it is much better for the average Joe to use a scanner rather than a digital camera for book scanning.
just because most of the proofers over at d.p. are not aware that the images they are getting are less-than-the-best doesn't mean it isn't so..
My focus on scanning goes beyond just OCR purposes -- I think if substantial work is being expended to acquire and scan a book, it takes only a little extra effort to scan at archival quality, which is at least 600 dpi optical (and 256 color greyscale for bitonal and even better 24-bit color.) It is also wise to scan a calibration color/greyscale chart before each book is scanned so it is possible to post-process the images should the scanner calibration be off some. I am saddened and frustrated when I see all this scanning activity of Public Domain materials going on, but being done haphazardly and with the needless throttling down of the scan quality. Jon Noring