Re: Enlightened Self Interest

Hello. The master format should be the digitized images of the original book pages. No font, nor footnote, nor math, nor any problems in readability, nor in representing the original text. I find the digitized images more pleasant than any ascii, html, word or TeX text. I don't know the reason but perhaps the art of typesetting and printing was better then than it is now! Any other format can be generated from the digitized images. If some conversion between html and TeX (say) does not go well, one can always check against the original typesetting from the images. So, keep archiving the digitized images!! 200 dpi with 32 grey levels starts looking ok but 300 dpi with 256 levels should be enough even for math texts. Forget 1-bit digitizations completely!!! Best regards, Juhana -- http://music.columbia.edu/mailman/listinfo/linux-graphics-dev for developers of open source graphics software

Juhana wrote:
Hello. The master format should be the digitized images of the original book pages. No font, nor footnote, nor math, nor any problems in readability, nor in representing the original text.
I find the digitized images more pleasant than any ascii, html, word or TeX text. I don't know the reason but perhaps the art of typesetting and printing was better then than it is now!...
So, keep archiving the digitized images!! 200 dpi with 32 grey levels starts looking ok but 300 dpi with 256 levels should be enough even for math texts. Forget 1-bit digitizations completely!!!
If the only purpose of scanning books is for OCRing whereupon the scans are either dumped or saved simply for "proving" provenance, then 300 dpi is *usually* sufficient: 8-bit greyscale for black and white, and 24-bit color for color pages. (If some type is very small, such as 5 point and less, then 600 dpi is usually required.) However, in my consultations with experts in the field, and personal experimentation (My Antonia at http://www.openreader.org/myantonia/ ), if the scans are to be used for multiple purposes besides OCR, such as for direct reading and other uses where sharpness is aesthetically important, then it is recommended to scan them at 600 dpi (optical) -- and 1200 dpi (optical) if the print is *very* small. Unfortunately, the resulting scan images become quite large (unless one uses lossy compression, such as DjVu, which is not recommended for the master archiving but alright for end-user delivery.) But if a job is worth doing, it is worth doing right. If there is one area which DP seems to fall short (let me know if I'm wrong here) is with respect to page scan resolution and archiving (or lack thereof). It is understandable considering the required disk space and bandwidth requirements (to move the scans around), but IA is a place to donate page scans once proofing is done (maybe this is already being done), and I'm sure others can be found who will gladly setup a terabyte storage box to store DP's 600 dpi page scans -- just post a plea to SlashDot and there'll probably be several volunteers who will step forward with spare terabytes available. Btw, if anyone here has made, and plans to make, 600 dpi (optical) greyscale or color scans of any public domain books including the book covers (and this includes books printed between 1923 and 1963 which may be public domain), I'll gladly accept donations of them on CD-ROM and DVD-ROM. I will also gladly accept the source books themselves, including if they've been chopped. I eventually will build a multi-terabyte hard disk storage system to support various activities including Distributed Scanners. Of course, the scans should be donated to IA as well so they can immediately be made available to the world. Jon Noring

I for one am both a lurker on here AND a slashdot reader =) I how many terabytes do you think we'd need? Putting together a relatively cheap 3/4/5 TB NAS is fairly easy considering the price of 300/400 GB SATA drives has been dropping steadily. This may be something we'd want to talk to iBiblio about though, as they already have the infrastructure in place. No point in re-inventing the wheel. -brandon Jon Noring wrote:
Juhana wrote:
Hello. The master format should be the digitized images of the original book pages. No font, nor footnote, nor math, nor any problems in readability, nor in representing the original text.
I find the digitized images more pleasant than any ascii, html, word or TeX text. I don't know the reason but perhaps the art of typesetting and printing was better then than it is now!...
So, keep archiving the digitized images!! 200 dpi with 32 grey levels starts looking ok but 300 dpi with 256 levels should be enough even for math texts. Forget 1-bit digitizations completely!!!
If the only purpose of scanning books is for OCRing whereupon the scans are either dumped or saved simply for "proving" provenance, then 300 dpi is *usually* sufficient: 8-bit greyscale for black and white, and 24-bit color for color pages. (If some type is very small, such as 5 point and less, then 600 dpi is usually required.)
However, in my consultations with experts in the field, and personal experimentation (My Antonia at http://www.openreader.org/myantonia/ ), if the scans are to be used for multiple purposes besides OCR, such as for direct reading and other uses where sharpness is aesthetically important, then it is recommended to scan them at 600 dpi (optical) -- and 1200 dpi (optical) if the print is *very* small. Unfortunately, the resulting scan images become quite large (unless one uses lossy compression, such as DjVu, which is not recommended for the master archiving but alright for end-user delivery.) But if a job is worth doing, it is worth doing right.
If there is one area which DP seems to fall short (let me know if I'm wrong here) is with respect to page scan resolution and archiving (or lack thereof). It is understandable considering the required disk space and bandwidth requirements (to move the scans around), but IA is a place to donate page scans once proofing is done (maybe this is already being done), and I'm sure others can be found who will gladly setup a terabyte storage box to store DP's 600 dpi page scans -- just post a plea to SlashDot and there'll probably be several volunteers who will step forward with spare terabytes available.
Btw, if anyone here has made, and plans to make, 600 dpi (optical) greyscale or color scans of any public domain books including the book covers (and this includes books printed between 1923 and 1963 which may be public domain), I'll gladly accept donations of them on CD-ROM and DVD-ROM. I will also gladly accept the source books themselves, including if they've been chopped. I eventually will build a multi-terabyte hard disk storage system to support various activities including Distributed Scanners. Of course, the scans should be donated to IA as well so they can immediately be made available to the world.
Jon Noring
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Brandon wrote:
I for one am both a lurker on here AND a slashdot reader =) I how many terabytes do you think we'd need? Putting together a relatively cheap 3/4/5 TB NAS is fairly easy considering the price of 300/400 GB SATA drives has been dropping steadily. This may be something we'd want to talk to iBiblio about though, as they already have the infrastructure in place. No point in re-inventing the wheel.
Since we are talking primarily about pre-1923 public domain books, most of them are black and white, so I'll restrict the analysis to those books. Color substantially adds to disk space requirements. (Also, many of the books published in the 1923-63 time frame, 90% of which are in the public domain, are black and white.) Ideally, we would like to scan the books at 600 dpi (optical), 8-bit greyscale, and store the images in some lossless compressed format (such as PNG). The images should not have gone through any lossy stage to get to this point, such as JPEG, since this adds annoying artifacts to the images. Unfortunately, this results in some pretty large scans. Using the data I have for the "My Antonia" project, a typical 600 dpi (optical) greyscale page saved as PNG occupies about 4.5 megs. So for a typical 300 page book, this works out to about 1.5 gigs per book (rounding up some to cover incidentals.) A terabyte hard disk storage system (optimized for data warehousing, since optimizing for server use increases the hardware cost) would thus hold about 700 books. This is not that many when there are potentially several million public domain books out there (especially if we include the many public domain books in the 1923-1963 range.) What could be done in the next few years, until multi-terabyte hard disk data warehousing systems become dirt cheap, is to backup the lossless greyscale scans onto DVD-ROM (which, granted, is risky), or even press DVDs (requires equipment to do this -- maybe someone will donate access to their DVD presser?) Of course, we should donate copies of the DVDs to IA and to other groups (?iBiblio) and hope they will preserve them, even moving them to hard disk. In the meanwhile, for public access and massive mirroring, we can convert the 600 dpi greyscale to 600 dpi bitonal (2-color black and white -- it is important to manually select the cutoff greyscale value for best quality.) This will save a *lot* of space and will be *minimally* acceptable as archival copies should the original greyscale scans get lost or become unreadable. Using 2-color PNG, a typical page now scrunches down to about 125 Kbytes, or about 40 Mbytes per book (using CCITT lossless compression, which is optimized for bitonal scans of text, it is possible to get the size down to about 60 Kbytes -- but this is an obscure format -- all web browsers will display PNG, but it requires a plugin or a special graphics program to display CCITT TIFFs. There may also be some proprietary problems with CCITT.) This way we can now store about 25,000 books on a terabyte server, which is very doable and will be sufficient for Distributed Scanners (or similar project) for a few years (in the meanwhile, disk space should continue to get cheaper and cheaper to the point we might even begin migrating the biggie-size greyscale scans stored on DVD or other storage medium back to mirrored hard disk servers.) Some of my thinking -- no doubt there's other approaches to consider. Should I start a "Distributed Scanners" discussion group at Yahoo? It seems like there may be enough people interested in this project. Jon
participants (3)
-
Brandon Galbraith
-
Jon Noring
-
Juhana Sadeharju