
Brandon wrote:
I for one am both a lurker on here AND a slashdot reader =) I how many terabytes do you think we'd need? Putting together a relatively cheap 3/4/5 TB NAS is fairly easy considering the price of 300/400 GB SATA drives has been dropping steadily. This may be something we'd want to talk to iBiblio about though, as they already have the infrastructure in place. No point in re-inventing the wheel.
Since we are talking primarily about pre-1923 public domain books, most of them are black and white, so I'll restrict the analysis to those books. Color substantially adds to disk space requirements. (Also, many of the books published in the 1923-63 time frame, 90% of which are in the public domain, are black and white.) Ideally, we would like to scan the books at 600 dpi (optical), 8-bit greyscale, and store the images in some lossless compressed format (such as PNG). The images should not have gone through any lossy stage to get to this point, such as JPEG, since this adds annoying artifacts to the images. Unfortunately, this results in some pretty large scans. Using the data I have for the "My Antonia" project, a typical 600 dpi (optical) greyscale page saved as PNG occupies about 4.5 megs. So for a typical 300 page book, this works out to about 1.5 gigs per book (rounding up some to cover incidentals.) A terabyte hard disk storage system (optimized for data warehousing, since optimizing for server use increases the hardware cost) would thus hold about 700 books. This is not that many when there are potentially several million public domain books out there (especially if we include the many public domain books in the 1923-1963 range.) What could be done in the next few years, until multi-terabyte hard disk data warehousing systems become dirt cheap, is to backup the lossless greyscale scans onto DVD-ROM (which, granted, is risky), or even press DVDs (requires equipment to do this -- maybe someone will donate access to their DVD presser?) Of course, we should donate copies of the DVDs to IA and to other groups (?iBiblio) and hope they will preserve them, even moving them to hard disk. In the meanwhile, for public access and massive mirroring, we can convert the 600 dpi greyscale to 600 dpi bitonal (2-color black and white -- it is important to manually select the cutoff greyscale value for best quality.) This will save a *lot* of space and will be *minimally* acceptable as archival copies should the original greyscale scans get lost or become unreadable. Using 2-color PNG, a typical page now scrunches down to about 125 Kbytes, or about 40 Mbytes per book (using CCITT lossless compression, which is optimized for bitonal scans of text, it is possible to get the size down to about 60 Kbytes -- but this is an obscure format -- all web browsers will display PNG, but it requires a plugin or a special graphics program to display CCITT TIFFs. There may also be some proprietary problems with CCITT.) This way we can now store about 25,000 books on a terabyte server, which is very doable and will be sufficient for Distributed Scanners (or similar project) for a few years (in the meanwhile, disk space should continue to get cheaper and cheaper to the point we might even begin migrating the biggie-size greyscale scans stored on DVD or other storage medium back to mirrored hard disk servers.) Some of my thinking -- no doubt there's other approaches to consider. Should I start a "Distributed Scanners" discussion group at Yahoo? It seems like there may be enough people interested in this project. Jon