Re: [gutvol-d] Re: Enlightened Self Interest

2 Mar 2005

      Brandon wrote:
...
I for one am both a lurker on here AND a slashdot reader =) I how
many terabytes do you think we'd need? Putting together a relatively
cheap 3/4/5 TB NAS is fairly easy considering the price of 300/400
GB SATA drives has been dropping steadily. This may be something
we'd want to talk to iBiblio about though, as they already have the
infrastructure in place. No point in re-inventing the wheel.
Since we are talking primarily about pre-1923 public domain books,
most of them are black and white, so I'll restrict the analysis to
those books. Color substantially adds to disk space requirements.
(Also, many of the books published in the 1923-63 time frame, 90% of
which are in the public domain, are black and white.)

Ideally, we would like to scan the books at 600 dpi (optical), 8-bit
greyscale, and store the images in some lossless compressed format
(such as PNG). The images should not have gone through any lossy
stage to get to this point, such as JPEG, since this adds annoying
artifacts to the images.

Unfortunately, this results in some pretty large scans. Using the
data I have for the "My Antonia" project, a typical 600 dpi (optical)
greyscale page saved as PNG occupies about 4.5 megs. So for a typical
300 page book, this works out to about 1.5 gigs per book (rounding up
some to cover incidentals.)

A terabyte hard disk storage system (optimized for data warehousing,
since optimizing for server use increases the hardware cost) would
thus hold about 700 books. This is not that many when there are
potentially several million public domain books out there
(especially if we include the many public domain books in the
1923-1963 range.)

What could be done in the next few years, until multi-terabyte hard
disk data warehousing systems become dirt cheap, is to backup the
lossless greyscale scans onto DVD-ROM (which, granted, is risky), or
even press DVDs (requires equipment to do this -- maybe someone will
donate access to their DVD presser?) Of course, we should donate
copies of the DVDs to IA and to other groups (?iBiblio) and hope they
will preserve them, even moving them to hard disk.

In the meanwhile, for public access and massive mirroring, we can
convert the 600 dpi greyscale to 600 dpi bitonal (2-color black and
white -- it is important to manually select the cutoff greyscale value
for best quality.) This will save a *lot* of space and will be
*minimally* acceptable as archival copies should the original
greyscale scans get lost or become unreadable.

Using 2-color PNG, a typical page now scrunches down to about 125
Kbytes, or about 40 Mbytes per book (using CCITT lossless compression,
which is optimized for bitonal scans of text, it is possible to get
the size down to about 60 Kbytes -- but this is an obscure format --
all web browsers will display PNG, but it requires a plugin or a
special graphics program to display CCITT TIFFs. There may also be
some proprietary problems with CCITT.)

This way we can now store about 25,000 books on a terabyte server,
which is very doable and will be sufficient for Distributed Scanners
(or similar project) for a few years (in the meanwhile, disk space
should continue to get cheaper and cheaper to the point we might even
begin migrating the biggie-size greyscale scans stored on DVD or
other storage medium back to mirrored hard disk servers.)

Some of my thinking -- no doubt there's other approaches to consider.

Should I start a "Distributed Scanners" discussion group at Yahoo?
It seems like there may be enough people interested in this project.

Jon

Re: [gutvol-d] Re: Enlightened Self Interest

Jon Noring