
--- Jon Noring <jon@noring.name> wrote:
1) There are scan sets all over the place, in various resolutions, color depths, quality, and completeness. (The variation in file size between sets will be significant.)
Take them as they are, or request certain standards for submitters. I'd lean towards the former for now, even though it might mean work later on.
2) Some to many scan sets comprise various derivative subsets (e.g., the original scans, a cleaned up set, etc.)
Every processing operation loses information. So, I'd suggest keeping original scans for archival and fully cleaned scans for viewing (I usually keep both around, although if the cleaning is entirely expressed in a series of shell scripts, with no human intervention, I'll just keep the originals and regenerate the cleans when necessary).
3) Many if not most scan sets have no external metadata.
I don't see this as a huge issue. I try to have png numbers match page numbers, but sometimes this causes problems (DP is fine with it; guiguts not so much so in some cases). You can leave it as a big blob for people to rifle through (like paper) or set up a simple workflow for capturing it. I'd say KISS for now, though. (Bonus: some of the metadata capturing code might be useful to dp in the future if handled carefully and with luck ;) )
4) Some scan sets we acquire may have license encumberances not allowing them to be accessible to the general public.
I'd say leave 'em out.
5) There is likely to be interest by some outside of PG/DP to contribute scans of public domain texts to the raw archive.
Sure, why not? As long as we're allowed to use them later and produce PG texts as well ^^ Another way to attract content.
7) When the scans are stored, do we zip each set/subset up, or keep each page scan separate?
A simple zip of all scans would be useful, easy, and eliminate a lot of the metadata requirement. One can always add code for online viewing later.
8) What are our space requirements, for now and in the future?
You know how to figure the amount of firewood you need for the night, right? Think of the fire burning. Think of the maximum amount of wood you could possibly need. Then gather triple that amount. (As a guess, my old job figured 100 dpi, legal, B&W TIFF's were about 100k a page. My 300dpi scans of History of a Lie are 4.7MB for two pages in an 8-bit, compression 9 PNG. Somewhere in between those two :) )
9) Do we consider conversion of all scans to DjVu?
No proprietary formats, please. It makes little sense to have a long-term repository dependent on the whim of a particular company.
10) Those possessing scan sets will have their own preferences for how they are submitted.
Get one way working first, and people can work around as necessary until there are other ways. I'd be happy to spend some upload bandwidth if necessary, for example (for some people in Europe shipping a DVD may be cheaper than paying for the connect time, right?)
11) Access to the repository(ies). Only PG/DP volunteers, or the general public?
Ultimately I think it should be general public. How you want to start probably depends on who's providing the resources and support. Good luck, Jon; I'd like to see this happen. ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs