
On 15 Jul 2005, at 11:58, Jon Noring wrote:
After reading the various replies, it is clear there is more interest in preserving and (sooner if not later) making the scans available to the public online than I originally perceived based on the statements of a few here. It is welcome to see the majority, including Greg and Juliet, come out in support of scan preservation and availability.
Now, with about 15 minutes of thought (so we're at a very early stage of conceptualization),
Very early stage of conceptualization? Yesterday you had a plan for implementation. What is wrong with the half-hour-plan you had yesterday? Do not sink the momentum you have got in the morass of overplanning. Having scans available is handy for: - The PG errata folks who do not want to accidentally "fix" the wrong thing. - Publishers who want to make richer versions of our texts based on the original lay-out. - Gatekeepers who claim references cannot be made without looking at a page number (they will--of course--invent a new claim as to why PG etexts are unusable as soon as we provide them with the means to name page numbers). For these purposes, a simple system that links a PG etext with its page scans suffice.
here's some of the issues and unknowns as I see them -- in no particular order -- a sort of stream of consciousness. Feel free to comment, counter-point, and to add more items.
1) There are scan sets all over the place, in various resolutions, color depths, quality, and completeness. (The variation in file size between sets will be significant.)
Yes and no. For the purposes outlined above, the scan set that was used to create DP's OCR from will suffice.
2) Some to many scan sets comprise various derivative subsets (e.g., the original scans, a cleaned up set, etc.)
Question: If a scan set contains derivatives of some sort, do we restrict what will be saved?
No. Why?
3) Many if not most scan sets have no external metadata. Thus their identification and source of metadata will be internal (title page info, etc.)
4) Some scan sets we acquire may have license encumberances not allowing them to be accessible to the general public. (From a conversation with Juliet a while back.) This will impact upon the design of the scan repository(ies).
Not quite. The problem with some scan sets is that their sources claim some kind of ownership. I believe that PG believes that mere scans of public domain material are themselves in the public domain. However, there are many ways in which these sources could make things difficult for us (for instance, putting scans behind a login, so that we cannot get at them), so DP plays nice.
5) There is likely to be interest by some outside of PG/DP to contribute scans of public domain texts to the raw archive. Should we allow this?
I am not sure if PG wants to run a scan archive. Assume that this is going to take place outside of (but with the support of) PG. If this were a PG project, PG might require copyright clearances. But assuming it is not, you will probably have to work something out with the hosting folks so that they cannot be sued. Purely on practical reasons I think outside scans should be banned; if somebody wants to contribute to DP, after which the scans will automatically trickle down to the scan archive. Bonus: we will have an accessible etext.
If so, keeping this straight is necessary. More metadata. (I note this thinking specifically of David Reed here in Utah who is scanning large numbers of books and doing most of the conversion himself -- I'm not sure if all the texts he is doing are being submitted to PG, but since he associates with PG do we disallow him from adding his scans to the mash?)
The great thing about metadata is that it can be added afterwards by those who actually care about that sort of thing.
6) We have to worry about scan sets of works still under copyright. (And there's the related difference in life+ countries versus the fixed 1923 date in the U.S., ignoring the renewal aspects. Will the repository be used for PG-affiliated projects world-wide?)
I like PG's philosophy, which is that PG itself does not worry about the rest of the world; let the rest of the world worry. There are no copyright concerns. A minor issue might be copyrighted texts in PG; but do we typically have the scans of those?
7) When the scans are stored, do we zip each set/subset up, or keep each page scan separate?
Separate at first. Software that will zip it up is trivial to write and add.
8) What are our space requirements, for now and in the future? (This probably requires a survey, plus we have to decide whether we save all derivative subsets, save the original scans or a cleaned up set, or whatever.)
I would first talk to Brewster Kahle. Mentioning space requirements to him is probably as useless as discussing snow balls with planets.
9) Do we consider conversion of all scans to DjVu? (If we ask Brewster for space, he may suggest this in order to cut down the size of the scans.)
First build the repository, then see how you can add value. Reminds me of something I keep hearing: apparently a lot of people won't use PG etexts, because these texts have errors in them. Funny how none of these people has bothered to report these errors to PG. The thing is, people who have a real need for extras, will probably come to us, and may even contribute what is necessary to let these extras become a reality. People who merely have a perceived need will just whine that we do not fullfil that need.
10) Those possessing scan sets will have their own preferences for how they are submitted. Some will want to upload them, others will want to send them via CD/DVD-ROM. If some scan sets comprise a few gigs, it may make more sense to burn them on optical disk and send them. This now requires us to possibly save what is sent. It also requires someone to accept the disks and transfer the data to the repository. Who will do this depends upon various factors.
Again, I would outright ban external scan sets.
11) Access to the repository(ies). Only PG/DP volunteers, or the general public? The answer depends upon the structure of the repository(ies), who is hosting it, legal/technical issues, etc.
General public.
I could go on, but the above items are a good start at some of the issues/decisions that need to be considered/resolved before bulling ahead with establishing the repository(ies). Obviously a lot of the design of the system has to integrate with how things are now done in DP with respect to handling/processing scans, which I don't have a good handle on at the present. And should we consider a "two repository" model? The first repository, which will be the first one started, will simply be a "centralized dumping ground" for the scan sets (and derivatives) which are produced as part of PG/DP activities. Access to it will be limited to the PG/DP volunteers.
Why? DP already has a scan repository accessible to the volunteers; the only snag being that only scans of running projects are available. -- branko collin collin@xs4all.nl