
Banko wrote:
Juliet Sutherland wrote:
Making page images available both to the general public and for checking on reported errata will happen eventually. We are not erasing or losing any of the basic material. But I'm sure it won't happen nearly as quickly as Jon Noring would like. ;-)
I think even an incomplete archive would already be an asset. Of a lot of projects page scans simply exist. We know that, because we use them for proofing. :-)
Also, if somebody set up such an incomplete archive, they could document the problems they ran into. So I think Jon should go ahead and try and set something up.
After reading the various replies, it is clear there is more interest in preserving and (sooner if not later) making the scans available to the public online than I originally perceived based on the statements of a few here. It is welcome to see the majority, including Greg and Juliet, come out in support of scan preservation and availability. Now, with about 15 minutes of thought (so we're at a very early stage of conceptualization), here's some of the issues and unknowns as I see them -- in no particular order -- a sort of stream of consciousness. Feel free to comment, counter-point, and to add more items. 1) There are scan sets all over the place, in various resolutions, color depths, quality, and completeness. (The variation in file size between sets will be significant.) 2) Some to many scan sets comprise various derivative subsets (e.g., the original scans, a cleaned up set, etc.) Question: If a scan set contains derivatives of some sort, do we restrict what will be saved? 3) Many if not most scan sets have no external metadata. Thus their identification and source of metadata will be internal (title page info, etc.) 4) Some scan sets we acquire may have license encumberances not allowing them to be accessible to the general public. (From a conversation with Juliet a while back.) This will impact upon the design of the scan repository(ies). 5) There is likely to be interest by some outside of PG/DP to contribute scans of public domain texts to the raw archive. Should we allow this? If so, keeping this straight is necessary. More metadata. (I note this thinking specifically of David Reed here in Utah who is scanning large numbers of books and doing most of the conversion himself -- I'm not sure if all the texts he is doing are being submitted to PG, but since he associates with PG do we disallow him from adding his scans to the mash?) 6) We have to worry about scan sets of works still under copyright. (And there's the related difference in life+ countries versus the fixed 1923 date in the U.S., ignoring the renewal aspects. Will the repository be used for PG-affiliated projects world-wide?) 7) When the scans are stored, do we zip each set/subset up, or keep each page scan separate? 8) What are our space requirements, for now and in the future? (This probably requires a survey, plus we have to decide whether we save all derivative subsets, save the original scans or a cleaned up set, or whatever.) 9) Do we consider conversion of all scans to DjVu? (If we ask Brewster for space, he may suggest this in order to cut down the size of the scans.) 10) Those possessing scan sets will have their own preferences for how they are submitted. Some will want to upload them, others will want to send them via CD/DVD-ROM. If some scan sets comprise a few gigs, it may make more sense to burn them on optical disk and send them. This now requires us to possibly save what is sent. It also requires someone to accept the disks and transfer the data to the repository. Who will do this depends upon various factors. 11) Access to the repository(ies). Only PG/DP volunteers, or the general public? The answer depends upon the structure of the repository(ies), who is hosting it, legal/technical issues, etc. I could go on, but the above items are a good start at some of the issues/decisions that need to be considered/resolved before bulling ahead with establishing the repository(ies). Obviously a lot of the design of the system has to integrate with how things are now done in DP with respect to handling/processing scans, which I don't have a good handle on at the present. And should we consider a "two repository" model? The first repository, which will be the first one started, will simply be a "centralized dumping ground" for the scan sets (and derivatives) which are produced as part of PG/DP activities. Access to it will be limited to the PG/DP volunteers. It has yet to be determined whether it should only include scans for finished projects, or also for ongoing projects and scans donated by those outside of PG? The second repository could come later, where a group of volunteers sort through the pile in the first repository, check for copyright/license aspects, completeness, sort the pages properly, then maybe convert the preferred derivative set to DjVu, PDF, and/or other more compact formats. Anyone who wants any of the stuff in the first repository can request it. And finally, since DP is the major player, should we move this design discussion to the relevant forum at DP? Or keep it here on gutvol-d? Jon