[gutvol-d] Initial thoughts on a PG/DP scan repository

15 Jul 2005

      Banko wrote:
...
Juliet Sutherland wrote:
...
...
Making page images available both to the general public and for
checking on reported errata will happen eventually. We are not erasing
or losing any of the basic material. But I'm sure it won't happen
nearly as quickly as Jon Noring would like. ;-)
...
I think even an incomplete archive would already be an asset. Of a 
lot of projects page scans simply exist. We know that, because we use
them for proofing. :-)
Also, if somebody set up such an incomplete archive, they could 
document the problems they ran into. So I think Jon should go ahead 
and try and set something up.
After reading the various replies, it is clear there is more interest
in preserving and (sooner if not later) making the scans available to
the public online than I originally perceived based on the statements
of a few here. It is welcome to see the majority, including Greg and
Juliet, come out in support of scan preservation and availability.

Now, with about 15 minutes of thought (so we're at a very early stage
of conceptualization), here's some of the issues and unknowns as I see
them -- in no particular order -- a sort of stream of consciousness.
Feel free to comment, counter-point, and to add more items.

1) There are scan sets all over the place, in various resolutions,
   color depths, quality, and completeness. (The variation in file
   size between sets will be significant.)

2) Some to many scan sets comprise various derivative subsets (e.g.,
   the original scans, a cleaned up set, etc.)

   Question: If a scan set contains derivatives of some sort, do we
             restrict what will be saved?

3) Many if not most scan sets have no external metadata. Thus their
   identification and source of metadata will be internal (title page
   info, etc.)

4) Some scan sets we acquire may have license encumberances not
   allowing them to be accessible to the general public. (From a
   conversation with Juliet a while back.) This will impact upon
   the design of the scan repository(ies).

5) There is likely to be interest by some outside of PG/DP to
   contribute scans of public domain texts to the raw archive. Should
   we allow this? If so, keeping this straight is necessary. More
   metadata. (I note this thinking specifically of David Reed here
   in Utah who is scanning large numbers of books and doing most of
   the conversion himself -- I'm not sure if all the texts he is
   doing are being submitted to PG, but since he associates with
   PG do we disallow him from adding his scans to the mash?)

6) We have to worry about scan sets of works still under copyright.
   (And there's the related difference in life+ countries versus the
   fixed 1923 date in the U.S., ignoring the renewal aspects. Will
   the repository be used for PG-affiliated projects world-wide?)

7) When the scans are stored, do we zip each set/subset up, or keep
   each page scan separate?

8) What are our space requirements, for now and in the future?
   (This probably requires a survey, plus we have to decide whether
   we save all derivative subsets, save the original scans or a
   cleaned up set, or whatever.)

9) Do we consider conversion of all scans to DjVu? (If we ask
   Brewster for space, he may suggest this in order to cut down
   the size of the scans.)

10) Those possessing scan sets will have their own preferences for
    how they are submitted. Some will want to upload them, others
    will want to send them via CD/DVD-ROM. If some scan sets
    comprise a few gigs, it may make more sense to burn them on
    optical disk and send them. This now requires us to possibly
    save what is sent. It also requires someone to accept the disks
    and transfer the data to the repository. Who will do this
    depends upon various factors.

11) Access to the repository(ies). Only PG/DP volunteers, or the
    general public? The answer depends upon the structure of the
    repository(ies), who is hosting it, legal/technical issues,
    etc.

I could go on, but the above items are a good start at some of the
issues/decisions that need to be considered/resolved before bulling
ahead with establishing the repository(ies). Obviously a lot of the
design of the system has to integrate with how things are now done
in DP with respect to handling/processing scans, which I don't have a
good handle on at the present.

And should we consider a "two repository" model? The first repository,
which will be the first one started, will simply be a "centralized
dumping ground" for the scan sets (and derivatives) which are produced
as part of PG/DP activities. Access to it will be limited to the PG/DP
volunteers. It has yet to be determined whether it should only include
scans for finished projects, or also for ongoing projects and scans
donated by those outside of PG? The second repository could come
later, where a group of volunteers sort through the pile in the first
repository, check for copyright/license aspects, completeness, sort
the pages properly, then maybe convert the preferred derivative set
to DjVu, PDF, and/or other more compact formats. Anyone who wants
any of the stuff in the first repository can request it.

And finally, since DP is the major player, should we move this
design discussion to the relevant forum at DP? Or keep it here on
gutvol-d?

Jon