Re: [gutvol-d] Initial thoughts on a PG/DP scan repository

15 Jul 2005

      On 15 Jul 2005, at 11:58, Jon Noring wrote:
...
After reading the various replies, it is clear there is more interest
in preserving and (sooner if not later) making the scans available to
the public online than I originally perceived based on the statements
of a few here. It is welcome to see the majority, including Greg and
Juliet, come out in support of scan preservation and availability.
Now, with about 15 minutes of thought (so we're at a very early stage
of conceptualization),
Very early stage of conceptualization? Yesterday you had a plan for 
implementation. What is wrong with the half-hour-plan you had 
yesterday? Do not sink the momentum you have got in the morass of 
overplanning.

Having scans available is handy for:

- The PG errata folks who do not want to accidentally "fix" the wrong 
thing.

- Publishers who want to make richer versions of our texts based on 
the original lay-out.

- Gatekeepers who claim references cannot be made without looking at 
a page number (they will--of course--invent a new claim as to why PG 
etexts are unusable as soon as we provide them with the means to name 
page numbers).

For these purposes, a simple system that links a PG etext with its 
page scans suffice.
...
here's some of the issues and unknowns as I see
them -- in no particular order -- a sort of stream of consciousness.
Feel free to comment, counter-point, and to add more items.
1) There are scan sets all over the place, in various resolutions,
   color depths, quality, and completeness. (The variation in file
   size between sets will be significant.)
Yes and no. For the purposes outlined above, the scan set that was 
used to create DP's OCR from will suffice.
...
2) Some to many scan sets comprise various derivative subsets (e.g.,
   the original scans, a cleaned up set, etc.)
Question: If a scan set contains derivatives of some sort, do we
             restrict what will be saved?
No. Why?
...
3) Many if not most scan sets have no external metadata. Thus their
   identification and source of metadata will be internal (title page
   info, etc.)
4) Some scan sets we acquire may have license encumberances not
   allowing them to be accessible to the general public. (From a
   conversation with Juliet a while back.) This will impact upon
   the design of the scan repository(ies).
Not quite. 

The problem with some scan sets is that their sources claim some kind 
of ownership. I believe that PG believes that mere scans of public 
domain material are themselves in the public domain. However, there 
are many ways in which these sources could make things difficult for 
us (for instance, putting scans behind a login, so that we cannot get 
at them), so DP plays nice.
...
5) There is likely to be interest by some outside of PG/DP to
   contribute scans of public domain texts to the raw archive. Should
   we allow this?
I am not sure if PG wants to run a scan archive. Assume that this is 
going to take place outside of (but with the support of) PG. 

If this were a PG project, PG might require copyright clearances. But 
assuming it is not, you will probably have to work something out with 
the hosting folks so that they cannot be sued.

Purely on practical reasons I think outside scans should be banned; 
if somebody wants to contribute to DP, after which the scans will 
automatically trickle down to the scan archive. Bonus: we will have 
an accessible etext.
...
If so, keeping this straight is necessary. More
   metadata. (I note this thinking specifically of David Reed here in
   Utah who is scanning large numbers of books and doing most of the
   conversion himself -- I'm not sure if all the texts he is doing are
   being submitted to PG, but since he associates with PG do we
   disallow him from adding his scans to the mash?)
The great thing about metadata is that it can be added afterwards by 
those who actually care about that sort of thing.
...
6) We have to worry about scan sets of works still under copyright.
   (And there's the related difference in life+ countries versus the
   fixed 1923 date in the U.S., ignoring the renewal aspects. Will the
   repository be used for PG-affiliated projects world-wide?)
I like PG's philosophy, which is that PG itself does not worry about 
the rest of the world; let the rest of the world worry. There are no 
copyright concerns.

A minor issue might be copyrighted texts in PG; but do we typically 
have the scans of those?
...
7) When the scans are stored, do we zip each set/subset up, or keep
   each page scan separate?
Separate at first. Software that will zip it up is trivial to write 
and add.
...
8) What are our space requirements, for now and in the future?
   (This probably requires a survey, plus we have to decide whether we
   save all derivative subsets, save the original scans or a cleaned
   up set, or whatever.)
I would first talk to Brewster Kahle. Mentioning space requirements 
to him is probably as useless as discussing snow balls with planets.
...
9) Do we consider conversion of all scans to DjVu? (If we ask
   Brewster for space, he may suggest this in order to cut down
   the size of the scans.)
First build the repository, then see how you can add value.

Reminds me of something I keep hearing: apparently a lot of people 
won't use PG etexts, because these texts have errors in them. Funny 
how none of these people has bothered to report these errors to PG. 

The thing is, people who have a real need for extras, will probably 
come to us, and may even contribute what is necessary to let these 
extras become a reality. People who merely have a perceived need will 
just whine that we do not fullfil that need.
...
10) Those possessing scan sets will have their own preferences for
    how they are submitted. Some will want to upload them, others will
    want to send them via CD/DVD-ROM. If some scan sets comprise a few
    gigs, it may make more sense to burn them on optical disk and send
    them. This now requires us to possibly save what is sent. It also
    requires someone to accept the disks and transfer the data to the
    repository. Who will do this depends upon various factors.
Again, I would outright ban external scan sets.
...
11) Access to the repository(ies). Only PG/DP volunteers, or the
    general public? The answer depends upon the structure of the
    repository(ies), who is hosting it, legal/technical issues,
    etc.
General public.
...
I could go on, but the above items are a good start at some of the
issues/decisions that need to be considered/resolved before bulling
ahead with establishing the repository(ies). Obviously a lot of the
design of the system has to integrate with how things are now done in
DP with respect to handling/processing scans, which I don't have a
good handle on at the present.
And should we consider a "two repository" model? The first
repository,
which will be the first one started, will simply be a "centralized
dumping ground" for the scan sets (and derivatives) which are produced
as part of PG/DP activities. Access to it will be limited to the PG/DP
volunteers.
Why? DP already has a scan repository accessible to the volunteers; 
the only snag being that only scans of running projects are 
available.

-- 
branko collin
collin@xs4all.nl

Re: [gutvol-d] Initial thoughts on a PG/DP scan repository

Branko Collin