Re: [gutvol-d] Some specific thoughts on the scan repository (directory, format, etc.)

16 Jul 2005

      Juliet wrote:
...
Jon Noring wrote:
Thanks for your detailed reply, Juliet. Some questions/comments on a
couple items:
...
...
Now, regarding Branko's comment that Brewster has disk space to burn
(implying he doesn't care about the details of what we intend to do),
this is NOT true. I've had several personal discussions with Brewster
regarding other matters, and I know he will be concerned about the
breadth/depth/future of what we have in mind, the public access
issues, the format, etc. He may also be concerned how it will relate
to his current book scanning project in Canada (not in a competitive
sense, but in a compatibility sort-of-way). Some of his questions and
concerns we cannot predict in advance. But the more we understand what
we want to do (and what we don't want to do), the easier it will be to
discuss this with him, hopefully resulting in his offer to host the
PG/DP scan repository.
...
DP is archiving projects on one of the Internet Archive servers already,
with their knowledge. Storage space is not an issue in that sense. 
However, from the PG perspective, issues having to do with whether the
scans are mirrored, and if so, how much of a load that will put on the
mirrors, become important. I don't have answers to those questions, I
just point them out.
I appreciate this. I suspect that for the present at least, because of
the large footprint of original page scans, it is impractical for PG
to themselves archive (and mirror) the original page scans along-side
the digital text versions. Other than linking to the page scans at the
central scan repository, the only thing that might make sense for PG
to do in the short term would be to distribute whole-book DjVu or PDF
encapsulations of the page scans. But this requires a lot of work, on
a book-by-book basis, to assure a scan set is complete and that the
final product meets some sort of minimum quality.
...
We do have internal identifiers for all "projects" and once a project is
posted to PG, the PG identifier is included in the information we keep
about each project. The issue here is that there is not always a clear
one-to-one mapping between PG books and DP projects.... [snip]
What's the nature/structure of the DP identifiers, and for each
identifier what metadata is collected? Is the metadata
machine-processible (such as in XML), or is it simply a written,
human-readable-only summary of the book being digitized plus other
project info?
...
A final complication has to do with mis-scanned or missing pages. These
are often handled at the end of the process and have usually not made it
into our archiving process. A post-processor will find that text has
been cut off, or is obscured for some reason, and will ask the CP for a
rescan. The rescanned image (maybe) and text would probably go to the
post-processor via email and never enter the DP system. This is another
matter that we have recently addressed and we are working to be sure
that, as much as possible, missing pages do go back through the DP 
system and get merged into their original projects. But we still have
~7K projects for which that wasn't the case.
What percentage of all the projects had one or more missing pages as
you mention?
...
It seems to me that there are LOTS of large image archives already
out there, and when we are ready to address the issue of making our
own, we will learn what we can from how they have handled things
like directory structures, etc.
This is a very good point, and definitely needs to be researched.

Do you have a short list of scanned page image archives that we should
consult with?
...
What prevents us
from going forward are not technical issues, as such, or physical 
resource issues, but rather human resources in the form of developers,
people to organize and regularize the scans, and the overall energy to
make it happen. The quickest way from here to there is to get the rest
of DP built so that we can then focus our attention on something like an
image archive.
This makes sense. But it is also unfortunate since there is already a
large number of completed DP projects, and day-by-day that number
steadily increases. As noted before (and which you are doing to a
certain extent as you mentioned), tightening the requirements for scan
submissions and processing, file naming, QC, handling illustrations,
etc., make sense to implement. For example, if one requires those
submitting the page scans for a project to name each image
corresponding to the page number (if any), then for most works it is
possible to quickly check for missing pages -- those submitting
scans are more likely to catch missing pages this way.

(Since it appears the bottleneck for DP is not page scan submissions,
greatly tightening up submission requirements for scans makes sense.
I believe most people will do their best to meet those requirements.
Those who can't, because of hardware limitations or lack of the needed
technical skill, can always find someone to do the scanning and
submission for them -- a contact list of volunteer scanners, a sort of
'Distributed Scanners', could be assembled.)

One crazy and maybe unworkable idea for DP to resolve the page scan
issues in the long-term, is to establish a separate "clearing house"
(CH) for page scans which would have its own volunteers. In this
system, DP would require all scans for a project to be submitted to
CH. In CH the scans will be checked by its volunteers, maybe using
some sort of online interface not unlike that used for DP proofing, to
check the page scans for quality, missing pages, file name issues,
and the like. The scans could even go through a volunteer-driven
clean-up process to normalize the scans and even produce DjVu and
PDF versions. It would also convert them for the needs of the
proofing process. CH could also produce the metadata (including MARC
or similar catalog records, thus one would try to find librarian
volunteers), and even issue the DP identifier. If the scan set passes
muster, copyright clearance from PG could then be obtained -- since
the page scans are online, those doing the clearance will be able to
inspect the original page scans to decide on a clearance. Once a scan
set is copyright cleared, it is sent to DP for OCR/proofing, as well
as deposited in the scan repository, where it will be flagged as an
unfinished project. Once the structured digital text is completed by
DP, the flag in the scan repository will be switched to finished.

Anyway, just a crazy idea. In the meanwhile, how to sort through the
existing 7000 or so project scans and organize them as best as
possible needs to be the focus of attention. It is sort of like a
can of worms -- the main decision will be how much the worms will be
untangled.

I suspect because of the lack of volunteer help, we may only get as
far as creating a directory structure based on the DP identifier, and
simply dump the existing scans for a DP identifier/project into that
directory without doing any filename changes (or into subdirectories
in that directory if we have multiple scan sets), along with a
metadata/cataloging file. Even here, it looks like the migration will
require substantial human intervention. Maybe with 3 volunteers, each
of whom squeezes in 50 project transfers/week, in one year we'll have
this done. (Now those familiar with how the scans are currently stored
may think this migration could go faster -- I don't know.) At a future
time, if there is a need, volunteers can revisit each project scan set
and do further QC and normalization, and try to locate missing page
scans as Juliet noted, and other improvements, even producing DjVu
and/or PDF versions.

How does this sound?

Jon