Re: [gutvol-d] Some specific thoughts on the scan repository (directory, format, etc.)

16 Jul 2005


      Jon Noring wrote
...
...
We do have internal identifiers for all "projects" and once a project is
posted to PG, the PG identifier is included in the information we keep
about each project. The issue here is that there is not always a clear
one-to-one mapping between PG books and DP projects.... [snip]
What's the nature/structure of the DP identifiers, and for each
identifier what metadata is collected? Is the metadata
machine-processible (such as in XML), or is it simply a written,
human-readable-only summary of the book being digitized plus other
project info?
DP works off a database. The projectid is a unique identifier (10 hex 
digits), intended only for internal, non-human use which serves both to 
organize the data associated with the project in the db, and as an 
identifier for the system/working directory that holds the scans and 
other info that is not in the db. We keep information relevant to our 
production process in the database, including things like title, author, 
genre (informally assigned), language, who scanned it, the name of the 
project manager and postprocessor, etc. At the time the project is 
created, a small file with Dublin Core information is also created and 
lives in the working directory. At the moment that file is not used for 
anything. All of this information is kept when the project is archived. 
The contents of the working directory moves off our server and onto the 
one at TIA as well as most of the info in the db (eg, the text from all 
rounds). We retain some project info in our production database for 
record keeping purposes.
...
...
A final complication has to do with mis-scanned or missing pages. These
are often handled at the end of the process and have usually not made it
into our archiving process. A post-processor will find that text has
been cut off, or is obscured for some reason, and will ask the CP for a
rescan. The rescanned image (maybe) and text would probably go to the
post-processor via email and never enter the DP system. This is another
matter that we have recently addressed and we are working to be sure
that, as much as possible, missing pages do go back through the DP 
system and get merged into their original projects. But we still have
~7K projects for which that wasn't the case.
What percentage of all the projects had one or more missing pages as
you mention?
Probably a relatively low percentage. But they end up taking far more 
time than they should.
...
...
It seems to me that there are LOTS of large image archives already
out there, and when we are ready to address the issue of making our
own, we will learn what we can from how they have handled things
like directory structures, etc.
This is a very good point, and definitely needs to be researched.
Do you have a short list of scanned page image archives that we should
consult with?
I refer you to the very long list in a post at the top of Content 
Providers forum at DP.
...
One crazy and maybe unworkable idea for DP to resolve the page scan
issues in the long-term, is to establish a separate "clearing house"
(CH) for page scans which would have its own volunteers. In this
system, DP would require all scans for a project to be submitted to
CH. In CH the scans will be checked by its volunteers, maybe using
some sort of online interface not unlike that used for DP proofing, to
check the page scans for quality, missing pages, file name issues,
and the like. The scans could even go through a volunteer-driven
clean-up process to normalize the scans and even produce DjVu and
PDF versions. It would also convert them for the needs of the
proofing process. CH could also produce the metadata (including MARC
or similar catalog records, thus one would try to find librarian
volunteers), and even issue the DP identifier. If the scan set passes
muster, copyright clearance from PG could then be obtained -- since
the page scans are online, those doing the clearance will be able to
inspect the original page scans to decide on a clearance. Once a scan
set is copyright cleared, it is sent to DP for OCR/proofing, as well
as deposited in the scan repository, where it will be flagged as an
unfinished project. Once the structured digital text is completed by
DP, the flag in the scan repository will be switched to finished.
Congratulations! You just discovered for yourself the next major 
addition to DP. We have been planning all along to have something that 
we call "metadata collection". This will most likely be two rounds. The 
first round will collect whatever project level metadata we decide is 
important and that can be derived directly from the scans. It will also 
check that all pages are present and that the scans are legible. The 
second round will look at each page, noting formating features such as 
ToC, index, tables, poems, block quotes, musical notation, mathematical 
or chemical equations, etc, etc. This information will be used to allow 
pages to go to "specialist" rounds for things like musical notation 
markup, math markup, tables, indexes, etc. It will also be used as a 
quality check for the final formatting ("The page metadata says there 
should be 2 footnotes. Why is there only markup for one?"). I hope that 
we will also be able to send illustrations scans through a separate 
production path as part of this process. We don't currently have any 
plans to do anything special with the scans themselves, other than being 
certain that they are all there and legible. There are a LOT of open 
questions about how we will implement this, but the basic idea has been 
in Charlz' plan from the beginning.

JulietS

Re: [gutvol-d] Some specific thoughts on the scan repository (directory, format, etc.)

Juliet Sutherland