Re: [gutvol-d] Some specific thoughts on the scan repository (directory, format, etc.)

16 Jul 2005

      Jon Noring wrote:
...
Now, regarding Branko's comment that Brewster has disk space to burn
(implying he doesn't care about the details of what we intend to do),
this is NOT true. I've had several personal discussions with Brewster
regarding other matters, and I know he will be concerned about the
breadth/depth/future of what we have in mind, the public access
issues, the format, etc. He may also be concerned how it will relate
to his current book scanning project in Canada (not in a competitive
sense, but in a compatibility sort-of-way). Some of his questions and
concerns we cannot predict in advance. But the more we understand what
we want to do (and what we don't want to do), the easier it will be to
discuss this with him, hopefully resulting in his offer to host the
PG/DP scan repository.
DP is archiving projects on one of the Internet Archive servers already, 
with their knowledge. Storage space is not an issue in that sense. 
However, from the PG perspective, issues having to do with whether the 
scans are mirrored, and if so, how much of a load that will put on the 
mirrors, become important. I don't have answers to those questions, I 
just point them out.
...
Probably the #1 problem as I see it is to assure that scan sets don't
become "dissassociated" from PG/DP. That is, there needs to be
sufficient metadata so each scan set can be attached to a particular
DP produced digital text, and from that, attached to the associated PG
text.
So, does DP keep an internal identifier and associated metadata for
each project it works on? And will it be pretty easy to associate a
scan set with that DP identifier, or will it require some human
intervention to make the association? And does the DP identifier
correlate to a PG text identifier?
We do have internal identifiers for all "projects" and once a project is 
posted to PG, the PG identifier is included in the information we keep 
about each project. The issue here is that there is not always a clear 
one-to-one mapping between PG books and DP projects. Historically 
several projects could make up one PG book, the book having been split 
up for internal DP reasons. Also historically there are DP projects that 
ended up being posted to PG as several separate pieces. I say 
historically because we have recently started to make a concerted effort 
to be sure that our "projects", as archived, do match up to the books as 
posted. This often means merging or splitting DP projects. We are doing 
this for the obvious reason of trying to make the whole image archive 
project easier when we get around to it. But there are lots of old 
projects that need to be sorted out.
...
Another issue, which affects directory design, metadata and access,
is serialization. Some Works were serialized (such as multiple parts
in periodicals), which are combined into one digital text. How do we
handle/organize the scan sets for this? There are no doubt other odd
"exceptions" we will have to handle.
Actually, all the cases that I can think of where we had one DP project 
that got posted to PG as multiple "books" are periodicals. We have not 
yet addressed the issue of posting a "book" to PG that is derived 
multiple issues of a periodical, although I'm hoping that that will 
happen eventually. We have Alcott's Under the Lilacs from St. Nicholas 
Magazine (in 1878) going through now. Not all the issues have been 
proofed yet, but when they are, I'm hoping that someone will pull out 
the entire book, with its illustrations, and post it to PG as a separate 
item, independent from the posting of the individual issues.
...
Another possible issue regards illustrations. Am I right that some
scan projects scan the whole book at one resolution, then return
and do the pages with illustrations at a higher resolution and crop
to the illustrations? This may have an impact on directory structure
and metadata fields.
Yes, you are correct about illustrations. It is my standard practice, 
and that of many other content providers (CPs), to make one set of scans 
for proofing, and then to rescan the illustrations separately. This fits 
into our workflow very well, but does make it harder to make an image 
archive that includes those illustrations. Also, for a good many books 
when we first started to make html versions, the illustrations were 
handled directly between the content provider and the post-processor and 
never made it into our image/archiving system at all. So the 
illustrations will be available from the PG html version, but DP may not 
have copies of them.

Illustrations are also tricky because what is used for an illustrated 
html version can be quite different, in terms of dpi, size, etc, from 
what was originally provided. I provide illustrations at either 400 or 
600 dpi, full-size, with a very minimum of processing (I've recently 
started rotating illustrations so that they are straight rather than 
crooked, for example), and those get archived with the project scans. 
But some CPs make the illustrations ready for the html version from the 
start and never upload the large, original scans. Getting good scans of 
illustrations is tricky. Getting the original scans into our system is 
also tricky, often because of the sizes involved.

While DP does have a very strict naming convention for proofing images 
and text, there is no consistent naming convention for illustrations. 
Most CPs have developed their own conventions and stick to them. But 
someone is going to have fun standardizing everything eventually.

A final complication has to do with mis-scanned or missing pages. These 
are often handled at the end of the process and have usually not made it 
into our archiving process. A post-processor will find that text has 
been cut off, or is obscured for some reason, and will ask the CP for a 
rescan. The rescanned image (maybe) and text would probably go to the 
post-processor via email and never enter the DP system. This is another 
matter that we have recently addressed and we are working to be sure 
that, as much as possible, missing pages do go back through the DP 
system and get merged into their original projects. But we still have 
~7K projects for which that wasn't the case.

So, DP is gradually moving in the direction of support for an image 
archive. We do always have in mind that it will happen eventually and 
we've been slowly changing our systems to try to make making such a 
thing easier in the long run. It seems to me that there are LOTS of 
large image archives already out there, and when we are ready to address 
the issue of making our own, we will learn what we can from how they 
have handled things like directory structures, etc. What prevents us 
from going forward are not technical issues, as such, or physical 
resource issues, but rather human resources in the form of developers, 
people to organize and regularize the scans, and the overall energy to 
make it happen. The quickest way from here to there is to get the rest 
of DP built so that we can then focus our attention on something like an 
image archive.

JulietS

Re: [gutvol-d] Some specific thoughts on the scan repository (directory, format, etc.)

Juliet Sutherland