[gutvol-d] Some specific thoughts on the scan repository (directory, format, etc.)

16 Jul 2005

      Thanks to Branko (public message) and Jon Niehof (private message, I
believe) for their quick reply to my initial sets of questions/
/concerns/thoughts regarding the scan repository which was posted
earlier today.

It is interesting that the answers differed in several ways, which I
won't reply to in the usual manner. But it does show that there will
not be uniformity in thought in all the details of the scan
repository.

But filtering through both messages -- and despite the angst I
expressed in an earlier message today agreeing with what Bowerbird had
to say on this topic ("what should be"), it is clear that we can and
should go ahead with a "first generation" scan repository focused
primarily on centralizing the preservation of the scans submitted to
DP, and to make them publicly available pretty much "as is".

(Expansion to allow submissions of scan sets outside of DP activities
can be considered once the DP scan repository is up and running, and
mostly debugged. We will certainly learn a lot from it. Thus my
go-slow preference.)

Now, regarding Branko's comment that Brewster has disk space to burn
(implying he doesn't care about the details of what we intend to do),
this is NOT true. I've had several personal discussions with Brewster
regarding other matters, and I know he will be concerned about the
breadth/depth/future of what we have in mind, the public access
issues, the format, etc. He may also be concerned how it will relate
to his current book scanning project in Canada (not in a competitive
sense, but in a compatibility sort-of-way). Some of his questions and
concerns we cannot predict in advance. But the more we understand what
we want to do (and what we don't want to do), the easier it will be to
discuss this with him, hopefully resulting in his offer to host the
PG/DP scan repository.

Probably the #1 problem as I see it is to assure that scan sets don't
become "dissassociated" from PG/DP. That is, there needs to be
sufficient metadata so each scan set can be attached to a particular
DP produced digital text, and from that, attached to the associated PG
text.

So, does DP keep an internal identifier and associated metadata for
each project it works on? And will it be pretty easy to associate a
scan set with that DP identifier, or will it require some human
intervention to make the association? And does the DP identifier
correlate to a PG text identifier?

Another issue concerns the directory structure of the repository which
will hold the scan sets. It has to take into account that there may be
not only the original scan set which was submitted to DP, but some
derivative sets produced by some processing (of course there's the 120
dpi used for the DP proofing interface.) I tend to view that for the
DP scans, that the DP identifier be used as part of the directory
name holding the scan sets associated with that identifier.

Another issue is "zip vs. individual files". My view is leaning
towards keeping each page scan image a separate file, so end-users
need not download a whole gigabyte-size (or more) ZIP file to just
look at one page (some books may have scan sets this large.)

Another issue, which affects directory design, metadata and access,
is serialization. Some Works were serialized (such as multiple parts
in periodicals), which are combined into one digital text. How do we
handle/organize the scan sets for this? There are no doubt other odd
"exceptions" we will have to handle.

Format is another issue. Banko mentioned that DjVu is proprietary (or
essentially so). Yet Brewster was, last I talked with him last year
about scan formats, enamored with it and using it for his scanning
projects to greatly conserve disk space (he *is* concerned about the
cost of storing book scans, thus his interest in DjVu.) If Brewster is
still using DjVu, I can *guarantee* he will mention it to us -- we
need an answer if we don't want to use DjVu compression. The
"proprietary" issue will not sway him if he is using it. We need to
maybe look at alternative "lossy" formats (I detest JPEG for
compressing scanned images because of the significant artifacts it
creates -- there must be something better which is also an open
standard?)

Obviously, I'd prefer to store the scan sets (and derivatives) in a
format which preserves, in a lossless sense, the scans as they were
submitted to DP. But we need not be wedded to keeping the identical
format as they were submitted in so long as they are losslessly the
same. Another issue regards end-user compatibility. Web browsers to
view scanned pages are usually not TIFF- (nor DjVu-) ready without a
plugin, while all contemporary browsers will handle PNG and JPG. When
it comes to lossless compression, some of the CCITT protocols (as used
in a TIFF encapsulation) for bitonal images are significantly better
than PNG at lossless compression (PNG is designed for general lossless
compression of all kinds of images.) Of course, there is the option
that if a "third party" interface is developed to access the
repository, image conversion on the fly could be employed to translate
the format in the repository to most any other end-user can handle,
even if the source is in some odd format.

Another possible issue regards illustrations. Am I right that some
scan projects scan the whole book at one resolution, then return
and do the pages with illustrations at a higher resolution and crop
to the illustrations? This may have an impact on directory structure
and metadata fields.

Anyway, that's enough for now.

Comments?

Jon

[gutvol-d] Some specific thoughts on the scan repository (directory, format, etc.)

Jon Noring