Re: [gutvol-d] Scan file naming -- another comment

23 Jul 2005

      Robert wrote:
...
Jon Noring wrote:
...
...
But your system requires that each image, when it is first saved,
needs a human being to eyeball the page, determine the publisher
supplied page number (if any; may be implied), and then manually save
the page using the publisher number.
...
I think you are both going rather far afield. The majority of the
books scanned at DP are scanned directly into Abbyy Finereader, which
has the following limitations/features: [snip of summary of practice.]
Yes, agreed. As I was sitting outside thinking of the problem (and
sipping on a beer), it became clear that an important driver for the
file naming of the *existing scan sets* is how they are presently named
at DP, and how much effort will be required (read: volunteer effort) to
rename them if needed. (I think they will require a human being to
look them over and make needed changes if the scans will be used in a
linking environment that Marcello is thinking of doing for PG -- and
which I'd like to see. We might consider a sort of DP-like environment
to assist with scan set QCing and filename changes.)

There is the issue of what a project to create high-quality book scans
would embrace for its naming system, and I think my system is the
better (but not necessarily the best) candidate for that. There are
certainly other possible systems which have not yet been proposed.

But the focus at present for PG and DP is what currently exists, and
how to best fit it in with both DP's and PG's needs and restrictions.

One question of Robert and the other DPers: how were blank pages handled?
...
After scanning, everything gets run through Guiprep (Excellent tool!)
which renumbers all of the illustrations to fit in with the DB
restrictions at DP (0001.png, 0002.png, etc.) consecutively. DP image
numbers have very little to do with actual page numbers, although in
most cases it is a fixed offset. As Juliet said before, there are few
conventions regarding illustration numbering; I generally use
pic[xxxx].png, where xxxx is the true page #, or a variant thereof.
No room for fancy metadata in the filename, although I believe DP now
accepts an XML file with each project; you may be able to store 'real'
page # information there. Although I believe the long term plan is to
leave it to a metadata round..
As an aside, I've always thought the best system would be to separate
the DP proofing system from the scanning portion. In essence, to setup
a separate (autonomous) "Distributed Scanners" which will encourage
the scanning of older books, set minimum quality requirements, QC,
standardized cataloging (possibly MARC-XML), clean up the scans to
form working sets (deskewing, cropping, color depth reduction, etc.),
and do so in a semi-distributed environment akin to DP. Then the work
product would be archived at IA (with public access to some of the
derivative scan sets if not the masters). And of course DO would
generate a derivative scanset optimized for DP's process. If the
system works well, DP could encourage submitters to go through the
DS system for submitting scans.

Of course, DS would require a few dedicated and knowledgeable people
(in various areas of expertise) to get together and hammer out the
specifics of the system and do the necessary development work. I do
believe it will be possible to get equipment donation (such as sheet
feed scanners, Plustek OptiBook scanners or similar for scanning bound
books with gentle handling and not page distortion, heavy duty choppers,
etc.) I also believe it possible to get tax-deductible donations of old
books in poor condition which could be chopped (I'm looking into this
now and am encouraged -- the tax deductibility is a big issue to
bookstores and others.) And though I may be naive on this, I think it
possible to find some willing librarians at academic libraries who will
let us come in and scan some of their older books. Of course, we'd try
to reach out to the library community to find volunteers to assist with
the cataloging/metadata aspects, and maybe some will help with the scan
QC (and filenaming) as well.

Anyway, just sort of dreaming/musing here.

Thanks, Robert, for clarifying the current status of the DP scans and
filenaming system.

Jon