
Robert wrote:
Jon Noring wrote:
But your system requires that each image, when it is first saved, needs a human being to eyeball the page, determine the publisher supplied page number (if any; may be implied), and then manually save the page using the publisher number.
I think you are both going rather far afield. The majority of the books scanned at DP are scanned directly into Abbyy Finereader, which has the following limitations/features: [snip of summary of practice.]
Yes, agreed. As I was sitting outside thinking of the problem (and sipping on a beer), it became clear that an important driver for the file naming of the *existing scan sets* is how they are presently named at DP, and how much effort will be required (read: volunteer effort) to rename them if needed. (I think they will require a human being to look them over and make needed changes if the scans will be used in a linking environment that Marcello is thinking of doing for PG -- and which I'd like to see. We might consider a sort of DP-like environment to assist with scan set QCing and filename changes.) There is the issue of what a project to create high-quality book scans would embrace for its naming system, and I think my system is the better (but not necessarily the best) candidate for that. There are certainly other possible systems which have not yet been proposed. But the focus at present for PG and DP is what currently exists, and how to best fit it in with both DP's and PG's needs and restrictions. One question of Robert and the other DPers: how were blank pages handled?
After scanning, everything gets run through Guiprep (Excellent tool!) which renumbers all of the illustrations to fit in with the DB restrictions at DP (0001.png, 0002.png, etc.) consecutively. DP image numbers have very little to do with actual page numbers, although in most cases it is a fixed offset. As Juliet said before, there are few conventions regarding illustration numbering; I generally use pic[xxxx].png, where xxxx is the true page #, or a variant thereof.
No room for fancy metadata in the filename, although I believe DP now accepts an XML file with each project; you may be able to store 'real' page # information there. Although I believe the long term plan is to leave it to a metadata round..
As an aside, I've always thought the best system would be to separate the DP proofing system from the scanning portion. In essence, to setup a separate (autonomous) "Distributed Scanners" which will encourage the scanning of older books, set minimum quality requirements, QC, standardized cataloging (possibly MARC-XML), clean up the scans to form working sets (deskewing, cropping, color depth reduction, etc.), and do so in a semi-distributed environment akin to DP. Then the work product would be archived at IA (with public access to some of the derivative scan sets if not the masters). And of course DO would generate a derivative scanset optimized for DP's process. If the system works well, DP could encourage submitters to go through the DS system for submitting scans. Of course, DS would require a few dedicated and knowledgeable people (in various areas of expertise) to get together and hammer out the specifics of the system and do the necessary development work. I do believe it will be possible to get equipment donation (such as sheet feed scanners, Plustek OptiBook scanners or similar for scanning bound books with gentle handling and not page distortion, heavy duty choppers, etc.) I also believe it possible to get tax-deductible donations of old books in poor condition which could be chopped (I'm looking into this now and am encouraged -- the tax deductibility is a big issue to bookstores and others.) And though I may be naive on this, I think it possible to find some willing librarians at academic libraries who will let us come in and scan some of their older books. Of course, we'd try to reach out to the library community to find volunteers to assist with the cataloging/metadata aspects, and maybe some will help with the scan QC (and filenaming) as well. Anyway, just sort of dreaming/musing here. Thanks, Robert, for clarifying the current status of the DP scans and filenaming system. Jon