
Juliet wrote:
Jon Noring wrote:
Thanks for your detailed reply, Juliet. Some questions/comments on a couple items:
Now, regarding Branko's comment that Brewster has disk space to burn (implying he doesn't care about the details of what we intend to do), this is NOT true. I've had several personal discussions with Brewster regarding other matters, and I know he will be concerned about the breadth/depth/future of what we have in mind, the public access issues, the format, etc. He may also be concerned how it will relate to his current book scanning project in Canada (not in a competitive sense, but in a compatibility sort-of-way). Some of his questions and concerns we cannot predict in advance. But the more we understand what we want to do (and what we don't want to do), the easier it will be to discuss this with him, hopefully resulting in his offer to host the PG/DP scan repository.
DP is archiving projects on one of the Internet Archive servers already, with their knowledge. Storage space is not an issue in that sense. However, from the PG perspective, issues having to do with whether the scans are mirrored, and if so, how much of a load that will put on the mirrors, become important. I don't have answers to those questions, I just point them out.
I appreciate this. I suspect that for the present at least, because of the large footprint of original page scans, it is impractical for PG to themselves archive (and mirror) the original page scans along-side the digital text versions. Other than linking to the page scans at the central scan repository, the only thing that might make sense for PG to do in the short term would be to distribute whole-book DjVu or PDF encapsulations of the page scans. But this requires a lot of work, on a book-by-book basis, to assure a scan set is complete and that the final product meets some sort of minimum quality.
We do have internal identifiers for all "projects" and once a project is posted to PG, the PG identifier is included in the information we keep about each project. The issue here is that there is not always a clear one-to-one mapping between PG books and DP projects.... [snip]
What's the nature/structure of the DP identifiers, and for each identifier what metadata is collected? Is the metadata machine-processible (such as in XML), or is it simply a written, human-readable-only summary of the book being digitized plus other project info?
A final complication has to do with mis-scanned or missing pages. These are often handled at the end of the process and have usually not made it into our archiving process. A post-processor will find that text has been cut off, or is obscured for some reason, and will ask the CP for a rescan. The rescanned image (maybe) and text would probably go to the post-processor via email and never enter the DP system. This is another matter that we have recently addressed and we are working to be sure that, as much as possible, missing pages do go back through the DP system and get merged into their original projects. But we still have ~7K projects for which that wasn't the case.
What percentage of all the projects had one or more missing pages as you mention?
It seems to me that there are LOTS of large image archives already out there, and when we are ready to address the issue of making our own, we will learn what we can from how they have handled things like directory structures, etc.
This is a very good point, and definitely needs to be researched. Do you have a short list of scanned page image archives that we should consult with?
What prevents us from going forward are not technical issues, as such, or physical resource issues, but rather human resources in the form of developers, people to organize and regularize the scans, and the overall energy to make it happen. The quickest way from here to there is to get the rest of DP built so that we can then focus our attention on something like an image archive.
This makes sense. But it is also unfortunate since there is already a large number of completed DP projects, and day-by-day that number steadily increases. As noted before (and which you are doing to a certain extent as you mentioned), tightening the requirements for scan submissions and processing, file naming, QC, handling illustrations, etc., make sense to implement. For example, if one requires those submitting the page scans for a project to name each image corresponding to the page number (if any), then for most works it is possible to quickly check for missing pages -- those submitting scans are more likely to catch missing pages this way. (Since it appears the bottleneck for DP is not page scan submissions, greatly tightening up submission requirements for scans makes sense. I believe most people will do their best to meet those requirements. Those who can't, because of hardware limitations or lack of the needed technical skill, can always find someone to do the scanning and submission for them -- a contact list of volunteer scanners, a sort of 'Distributed Scanners', could be assembled.) One crazy and maybe unworkable idea for DP to resolve the page scan issues in the long-term, is to establish a separate "clearing house" (CH) for page scans which would have its own volunteers. In this system, DP would require all scans for a project to be submitted to CH. In CH the scans will be checked by its volunteers, maybe using some sort of online interface not unlike that used for DP proofing, to check the page scans for quality, missing pages, file name issues, and the like. The scans could even go through a volunteer-driven clean-up process to normalize the scans and even produce DjVu and PDF versions. It would also convert them for the needs of the proofing process. CH could also produce the metadata (including MARC or similar catalog records, thus one would try to find librarian volunteers), and even issue the DP identifier. If the scan set passes muster, copyright clearance from PG could then be obtained -- since the page scans are online, those doing the clearance will be able to inspect the original page scans to decide on a clearance. Once a scan set is copyright cleared, it is sent to DP for OCR/proofing, as well as deposited in the scan repository, where it will be flagged as an unfinished project. Once the structured digital text is completed by DP, the flag in the scan repository will be switched to finished. Anyway, just a crazy idea. In the meanwhile, how to sort through the existing 7000 or so project scans and organize them as best as possible needs to be the focus of attention. It is sort of like a can of worms -- the main decision will be how much the worms will be untangled. I suspect because of the lack of volunteer help, we may only get as far as creating a directory structure based on the DP identifier, and simply dump the existing scans for a DP identifier/project into that directory without doing any filename changes (or into subdirectories in that directory if we have multiple scan sets), along with a metadata/cataloging file. Even here, it looks like the migration will require substantial human intervention. Maybe with 3 volunteers, each of whom squeezes in 50 project transfers/week, in one year we'll have this done. (Now those familiar with how the scans are currently stored may think this migration could go faster -- I don't know.) At a future time, if there is a need, volunteers can revisit each project scan set and do further QC and normalization, and try to locate missing page scans as Juliet noted, and other improvements, even producing DjVu and/or PDF versions. How does this sound? Jon