
Jon Noring wrote
We do have internal identifiers for all "projects" and once a project is posted to PG, the PG identifier is included in the information we keep about each project. The issue here is that there is not always a clear one-to-one mapping between PG books and DP projects.... [snip]
What's the nature/structure of the DP identifiers, and for each identifier what metadata is collected? Is the metadata machine-processible (such as in XML), or is it simply a written, human-readable-only summary of the book being digitized plus other project info?
DP works off a database. The projectid is a unique identifier (10 hex digits), intended only for internal, non-human use which serves both to organize the data associated with the project in the db, and as an identifier for the system/working directory that holds the scans and other info that is not in the db. We keep information relevant to our production process in the database, including things like title, author, genre (informally assigned), language, who scanned it, the name of the project manager and postprocessor, etc. At the time the project is created, a small file with Dublin Core information is also created and lives in the working directory. At the moment that file is not used for anything. All of this information is kept when the project is archived. The contents of the working directory moves off our server and onto the one at TIA as well as most of the info in the db (eg, the text from all rounds). We retain some project info in our production database for record keeping purposes.
A final complication has to do with mis-scanned or missing pages. These are often handled at the end of the process and have usually not made it into our archiving process. A post-processor will find that text has been cut off, or is obscured for some reason, and will ask the CP for a rescan. The rescanned image (maybe) and text would probably go to the post-processor via email and never enter the DP system. This is another matter that we have recently addressed and we are working to be sure that, as much as possible, missing pages do go back through the DP system and get merged into their original projects. But we still have ~7K projects for which that wasn't the case.
What percentage of all the projects had one or more missing pages as you mention?
Probably a relatively low percentage. But they end up taking far more time than they should.
It seems to me that there are LOTS of large image archives already out there, and when we are ready to address the issue of making our own, we will learn what we can from how they have handled things like directory structures, etc.
This is a very good point, and definitely needs to be researched.
Do you have a short list of scanned page image archives that we should consult with?
I refer you to the very long list in a post at the top of Content Providers forum at DP.
One crazy and maybe unworkable idea for DP to resolve the page scan issues in the long-term, is to establish a separate "clearing house" (CH) for page scans which would have its own volunteers. In this system, DP would require all scans for a project to be submitted to CH. In CH the scans will be checked by its volunteers, maybe using some sort of online interface not unlike that used for DP proofing, to check the page scans for quality, missing pages, file name issues, and the like. The scans could even go through a volunteer-driven clean-up process to normalize the scans and even produce DjVu and PDF versions. It would also convert them for the needs of the proofing process. CH could also produce the metadata (including MARC or similar catalog records, thus one would try to find librarian volunteers), and even issue the DP identifier. If the scan set passes muster, copyright clearance from PG could then be obtained -- since the page scans are online, those doing the clearance will be able to inspect the original page scans to decide on a clearance. Once a scan set is copyright cleared, it is sent to DP for OCR/proofing, as well as deposited in the scan repository, where it will be flagged as an unfinished project. Once the structured digital text is completed by DP, the flag in the scan repository will be switched to finished.
Congratulations! You just discovered for yourself the next major addition to DP. We have been planning all along to have something that we call "metadata collection". This will most likely be two rounds. The first round will collect whatever project level metadata we decide is important and that can be derived directly from the scans. It will also check that all pages are present and that the scans are legible. The second round will look at each page, noting formating features such as ToC, index, tables, poems, block quotes, musical notation, mathematical or chemical equations, etc, etc. This information will be used to allow pages to go to "specialist" rounds for things like musical notation markup, math markup, tables, indexes, etc. It will also be used as a quality check for the final formatting ("The page metadata says there should be 2 footnotes. Why is there only markup for one?"). I hope that we will also be able to send illustrations scans through a separate production path as part of this process. We don't currently have any plans to do anything special with the scans themselves, other than being certain that they are all there and legible. There are a LOT of open questions about how we will implement this, but the basic idea has been in Charlz' plan from the beginning. JulietS