
Jon Noring wrote:
Now, regarding Branko's comment that Brewster has disk space to burn (implying he doesn't care about the details of what we intend to do), this is NOT true. I've had several personal discussions with Brewster regarding other matters, and I know he will be concerned about the breadth/depth/future of what we have in mind, the public access issues, the format, etc. He may also be concerned how it will relate to his current book scanning project in Canada (not in a competitive sense, but in a compatibility sort-of-way). Some of his questions and concerns we cannot predict in advance. But the more we understand what we want to do (and what we don't want to do), the easier it will be to discuss this with him, hopefully resulting in his offer to host the PG/DP scan repository.
DP is archiving projects on one of the Internet Archive servers already, with their knowledge. Storage space is not an issue in that sense. However, from the PG perspective, issues having to do with whether the scans are mirrored, and if so, how much of a load that will put on the mirrors, become important. I don't have answers to those questions, I just point them out.
Probably the #1 problem as I see it is to assure that scan sets don't become "dissassociated" from PG/DP. That is, there needs to be sufficient metadata so each scan set can be attached to a particular DP produced digital text, and from that, attached to the associated PG text.
So, does DP keep an internal identifier and associated metadata for each project it works on? And will it be pretty easy to associate a scan set with that DP identifier, or will it require some human intervention to make the association? And does the DP identifier correlate to a PG text identifier?
We do have internal identifiers for all "projects" and once a project is posted to PG, the PG identifier is included in the information we keep about each project. The issue here is that there is not always a clear one-to-one mapping between PG books and DP projects. Historically several projects could make up one PG book, the book having been split up for internal DP reasons. Also historically there are DP projects that ended up being posted to PG as several separate pieces. I say historically because we have recently started to make a concerted effort to be sure that our "projects", as archived, do match up to the books as posted. This often means merging or splitting DP projects. We are doing this for the obvious reason of trying to make the whole image archive project easier when we get around to it. But there are lots of old projects that need to be sorted out.
Another issue, which affects directory design, metadata and access, is serialization. Some Works were serialized (such as multiple parts in periodicals), which are combined into one digital text. How do we handle/organize the scan sets for this? There are no doubt other odd "exceptions" we will have to handle.
Actually, all the cases that I can think of where we had one DP project that got posted to PG as multiple "books" are periodicals. We have not yet addressed the issue of posting a "book" to PG that is derived multiple issues of a periodical, although I'm hoping that that will happen eventually. We have Alcott's Under the Lilacs from St. Nicholas Magazine (in 1878) going through now. Not all the issues have been proofed yet, but when they are, I'm hoping that someone will pull out the entire book, with its illustrations, and post it to PG as a separate item, independent from the posting of the individual issues.
Another possible issue regards illustrations. Am I right that some scan projects scan the whole book at one resolution, then return and do the pages with illustrations at a higher resolution and crop to the illustrations? This may have an impact on directory structure and metadata fields.
Yes, you are correct about illustrations. It is my standard practice, and that of many other content providers (CPs), to make one set of scans for proofing, and then to rescan the illustrations separately. This fits into our workflow very well, but does make it harder to make an image archive that includes those illustrations. Also, for a good many books when we first started to make html versions, the illustrations were handled directly between the content provider and the post-processor and never made it into our image/archiving system at all. So the illustrations will be available from the PG html version, but DP may not have copies of them. Illustrations are also tricky because what is used for an illustrated html version can be quite different, in terms of dpi, size, etc, from what was originally provided. I provide illustrations at either 400 or 600 dpi, full-size, with a very minimum of processing (I've recently started rotating illustrations so that they are straight rather than crooked, for example), and those get archived with the project scans. But some CPs make the illustrations ready for the html version from the start and never upload the large, original scans. Getting good scans of illustrations is tricky. Getting the original scans into our system is also tricky, often because of the sizes involved. While DP does have a very strict naming convention for proofing images and text, there is no consistent naming convention for illustrations. Most CPs have developed their own conventions and stick to them. But someone is going to have fun standardizing everything eventually. A final complication has to do with mis-scanned or missing pages. These are often handled at the end of the process and have usually not made it into our archiving process. A post-processor will find that text has been cut off, or is obscured for some reason, and will ask the CP for a rescan. The rescanned image (maybe) and text would probably go to the post-processor via email and never enter the DP system. This is another matter that we have recently addressed and we are working to be sure that, as much as possible, missing pages do go back through the DP system and get merged into their original projects. But we still have ~7K projects for which that wasn't the case. So, DP is gradually moving in the direction of support for an image archive. We do always have in mind that it will happen eventually and we've been slowly changing our systems to try to make making such a thing easier in the long run. It seems to me that there are LOTS of large image archives already out there, and when we are ready to address the issue of making our own, we will learn what we can from how they have handled things like directory structures, etc. What prevents us from going forward are not technical issues, as such, or physical resource issues, but rather human resources in the form of developers, people to organize and regularize the scans, and the overall energy to make it happen. The quickest way from here to there is to get the rest of DP built so that we can then focus our attention on something like an image archive. JulietS