[gutvol-d] Some specific thoughts on the scan repository (directory, format, etc.)

Thanks to Branko (public message) and Jon Niehof (private message, I believe) for their quick reply to my initial sets of questions/ /concerns/thoughts regarding the scan repository which was posted earlier today. It is interesting that the answers differed in several ways, which I won't reply to in the usual manner. But it does show that there will not be uniformity in thought in all the details of the scan repository. But filtering through both messages -- and despite the angst I expressed in an earlier message today agreeing with what Bowerbird had to say on this topic ("what should be"), it is clear that we can and should go ahead with a "first generation" scan repository focused primarily on centralizing the preservation of the scans submitted to DP, and to make them publicly available pretty much "as is". (Expansion to allow submissions of scan sets outside of DP activities can be considered once the DP scan repository is up and running, and mostly debugged. We will certainly learn a lot from it. Thus my go-slow preference.) Now, regarding Branko's comment that Brewster has disk space to burn (implying he doesn't care about the details of what we intend to do), this is NOT true. I've had several personal discussions with Brewster regarding other matters, and I know he will be concerned about the breadth/depth/future of what we have in mind, the public access issues, the format, etc. He may also be concerned how it will relate to his current book scanning project in Canada (not in a competitive sense, but in a compatibility sort-of-way). Some of his questions and concerns we cannot predict in advance. But the more we understand what we want to do (and what we don't want to do), the easier it will be to discuss this with him, hopefully resulting in his offer to host the PG/DP scan repository. Probably the #1 problem as I see it is to assure that scan sets don't become "dissassociated" from PG/DP. That is, there needs to be sufficient metadata so each scan set can be attached to a particular DP produced digital text, and from that, attached to the associated PG text. So, does DP keep an internal identifier and associated metadata for each project it works on? And will it be pretty easy to associate a scan set with that DP identifier, or will it require some human intervention to make the association? And does the DP identifier correlate to a PG text identifier? Another issue concerns the directory structure of the repository which will hold the scan sets. It has to take into account that there may be not only the original scan set which was submitted to DP, but some derivative sets produced by some processing (of course there's the 120 dpi used for the DP proofing interface.) I tend to view that for the DP scans, that the DP identifier be used as part of the directory name holding the scan sets associated with that identifier. Another issue is "zip vs. individual files". My view is leaning towards keeping each page scan image a separate file, so end-users need not download a whole gigabyte-size (or more) ZIP file to just look at one page (some books may have scan sets this large.) Another issue, which affects directory design, metadata and access, is serialization. Some Works were serialized (such as multiple parts in periodicals), which are combined into one digital text. How do we handle/organize the scan sets for this? There are no doubt other odd "exceptions" we will have to handle. Format is another issue. Banko mentioned that DjVu is proprietary (or essentially so). Yet Brewster was, last I talked with him last year about scan formats, enamored with it and using it for his scanning projects to greatly conserve disk space (he *is* concerned about the cost of storing book scans, thus his interest in DjVu.) If Brewster is still using DjVu, I can *guarantee* he will mention it to us -- we need an answer if we don't want to use DjVu compression. The "proprietary" issue will not sway him if he is using it. We need to maybe look at alternative "lossy" formats (I detest JPEG for compressing scanned images because of the significant artifacts it creates -- there must be something better which is also an open standard?) Obviously, I'd prefer to store the scan sets (and derivatives) in a format which preserves, in a lossless sense, the scans as they were submitted to DP. But we need not be wedded to keeping the identical format as they were submitted in so long as they are losslessly the same. Another issue regards end-user compatibility. Web browsers to view scanned pages are usually not TIFF- (nor DjVu-) ready without a plugin, while all contemporary browsers will handle PNG and JPG. When it comes to lossless compression, some of the CCITT protocols (as used in a TIFF encapsulation) for bitonal images are significantly better than PNG at lossless compression (PNG is designed for general lossless compression of all kinds of images.) Of course, there is the option that if a "third party" interface is developed to access the repository, image conversion on the fly could be employed to translate the format in the repository to most any other end-user can handle, even if the source is in some odd format. Another possible issue regards illustrations. Am I right that some scan projects scan the whole book at one resolution, then return and do the pages with illustrations at a higher resolution and crop to the illustrations? This may have an impact on directory structure and metadata fields. Anyway, that's enough for now. Comments? Jon

Jon Noring wrote:
So, does DP keep an internal identifier and associated metadata for each project it works on?
Yup.
And will it be pretty easy to associate a scan set with that DP identifier, or will it require some human intervention to make the association?
If the scan set knows its DP project id, then the association is there. If it doesn't, then someone will have to find out what it is. I'm not sure I understand the question.
And does the DP identifier correlate to a PG text identifier?
The DP database maintains the relation. Mostly it's one-to-one, sometimes many-to-one, and rarely one-to-many or many-to-many. The DP database can handle the first two. -Michael

Michael Dyck wrote:
Jon Noring wrote:
So, does DP keep an internal identifier and associated metadata for each project it works on?
Yup.
Well, as I asked in replying to Juliet's message, is the identifier simply a sequential integer, or does it contain some metadata? Also, what metadata is recorded for each project, and is it in some normalized (machine-readable) form, such as Dublin Core?
And will it be pretty easy to associate a scan set with that DP identifier, or will it require some human intervention to make the association?
If the scan set knows its DP project id, then the association is there. If it doesn't, then someone will have to find out what it is. I'm not sure I understand the question.
I apologize for not being more precise in my question. How many scan sets (or maybe what percentage of scan sets) will require a human being to intervene to determine the associated DP project ID? Intervention includes actually looking at the title page scan and then use the information to manually lookup with which DP project it is associated.
And does the DP identifier correlate to a PG text identifier?
The DP database maintains the relation. Mostly it's one-to-one, sometimes many-to-one, and rarely one-to-many or many-to-many. The DP database can handle the first two.
Juliet went into even gorier detail on this! But despite the complexity of the PG <--> DP id mappings, once a DP project ID is associated with a particular scan set, then the association to a PG text is possible to machine trace. Am I right on this? Thanks. Jon

Jon Noring wrote:
Well, as I asked in replying to Juliet's message, is the identifier simply a sequential integer, or does it contain some metadata?
Also, what metadata is recorded for each project, and is it in some normalized (machine-readable) form, such as Dublin Core?
And I answered in my reply to that message.
Juliet went into even gorier detail on this! But despite the complexity of the PG <--> DP id mappings, once a DP project ID is associated with a particular scan set, then the association to a PG text is possible to machine trace. Am I right on this?
No, it isn't necessarily possible to machine trace. We can trace those mappings that we represent in our database, but our current structure doesn't allow for all possible mappings, in particular one-to-many (DP to PG) and many-to-many. These are relatively rare, but they do occur. The information is there for most cases. But the exceptions will be what slows down whatever we decide to do. As I've said, we've had the image archive in mind all along. But the metadata rounds, specialist rounds, improved proofing/formatting interface, and finding or developing tools to support creation of PGTEI texts are all higher priorities at the moment. JulietS

No, it isn't necessarily possible to machine trace. We can trace those mappings that we represent in our database, but our current structure doesn't allow for all possible mappings, in particular one-to-many (DP to PG) and many-to-many. These are relatively rare, but they do occur.
If it were possible to determine (algorithmically) which projects have a one-to-one mapping, we've captured most of the utility, even if the more complicated cases have to be saved for a later headache. -- RS

Jon Noring wrote:
Now, regarding Branko's comment that Brewster has disk space to burn (implying he doesn't care about the details of what we intend to do), this is NOT true. I've had several personal discussions with Brewster regarding other matters, and I know he will be concerned about the breadth/depth/future of what we have in mind, the public access issues, the format, etc. He may also be concerned how it will relate to his current book scanning project in Canada (not in a competitive sense, but in a compatibility sort-of-way). Some of his questions and concerns we cannot predict in advance. But the more we understand what we want to do (and what we don't want to do), the easier it will be to discuss this with him, hopefully resulting in his offer to host the PG/DP scan repository.
DP is archiving projects on one of the Internet Archive servers already, with their knowledge. Storage space is not an issue in that sense. However, from the PG perspective, issues having to do with whether the scans are mirrored, and if so, how much of a load that will put on the mirrors, become important. I don't have answers to those questions, I just point them out.
Probably the #1 problem as I see it is to assure that scan sets don't become "dissassociated" from PG/DP. That is, there needs to be sufficient metadata so each scan set can be attached to a particular DP produced digital text, and from that, attached to the associated PG text.
So, does DP keep an internal identifier and associated metadata for each project it works on? And will it be pretty easy to associate a scan set with that DP identifier, or will it require some human intervention to make the association? And does the DP identifier correlate to a PG text identifier?
We do have internal identifiers for all "projects" and once a project is posted to PG, the PG identifier is included in the information we keep about each project. The issue here is that there is not always a clear one-to-one mapping between PG books and DP projects. Historically several projects could make up one PG book, the book having been split up for internal DP reasons. Also historically there are DP projects that ended up being posted to PG as several separate pieces. I say historically because we have recently started to make a concerted effort to be sure that our "projects", as archived, do match up to the books as posted. This often means merging or splitting DP projects. We are doing this for the obvious reason of trying to make the whole image archive project easier when we get around to it. But there are lots of old projects that need to be sorted out.
Another issue, which affects directory design, metadata and access, is serialization. Some Works were serialized (such as multiple parts in periodicals), which are combined into one digital text. How do we handle/organize the scan sets for this? There are no doubt other odd "exceptions" we will have to handle.
Actually, all the cases that I can think of where we had one DP project that got posted to PG as multiple "books" are periodicals. We have not yet addressed the issue of posting a "book" to PG that is derived multiple issues of a periodical, although I'm hoping that that will happen eventually. We have Alcott's Under the Lilacs from St. Nicholas Magazine (in 1878) going through now. Not all the issues have been proofed yet, but when they are, I'm hoping that someone will pull out the entire book, with its illustrations, and post it to PG as a separate item, independent from the posting of the individual issues.
Another possible issue regards illustrations. Am I right that some scan projects scan the whole book at one resolution, then return and do the pages with illustrations at a higher resolution and crop to the illustrations? This may have an impact on directory structure and metadata fields.
Yes, you are correct about illustrations. It is my standard practice, and that of many other content providers (CPs), to make one set of scans for proofing, and then to rescan the illustrations separately. This fits into our workflow very well, but does make it harder to make an image archive that includes those illustrations. Also, for a good many books when we first started to make html versions, the illustrations were handled directly between the content provider and the post-processor and never made it into our image/archiving system at all. So the illustrations will be available from the PG html version, but DP may not have copies of them. Illustrations are also tricky because what is used for an illustrated html version can be quite different, in terms of dpi, size, etc, from what was originally provided. I provide illustrations at either 400 or 600 dpi, full-size, with a very minimum of processing (I've recently started rotating illustrations so that they are straight rather than crooked, for example), and those get archived with the project scans. But some CPs make the illustrations ready for the html version from the start and never upload the large, original scans. Getting good scans of illustrations is tricky. Getting the original scans into our system is also tricky, often because of the sizes involved. While DP does have a very strict naming convention for proofing images and text, there is no consistent naming convention for illustrations. Most CPs have developed their own conventions and stick to them. But someone is going to have fun standardizing everything eventually. A final complication has to do with mis-scanned or missing pages. These are often handled at the end of the process and have usually not made it into our archiving process. A post-processor will find that text has been cut off, or is obscured for some reason, and will ask the CP for a rescan. The rescanned image (maybe) and text would probably go to the post-processor via email and never enter the DP system. This is another matter that we have recently addressed and we are working to be sure that, as much as possible, missing pages do go back through the DP system and get merged into their original projects. But we still have ~7K projects for which that wasn't the case. So, DP is gradually moving in the direction of support for an image archive. We do always have in mind that it will happen eventually and we've been slowly changing our systems to try to make making such a thing easier in the long run. It seems to me that there are LOTS of large image archives already out there, and when we are ready to address the issue of making our own, we will learn what we can from how they have handled things like directory structures, etc. What prevents us from going forward are not technical issues, as such, or physical resource issues, but rather human resources in the form of developers, people to organize and regularize the scans, and the overall energy to make it happen. The quickest way from here to there is to get the rest of DP built so that we can then focus our attention on something like an image archive. JulietS

Juliet wrote:
Jon Noring wrote:
Thanks for your detailed reply, Juliet. Some questions/comments on a couple items:
Now, regarding Branko's comment that Brewster has disk space to burn (implying he doesn't care about the details of what we intend to do), this is NOT true. I've had several personal discussions with Brewster regarding other matters, and I know he will be concerned about the breadth/depth/future of what we have in mind, the public access issues, the format, etc. He may also be concerned how it will relate to his current book scanning project in Canada (not in a competitive sense, but in a compatibility sort-of-way). Some of his questions and concerns we cannot predict in advance. But the more we understand what we want to do (and what we don't want to do), the easier it will be to discuss this with him, hopefully resulting in his offer to host the PG/DP scan repository.
DP is archiving projects on one of the Internet Archive servers already, with their knowledge. Storage space is not an issue in that sense. However, from the PG perspective, issues having to do with whether the scans are mirrored, and if so, how much of a load that will put on the mirrors, become important. I don't have answers to those questions, I just point them out.
I appreciate this. I suspect that for the present at least, because of the large footprint of original page scans, it is impractical for PG to themselves archive (and mirror) the original page scans along-side the digital text versions. Other than linking to the page scans at the central scan repository, the only thing that might make sense for PG to do in the short term would be to distribute whole-book DjVu or PDF encapsulations of the page scans. But this requires a lot of work, on a book-by-book basis, to assure a scan set is complete and that the final product meets some sort of minimum quality.
We do have internal identifiers for all "projects" and once a project is posted to PG, the PG identifier is included in the information we keep about each project. The issue here is that there is not always a clear one-to-one mapping between PG books and DP projects.... [snip]
What's the nature/structure of the DP identifiers, and for each identifier what metadata is collected? Is the metadata machine-processible (such as in XML), or is it simply a written, human-readable-only summary of the book being digitized plus other project info?
A final complication has to do with mis-scanned or missing pages. These are often handled at the end of the process and have usually not made it into our archiving process. A post-processor will find that text has been cut off, or is obscured for some reason, and will ask the CP for a rescan. The rescanned image (maybe) and text would probably go to the post-processor via email and never enter the DP system. This is another matter that we have recently addressed and we are working to be sure that, as much as possible, missing pages do go back through the DP system and get merged into their original projects. But we still have ~7K projects for which that wasn't the case.
What percentage of all the projects had one or more missing pages as you mention?
It seems to me that there are LOTS of large image archives already out there, and when we are ready to address the issue of making our own, we will learn what we can from how they have handled things like directory structures, etc.
This is a very good point, and definitely needs to be researched. Do you have a short list of scanned page image archives that we should consult with?
What prevents us from going forward are not technical issues, as such, or physical resource issues, but rather human resources in the form of developers, people to organize and regularize the scans, and the overall energy to make it happen. The quickest way from here to there is to get the rest of DP built so that we can then focus our attention on something like an image archive.
This makes sense. But it is also unfortunate since there is already a large number of completed DP projects, and day-by-day that number steadily increases. As noted before (and which you are doing to a certain extent as you mentioned), tightening the requirements for scan submissions and processing, file naming, QC, handling illustrations, etc., make sense to implement. For example, if one requires those submitting the page scans for a project to name each image corresponding to the page number (if any), then for most works it is possible to quickly check for missing pages -- those submitting scans are more likely to catch missing pages this way. (Since it appears the bottleneck for DP is not page scan submissions, greatly tightening up submission requirements for scans makes sense. I believe most people will do their best to meet those requirements. Those who can't, because of hardware limitations or lack of the needed technical skill, can always find someone to do the scanning and submission for them -- a contact list of volunteer scanners, a sort of 'Distributed Scanners', could be assembled.) One crazy and maybe unworkable idea for DP to resolve the page scan issues in the long-term, is to establish a separate "clearing house" (CH) for page scans which would have its own volunteers. In this system, DP would require all scans for a project to be submitted to CH. In CH the scans will be checked by its volunteers, maybe using some sort of online interface not unlike that used for DP proofing, to check the page scans for quality, missing pages, file name issues, and the like. The scans could even go through a volunteer-driven clean-up process to normalize the scans and even produce DjVu and PDF versions. It would also convert them for the needs of the proofing process. CH could also produce the metadata (including MARC or similar catalog records, thus one would try to find librarian volunteers), and even issue the DP identifier. If the scan set passes muster, copyright clearance from PG could then be obtained -- since the page scans are online, those doing the clearance will be able to inspect the original page scans to decide on a clearance. Once a scan set is copyright cleared, it is sent to DP for OCR/proofing, as well as deposited in the scan repository, where it will be flagged as an unfinished project. Once the structured digital text is completed by DP, the flag in the scan repository will be switched to finished. Anyway, just a crazy idea. In the meanwhile, how to sort through the existing 7000 or so project scans and organize them as best as possible needs to be the focus of attention. It is sort of like a can of worms -- the main decision will be how much the worms will be untangled. I suspect because of the lack of volunteer help, we may only get as far as creating a directory structure based on the DP identifier, and simply dump the existing scans for a DP identifier/project into that directory without doing any filename changes (or into subdirectories in that directory if we have multiple scan sets), along with a metadata/cataloging file. Even here, it looks like the migration will require substantial human intervention. Maybe with 3 volunteers, each of whom squeezes in 50 project transfers/week, in one year we'll have this done. (Now those familiar with how the scans are currently stored may think this migration could go faster -- I don't know.) At a future time, if there is a need, volunteers can revisit each project scan set and do further QC and normalization, and try to locate missing page scans as Juliet noted, and other improvements, even producing DjVu and/or PDF versions. How does this sound? Jon

Jon Noring wrote
We do have internal identifiers for all "projects" and once a project is posted to PG, the PG identifier is included in the information we keep about each project. The issue here is that there is not always a clear one-to-one mapping between PG books and DP projects.... [snip]
What's the nature/structure of the DP identifiers, and for each identifier what metadata is collected? Is the metadata machine-processible (such as in XML), or is it simply a written, human-readable-only summary of the book being digitized plus other project info?
DP works off a database. The projectid is a unique identifier (10 hex digits), intended only for internal, non-human use which serves both to organize the data associated with the project in the db, and as an identifier for the system/working directory that holds the scans and other info that is not in the db. We keep information relevant to our production process in the database, including things like title, author, genre (informally assigned), language, who scanned it, the name of the project manager and postprocessor, etc. At the time the project is created, a small file with Dublin Core information is also created and lives in the working directory. At the moment that file is not used for anything. All of this information is kept when the project is archived. The contents of the working directory moves off our server and onto the one at TIA as well as most of the info in the db (eg, the text from all rounds). We retain some project info in our production database for record keeping purposes.
A final complication has to do with mis-scanned or missing pages. These are often handled at the end of the process and have usually not made it into our archiving process. A post-processor will find that text has been cut off, or is obscured for some reason, and will ask the CP for a rescan. The rescanned image (maybe) and text would probably go to the post-processor via email and never enter the DP system. This is another matter that we have recently addressed and we are working to be sure that, as much as possible, missing pages do go back through the DP system and get merged into their original projects. But we still have ~7K projects for which that wasn't the case.
What percentage of all the projects had one or more missing pages as you mention?
Probably a relatively low percentage. But they end up taking far more time than they should.
It seems to me that there are LOTS of large image archives already out there, and when we are ready to address the issue of making our own, we will learn what we can from how they have handled things like directory structures, etc.
This is a very good point, and definitely needs to be researched.
Do you have a short list of scanned page image archives that we should consult with?
I refer you to the very long list in a post at the top of Content Providers forum at DP.
One crazy and maybe unworkable idea for DP to resolve the page scan issues in the long-term, is to establish a separate "clearing house" (CH) for page scans which would have its own volunteers. In this system, DP would require all scans for a project to be submitted to CH. In CH the scans will be checked by its volunteers, maybe using some sort of online interface not unlike that used for DP proofing, to check the page scans for quality, missing pages, file name issues, and the like. The scans could even go through a volunteer-driven clean-up process to normalize the scans and even produce DjVu and PDF versions. It would also convert them for the needs of the proofing process. CH could also produce the metadata (including MARC or similar catalog records, thus one would try to find librarian volunteers), and even issue the DP identifier. If the scan set passes muster, copyright clearance from PG could then be obtained -- since the page scans are online, those doing the clearance will be able to inspect the original page scans to decide on a clearance. Once a scan set is copyright cleared, it is sent to DP for OCR/proofing, as well as deposited in the scan repository, where it will be flagged as an unfinished project. Once the structured digital text is completed by DP, the flag in the scan repository will be switched to finished.
Congratulations! You just discovered for yourself the next major addition to DP. We have been planning all along to have something that we call "metadata collection". This will most likely be two rounds. The first round will collect whatever project level metadata we decide is important and that can be derived directly from the scans. It will also check that all pages are present and that the scans are legible. The second round will look at each page, noting formating features such as ToC, index, tables, poems, block quotes, musical notation, mathematical or chemical equations, etc, etc. This information will be used to allow pages to go to "specialist" rounds for things like musical notation markup, math markup, tables, indexes, etc. It will also be used as a quality check for the final formatting ("The page metadata says there should be 2 footnotes. Why is there only markup for one?"). I hope that we will also be able to send illustrations scans through a separate production path as part of this process. We don't currently have any plans to do anything special with the scans themselves, other than being certain that they are all there and legible. There are a LOT of open questions about how we will implement this, but the basic idea has been in Charlz' plan from the beginning. JulietS
participants (4)
-
Jon Noring
-
Juliet Sutherland
-
Michael Dyck
-
Robert Shimmin