
Thanks to Branko (public message) and Jon Niehof (private message, I believe) for their quick reply to my initial sets of questions/ /concerns/thoughts regarding the scan repository which was posted earlier today. It is interesting that the answers differed in several ways, which I won't reply to in the usual manner. But it does show that there will not be uniformity in thought in all the details of the scan repository. But filtering through both messages -- and despite the angst I expressed in an earlier message today agreeing with what Bowerbird had to say on this topic ("what should be"), it is clear that we can and should go ahead with a "first generation" scan repository focused primarily on centralizing the preservation of the scans submitted to DP, and to make them publicly available pretty much "as is". (Expansion to allow submissions of scan sets outside of DP activities can be considered once the DP scan repository is up and running, and mostly debugged. We will certainly learn a lot from it. Thus my go-slow preference.) Now, regarding Branko's comment that Brewster has disk space to burn (implying he doesn't care about the details of what we intend to do), this is NOT true. I've had several personal discussions with Brewster regarding other matters, and I know he will be concerned about the breadth/depth/future of what we have in mind, the public access issues, the format, etc. He may also be concerned how it will relate to his current book scanning project in Canada (not in a competitive sense, but in a compatibility sort-of-way). Some of his questions and concerns we cannot predict in advance. But the more we understand what we want to do (and what we don't want to do), the easier it will be to discuss this with him, hopefully resulting in his offer to host the PG/DP scan repository. Probably the #1 problem as I see it is to assure that scan sets don't become "dissassociated" from PG/DP. That is, there needs to be sufficient metadata so each scan set can be attached to a particular DP produced digital text, and from that, attached to the associated PG text. So, does DP keep an internal identifier and associated metadata for each project it works on? And will it be pretty easy to associate a scan set with that DP identifier, or will it require some human intervention to make the association? And does the DP identifier correlate to a PG text identifier? Another issue concerns the directory structure of the repository which will hold the scan sets. It has to take into account that there may be not only the original scan set which was submitted to DP, but some derivative sets produced by some processing (of course there's the 120 dpi used for the DP proofing interface.) I tend to view that for the DP scans, that the DP identifier be used as part of the directory name holding the scan sets associated with that identifier. Another issue is "zip vs. individual files". My view is leaning towards keeping each page scan image a separate file, so end-users need not download a whole gigabyte-size (or more) ZIP file to just look at one page (some books may have scan sets this large.) Another issue, which affects directory design, metadata and access, is serialization. Some Works were serialized (such as multiple parts in periodicals), which are combined into one digital text. How do we handle/organize the scan sets for this? There are no doubt other odd "exceptions" we will have to handle. Format is another issue. Banko mentioned that DjVu is proprietary (or essentially so). Yet Brewster was, last I talked with him last year about scan formats, enamored with it and using it for his scanning projects to greatly conserve disk space (he *is* concerned about the cost of storing book scans, thus his interest in DjVu.) If Brewster is still using DjVu, I can *guarantee* he will mention it to us -- we need an answer if we don't want to use DjVu compression. The "proprietary" issue will not sway him if he is using it. We need to maybe look at alternative "lossy" formats (I detest JPEG for compressing scanned images because of the significant artifacts it creates -- there must be something better which is also an open standard?) Obviously, I'd prefer to store the scan sets (and derivatives) in a format which preserves, in a lossless sense, the scans as they were submitted to DP. But we need not be wedded to keeping the identical format as they were submitted in so long as they are losslessly the same. Another issue regards end-user compatibility. Web browsers to view scanned pages are usually not TIFF- (nor DjVu-) ready without a plugin, while all contemporary browsers will handle PNG and JPG. When it comes to lossless compression, some of the CCITT protocols (as used in a TIFF encapsulation) for bitonal images are significantly better than PNG at lossless compression (PNG is designed for general lossless compression of all kinds of images.) Of course, there is the option that if a "third party" interface is developed to access the repository, image conversion on the fly could be employed to translate the format in the repository to most any other end-user can handle, even if the source is in some odd format. Another possible issue regards illustrations. Am I right that some scan projects scan the whole book at one resolution, then return and do the pages with illustrations at a higher resolution and crop to the illustrations? This may have an impact on directory structure and metadata fields. Anyway, that's enough for now. Comments? Jon