
RS wrote:
Many of the tools currently assume that PNG numbers are strictly numeric and monotonically increasing. Perhaps this should be changed (I would argue yes :) ), but that does create complications with how page renumbering should work, for example.
I know of one fairly prominent commercial digital library (Eighteenth Century Online) that made the decision that the issues with using page numbers as unique identifiers are sufficiently hairy that they went with sequential *image* numbers (which are unique), and the database maintains as metadata the page number that goes with each image number (which may not be unique, sequential, numerical, or even present).
This seems to be the route DP is headed with the proposed "metadata collection" round.
This returns to my idea that one could embed some of the core metadata information into the scan image filename itself, so when it comes to pagination issues the metadata database need not be queried. (It is often an advantage to not have to query the database all the time to find out certain basic information -- it depends upon who will use the data and for what purpose(s), and if there's a chance the user will not have access at the time to the database.) To reiterate, we might consider the following syntax for scan image file naming (this is not necessarily what one would use when the images are embedded within a DjVu for deeplinking purposes, as Marcello is proposing in his RFC -- we have to differentiate between general page image naming and image naming within DjVu encapsulation.) ScanFileName == BookID : SeqPage# : [PrintedPage#] : [OtherInfo] : FormatSuffix The BookID could itself comprise multiple parts, depending upon the structure of the source BookID (we have the issue of multivolume sets, for example -- tbd.) It would be an identifier that will point to metadata elsewhere. The SeqPage# is an integer giving the position of the scan in the full scan set -- it is not the same as the page numbers printed in the work. We'd include all blank pages (both sides) from cover to cover as part of the numbering. PrintedPage# is a string comprising what the publisher used (or implied using) for the page number in that scan. It could be boring like '135' (for page 135) or something more bizarre as Geoff gave some examples. OtherInfo will state various oddities, such as implied page numbering, illustrations, foldout, etc., whatever is deemed necessary. This is where the system can be expanded when we run into some really odd and unforeseen stuff. Example (using '-' as a field delimeter, there's probably a better delimeter that could be used): DP0003579-00313-296-IMP.png 0003579 is the BookID -- an identifier used in the system. I have this as a 7-digit decimal (which is more human friendly) meaning that up to 10 million IDs (less one) are possible. But as noted above, this field might itself comprise multiple parts depending upon how everything is setup. 00313 is the sequential page number of the scan set for the book. Five decimal digits is more than enough for any single book volume under the sun that I can think of, 99998 pages, or 49,999 leaves! the insides of both covers can be considered pages as well since sometimes they contain interesting stuff (like my copy of the Kama Sutra.) '296' is the string the publisher used to identify the page (here it is a straight page number: "page 296".) 'IMP' means it is implied. the publisher actually did not print '296' on the page, but from looking at the page numbering sequence, it is clearly implied. If the '296' was printed, this field would be left blank,, viz. DP0003579-00313-296.png . Hmmm, as a final point, we may even want to put another field into the filename syntax which describes something about the scan image generation, so if the scan is resampled, cleaned-up, etc., that would differentiate it from the others which represent the same page image but at a different resolution or stage of cleanup -- or that it is the original image that came off the scanner. I recall when I was handling the Kama Sutra scan set, I'd do some bulk image processing on the lot, and generate a new scan set. I had to somehow keep track of filenaming issues so I didn't mix up the scan sets. It got to be an interesting exercise to keep everything straight since I had not yet come up with a filenaming system that made sense (I was just ad-hoc'ing' it as I went along, and didn't devote much thought to it.) So, yes, it does appear the filename should include a field about generation -- maybe here we have a system such as "00", "01", etc., where "00" is the raw scan right from the scanner (and the metadata would give the details about the "00" scan set, such as resolution and color depth, type of scanner, etc.), and the other numbers denote various conversions and cleanups also specified in detail in the metadata. Since image processing can be quite complicated and vary from job to job, it's seems impossible to come up with a simple system that could be embedded right in the filename, thus the proposal to stick to a non-descript system to differentiate various converted scan sets from the original scan set. I think if someone got a hold of the various derivative scan sets, and they don't have access to the database, by careful inspection of the images they could infer, in a general sense, most of what was done to produce each scan set. (Of course, most image formats allow one to write metadata within the header, but such information can be brittle. It is something worth looking into, however.) I know what I recommend above is not the input Marcello wants on his proposed RFC, but before we tackle the specific application Marcello is discussing (linking to particular pages within a DjVu), I think it wise we first look at the general issue of scan file image naming for a variety of purposes -- as part of the work flow, and for the end-use side. As noted before, the above syntax should be trivially remapped by machine processing to Marcello's proposed RFC syntax -- at least it appears the mapping (in one direction) is easily doable. Jon