[gutvol-d] A general scan image filename syntax (conceptual)

19 Jul 2005

      RS wrote:
...
...
Many of the tools currently assume that PNG numbers are strictly
numeric and monotonically increasing. Perhaps this should be
changed (I would argue yes :) ), but that does create
complications with how page renumbering should work, for
example.
...
I know of one fairly prominent commercial digital library (Eighteenth
Century Online) that made the decision that the issues with using page
numbers as unique identifiers are sufficiently hairy that they went with
sequential *image* numbers (which are unique), and the database 
maintains as metadata the page number that goes with each image number
(which may not be unique, sequential, numerical, or even present).
This seems to be the route DP is headed with the proposed "metadata 
collection" round.
This returns to my idea that one could embed some of the core
metadata information into the scan image filename itself, so when it
comes to pagination issues the metadata database need not be queried.
(It is often an advantage to not have to query the database all the
time to find out certain basic information -- it depends upon who will
use the data and for what purpose(s), and if there's a chance the user
will not have access at the time to the database.)

To reiterate, we might consider the following syntax for scan image file
naming (this is not necessarily what one would use when the images are
embedded within a DjVu for deeplinking purposes, as Marcello is
proposing in his RFC -- we have to differentiate between general page
image naming and image naming within DjVu encapsulation.)

ScanFileName == BookID : SeqPage# : [PrintedPage#] : [OtherInfo] : FormatSuffix

The BookID could itself comprise multiple parts, depending upon the
structure of the source BookID (we have the issue of multivolume sets,
for example -- tbd.) It would be an identifier that will point to
metadata elsewhere.

The SeqPage# is an integer giving the position of the scan in the full
scan set -- it is not the same as the page numbers printed in the
work. We'd include all blank pages (both sides) from cover to cover as
part of the numbering.

PrintedPage# is a string comprising what the publisher used (or
implied using) for the page number in that scan. It could be boring
like '135' (for page 135) or something more bizarre as Geoff gave some
examples.

OtherInfo will state various oddities, such as implied page numbering,
illustrations, foldout, etc., whatever is deemed necessary. This is
where the system can be expanded when we run into some really odd and
unforeseen stuff.

Example (using '-' as a field delimeter, there's probably a better
delimeter that could be used):

   DP0003579-00313-296-IMP.png

0003579 is the BookID -- an identifier used in the system. I have this
as a 7-digit decimal (which is more human friendly) meaning that up to
10 million IDs (less one) are possible. But as noted above, this field
might itself comprise multiple parts depending upon how everything is
setup.

00313 is the sequential page number of the scan set for the book. Five
decimal digits is more than enough for any single book volume under
the sun that I can think of, 99998 pages, or 49,999 leaves! the
insides of both covers can be considered pages as well since sometimes
they contain interesting stuff (like my copy of the Kama Sutra.)

'296' is the string the publisher used to identify the page (here it
is a straight page number: "page 296".)

'IMP' means it is implied. the publisher actually did not print '296'
on the page, but from looking at the page numbering sequence, it is
clearly implied. If the '296' was printed, this field would be left
blank,, viz. DP0003579-00313-296.png .

Hmmm, as a final point, we may even want to put another field into the
filename syntax which describes something about the scan image
generation, so if the scan is resampled, cleaned-up, etc., that would
differentiate it from the others which represent the same page image
but at a different resolution or stage of cleanup -- or that it is the
original image that came off the scanner. I recall when I was handling
the Kama Sutra scan set, I'd do some bulk image processing on the lot,
and generate a new scan set. I had to somehow keep track of filenaming
issues so I didn't mix up the scan sets. It got to be an interesting
exercise to keep everything straight since I had not yet come up with
a filenaming system that made sense (I was just ad-hoc'ing' it as I
went along, and didn't devote much thought to it.)

So, yes, it does appear the filename should include a field about
generation -- maybe here we have a system such as "00", "01", etc., where
"00" is the raw scan right from the scanner (and the metadata would
give the details about the "00" scan set, such as resolution and color
depth, type of scanner, etc.), and the other numbers denote various
conversions and cleanups also specified in detail in the metadata. Since
image processing can be quite complicated and vary from job to job, it's
seems impossible to come up with a simple system that could be embedded
right in the filename, thus the proposal to stick to a non-descript
system to differentiate various converted scan sets from the original
scan set. I think if someone got a hold of the various derivative scan
sets, and they don't have access to the database, by careful inspection
of the images they could infer, in a general sense, most of what was
done to produce each scan set. (Of course, most image formats allow
one to write metadata within the header, but such information can be
brittle. It is something worth looking into, however.)

I know what I recommend above is not the input Marcello wants on his
proposed RFC, but before we tackle the specific application Marcello
is discussing (linking to particular pages within a DjVu), I think it
wise we first look at the general issue of scan file image naming for a
variety of purposes -- as part of the work flow, and for the end-use
side. As noted before, the above syntax should be trivially remapped by
machine processing to Marcello's proposed RFC syntax -- at least it
appears the mapping (in one direction) is easily doable.

Jon

[gutvol-d] A general scan image filename syntax (conceptual)

Jon Noring