[gutvol-d] Scan file naming -- another comment

22 Jul 2005

      Bowerbird's thoughts on scanning is a good summary of some of the
issues. And despite his view that we've gotten off-track on the
discussion, his points about filenaming, image processing (deskewing),
etc., align pretty well with the ongoing discussion.

Regarding scan filenaming, he rightfully notes that a source book
(or Work) identifier be prepended to the filename. This is what I have
also proposed.

Where we differ in filename convention is that I believe right after
the source ID be a sequential number which describes where the page
side is in the linear order of all the page sides in the book (totally
independent of how the publisher may have paginated the book.)

This way one will unambiguously and immediately know the position of
every page scan in the bound book (starting with the inside of the
front cover, which can be "side 1", and end with the inside of the
back cover -- alternatively we can start with the front cover as
"side 1" which has some advantages with respect to the dominant
recto/verso page numbering convention.) All blank pages will be
included.

Now, this sequential number will not correlate at all with whatever
pagination the publisher uses to 'id' the pages. So, after the
sequential number we have a third field in the filename which gives
the actual publisher supplied page number (if any; can be implied).

This way we decouple the publisher pagination with the page sequence
in the book, thereby simplifying the system and making it more
flexible. It will be able to handle *any* bizarre pagination system
the publisher/author dreamed up (the publisher could number the pages
backwards for all we care, and this system will handle it without any
complications -- yet we preserve the publisher-supplied page "number"
in the filename which is important for referencing/citation.)

Example: DP0000239-00125-106.png

"DP0000239" is the source book identifier, here a DP identifier. If
the scan project is independent of DP, it could be 'PG0014239' to
associate the scan set with PG text number 14239.

"00125" says this is the 125th "side" in the full sequence of sides
in the book, starting from the front cover or wherever else is
considered the starting point.

"106" is the string (which can be more complicated like "A2", "5-4",
"ix", "ABCD" whatever), which the publisher printed on that page to
identify it (that's really what a publisher-supplied page "number" is:
a page identifier.)

(My proposed system has a couple more fields after these three,
dealing with exceptions and generation of the scan set from the
original, which aid in keeping tracking of multiple derivative scan
sets and a few other oddities. The details are described in previous
messages.)

Jon Noring

[Note: In the "My Antonia" project, it is interesting that there is
no "Page 1" and "Page 2". The book starts (after the Roman numbered
foreword section) with Page 3! Now imagine getting a scan set of
"My Antonia" where we have defined page scan sequencing using the
page numbering the publisher used (which are the systems proposed by
Marcello and Bowerbird.) The first question I will have is "where
are pages 1 and 2? Are they missing from the set?" However, if the
scans are sequentially numbered based on their position in the book
(with the knowledge the book passed QC checking), then I would know
that at least the project saw this, too, and likely there were no
missing pages, thus it is likely Page 1 and 2 never existed.

And for those who will ask, before scanning the book I took it apart
to determine if there was a page which got ripped out there, but there
definitely was no ripped out page. There might have been an inserted/
glued plate which "fell out", but checking with the "My Antonia"
experts there definitely was no such insert in the First Edition. How
many other books start pagination of the body with something other
than 1?]