Re: [gutvol-d] Scan file naming -- another comment

22 Jul 2005

      Marcello wrote:
...
Jon Noring wrote:
...
...
Example: DP0000239-00125-106.png
"DP0000239" is the source book identifier, here a DP identifier. If
the scan project is independent of DP, it could be 'PG0014239' to
associate the scan set with PG text number 14239.
"00125" says this is the 125th "side" in the full sequence of sides
in the book, starting from the front cover or wherever else is
considered the starting point.
"106" is the string (which can be more complicated like "A2", "5-4",
"ix", "ABCD" whatever), which the publisher printed on that page to
identify it (that's really what a publisher-supplied page "number" is:
a page identifier.)
...
Incredibly awkward and broken in several ways:
1.
While scanning you have no feedback on the correctitude of your 
scanning. You are scanning page "42" and saving to file "58.tif". There
is no immediate relation between the page you are putting on the scanner
and the filename you are saving it under.
But your system requires that each image, when it is first saved,
needs a human being to eyeball the page, determine the publisher
supplied page number (if any; may be implied), and then manually save
the page using the publisher number.
...
2.
To add the real page number to the filename you need a second run over
all files. Errors galore!
Which is important to do anyway. Just think through how you would
construct a volunteer, multi-people effort to scan lots of books.
There is the need for people to scan, for people to look over the work
and determine if there are problems (QC), people to deskew and crop
the images, people to regularize the filenaming (to whatever system),
the making of derivative scan sets, etc.

It will look more and more like DP.
...
Proof: your example filename DP0000239-00125-106.png is bogus: page 125
"starting from the front cover" must be a right-hand side, but page 106
is sure a left-hand one. You got confused even with one file alone. What
about handling hundreds of them at once?
I did not get confused, but your observation is correct, and I am
revising where seq# starts.

I first started my seq# scheme from the inside of the front cover,
which is better to be an even number. The front of the front cover
should be given Seq#= 1, the inside of the front cover=2, etc. One
can think of the cover like a very thick leaf of the book, and any
scan project should scan the covers anyway. (And maybe set Seq#=0
for the spine, which should also be scanned. It is important for
bibliographic purposes to get scans of the outside covers and spine. I
learned this in trying to date my particular copy of the Kama Sutra,
where the spacing and color of the lettering on the spine is critical
to determination of both the edition and printing, and whether it is
original or a pirate copy.)
...
Being composed of 2 keys, the probability that a link to this file 
breaks is much higher than using whichever one key.
As I noted before, for certain end-uses, one could do a conversion. I'm
thinking of the work flow and scan set archiving stages.
...
...
It will be able to handle *any* bizarre pagination system
the publisher/author dreamed up (the publisher could number the pages
backwards for all we care, and this system will handle it without any
complications -- yet we preserve the publisher-supplied page "number"
in the filename which is important for referencing/citation.)
...
Bogus claim.
The publisher might put something in the page "number" that doesn't work
as filename or url. What about page "4/2"? Makes a good filename, huh?
For these rare circumstances where the string contains a disallowed
character for filenaming (of course, we have to think internationally,
too), it would be possible to use an escape character, as is done for
URL's when they contain disallowed characters.

Page referencing as found in written literature will likely use
whatever system was used in the target book, such as "see 'Lust in
the Dust' by John Rust, page 4/2, ...". It is wise that the exact
string be preserved for linking purposes, rather than renaming it to
something else and losing that information. Not recording the exact
character string the publisher used to number (or id) a page is not
good. If it is occasionally necessary to escape characters, then so be
it. This is done all the time in URLs.

Jon