
Marcello wrote:
Jon Noring wrote:
Example: DP0000239-00125-106.png
"DP0000239" is the source book identifier, here a DP identifier. If the scan project is independent of DP, it could be 'PG0014239' to associate the scan set with PG text number 14239.
"00125" says this is the 125th "side" in the full sequence of sides in the book, starting from the front cover or wherever else is considered the starting point.
"106" is the string (which can be more complicated like "A2", "5-4", "ix", "ABCD" whatever), which the publisher printed on that page to identify it (that's really what a publisher-supplied page "number" is: a page identifier.)
Incredibly awkward and broken in several ways:
1.
While scanning you have no feedback on the correctitude of your scanning. You are scanning page "42" and saving to file "58.tif". There is no immediate relation between the page you are putting on the scanner and the filename you are saving it under.
But your system requires that each image, when it is first saved, needs a human being to eyeball the page, determine the publisher supplied page number (if any; may be implied), and then manually save the page using the publisher number.
2.
To add the real page number to the filename you need a second run over all files. Errors galore!
Which is important to do anyway. Just think through how you would construct a volunteer, multi-people effort to scan lots of books. There is the need for people to scan, for people to look over the work and determine if there are problems (QC), people to deskew and crop the images, people to regularize the filenaming (to whatever system), the making of derivative scan sets, etc. It will look more and more like DP.
Proof: your example filename DP0000239-00125-106.png is bogus: page 125 "starting from the front cover" must be a right-hand side, but page 106 is sure a left-hand one. You got confused even with one file alone. What about handling hundreds of them at once?
I did not get confused, but your observation is correct, and I am revising where seq# starts. I first started my seq# scheme from the inside of the front cover, which is better to be an even number. The front of the front cover should be given Seq#= 1, the inside of the front cover=2, etc. One can think of the cover like a very thick leaf of the book, and any scan project should scan the covers anyway. (And maybe set Seq#=0 for the spine, which should also be scanned. It is important for bibliographic purposes to get scans of the outside covers and spine. I learned this in trying to date my particular copy of the Kama Sutra, where the spacing and color of the lettering on the spine is critical to determination of both the edition and printing, and whether it is original or a pirate copy.)
Being composed of 2 keys, the probability that a link to this file breaks is much higher than using whichever one key.
As I noted before, for certain end-uses, one could do a conversion. I'm thinking of the work flow and scan set archiving stages.
It will be able to handle *any* bizarre pagination system the publisher/author dreamed up (the publisher could number the pages backwards for all we care, and this system will handle it without any complications -- yet we preserve the publisher-supplied page "number" in the filename which is important for referencing/citation.)
Bogus claim.
The publisher might put something in the page "number" that doesn't work as filename or url. What about page "4/2"? Makes a good filename, huh?
For these rare circumstances where the string contains a disallowed character for filenaming (of course, we have to think internationally, too), it would be possible to use an escape character, as is done for URL's when they contain disallowed characters. Page referencing as found in written literature will likely use whatever system was used in the target book, such as "see 'Lust in the Dust' by John Rust, page 4/2, ...". It is wise that the exact string be preserved for linking purposes, rather than renaming it to something else and losing that information. Not recording the exact character string the publisher used to number (or id) a page is not good. If it is occasionally necessary to escape characters, then so be it. This is done all the time in URLs. Jon