Re: [gutvol-d] Scan file naming -- another comment

24 Jul 2005

      D Garcia wrote:
...
Books almost never start with page 1, even in the arabic numbered sections. 
Where's your front matter? Back matter? A book which starts at 1 and ends at 
(1+n) and has no other numbering in it is likely the exception rather than 
the rule, so it is unrealistic to design a scheme to address only this 
special case.
I guess you should really go and read my RFC before assuming it doesn't 
accomodate something. It has front, body and cover streams and 
accomodates up to a total of 26 page numbering streams. Unless you have 
a book with more than 26 number sequences, you have nothing to complain of.
...
1) Store the physical sequence of the pages, front cover to back cover by 
simply naming the scan files ascending from 1. Advantages: Sort order 
guaranteed identical across platforms, no parsing of file name segments 
required to determine information about the file.
Do you know of some platform that sorts the alphabet in a different way? 
No? Then sort order can be no problem in my scheme. (I intentionally 
avoided mixed case prefixes.)

The only information you should get from the filename is what page the 
file contains. My scheme does not use the filename to determine the 
order of the pages. The multi-page djvu file keeps track of the order of 
the pages. If the pages are numbered backwards, fine -- just insert them 
backwards into the multi-page file. (And they will stay that way on 
every platform too.)

Disadvantages of your format:

- introduces artificial numbering sequence with completely arbitrary 
relation to the printed page numbers.

- breaks the law of least astonishment:  any user in its right mind who 
wants to look at page 42 will instinctively go and open 42.tiff. Bumm!

- does not accomodate all the scans without covers and blank pages we 
already have gathered.

- is brittle and thus not adapted for archiving. The sequence changes if 
somebody goes back and adds the cover page and blank page scans to an 
existing set of scans.

How do you handle hardcovers and paperbacks? The paperback editions 
usually have less pages around the covers than the hardcover but are 
otherwise quite exact copies. Which edition will you follow?

And don't get me started about collections where portions are still 
copyrighted and scanning those portions would be illegal. Every other 
year some portion will drop into the public domain (assuming live + N) 
and will have to be added to the scan set. A nightmare in your scheme 
and no problem at all in mine.

Your format is very simple, but too simple in many points. As Einstein 
said: make it as simple as possible but not simpler.
...
2) Create a corresponding metadata file named with a 1:1 correspondence to 
hold the other (in this case) numbering information about the scanned image. 
Advantages: Any additional information about the image file is trivially 
associated, modifiable and extendible. it could be loaded in a database or 
converted to XML or other formats trivially, making implementation of 
meaningful searching that much easier.
My RFC is about people who want to click on a link and have the right 
page open in their browsers. They don't want to fiddle with XML or 
databases just to look at some scans. And the browser surely will not 
look into any XML file to find out which url it should request.

Also its far simpler to keep the real page number in the filename and 
store the information about the sequence in metadata than the other way 
round. Djvu files support metadata. You can put all sorts of metadata 
into the djvu file and nobody will complain. You can build any amount of 
metadata processing software around my proposed format.
...
Marcello, in your role at PG, you should realize more than many others that 
storing data in a file name makes it less accessible to programs which could 
automate much of the common work of maintaining a dataset. It's not 
impossible to manipulate, but it is cumbersome and has sort order and case 
issues across platforms.
LOL. You did fire all your heavy guns at the pumpkin patch again. My 
filenames are just designed to be:

  - mnemonic,
  - unique per book and
  - permanent.

No ordering is derived from the filenames at all. The djvu file keeps 
the ordering. And my RFC doesn't even prescribe you what to do with 
empty pages. You may drop them or keep them at will (replaced by a 
notice, that is).

I'd been very glad if PG had stored the etext no. in the filename for 
the books before #10k, instead of a making up a completely arbitrary and 
unintelligible string, so I had to go thru GUTINDEX to find out what was 
what. That would have saved me weeks of programming. And now you come 
along and propose the same error again: to use a completely arbitrary 
number as filename instead of the real page number and to have to grovel 
thru an XML file to find out which file to open for page 42.

-- 
Marcello Perathoner
webmaster@gutenberg.org