
D Garcia wrote:
Books almost never start with page 1, even in the arabic numbered sections. Where's your front matter? Back matter? A book which starts at 1 and ends at (1+n) and has no other numbering in it is likely the exception rather than the rule, so it is unrealistic to design a scheme to address only this special case.
I guess you should really go and read my RFC before assuming it doesn't accomodate something. It has front, body and cover streams and accomodates up to a total of 26 page numbering streams. Unless you have a book with more than 26 number sequences, you have nothing to complain of.
1) Store the physical sequence of the pages, front cover to back cover by simply naming the scan files ascending from 1. Advantages: Sort order guaranteed identical across platforms, no parsing of file name segments required to determine information about the file.
Do you know of some platform that sorts the alphabet in a different way? No? Then sort order can be no problem in my scheme. (I intentionally avoided mixed case prefixes.) The only information you should get from the filename is what page the file contains. My scheme does not use the filename to determine the order of the pages. The multi-page djvu file keeps track of the order of the pages. If the pages are numbered backwards, fine -- just insert them backwards into the multi-page file. (And they will stay that way on every platform too.) Disadvantages of your format: - introduces artificial numbering sequence with completely arbitrary relation to the printed page numbers. - breaks the law of least astonishment: any user in its right mind who wants to look at page 42 will instinctively go and open 42.tiff. Bumm! - does not accomodate all the scans without covers and blank pages we already have gathered. - is brittle and thus not adapted for archiving. The sequence changes if somebody goes back and adds the cover page and blank page scans to an existing set of scans. How do you handle hardcovers and paperbacks? The paperback editions usually have less pages around the covers than the hardcover but are otherwise quite exact copies. Which edition will you follow? And don't get me started about collections where portions are still copyrighted and scanning those portions would be illegal. Every other year some portion will drop into the public domain (assuming live + N) and will have to be added to the scan set. A nightmare in your scheme and no problem at all in mine. Your format is very simple, but too simple in many points. As Einstein said: make it as simple as possible but not simpler.
2) Create a corresponding metadata file named with a 1:1 correspondence to hold the other (in this case) numbering information about the scanned image. Advantages: Any additional information about the image file is trivially associated, modifiable and extendible. it could be loaded in a database or converted to XML or other formats trivially, making implementation of meaningful searching that much easier.
My RFC is about people who want to click on a link and have the right page open in their browsers. They don't want to fiddle with XML or databases just to look at some scans. And the browser surely will not look into any XML file to find out which url it should request. Also its far simpler to keep the real page number in the filename and store the information about the sequence in metadata than the other way round. Djvu files support metadata. You can put all sorts of metadata into the djvu file and nobody will complain. You can build any amount of metadata processing software around my proposed format.
Marcello, in your role at PG, you should realize more than many others that storing data in a file name makes it less accessible to programs which could automate much of the common work of maintaining a dataset. It's not impossible to manipulate, but it is cumbersome and has sort order and case issues across platforms.
LOL. You did fire all your heavy guns at the pumpkin patch again. My filenames are just designed to be: - mnemonic, - unique per book and - permanent. No ordering is derived from the filenames at all. The djvu file keeps the ordering. And my RFC doesn't even prescribe you what to do with empty pages. You may drop them or keep them at will (replaced by a notice, that is). I'd been very glad if PG had stored the etext no. in the filename for the books before #10k, instead of a making up a completely arbitrary and unintelligible string, so I had to go thru GUTINDEX to find out what was what. That would have saved me weeks of programming. And now you come along and propose the same error again: to use a completely arbitrary number as filename instead of the real page number and to have to grovel thru an XML file to find out which file to open for page 42. -- Marcello Perathoner webmaster@gutenberg.org