
On Sunday 24 July 2005 07:17 pm, Marcello Perathoner wrote:
I guess you should really go and read my RFC before assuming it doesn't accomodate something. It has front, body and cover streams and accomodates up to a total of 26 page numbering streams. Unless you have a book with more than 26 number sequences, you have nothing to complain of. Do you know of some platform that sorts the alphabet in a different way? No? Then sort order can be no problem in my scheme. (I intentionally avoided mixed case prefixes.)
Yes, I do, and clearly so do others here. Thanks to those for saving me the trouble of looking up examples.
The only information you should get from the filename is what page the file contains. My scheme does not use the filename to determine the order of the pages. The multi-page djvu file keeps track of the order of the pages. If the pages are numbered backwards, fine -- just insert them backwards into the multi-page file. (And they will stay that way on every platform too.)
Yes, but we're not talking about presentation formats, we're talking about storing data in archival formats to generate presentation formats. And consistency simplifies many of those tasks.
Disadvantages of your format:
- introduces artificial numbering sequence with completely arbitrary relation to the printed page numbers.
This is only an issue if you feel that the printed page numbers must be in the file name or be the file name. It's almost as if you've never heard of or completely ignored the programmatic advantages of things like hashes or linked lists.
- breaks the law of least astonishment: any user in its right mind who wants to look at page 42 will instinctively go and open 42.tiff. Bumm!
But in your scheme, the end user won't see 42.tif in the raw data, they will see a cryptic melange of alphanumerics which contains 42 somewhere in it.
- does not accomodate all the scans without covers and blank pages we already have gathered.
See below.
- is brittle and thus not adapted for archiving. The sequence changes if somebody goes back and adds the cover page and blank page scans to an existing set of scans.
This only applies to legacy data, and while true that adjustments would need to be made, they can be handled by tools which also update the metadata. It's not as if data inserts and record renumbering were virgin territory. Your "store it in the filename" convention is as-or-more brittle in this same respect, without the advantages of external metadata.
How do you handle hardcovers and paperbacks? The paperback editions usually have less pages around the covers than the hardcover but are otherwise quite exact copies. Which edition will you follow?
Seems to me that the one which was scanned (and we mostly DO keep publisher/edition information at DP) would be the logical one to follow. That's a straw man issue, Marcello. Most of us know how to handle multiple editions of books.
And don't get me started about collections where portions are still copyrighted and scanning those portions would be illegal. Every other year some portion will drop into the public domain (assuming live + N) and will have to be added to the scan set. A nightmare in your scheme and no problem at all in mine.
Hardly a nightmare at all, the very reason in fact to store the metadata OUTSIDE of the filename, because it is volatile, although you claim in several places that the filename under your scheme is permanent. In my scheme, you would simply have a field which indicated the date the information on the page went into public domain (including future) and the tools can automatically insert either the page, or a placeholder image explaining that the page isn't in PD until X. Since PG is legally a library, it is fine for them to store (but not distribute) copyrighted material. (IANAL) My suggestion would allow automatic inclusion of such material as it became PD without any intervention. Yours doesn't.
Your format is very simple, but too simple in many points. As Einstein said: make it as simple as possible but not simpler.
Einstein also said "We can't solve problems by using the same kind of thinking we used when we created them." PG used to store file version information in the cryptic 8.3 filenames. You wish to store page numbering information in filenames. I can't see what makes your scheme any less susceptible to the problems that were (eventually) realized with the former.
2) Create a corresponding metadata file named with a 1:1 correspondence to hold the other (in this case) numbering information about the scanned image. Advantages: Any additional information about the image file is trivially associated, modifiable and extendible. it could be loaded in a database or converted to XML or other formats trivially, making implementation of meaningful searching that much easier.
My RFC is about people who want to click on a link and have the right page open in their browsers. They don't want to fiddle with XML or databases just to look at some scans. And the browser surely will not look into any XML file to find out which url it should request.
You appear to be talking at cross purposes with yourself. The archive format and the presentation format are completely separate things, though in a sense they do drive certain requirements of each other. The end user wouldn't have to "fiddle with XML" under my scheme. The data representation I suggested is for the archive of image scans and data, not the presentation of the data. Your preception error is that you've made an incorrect assumption that the raw data would be what is presented to the user. The scans and metadata are stored simply in my scheme to reduce programming effort required to deliver files constructed in whatever arbitrary format desired for which a routine is coded to assemble the data in that format. No reader ever need see the metadata. You're attacking a strawman of your own creation, presumbably because I said "XML" though I never said that it would/should ever be in that particular format. In fact, were it not for the fact that PG's database doesn't seem to support very many simultaneous connections, I'd recommend that the metadata be stored there (i.e., in a database.)
Also its far simpler to keep the real page number in the filename and store the information about the sequence in metadata than the other way round. Djvu files support metadata. You can put all sorts of metadata into the djvu file and nobody will complain. You can build any amount of metadata processing software around my proposed format.
Far simpler how? Filenames should not be data repositories, though they can be abused as such. Data which you wish to manipulate in arbitrary fashion is better stored in a format or schema which supports random access natively. See above.
LOL. You did fire all your heavy guns at the pumpkin patch again. My filenames are just designed to be:
- mnemonic, - unique per book and - permanent.
To which I counter: arbitrary and unintelligible, irrelevant if they are in the directory structure of the ebook (say scans/), and inflexible. These are not "heavy guns," though it is a different viewpoint from yours. I have no idea what you mean about a "pumpkin patch," and have never exhibited a desire to engage in armed combat with gourds, decorative, edible, or any other arbitrary vegetable and/or fruit. :)
No ordering is derived from the filenames at all. The djvu file keeps the ordering. And my RFC doesn't even prescribe you what to do with empty pages. You may drop them or keep them at will (replaced by a notice, that is).
And so could the software which assembles the presentation format that the end user sees, just as I described above concerning interspersed non-PD material.
I'd been very glad if PG had stored the etext no. in the filename for the books before #10k, instead of a making up a completely arbitrary and unintelligible string, so I had to go thru GUTINDEX to find out what was what. That would have saved me weeks of programming. And now you come along and propose the same error again: to use a completely arbitrary number as filename instead of the real page number and to have to grovel thru an XML file to find out which file to open for page 42.
Already addressed above, though you still disregard that you're making the exact same mistake by storing everything in the filename. As my Canadian friends might say, I'm sorry I don't have anything to apologize for today. I'm just trying to point out alternatives, and the strengths and weaknesses of various approaches, as I see them. It's not personal.