Re: [gutvol-d] Scan file naming -- another comment

25 Jul 2005

      On Sunday 24 July 2005 07:17 pm, Marcello Perathoner wrote:
...
I guess you should really go and read my RFC before assuming it doesn't
accomodate something. It has front, body and cover streams and
accomodates up to a total of 26 page numbering streams. Unless you have
a book with more than 26 number sequences, you have nothing to complain of.
Do you know of some platform that sorts the alphabet in a different way?
No? Then sort order can be no problem in my scheme. (I intentionally
avoided mixed case prefixes.)
Yes, I do, and clearly so do others here. Thanks to those for saving me the 
trouble of looking up examples.
...
The only information you should get from the filename is what page the
file contains. My scheme does not use the filename to determine the
order of the pages. The multi-page djvu file keeps track of the order of
the pages. If the pages are numbered backwards, fine -- just insert them
backwards into the multi-page file. (And they will stay that way on
every platform too.)
Yes, but we're not talking about presentation formats, we're talking about 
storing data in archival formats to generate presentation formats. And 
consistency simplifies many of those tasks.
...
Disadvantages of your format:
- introduces artificial numbering sequence with completely arbitrary
relation to the printed page numbers.
This is only an issue if you feel that the printed page numbers must be in the 
file name or be the file name. It's almost as if you've never heard of or 
completely ignored the programmatic advantages of things like hashes or 
linked lists.
...
- breaks the law of least astonishment:  any user in its right mind who
wants to look at page 42 will instinctively go and open 42.tiff. Bumm!
But in your scheme, the end user won't see 42.tif in the raw data, they will 
see a cryptic melange of alphanumerics which contains 42 somewhere in it.
...
- does not accomodate all the scans without covers and blank pages we
already have gathered.
See below.
...
- is brittle and thus not adapted for archiving. The sequence changes if
somebody goes back and adds the cover page and blank page scans to an
existing set of scans.
This only applies to legacy data, and while true that adjustments would need 
to be made, they can be handled by tools which also update the metadata. It's 
not as if data inserts and record renumbering were virgin territory. Your 
"store it in the filename" convention is as-or-more brittle in this same 
respect, without the advantages of external metadata.
...
How do you handle hardcovers and paperbacks? The paperback editions
usually have less pages around the covers than the hardcover but are
otherwise quite exact copies. Which edition will you follow?
Seems to me that the one which was scanned (and we mostly DO keep 
publisher/edition information at DP) would be the logical one to follow. 
That's a straw man issue, Marcello. Most of us know how to handle multiple 
editions of books.
...
And don't get me started about collections where portions are still
copyrighted and scanning those portions would be illegal. Every other
year some portion will drop into the public domain (assuming live + N)
and will have to be added to the scan set. A nightmare in your scheme
and no problem at all in mine.
Hardly a nightmare at all, the very reason in fact to store the metadata 
OUTSIDE of the filename, because it is volatile, although you claim in 
several places that the filename under your scheme is permanent. 
In my scheme, you would simply have a field which indicated the date the 
information on the page went into public domain (including future) and the 
tools can automatically insert either the page, or a placeholder image 
explaining that the page isn't in PD until X. Since PG is legally a library, 
it is fine for them to store (but not distribute) copyrighted material. 
(IANAL) My suggestion would allow automatic inclusion of such material as it 
became PD without any intervention. Yours doesn't.
...
Your format is very simple, but too simple in many points. As Einstein
said: make it as simple as possible but not simpler.
Einstein also said "We can't solve problems by using the same kind of thinking 
we used when we created them." PG used to store file version information in 
the cryptic 8.3 filenames. You wish to store page numbering information in 
filenames. I can't see what makes your scheme any less susceptible to the 
problems that were (eventually) realized with the former.
...
...
2) Create a corresponding metadata file named with a 1:1 correspondence
to hold the other (in this case) numbering information about the scanned
image. Advantages: Any additional information about the image file is
trivially associated, modifiable and extendible. it could be loaded in a
database or converted to XML or other formats trivially, making
implementation of meaningful searching that much easier.
My RFC is about people who want to click on a link and have the right
page open in their browsers. They don't want to fiddle with XML or
databases just to look at some scans. And the browser surely will not
look into any XML file to find out which url it should request.
You appear to be talking at cross purposes with yourself. The archive format 
and the presentation format are completely separate things, though in a sense 
they do drive certain requirements of each other.

The end user wouldn't have to "fiddle with XML" under my scheme. The data 
representation I suggested is for the archive of image scans and data, not 
the presentation of the data. Your preception error is that you've made an 
incorrect assumption that the raw data would be what is presented to the 
user.

The scans and metadata are stored simply in my scheme to reduce programming 
effort required to deliver files constructed in whatever arbitrary format 
desired for which a routine is coded to assemble the data in that format. No 
reader ever need see the metadata. You're attacking a strawman of your own 
creation, presumbably because I said "XML" though I never said that it 
would/should ever be in that particular format. In fact, were it not for the 
fact that PG's database doesn't seem to support very many simultaneous 
connections, I'd recommend that the metadata be stored there (i.e., in a 
database.)
...
Also its far simpler to keep the real page number in the filename and
store the information about the sequence in metadata than the other way
round. Djvu files support metadata. You can put all sorts of metadata
into the djvu file and nobody will complain. You can build any amount of
metadata processing software around my proposed format.
Far simpler how? Filenames should not be data repositories, though they can be 
abused as such. Data which you wish to manipulate in arbitrary fashion is 
better stored in a format or schema which supports random access natively. 
See above.
...
LOL. You did fire all your heavy guns at the pumpkin patch again. My
filenames are just designed to be:
- mnemonic,
  - unique per book and
  - permanent.
To which I counter: arbitrary and unintelligible, irrelevant if they are in 
the directory structure of the ebook (say scans/), and inflexible.

These are not "heavy guns," though it is a different viewpoint from yours.
I have no idea what you mean about a "pumpkin patch," and have never exhibited 
a desire to engage in armed combat with gourds, decorative, edible, or any 
other arbitrary vegetable and/or fruit. :)
...
No ordering is derived from the filenames at all. The djvu file keeps
the ordering. And my RFC doesn't even prescribe you what to do with
empty pages. You may drop them or keep them at will (replaced by a
notice, that is).
And so could the software which assembles the presentation format that the end 
user sees, just as I described above concerning interspersed non-PD material.
...
I'd been very glad if PG had stored the etext no. in the filename for
the books before #10k, instead of a making up a completely arbitrary and
unintelligible string, so I had to go thru GUTINDEX to find out what was
what. That would have saved me weeks of programming. And now you come
along and propose the same error again: to use a completely arbitrary
number as filename instead of the real page number and to have to grovel
thru an XML file to find out which file to open for page 42.
Already addressed above, though you still disregard that you're making the 
exact same mistake by storing everything in the filename.

As my Canadian friends might say, I'm sorry I don't have anything to apologize 
for today. I'm just trying to point out alternatives, and the strengths and 
weaknesses of various approaches, as I see them. It's not personal.

Re: [gutvol-d] Scan file naming -- another comment

D Garcia