al said:
> Basic format:
>
> The prefix for the cover pages is: "c".
> The prefix for the roman pages is: "f".
> The prefix for the arabic pages is: "p".
>
> ***
>
> For blank pages there should be no file and
> the page number should be skipped.
> Optionally an image saying:
> "This page is blank in the original."
> may be inserted.
>
> ***
>
> Example of file naming:
>
> front cover c0001.png
> back cover c0002.png
> spine c0003.png
>
> i title page f0001.png
> ii title verso f0002.png
> iii dedication f0003.png
> iv is blank
> v contents f0005.png
>
> page 1 p0001.png
> page 2 p0002.png
> image on page 2 p0002-image1.png
> image on page 2 p0002-image2.png
> page 3 p0003.png
> page 4 is blank
> page 5 p0005.png
> ... ...
> page 9999 p9999.png
dkretz said:
> So far this "spec" seems to be primarily a legend.
> Is it documented anywhere?
al said:
> No. It was developed and used by Joshua Hutchinson
dkretz said:
> Here's an extensive forum thread on DP
> where we hashed this all out.
oh lord.
***
where do i begin?
seriously, this is such a mess.
where do i begin?
***
well, to start with the last comment first,
this wasn't "hashed out" at all. it was just
messed up, because josh and marcello are
too stubborn to take good advice from me.
and on a more general level, this all shows
that d.p. and p.g. can mess things up even
when they actually try to do the right thing.
***
so let's go back and examine the problems...
***
we'll need to start with a short history lesson.
years back, there was a push to get the scans
hosted at p.g. with the text, and p.g. said ok.
but when people started posting their scans,
i noticed they had been named very stupidly.
most stupid was that the filenames contained
_numbers_ that were _not_ the _pagenumbers_.
thus the file for page 123 might be "0128.png".
this didn't surprise me, because d.p. has been
naming their scans stupidly for many years...
i'd tried to wise them up, but they didn't listen.
but it's one thing to name _your_ files stupidly,
since you're the only one who works with 'em,
so you're the only one who pays the penalties
of the big costs that stupid filenames impose.
it is quite _another_ thing to name files that
you post in public using a stupid convention,
because the _public_ works with those files...
luckily, the most insane position did not prevail.
p.g. required that all scans must be named
using the same number as the pagenumber.
for a while, anyway, some d.p. people would
rename all the scan-files so they could then
be posted to p.g. yes, it's stupid to work with
stupidly-named files, because you pay all the
penalties of working with stupidly-named files,
only to rename them to smarter names _after_
you're done working with them, but that's what
d.p. was doing. for a little while. until it fizzed.
the good news is that most scans at p.g. are
named with a number that's the pagenumber.
the bad news is that the renaming requirement
essentially means not many scans get posted...
the ugly news is that the names are _still_not_
really intelligent. they're not _moronic_, but
they're not very intelligent either, not at all...
on an i.q. scale, they'd weigh in at about 87.
thus ends our history lesson to set context...
***
ok, what comes next?
first of all, let's remember the philosophy
that should be a fundamental cornerstone
of _any_ intelligent filenaming convention...
one important principle (the first?) which should
be at work here is that every filename is _unique._
that is, _each_and_every_ file should have a name
that identifies _that_file_ separate from all others.
now, there might be some cases where the same
file might have different names in different places.
(some would argue that; let's put that off for now.)
but an _iron-clad_rule,_ with _no_ exception, is
different files must always have different names.
to say it another way, different files must _never_
have the same name. _never_, _never_, _never_.
so right at the _very_outset_, the dp/pg model
has failed us... all of their files are named with
the same p0001.png-p9999.png convention and
thus fail to meet the imperative to be _unique._
how can we tell one file named p0001.png from
_every_other_ file named p0001.png? we cannot.
and since every book has a p0001.png file, _bad_.
this isn't rocket-science. it's common sense.
_different_files_should_have_different names!_
we're back in the same old boat where we need
to pay heed to the subdirectory name to know
with certainty which book each file represents.
if the filenames were unique, we could place
every one of our files in a single subdirectory,
and we would have no filename crashes and
we could identify each file as a unique entity,
just from its name, without looking inside it.
i mean, it's great that we know that p0001.png
is a scan of a page that was numbered as page 1
in the book in which it appeared, but the filename
doesn't tell us _which_ book that was, so we are
left out in the cold on the very first step we take.
how sad... how utterly and thoroughly pathetic...
***
to make my filenames _unique_ to a particular book,
i give each scan in a book a 5-letter unique prefix...
so, for the "sitka" book we've been analyzing lately,
the 5-letter prefix for all the filenames is "sitka"...
in case you're wondering, a 5-letter prefix gives us
26**5 possibilities for unique ones, which computes
to 11 million possibilities. 11.8 million, to be exact,
but some of those might be voided as unusable...
if you feel a need to be able to label more books,
a 6-letter prefix gives 308,915,776. (308+ million.)
a 7-letter prefix gives 8 billion. 8-letter, 208 billion.
let me know when you've got 208 billion documents.
til then, an 8-letter prefix will work just fine, thanks.
indeed, i'm happy with a 5-letter prefix at the moment.
***
ok, so let's go on...
jim said:
> The prefix for the cover pages is: "c".
> The prefix for the roman pages is: "f".
> The prefix for the arabic pages is: "p".
the "c", "f", and "p" convention is one i created...
thankfully, this model was adopted by dp/pg.
but there was a _reason_ i picked those letters,
a good reason, and -- when it came to details --
dp/pg again screwed up with its implementation.
the "p" stands for "page", and that's obvious.
and "c" for "cover" is the obvious choice too.
but some people suggested the front-matter
should have an "r" prefix, for "roman numbers".
know why i rejected "r" in favor of "f", do you?
think about it for a minute, and see if you know.
if you said i chose "f" to stand for "front-matter"
or "forward-matter", you got an "f" on this quiz.
it's a nice mnemonic, sure, but the real reason
why i chose "f" is a much more pragmatic one...
(know any other words that start with "mne"
besides "mnemonic"? so what is its origin?)
so, did you think of the answer why i used "f"?
to explain why, think back to when i said that
-- in coding your app and getting a "map" of
the files within any specific book by reading
the directory to see what files were there --
a vital component of that strategy will be that
the filenames _sort_in_the_order_they_appear._
that is, we need to know not just the files that
comprise the book, but their appearance order.
so i choose "f" for front-matter pages because
those pages appear between "c" and "p" pages
-- the cover and the arabic-numbered pages --
so the prefix needed to fall between "c" and "p".
and "f" worked just fine.
you should also keep in mind that the letters
"d" and "e" can be used between "c" and "f",
if the idiosyncrasies of a certain book need it.
likewise, there are lots of letters that can be
used between "f" and "p", if a book needs 'em.
and similarly, there are lots of letters _after_
"p" that can be used, for material that might
come _after_ regular arabic-sequence "pages".
but yeah, that's why i chose "f" instead of "r"...
it was so the filenames would _sort_ correctly.
***
and speaking along these lines, it's just plain silly
that dp/pg pads their pagenumbers to 4 places...
the vast majority of books are under 1000 pages,
so padding the pagenumber to 3 places works well.
that fourth padding place just causes more work.
in those rare cases where you have pagenumbers
that run in 4 digits, one can summon the "r" prefix
to signify those pages, so "r000.png" is page 1000,
"r001.png" would be 1001, "r002.png" 1002, etc.
(yes, you could use "q" too. but as a general rule,
you will leave yourself more flexibility if you do not
choose to use prefixes that are directly adjoining.)
***
the insanity continues...
al says this:
> For blank pages there should be no file
> and the page number should be skipped.
that's just crazy talk. include a blank image-file
and name it appropriately, so the world doesn't
suspect that you screwed up and dropped a file.
because that's _exactly_ what they will suspect...
(and with good reason. skipped pages happen,
a lot, as the world learned from google's work.)
***
...and it goes on and on...
al said:
> front cover c0001.png
> back cover c0002.png
> spine c0003.png
um, no. bad idea. very bad idea. you know how
i said that the sort-order of the filenames should
be identical to their order of appearance, right?
so hopefully you understand that the back-cover
-- i.e., the last thing in the book -- should have
a filename that sorts to the end. not position #2.
that's assuming that you even need a back-cover.
and the spine? i suppose if you _must_ have it,
you will be determined to include it, but please
give it a name that sorts it to the end, too, since
for most people it will just be a cute little gesture.
consider it as the mint as you leave the restaurant.
you might also remember that i insisted the files
must reflect the recto/verso aspects of the book.
for every recto file and filename, there _must_ be
a verso file and filename. once again, if you fail to
maintain this nicety, the world will suspect that you
have lost a file, or that you just do not understand
one of the basic structural aspects of the p-book,
specifically that every piece of paper has two sides.
that's why you always include a blank-page file...
...and why, if you have a file named "c0003.png",
you must also have "c0004.png". don't forget it.
***
...and on and on...
al said:
> page 1 p0001.png
> page 2 p0002.png
> image on page 2 p0002-image1.png
> image on page 2 p0002-image2.png
> page 3 p0003.png
> page 4 is blank
> page 5 p0005.png
> ... ...
> page 9999 p9999.png
first off, you can tell this originated from me,
because of the all-lower-case look of it, _but_
i've always padded my numbers to just 3 digits.
i believe it was marcello who added that 4th one.
(and, as i just explained above, it's unnecessary.)
and gee. you know, like jim said, what i propose is
really -- at the very heart of it -- a simple system...
so it's honestly quite _amazing_ that dp/pg could
screw it up in so many different ways. _amazing_.
look at the lines there pointing to "image on page 2".
either marcello or josh must have added those too.
this is something of a nightmare happening here.
up to now, the files we've been talking about are
_page-scans_. that is, they represent a full page.
we all know why that's the case; it's because we
are doing _proofing_, so we need the page-scan.
now all of a sudden something different pops in,
namely "images" contained on the same page as
the page-scan (which, of course, is also an image).
ok, i won't pretend i don't know what these are.
they're higher-resolution versions of _pictures_
that were contained on that page in the p-book.
which is all well and good, but let's not mix them
in with the page-scans, which is what happens if
you name the hi-res files using the same model.
give those files names that are _quite_different_,
and which sort them completely out of our range.
it'd be good if you even stored them in a separate
directory. (luckily, this is exactly what p.g. does,
storing them in a subdirectory of the .html file,
as these "subimages" are used by .html versions;
but we certainly don't need 'em to do proofing.)
better yet, examine if you need those files at all.
if a particular page had a picture on it that needs
to be scanned at a higher-resolution, then make
the actual page-scan at that higher-resolution...
there's no sense having a low-res version of it,
especially if it's just going to cause us problems.
then, in your e-book file, give instructions for the
viewer-program about the coordinates of the scan
that represent the picture that you want it to "clip".
the viewer-app will then load in the high-res scan,
clip out the picture, and then display it accordingly.
(ok, this is a little futuristic, since no viewer-apps
will do this currently, not even mine. but soon...)
***
al said:
> page 2 p0002.png
> image on page 2 p0002-image1.png
> image on page 2 p0002-image2.png
one more thing about this. even though, as i
mentioned above, these "subimage" filenames
have no ill effects, as they're stored elsewhere,
there is yet another problem presented here,
one which _does_ manifest in the posted scans.
you might get the idea, from that list there,
that dashes are an ok thing in your filenames.
the problem comes with unnumbered pages.
let's say we have an unnumbered illustration
facing page 36 in our "sitka" book, as we do.
so our names would run like this:
> sitkap035.png
> sitkap036.png
> sitkap036a.png
> sitkap036b.png
> sitkap037.png
at least that's how _i_ do it...
but if you looked at the policy as al wrote it,
you might well conclude the names should be:
> sitkap035.png
> sitkap036.png
> sitkap036-a.png
> sitkap036-b.png
> sitkap037.png
or maybe you'd even think they could be:
> sitkap035.png
> sitkap036.png
> sitkap036-1.png
> sitkap036-2.png
> sitkap037.png
either way, the problem becomes clear if you
once again recall that we want the filenames to
_sort_ correctly... al's names will sort this way:
> sitkap035.png
> sitkap036-a.png
> sitkap036-b.png
> sitkap036.png
> sitkap037.png
this would cause the viewer-program to believe
that it should place that unnumbered illustration
between pages 35 and 36 -- a recto and a verso!
this illustration either goes between 34 and 35,
or it goes between 36 and 37, but that is unclear,
and computer programs need things to be clear.
***
if you are now asking "why do we need to be
concerned with how computer programs will
interpret these files?", then you're making the
same mistake that the dp/pg people have made.
you are failing to grasp the _larger_context_
in which these files will be used. and it is this
larger context that is necessary to help us hone
the conventions that we adopt in making e-texts.
the pagenumber f.a.q. failed to consider the
necessary linkage with the names of the scans,
and the scanfile-naming rules failed to consider
how those scans would be used by developers.
this inability -- and unwillingness sometimes --
to see the big picture is why dp/pg isn't creating
coherent policies on such matters, even when it
actually _tries_ to do so (which is relatively rare).
so there implementations will be short-sighted.
when you add in the stubborn way that people
like al and juliet and marcello and josh _refuse_
to take any advice from me, no matter how good,
the situation can look bleak. however, i remain
focused on the long-term, where i am confident
-- supremely confident -- that my ideas will win.
and in the short-term, i just remind myself, on
the infrequent occasions when the question will
present itself to me, that i am not the stupid one.
-bowerbird