ok, first i wasn't gonna even write this.
but then i decided that i had to write it.

and then i decided that i wouldn't send it.
but now i've decided that i have to send it.

but oh lord, i am so tired of the merry-go-round,
and re-assert my resolution to get off of the thing.

i mean, i do love to say "i told you so", a lot, but
after a while, even _that_ gets old.  but you guys
need to be reminded of the facts of this matter...

so i must...

***

al said:
>   This is the convention used by Joshua Hutchinson,
>   when he provided page scans from DP:

oh please.  these lame attempts to rewrite history by
reporting it selectively are a demand for a slapdown.

that file-naming convention is one that _i_ developed,
which josh then bastardized because he didn't grok it.

and let's not forget that i had to rail for _years_ to
install this common-sense file-naming convention,
fighting _against_ josh and david "donovan" garcia
and _many_ others over at distributed proofreaders,
where some idiots _still_ use other naming schemes
since d.p. doesn't demand consistent common-sense.
(just in case you thought that flaw was unique to p.g.)

the archives will prove that this fight lasted for years.

**

so, what was the fight about?

the basic notion is that the filename for a scan should
include the pagenumber of the page the scan captured.

d.p. doesn't require this, so there was some question
about whether p.g. should also be lax in that regard,
and whether d.p. should be "encouraged" to change...

as i recently showed, d.p. still allows bad file-naming.

you might remember i pointed to scans from the first
8 pages of "when it was dark", which is now at d.p.:
>    http://zenmagiclove.com/pgstupidity/012.png
>    http://zenmagiclove.com/pgstupidity/013.png
>    http://zenmagiclove.com/pgstupidity/014.png
>    http://zenmagiclove.com/pgstupidity/015.png
>    http://zenmagiclove.com/pgstupidity/016.png
>    http://zenmagiclove.com/pgstupidity/017.png
>    http://zenmagiclove.com/pgstupidity/018.png
>    http://zenmagiclove.com/pgstupidity/019.png

so the scan for page 1 is named "012.png".
and the scan for page 2 is name "013.png".
and the scan for page 3 is name "014.png".
and so on.

it's smarter to name the scan for page 1 as "001.png".
page 2 will be "002.png", and "003.png" is page 3, etc.

and yes, people fought against me for _years_ on this.

***

that's right -- i had to _fight_, for _years_and_years_,
to get common-sense to prevail.  i consider this to be
a _victory_, since -- in this one case, anyway -- it did!
there are too many others where a fight still goes on...

but anyway, so we're in the middle of a years-long fight
about this file-naming convention, or the _lack_ of it...
over time, i've written up lots of careful documentation
for the way that the convention _should_ be construed.

and then one day, josh rewrote my careful system,
adding a few twists of his own (which screwed it up),
and everybody shifted to support "josh's proposal".

as if he had invented it.

yeah, right.  ok, boys, do it your way.  you assholes.

***

and just to give a flavor of some of the other history
around this issue, the page-scan-submission process
was the genesis for don kretz writing his "twisted" app.

since d.p. refused to require its content providers use
smart file-naming in the beginning, there were lots of
scan-sets around (then, and now) with bad filenames,
so don wrote this tool for people uploading scan-sets.

it was originally called "twister", and its purpose was
to aid in the file-renaming process.  after coding that,
don realized it was relatively simple to extend the app,
to the point he had turned it into a very good start on
a postprocessing program.  unfortunately, the "powers"
over at d.p. turned their noses up at it, so it never got
the adoption that it should have, and d.p. was the loser.
and the postprocessing queue lingers to this very day...

***

as for the issue of "omnibus" editions -- ones which
combine input from several p-books -- i'll say this...

in 1992, omnibus editions made sense.  for instance,
"alice in wonderland" has a few places where it says
"later editions added this" and it gives the addition...
in that place and time, that was the perfect solution.
it would've been stupid to create two different texts,
which differed by only a few lines, absolutely stupid.

15 years ago, 1997, omnibus editions still made sense.
scans
were rare, and even when available, o.c.r. stunk.

10 years ago, 2002, they still made sense.  that's why
i staunchly defended you, greg, and michael hart, and
p.g. in general, when some people (like jon noring and
lee passey) were attacking you relentlessly because you
weren't being "faithful" to this or that canonical version.
o.c.r. had gotten better, and scans were more prevalent,
but bandwidth still presented a large logistical obstacle.

5 years ago, 2007, google books and archive.org were
still trying to establish a solid footprint on the ground,
so omnibus editions could still argue they made sense.
but we were starting to swim in scan-sets, and even the
longstanding bandwidth logjam was promising to clear.

so, today -- 2012 -- there is no argument that can be
made in support of an omnibus edition.  not any at all.

"our policy is what it always was -- it hasn't changed"
is hollow once we grant that the world _has_ changed.

i've always said that, if the text was indeed based on
a specific edition, and p.g. had scans for that edition,
you should mount them.  (again, just common sense!)

indeed, it was my support for _that_ position which led
jon noring to say "see, even bowerbird agrees with me"
at the 10,000 celebration in san francisco back in 2003.

(of course, in typical noring style, jon refused to see that
i did _not_ agree with his overall position, which was that
p.g. should cease making omnibus editions entirely, and
base every book on one specific scan-set, and mount it.
i only supported it _if_ a text was based on one p-book.)

now, for many books, we have scan-sets for all versions.

so project gutenberg needs to point to _one_ scan-set
that's "the one" for each and every particular p.g. e-text.

the alternative -- that a p.g. e-text is "another" version,
but one which has no relationship to any printed book --
dooms the p.g. text to being neglected, then ostracized,
because it pretends to document the past, but it cannot
point to anything tangible as a proof of its provenance,
while the scan-sets sitting online are self-documenting.

***

carlo said:
>   I have two books that are really weird:

weird is not a problem.


>   one in which the page numbers
>   are out of order in the original

the numbers which are actually printed on the pages can
be printed in error, just like anything else that's printed.

so, were the pages _bound_ in the incorrect order?
if so, you should fix that error by rearranging them.

but if the content appears in the proper sequence,
then it is the _pagenumbers_ which were incorrect,
and you would correct that error by changing them.

you might also leave a note, to explain the situation,
so users aren't confused by the apparent discrepancy.


>   and another in which a signature
>   repeats the numbers of a previous one
>   (after 1...208, we have 197a....208a,
>   then 209...285).

this one is clearly an error.  you should renumber
the second sequence.  in this case, since the extra
pages _followed_ page 208, rename them this way:
>   197a -> 208a
>   198a -> 208a
>   199a -> 208c
>   200a -> 208d
>   201a -> 208e
>   202a -> 208f
>   203a -> 208g
>   204a -> 208h
>   205a -> 208i
>   206a -> 208j
>   207a -> 208k
>   208a -> 208l

as i will discuss shortly, one purpose of the filenames
is to represent the binding order, via a sort of them...
and that is the principle which is behind this solution.

***

al said:
>   This is the convention used by Joshua Hutchinson,
>   when he provided page scans from DP:

yeah, right.

ok, so let's take a look, shall we?

>   Basic format:
>   The prefix for the cover pages is: "c".
>   The prefix for the roman pages is: "f".
>   The prefix for the arabic pages is: "p".

so far, so good.

but, for the record, i had to fight like a dog to get
even something as basic as _this_ accepted.  really!

even after people accepted the general idea that
the pagenumber should be reflected in the name,
some of 'em wanted to name the cover as "cover",
and "back-cover", and "spine", and what have you.

they couldn't understand something as simple as
_a_need_for_names_to_reflect_the_binding_order_.

some of them wanted to prefix the front-matter
with "r", to represent the "roman" numerals there.
but of course then those files would sort _after_
the "p" prefix that everyone agreed on for "page".

but these idiots couldn't grok that simple notion.

***

>    For blank pages there should be no file
>   and the page number should be skipped.

wrong.  wrong wrong wrong.  you need to include
a file for _every_ page in the book, or else you will
ruin the verso/recto left/right nature of the spread.

just goes to show how josh failed to grok the basics.

and "skipping" is just a big invitation for disaster.

because then when you lose a file for any reason,
people will just assume that it was a blank page...

if a book does _skip_ pagenumbers, you should
inject images into that range that inform users
"the book skipped pagenumbers at this point",
again taking care to preserve your verso/recto.


>   Optionally an image saying:
>   "This page is blank in the original."
>    may be inserted.

well, this isn't a "bad" thing to do.  but neither is it
a _necessary_ one.  a blank scan speaks for itself...

indeed, if you do things right, the space used by
a blank-page scan should make it obvious that
that specific page was indeed blank in the book.

so a simple look at the size of your scans would
be enough to tell an app which pages are blank.


>   Example of file naming:
>   front cover           c0001.png
>   back cover            c0002.png
>   spine                 c0003.png

again, more evidence of josh's utter stupidity.

the back cover should be given a name that
will sort it _after_ regular "p"-prefix pages.
(and also after all of the back-matter pages.)

c002 must be used for the inside front cover.

yes, folks, if you're going to scan the cover,
you must scan the verso of the cover as well.
indeed, for _any_ thing you scan, you _must_
scan the recto side first, and then the verso,
because we need to be able to show spreads.

which means that the only logical name for a
scan of the spine is the last one in the bunch.

(if, as usual, the inside front cover is blank,
you should substitute in a table-of-contents.
in general, you can substitute in _anything_
that'll be useful to people, for a blank page.)


>   i title page          f0001.png
>   ii title verso        f0002.png
>   iii dedication        f0003.png
>   iv is blank
>   v contents            f0005.png

idiots.

oh, and the title-page is usually _not_ roman i.

if you really have a roman i (for 1), and the
title-page comes before it, then you should
use the "c" prefix for the title and its verso.


>   page 1                p0001.png
>   page 2                p0002.png
>   image on page 2       p0002-image1.png
>   image on page 2       p0002-image2.png
>   page 3                p0003.png

wrong wrong wrong wrong wrong wrong wrong.

the image-files for images on a page must be
kept separate from the page-scans themselves.

not necessarily in another folder -- since that
spoils the good idea of all files in one folder) --
but _definitely_ with a different naming scheme,
one which sorts those names to a different place,
out of the sorting for recto-verso binding-order.

***

al doesn't mention another shortcoming of the
system that josh "borrowed" (so badly) from me.

for any unnumbered "tip-in" illustration pages,
my systems had the filenames append a letter...

so, for a tip-in between, say, pages 198 and 199:
>   196.png
>   197.png
>   198.png
>   198a.png
>   198b.png
>   199.png
>   200.png
>   201.png

you'll notice this is what i suggested to carlo above.

pretty straightforward, eh?  hard to screw it up, yes?

well, no, not for josh, apparently.  he did it like so:
>   196.png
>   197.png
>   198.png
>   198-a.png
>   198-b.png
>   199.png
>   200.png
>   201.png

looks pretty close, don't you think?  well, you're wrong.

because if you sort those names, you'll find that they
don't sort in that order, not on most systems anyway.

try it, and you will see that they sort like this instead:
>   196.png
>   197.png
>   198-a.png
>   198-b.png
>   198.png
>   199.png
>   200.png
>   201.png

with that sort, we've destroyed the p197-p198 spread.
and, of course, we've rearranged the content's order...

it's funny how not understanding something fully
leads an amateur to make a fundamental mistake.

***

and a couple other notes about stupidities here,
which are not just "after-thoughts", but actually
are aimed at the _most_ stupid things about this.

first, it's stupid to pad the numbers to _4_ digits.
since very, very few books go over 1,000 pages...
so padding to 4 digits is unnecessary.  it is also
unsightly.  but _worst_ of all, it's ungainly when
people have to _type_ it, when they enter a u.r.l.

in the extremely rare cases where pagenumbers
go over 1000, you just switch from "p" to "q" as
the preface, and bingo, you're back in business.
(you might even wanna reserve "q" to mean that.)

but the most stupid thing of all is something that
you can't even see here, because it's _not_ here...

one of the most important rules for file-naming
-- indeed, it's probably the _cardinal_ rule! --
is that a filename must be unique to its content.

a filename _must_ point unequivocally to one thing.

the reverse angle -- that each thing must have
one and only one filename -- is a good _goal_,
even though there are some worthy exceptions.

but there is _no_exception_ to the cardinal rule:
a filename must point unwaveringly to one thing.

or, to put this in a different way, different content
_must_always_ have a filename which reflects that.

or, yet another way:

_different_stuff_must_never_have_the_same_name._

if you look at p.g. image-files, however, you will
discover that it has tons of different files that all
have been given the same name -- p0001.png...

likewise, you will find a ton of p0002.png, and
p0003.png, and p0004.png, and p0005.png...

it's stupid.  it's ridiculous.  it's ridiculously stupid.

please don't demonstrate your stupidity by trying to
argue that this is acceptable, because "the files are
in different folders".  that shows you miss the point.

besides, it just so happens that the _folders_ have
_the_same_name_as_well_, for yet another violation.

so to differentiate one p0001.png from another, you
need the parent-folder name of the parent-folder...
holy batman, talk about abstracting the abstraction!

and the whole point is that you need to overcome the
possibility of confusion in the event that your files are
copied to the wrong folder.  or to the _same_ folder...

a solution is easy.  append the 5-digit pg# to each file.

so the files for pg#12345 might be named like this;

>   12345-f001.png
>   12345-f002.png
>   12345-f003.png
>   12345-p001.png
>   12345-p002.png
>   12345-p003.png
>   12345-p004.png

see how easy that is?  take a good close look at it...

so, did you take a good close look?

really?

if you did, you should have spotted a bad _error_...
i had "12345-f003.png", but no "12345-f004.png".
remember, we have to maintain the page-spreads...

back to the point, though.  this way if these files were
to be accidentally copied into the folder for pg#23456,
we'd know immediately, and nothing'd be overwritten.

with non-unique filenames, you will have a mess and
you won't even know right away that you screwed up...

***

like i said, getting the pagenumber put in the filename
was a _huge_ victory for me, one that i fought hard for,
so i'm glad there was _some_ benefit from all my work.

but if y'all think you did it right, you're badly mistaken.

***

and finally, a few more things, while i'm at it...

some of my antagonists try to have a field day with
"oh, that bowerbird, he thinks he's so damn smart".

and even some newcomers might be led to agree.

well, first off, much of this is just _common_sense_.
now, if you don't have enough basic intelligence to
recognize _common_sense_ kicking you in the shin,
don't try and blame me that you are such a retard...

and second, the things which aren't "common sense"
are things i learned from my _hard-won_ experience.
you think i haven't screwed up, and given non-unique
names to different stuff, and then overwritten the new
with the old?  think again.  i've done it.  several times.
enough times that i _learned_ it is a mistake to break
the cardinal rule, which is why it _is_ the cardinal rule.

if you guys were smart, you'd learn from my mistakes.
when i say "you sure don't wanna be doing it that way,"
you should be able to hear the pain of my experience.

and when you ignore me, i just laugh at you, because
i know the pain of _your_ experience _will_ teach you.

the other thing is "that bowerbird is such a rude guy".

and again, i can see newcomers feeling that way too...

well, listen up.  i explained _all_ of these things nicely.
i was polite the first time, and the second, the third,
fourth, fifth, and sixth.  i was always calm and careful.
i am _still_ calm and careful, to this very day, mind you.

but -- after a dozen careful and calm expositions that
explain little more than common sense and experience,
and which were met with knock-down-drag-out _fights_,
where i was insulted and my reputation was maligned --
it's no wonder that i've now developed the attitude that
i will call _stupidity_ by the word that best describes it...

and that is also why i rub it in and say "i told you so..."

so if any of you want to indignantly label that as "rude",
then i'll humbly suggest that you can take up the issue
with the goddesses of honesty and integrity and truth...

-bowerbird