i'm glad the topic of formulating a policy for scan-sets
is being discussed again. even though donovan made it
sound like it would be a while before scan-sets would be
made available, i think it's wise to decide on a policy now.
in that vein, here's a response i made to marcello's proposal.
(i probably made much the same response back when he
first presented his proposal, but i didn't go back and check,
i just wrote it up anew.)
anyway, like i said, i'm glad this is being discussed again,
and i offer this -- as usual -- with a constructive spirit...
****
marcello said:
> PG allows the posting of page images along an existing ebook.
> The only currently accepted format is tiff files collected in a zip
archive.
i didn't read that "should" as a _requirement_ for .tiff files, but
it should be stated clearly what image-formats are "allowed".
my opinion is that any widely-used one should be permitted.
> This new format will not replace the old format. Page images
> for any book can be posted in whichever of the 2 formats
> the poster chooses.
i believe strongly in "lockss" -- "lots of copies keeps stuff safe".
i believe strongly p.g. should store its content in multiple formats.
one form of storage for the scans should be _as_individual_files,
and that form should be thought of as the _primary_ method...
but yes, i'm all in favor of bundling them up in other formats too,
such as .zip and .djvu, as well as .pdf and even as .quicktime movies,
so that they could be more easily handled on machines like the psp...
> The format specified here will permit online viewing of the page images
> (with djvu plugin). It is not required to download the file and unpack
it
> as with the old format, although it is still possible to do so.
> The new format also allows for linking from an html document
> to an arbitrary page image, so that a click on a link will
> open the right page image in the djvu browser plugin.
the ability to deep-link to a specific page-scan in a browser is good.
in fact, it's absolutely necessary. but it doesn't go far enough.
browsers simply are _not_ the only way people will access this content.
every obstacle that you put between the user and the content is bad,
and wrapping the content up into a bundle _is_ an obstacle, because
it means that that bundle has to be unwrapped.
look around, and you'll see that websites with straightforward a.p.i.'s
are the ones that programmers are gravitating toward. understandably.
p.g. needs to make it _easy_ for programmers to access this content,
because that's what's going to drive independent development forward.
even if it's possible, down the line or even now, for programmers to include
a library that will negotiate a djvu file, it's not a good thing to make them
do.
it just raises the level of difficulty, and bloats their applications. and
_really_,
there's no good reason for it!
don't make programmers jump through hoops to get to your content!
remember those two lines of code that i posted that pull a file from the web?
that's how _easy_ it can be, but only if your site is free of
access-obstacles...
and when i'm talking about automatic processes to convert the scan-sets
into digital text, it's absolutely crucial that access to each individual
scan
be unobstructed. to have it any other way is to demand unnecessary labor.
let me give an example. when you direct a browser to a google scan
-- in some books, anyway, and maybe all, i just haven't verified that --
the initial download from google to your website is a _redirect_ page,
which the browser automatically resolves by calling that other webpage.
what this "redirection" workflow means, though, is that any programs
that are used to grab the scans have to know how to resolve redirection.
this is an example of the unnecessary obstacle that should be removed.
we want to make it as easy as copying a 2-line routine to do a download;
we don't want every programmer who wants to access the library have to
bloat their code by requiring 'em to include routines to resolve redirection.
> Numbering / Naming of page files
ok, here we go.
> A book usually contains 2 page number sequences, a roman one followed by
> an arabic one. We considered the cover pages as yet another sequence.
this is good.
> A filename for a single-page djvu file MUST follow this pattern:
> <prefix><page number>.djvu
> The prefix for the cover pages is: "c".
> The prefix for the roman pages is: "f".
> The prefix for the arabic pages is: "p".
so far, so good. it's not really necessary to _require_ these letters,
as a "strong recommendation" should do the job, but whatever...
> If there are more page number sequences in the book, they MUST
> be handled in a similar fashion, using an arbitrary free letter.
ok.
> The <page number> is the true page number as seen on the physical page
> (or inferred from the previous / next pages) expressed in arabic
numerals
> and left-padded with zeroes to a length of 4 digits.
i don't see any need to require any particular number of digits.
i myself generally pad to 3, but there's no need to be dogmatic.
> For blank pages there should be no file
> and the page number should be skipped.
> Optionally an image saying:
> "This page is blank in the original."
> may be inserted.
> Missing pages MUST be replaced by
> an image saying: "This page is missing."
this is just wrong. you need to have a file for each page, even blank ones;
otherwise, there's no way of knowing if you have inadvertently lost a file...
also, there's really no reason to require _what_ a blank-page image says,
as long as it communicates the message adequately.
as an aside, sometimes it's a good thing to "put a blank page to good use",
for instance, by copying a graphic from elsewhere in the book to that spot.
it is sufficient to tell the user that the page was blank, then put it to
work...
> A filename for a single-page djvu file containing an illustration
> scanned in a different resolution or color depth MUST follow this
pattern:
> <prefix><page number>-<image position on the page>.djvu
there's no need to scan illustrations on a page in addition to the page
itself.
some people might try to tell you that you need to do that for an .html vers
ion.
tell 'em that you'll have software crop out the illustration when that time
comes.
(this practice would also screws up recto/verso order, which is discussed
below.)
if a page has illustrations that require a different resolution or color
depth,
just scan the page that way.
> If present, front cover, back cover and spine MUST be named as follows:
> front cover outside: c0001.djvu
> front cover inside: c0002.djvu
> back cover inside: c0003.djvu
> back cover outside: c0004.djvu
> spine: c0005.djvu
um, no. this breaks the rule of "alphabetic filename sort = print/bind
order".
> Example of file naming:
> front cover c0001.djvu
> back cover c0004.djvu
> spine c0005.djvu
if you have a front-cover scan saved as "c0001", you _must_ have its verso
-- even if it was blank -- saved as "c0002". this must be an _ironclad_
rule,
because it is absolutely essential to the reprinting of the scans as a p-book
(which is one of the main things people will want to do with the scan-sets).
every recto has to have a verso. it is a basic, fundamental fact about
paper.
> i title page f0001.djvu
> ii title verso f0002.djvu
> iii dedication f0003.djvu
> iv is blank
> v contents f0005.djvu
you need an "f0004" file, and an "f0006" one too.
> page 1 p0001.djvu
> page 2 p0002.djvu
> image on page 2 p0002-1.djvu
> image on page 2 p0002-2.djvu
i've already said we don't need to scan illustrations separately.
but an issue that does arise involves unnumbered plate pages.
in such a case, i use naming that goes like this:
> myantp157.png
> myantp158.png
> myantp158x6a.png
> myantp158x6b.png
> myantp159.png
> myantp160.png
the "x6" indicates that this is the 6th illustration in the book.
i haven't convinced myself that that's a _necessary_ piece of info,
but for the sake of a confidence-doublecheck, it seems useful...
oh yeah, notice the "myant" prefix, which serves to make
the filenames for this book _unique_ across the library...
once again, naming files "f0001.tif" is _asking_ for trouble.
you want to know exactly what's in a file just by knowing
its _name_ -- you _need_ to know that -- because the
alternative (i.e., opening up the file to look at it to tell)
is -- for want of a more accurate description -- idiotic.
if people can't mix files from several books into one folder,
because that results in filename crashes, you've hobbled them.
oh yeah, as i've said for the 493rd time now, i need to retain the
recto/verso order, which is why i have a "myantp158x6b.png" file,
even though in this particular case the verso was blank...
> page 9999 p9999.djvu
if you have a page 9999, you must have a page 10000. oops!
count=494.
> All cover pages and the book spine should appear at the front.
i disagree. only front-cover pages should go at the front.
the back-cover images should go at the end, naturally.
if the spine was digitized, it should be the very last image.
(and -- because every recto has to have a verso -- it must too.)
> The naming scheme was chosen so that saying:
> djvm -c 12345.djvu *djvu
> in a directory containing all single-page djvu files will
> assemble the multi-page djvu file in the correct sequence.
well, that's the right _idea_. (and kudos for picking it up
from me from our previous thread.) but it also shows you
why you need to have a verso file for every recto file, or else
your multi-page .djvu file will have some even-numbered
pages on the right, and some odd-numbered pages on
the left, which is a very big no-no in book typography...
***
greg said:
> At this point, let's try to get a few dozen done,
> in a few different ways, and take a look at
> accessibility/utility.
you don't need to "do experiments" to see that some of
the suggestions that have been made have flaws in them.
you just need to analyze them carefully.
***
jon said:
> it is best if both PG/DP align its practices as best as
> practicable with OCA-developed standards and conventions.
i do believe that it would be good if there was some consistency.
but the o.c.a. is doing some things wrong too. to "synchronize"
with them would be a mistake. to correct their policy is better.
i would also like to take the opportunity to learn from them,
because i'm sure they are aware of some gotchas i don't know.
i've been meaning to volunteer for the appropriate committee,
but i would _much_ rather they just held an open discussion,
because in order to formulate a wise policy, you have to be
knowledgeable about the _uses_ to which people will put the
content, and the only way to find out the wide range of ideas
out there in the world is to open up your ears and _listen_...
and when i look at the names of the o.c.a. heavy hitters --
microsoft, adobe, rlg, and so on -- it makes me _wonder_
if they are even capable of listening to the world at large...
i don't want to be a pessimist. but i do want to be a realist.
if anyone gets a sense they _are_ open to input, let me know,
and i'll be happy to tell 'em exactly what they should do... ;+)
-bowerbird