Oops Re: [gutvol-d] RFC: Posting Page Scans in DJVU Format

18 Jul 2005

      Marcello wrote:
...
Jon Noring wrote:
...
...
It is common in a DP book scan job to scan the pages at one
resolution sufficient for text, then return and redo all the
illustrations at a higher resolution. (There can be multiple
illustrations per page, and an iluustration can be embedded within
text.)
So the page scan filename system has to include this possibility.
...
Did you actually *read* my RFC before commenting on it? I ask, because
if you had read it, you would have noticed this section:
Oops, I apologize for not making it clear, but I was focusing not on
your particular RFC (and its purpose), but on page scan filenaming in
general (not embedded within DjVu or whatever). I renamed the Subject:
header line on most of my messages to reflect this, but not all of
them.

I have a book, I am scanning it, and I want to apply an appropriate
filename to each separate image so I (and others) can keep everything
straight. I also want the filename to be machine processible so important
page-related info can be machine read at a future time.

I am not thinking of any particular application of using the page scans,
such as DjVu. Your system for image naming within DjVu is interesting,
and certainly there can be a mapping between a system like I propose,
and the one to be used strictly within DjVu.
...
...
0003857-00035-28
Where '0003857' is the decimal identifier for the source book which
was scanned -- 7 digits gives us 10,000,000 books (hexadecimal would
be slightly more compact but not as human friendly) -- '00035' is the
sequential page as appears in the source book (independent of any
page numbering scheme which includes unnumbered blank pages), and
"28" is the page number (or 'string') the publisher/author actually
printed on the page to identify it.
...
This is more complicated and less robust than my proposal.
Well, we are sort of comparing apples and oranges. Sorry for not
making that clear.
...
1. You don't need the ebook number because the ebook number will be in
the filename of the multi-page djvu.
Of course, once the images are embedded within a DjVu, the source book
ID need not be, and probably should not be, part of the page image
naming. So you are right here.
...
2. You don't want the ebook number because at the time of scanning, the
ebook number is unknown.
True. However, for my different situation, when a scan set is submitted to
some repository, along with the metadata associated with the scan set, the
repository may append the source book id that they assign to the front of
the filename. Now, if they produce a single DjVu file from the scan
set, then they can transform/remap the filenames to something
appropriate for that specific purpose.
...
3. You don't want the sequence number in the filename because it 
increases the probability that links to the page image break. If you
have to insert a page all subsequent files will have to be renamed, and
all links to them will break. In my proposal no link will break if you
insert or remove pages (except a link to the removed page).
Within DjVu, certainly!

You bring up an interesting point, though in that if someone scans a
book, and misses a page, then the sequential page scan numbering (not
the same as the publisher page numbering) gets messed up. So once it is
discovered a page is missing and is scanned, the sequential, integer
numbering has to be fixed to "insert" that page.

I am thinking, though, that any book scanning project will go through
some kind of quality control checking, as well as generating a
metadata/catalog record. During this process the scan file name will
be finalized.

If the page scans are later incorporated into a DjVu, then the
filenames before embedding can be mapped into your proposed system.
...
4. Who wants to know about "unnumbered blank pages"? You are not going
to cite a blank page, are you?
Again, I am not thinking specifically of deeplinking into DjVu and
trying to maintain stable links, but rather the filenaming scheme for
a bunch of page scans. With respect to understanding how a source book
is laid out, it is a good idea to know where the blank pages were --
this also aids in knowing if the book scan set is complete. (There are
reasons why many official documents add the statement "this page
intentionally left blank" on blank pages. Also, it would not surprise
me that in rare cases a page which should have been printed, turned
out to be unprinted. But this is a different problem.)

For deeplinking into a finalized DjVu file, the blank pages can
be left out. You are right -- it is unlikely one will ever encounter a
reference in one book to a blank page in another book, except maybe as
some elaborate joke.

Jon