Re: [gutvol-d] RFC: Posting Page Scans in DJVU Format

17 Jul 2005

      David Starner wrote:
...
Marcello wrote:
...
...
DP scans already are single-page per file.
...
But my originals aren't. That's not a big deal, but missplit pages can
be a pain. So can pages the OCR badly despeckled.
[snip]
David brought up the issue of illustrations, which Juliet also
mentioned yesterday.

It is common in a DP book scan job to scan the pages at one resolution
sufficient for text, then return and redo all the illustrations at a
higher resolution. (There can be multiple illustrations per page, and
an iluustration can be embedded within text.)

So the page scan filename system has to include this possibility.

Another thing I forgot to mention is if the page scan filename system
should include an identifier pointing to the source book? So we might
preface each filename with an ID associated with the source book. This
way a page scan can be associated with a particular book (and its
metadata) should it be copied somewhere else and stands alone. Or, we
could dump millions of page scans from thousands of books together,
and trivially identify those belonging to a particular book.

So in the prior example I gave of "00035-28", we might now have:

0003857-00035-28

Where '0003857' is the decimal identifier for the source book which
was scanned -- 7 digits gives us 10,000,000 books (hexadecimal would
be slightly more compact but not as human friendly) -- '00035' is the
sequential page as appears in the source book (independent of any page
numbering scheme which includes unnumbered blank pages), and "28" is
the page number (or 'string') the publisher/author actually printed on
the page to identify it.

No doubt there's problems with this system (as noted, how to deal with
oddities such as foldouts and the like), but am proposing it as a sort
of strawman.

Jon

Re: [gutvol-d] RFC: Posting Page Scans in DJVU Format

Jon Noring