
David Starner wrote:
Marcello wrote:
DP scans already are single-page per file.
But my originals aren't. That's not a big deal, but missplit pages can be a pain. So can pages the OCR badly despeckled.
[snip]
David brought up the issue of illustrations, which Juliet also mentioned yesterday. It is common in a DP book scan job to scan the pages at one resolution sufficient for text, then return and redo all the illustrations at a higher resolution. (There can be multiple illustrations per page, and an iluustration can be embedded within text.) So the page scan filename system has to include this possibility. Another thing I forgot to mention is if the page scan filename system should include an identifier pointing to the source book? So we might preface each filename with an ID associated with the source book. This way a page scan can be associated with a particular book (and its metadata) should it be copied somewhere else and stands alone. Or, we could dump millions of page scans from thousands of books together, and trivially identify those belonging to a particular book. So in the prior example I gave of "00035-28", we might now have: 0003857-00035-28 Where '0003857' is the decimal identifier for the source book which was scanned -- 7 digits gives us 10,000,000 books (hexadecimal would be slightly more compact but not as human friendly) -- '00035' is the sequential page as appears in the source book (independent of any page numbering scheme which includes unnumbered blank pages), and "28" is the page number (or 'string') the publisher/author actually printed on the page to identify it. No doubt there's problems with this system (as noted, how to deal with oddities such as foldouts and the like), but am proposing it as a sort of strawman. Jon