Re: [gutvol-d] Scan file naming -- another comment

23 Jul 2005

      On 7/22/05, Jon Noring <jon@noring.name> wrote:
...
Marcello wrote:
...
Jon Noring wrote:
...
Example: DP0000239-00125-106.png
"DP0000239" is the source book identifier, here a DP identifier. If
the scan project is independent of DP, it could be 'PG0014239' to
associate the scan set with PG text number 14239.
Incredibly awkward and broken in several ways:
1.
While scanning you have no feedback on the correctitude of your
scanning. You are scanning page "42" and saving to file "58.tif". There
is no immediate relation between the page you are putting on the scanner
and the filename you are saving it under.
But your system requires that each image, when it is first saved,
needs a human being to eyeball the page, determine the publisher
supplied page number (if any; may be implied), and then manually save
the page using the publisher number.
I think you are both going rather far afield. The majority of the
books scanned at DP are scanned directly into Abbyy Finereader, which
has the following limitations/features:

*Numbers only.
*Will split/deskew automatically, and assign consecutive numbers. If
the page is upside down, both halves will be upside down and assigned
reversed numbers.
*Has batch renumbering capability that will move a range of pages to
another range, but not affect the order.
*Will threshold greyscale/color to b/w if told (it doesn't handle
pages left in grey well.) and will despeckle if told; the despeckle is
fairly aggressive, and has been known to eat punctuation.
*Is not suited to making archival scans of illustrations; greyscale
images are quantized to about half the normal color space, and it
deskews using the shear method.

Generally I start the beginning material at 1, run until I hit real
page numbers, then push 'real' numbered pages to 101. I believe most
other PMs use a variant of this.

Pages without a number often, but not always, do not have OCRable text
and get scanned separately. If they do have text I make a note where
they fit and put them up in the 900 range.

After scanning, everything gets run through Guiprep (Excellent tool!)
which renumbers all of the illustrations to fit in with the DB
restrictions at DP (0001.png, 0002.png, etc.) consecutively. DP image
numbers have very little to do with actual page numbers, although in
most cases it is a fixed offset. As Juliet said before, there are few
conventions regarding illustration numbering; I generally use
pic[xxxx].png, where xxxx is the true page #, or a variant thereof.

No room for fancy metadata in the filename, although I believe DP now
accepts an XML file with each project; you may be able to store 'real'
page # information there. Although I believe the long term plan is to
leave it to a metadata round..

R C

Re: [gutvol-d] Scan file naming -- another comment

Robert Cicconetti