
On 7/22/05, Jon Noring <jon@noring.name> wrote:
Marcello wrote:
Jon Noring wrote:
Example: DP0000239-00125-106.png
"DP0000239" is the source book identifier, here a DP identifier. If the scan project is independent of DP, it could be 'PG0014239' to associate the scan set with PG text number 14239. Incredibly awkward and broken in several ways:
1.
While scanning you have no feedback on the correctitude of your scanning. You are scanning page "42" and saving to file "58.tif". There is no immediate relation between the page you are putting on the scanner and the filename you are saving it under.
But your system requires that each image, when it is first saved, needs a human being to eyeball the page, determine the publisher supplied page number (if any; may be implied), and then manually save the page using the publisher number.
I think you are both going rather far afield. The majority of the books scanned at DP are scanned directly into Abbyy Finereader, which has the following limitations/features: *Numbers only. *Will split/deskew automatically, and assign consecutive numbers. If the page is upside down, both halves will be upside down and assigned reversed numbers. *Has batch renumbering capability that will move a range of pages to another range, but not affect the order. *Will threshold greyscale/color to b/w if told (it doesn't handle pages left in grey well.) and will despeckle if told; the despeckle is fairly aggressive, and has been known to eat punctuation. *Is not suited to making archival scans of illustrations; greyscale images are quantized to about half the normal color space, and it deskews using the shear method. Generally I start the beginning material at 1, run until I hit real page numbers, then push 'real' numbered pages to 101. I believe most other PMs use a variant of this. Pages without a number often, but not always, do not have OCRable text and get scanned separately. If they do have text I make a note where they fit and put them up in the 900 range. After scanning, everything gets run through Guiprep (Excellent tool!) which renumbers all of the illustrations to fit in with the DB restrictions at DP (0001.png, 0002.png, etc.) consecutively. DP image numbers have very little to do with actual page numbers, although in most cases it is a fixed offset. As Juliet said before, there are few conventions regarding illustration numbering; I generally use pic[xxxx].png, where xxxx is the true page #, or a variant thereof. No room for fancy metadata in the filename, although I believe DP now accepts an XML file with each project; you may be able to store 'real' page # information there. Although I believe the long term plan is to leave it to a metadata round.. R C