re: [gutvol-d] Scan file naming -- another comment

jon, go back and read my post in 5 years. you will understand it much better then... the overarching rule is to make the system as complex as it needs to be, and no more. -bowerbird

Bowerbird wrote:
the overarching rule is to make the system as complex as it needs to be, and no more.
The system I propose for scan file naming is *simpler* than yours and more flexible -- it will handle *any* publisher page numbering/naming convention one throws at it (even backwards page numbering or where the same number is used twice) without the use of special prefix letters to describe which major structure of the book the page numbers are used. It also integrates better into the QC system. Here's your two part (really three-part) system as I best understand it: BookID : PublisherPageID(withalphabetprefixes) Here's my system: BookID : ScanSeq# : PublisherPageID(asis) In your system, you are asking the person who is naming the files to not only read the PublisherPageID, but to then append some letter (as needed) to handle the different numbering schemes used in the book (typically two: Roman and Arabic). In my system, the person only needs to read the PublisherPageID and enter that without having to figure out any letter prefixes -- this is easier and more reliable. It will also handle cases your (and Marcello's) system won't handle, such as backward-numbered pages and where page numbers are repeated (this example was actually brought up.) My system also integrates well (not saying yours doesn't) into the natural work flow of the scanning and QC process: 1) Scans are sequentially made from the front to back, including all blank pages. Each scan is given a simple ScanSeq#. 2) During the next stage where a human being is looking at each scan, they append the *actual* publisher supplied page number (or string) to the filename from (1). No need to add any letter prefixes or anything -- they use the *actual* string "as it is". 3) Then, if needed, the BookID (whether it is a DP database record ID or a PG text number) is then prepended to the whole set. This is trivially done with a script. In Windows, I can run a command line *.bat file to do this, if I wanted. Let's not forget the fact that during post-processing (deskewing, cropping, color reduction, etc.) we are generating derivative scan sets whose names must differentiate from other derivative (and the master) scanset. This was a problem I had. By adding a fourth field to the filename, we can differentiate between scan sets of the same project, which we may wish to preserve (at least in the working data base.) This is not complicated at all, and provides a lot of flexibility. Jon

Jon Noring wrote:
In my system, the person only needs to read the PublisherPageID and enter that without having to figure out any letter prefixes -- this is easier and more reliable. It will also handle cases your (and Marcello's) system won't handle, such as backward-numbered pages and where page numbers are repeated (this example was actually brought up.)
Wrong. My system will handle backward-numbered and duplicated pages with a vengeance. But before I demonstrate this, I want to say that your better software architect (me) will design a system that handles 99 % of the cases in a simple, intuitive and straightforward manner, and not a system that handles 100 % of the cases -- but only theorically because it is so incredibly complicated and awkward that nobody can use it. Your system is incredibly complicated, awkward and fundamentally broken because you need too much information to successfully link to a page and once the link is set it will not survive the slightest reorganisation of the files. But now let me demonstrate how my system handles your abnormal cases with very little manual workaround. BACKWARD-NUMBERED PAGES 1. Scan the book "backwards", ie. starting with page 1. If you have a sheet-feeder this will take no more work than just flip the whole book over once. 2. Your scan software will save the pages as 1.tiff, 2.tiff etc. with every file containing the "real" page 1, 2, etc. You have instant feed-back on the correctitude of your scanning: if you put on page 314 your software should offer to save it to 314.tiff. If not, you know you have botched it and can go figure. 3. You run a perl-script that compresses 1.tiff to p0001.djvu, etc. 4. You assemble the multi-page djvu file backwards: djvm -c 12345.djvu `ls -r *djvu` Done. A "backward" book will take you about 5 seconds longer than a straight one. (If you need help on the perl-script drop me a mail.) DUPLICATED PAGE NUMBERS The only thing my system doesn't handle gracefully right out of the box -- being based on the real page number as key -- is duplicated page numbers. But there is an easy workaround: you have to manually edit the filename of the second page "42" to "p0042a". After assembling the multi-page djvu file you have to insert the duplicate pages like this: djvm -c 12345.djvu *[0-9].djvu djvm -i 12345.djvu p0042a.djvu 44 djvm -i 12345.djvu p0043a.djvu 45 Done. This will take you about 5 minutes longer than a "correct" book. -- Marcello Perathoner webmaster@gutenberg.org
participants (3)
-
Bowerbird@aol.com
-
Jon Noring
-
Marcello Perathoner