re: [gutvol-d] Re: DPF images archives [Was: Re: Kevin Kelly ...]

27 May 2006

      robert said:
...
For one thing, the DP pages # rarely correspond to the physical page #s,
...
and each book will require manual intervention to determine page 
numbers, 
   short of some kind of re-ocr and automatic page # extraction from the 
headers. 
   Personally, I'm betting on manual intervention. :)
actually, i have a program that aids greatly in that regard.

it'll display each scan so you can get the page-number from it.

if there's a mismatch between file-name and page-number,
you just enter the correct page-number, so it'll be renamed.

the time-saver is an auto-increment button, which you click
if the page-number is the next one in sequence, which both
fills in the rename-info and steps to the next page, meaning
you use that for most of the files, so the task goes fairly fast.

the actual renaming takes place as a batch process at the end.

however...

i've also used a variation on the other process you mentioned.

if for some reason you screwed up the file-name/page-number
relationship when you scanned the book, you can fix it quickly.

since the text-file from the o.c.r. process has the _same_name_
as the image-file from which it came, you can have the program
read the text-file and   do an "extraction" of the page-number --
it's almost always the number one higher than the previous, and
mostly, it'll be in the running-head, so in the first line of the file,
which is why you must always make sure to scan running-heads,
and if it's an even-number it will be on the left, odd on the right,
except on pages where a new chapter starts, when it's often gone
entirely, or (equally often) centered at the _bottom_ of the page,
which does mean, yes sir, that that is one of those 30 variables
that are an indication that this specific page starts a new chapter,
in which case the first line(s) of that page constitute the header,
and i do apologize for the run-on digression this has become --
thus you can use that information to rename both the image-file
and the text-file so that they reflect the correct page-number...

unlike the former process, which is mostly manual, this one is
mostly automatic, you just have to give the file-renaming list
a quick once-over to make sure that it got everything correct.
again, the renaming takes place as a batch process at the end.

even a book with unnumbered plate-pages intermixed can be
fixed in fairly short order with this program.   it's pretty sweet...

it's best to run this correction _immediately_, before you even
start to do any other processing on the files, because it's simply
a waste of time to work with inaccurately-named files.   however,
you can even use it after-the-fact on completely-processed books.
(unless you've discarded or combined all the individual text-files.)

i use this app most frequently, however, after scraping a scan-set
at google or distributed proofreaders.   after displaying all the pages,
i copy all of those files from the browser-cache into a separate folder;
the program can sort the files according to their modification-time,
which puts them in the correct order (since their alphanumeric names
from the cache are often not in alphabetic order), and i get the first name
in sync, and then just keep on clicking the auto-increment button.   slick.

anyway, i'm sure you scripting cats can whip up a version of this program
on your own, now that i've described how it works.   but if you do need this
program i've written, let me know what operating system you want it for,
and i'll compile you out a copy...

-bowerbird

Bowerbird＠aol.com

tags

participants (1)