re: [gutvol-d] Re: DPF images archives [Was: Re: Kevin Kelly ...]

robert said:
For one thing, the DP pages # rarely correspond to the physical page #s,
and each book will require manual intervention to determine page numbers, short of some kind of re-ocr and automatic page # extraction from the headers. Personally, I'm betting on manual intervention. :)
actually, i have a program that aids greatly in that regard. it'll display each scan so you can get the page-number from it. if there's a mismatch between file-name and page-number, you just enter the correct page-number, so it'll be renamed. the time-saver is an auto-increment button, which you click if the page-number is the next one in sequence, which both fills in the rename-info and steps to the next page, meaning you use that for most of the files, so the task goes fairly fast. the actual renaming takes place as a batch process at the end. however... i've also used a variation on the other process you mentioned. if for some reason you screwed up the file-name/page-number relationship when you scanned the book, you can fix it quickly. since the text-file from the o.c.r. process has the _same_name_ as the image-file from which it came, you can have the program read the text-file and do an "extraction" of the page-number -- it's almost always the number one higher than the previous, and mostly, it'll be in the running-head, so in the first line of the file, which is why you must always make sure to scan running-heads, and if it's an even-number it will be on the left, odd on the right, except on pages where a new chapter starts, when it's often gone entirely, or (equally often) centered at the _bottom_ of the page, which does mean, yes sir, that that is one of those 30 variables that are an indication that this specific page starts a new chapter, in which case the first line(s) of that page constitute the header, and i do apologize for the run-on digression this has become -- thus you can use that information to rename both the image-file and the text-file so that they reflect the correct page-number... unlike the former process, which is mostly manual, this one is mostly automatic, you just have to give the file-renaming list a quick once-over to make sure that it got everything correct. again, the renaming takes place as a batch process at the end. even a book with unnumbered plate-pages intermixed can be fixed in fairly short order with this program. it's pretty sweet... it's best to run this correction _immediately_, before you even start to do any other processing on the files, because it's simply a waste of time to work with inaccurately-named files. however, you can even use it after-the-fact on completely-processed books. (unless you've discarded or combined all the individual text-files.) i use this app most frequently, however, after scraping a scan-set at google or distributed proofreaders. after displaying all the pages, i copy all of those files from the browser-cache into a separate folder; the program can sort the files according to their modification-time, which puts them in the correct order (since their alphanumeric names from the cache are often not in alphabetic order), and i get the first name in sync, and then just keep on clicking the auto-increment button. slick. anyway, i'm sure you scripting cats can whip up a version of this program on your own, now that i've described how it works. but if you do need this program i've written, let me know what operating system you want it for, and i'll compile you out a copy... -bowerbird
participants (1)
-
Bowerbird@aol.com