ok, on the "good news" front, it appears that rfrank has
finally decided to start naming his files more wisely, so
big respect to the people who steered in that direction.
there seemed to be some uncertainty from roger about
how to go about coding apps with those new filenames,
so i'll talk a little bit about that and hope it filters back...
but the initial info can be used by other people as well!
sure, if you're scanning your own books, you can name
the files intelligently from the get-go, and never worry.
(but, um, if you _are_ scanning your own books, please
ask me for advice on filenaming, and don't just do what
d.p. did when they tried to implement smart filenames,
because they got some of the "details" badly mangled.)
but sometimes, from other people, you might get files
which were named badly, and you'll have to rename 'em.
even some of the big scanning projects -- umichigan and
the internet archive and google (well, not so much google,
not any more, they wised up pretty quickly) -- have been
known to adopt some fairly stupid filenaming conventions,
so if you use their stuff, you'll have to clean up their mess.
so it behooves you to know how.
first things first: get yourself "twisted", the dkretz program.
> http://code.google.com/p/dp50/downloads/list
the initial impetus for this program was precisely this task
of renaming files intelligently, and it works very well for it.
so that's really all you need.
but i'll tell you a bit more...
let's say you're doing preprocessing. one of the things that
d.p. does is it strips the pagenumbers out of the .txt files...
that is just asinine! do not do that, folks. that is the info
that you _need_, so -- obviously -- do _not_ throw it away!
rfrank discards the pagenumber info from his .txt files too.
sometimes, though, for some books, the pagenumber info
sidesteps deletion. one such book was the "sitka" one that
jim and i have been working on. you can find the file here:
> http://z-m-l.com/go/jimad/sitka0-ocr.txt
you can see, at the bottom of each page, the pagenumber,
enclosed in brackets. and oh what a lovely sight they are!
because they tell exactly what the file _should_ be named!
for instance, go down to the start of chapter 1.
you will see that it occurs in the file rfrank named "011.txt".
but, as shown by the pagenumber at the bottom, it's page 7,
and _should_ be named "007.txt" or (better) "sitkap007.txt".
(in case you're wondering why chapter 1 starts on page 7,
it's because the _foreword_ starts on page 1, and runs to
page 5. page 6 is a blank verso that is opposite chapter 1.)
so we know the file "011.txt" should be "sitkap007.txt". great!
but remember the another wrinkle too -- the pagescan filename.
so if we know that "011.txt" should be named "sitkap007.txt",
we also know that "011.png" should be named "sitkap011.png".
now we're cooking...
***
so, to find out the pagenumbers in each of the text-files,
you can run a little perl program i've put up on the site:
> http://z-m-l.com/go/jimad/doglobal.pl
that program is a simple "find" program that pulls out any line
with the string ".txt" in it, or a right-bracket (i.e., "]"), as shown:
sitka0-ocr-001.txt -- [Illustration][**fine print verified by CP]
sitka0-ocr-002.txt --
sitka0-ocr-003.txt --
sitka0-ocr-004.txt -- [Illustration: Lovers' Lane, Sitka.]
sitka0-ocr-005.txt --
sitka0-ocr-006.txt --
sitka0-ocr-007.txt -- [3]
sitka0-ocr-008.txt -- [4]
sitka0-ocr-009.txt -- [5]
sitka0-ocr-010.txt -- [Blank Page]
sitka0-ocr-011.txt -- [7]
sitka0-ocr-012.txt -- [8]
sitka0-ocr-013.txt -- [9]
sitka0-ocr-014.txt -- [10]
sitka0-ocr-015.txt --
sitka0-ocr-016.txt -- [11]
sitka0-ocr-017.txt -- 112]
sitka0-ocr-018.txt -- [13]
sitka0-ocr-019.txt -- [14]
sitka0-ocr-020.txt -- [15]
sitka0-ocr-021.txt -- [16]
sitka0-ocr-022.txt -- [17]
sitka0-ocr-023.txt -- [18]
sitka0-ocr-024.txt -- [19]
sitka0-ocr-025.txt -- 120]
sitka0-ocr-026.txt -- [21]
sitka0-ocr-027.txt -- [22]
sitka0-ocr-028.txt -- [23]
sitka0-ocr-029.txt -- [24]
sitka0-ocr-030.txt -- [Blank Page]
sitka0-ocr-031.txt -- [25]
sitka0-ocr-032.txt -- [26]
sitka0-ocr-033.txt -- [27]
sitka0-ocr-034.txt -- [28]
sitka0-ocr-035.txt -- [29]
sitka0-ocr-036.txt -- [30]
sitka0-ocr-037.txt -- [31]
sitka0-ocr-038.txt -- [32]
sitka0-ocr-039.txt -- [33]
sitka0-ocr-040.txt -- [34]
sitka0-ocr-041.txt -- [35]
sitka0-ocr-042.txt -- [36]
sitka0-ocr-043.txt -- [Blank Page]
sitka0-ocr-044.txt -- [37]
sitka0-ocr-045.txt -- [38]
sitka0-ocr-046.txt -- [39]
sitka0-ocr-047.txt -- [40]
sitka0-ocr-048.txt -- [41]
sitka0-ocr-049.txt -- [42]
sitka0-ocr-050.txt -- [43]
sitka0-ocr-051.txt -- [44]
sitka0-ocr-052.txt -- [45]
sitka0-ocr-053.txt -- [46]
sitka0-ocr-054.txt -- [Blank Page]
sitka0-ocr-055.txt -- [47]
sitka0-ocr-056.txt -- [48]
sitka0-ocr-057.txt -- [49]
sitka0-ocr-058.txt -- [50]
sitka0-ocr-059.txt -- [51]
sitka0-ocr-060.txt -- [52]
sitka0-ocr-061.txt -- [53]
sitka0-ocr-062.txt -- [54]
sitka0-ocr-063.txt -- [Blank Page]
sitka0-ocr-064.txt -- [55]
sitka0-ocr-065.txt -- [56]
sitka0-ocr-066.txt --
sitka0-ocr-067.txt -- [57]
sitka0-ocr-068.txt --
sitka0-ocr-069.txt -- [59]
sitka0-ocr-070.txt -- [60]
sitka0-ocr-071.txt --
sitka0-ocr-072.txt -- [61]
sitka0-ocr-073.txt -- [62]
sitka0-ocr-074.txt --
sitka0-ocr-075.txt -- [63]
sitka0-ocr-076.txt -- [64]
sitka0-ocr-077.txt -- [65]
sitka0-ocr-078.txt -- [66]
sitka0-ocr-079.txt --
sitka0-ocr-080.txt -- [67]
sitka0-ocr-081.txt -- [68]
sitka0-ocr-082.txt -- [69]
sitka0-ocr-083.txt -- [70]
sitka0-ocr-084.txt -- [71]
sitka0-ocr-085.txt -- [72]
sitka0-ocr-086.txt -- [73]
sitka0-ocr-087.txt -- [74]
sitka0-ocr-088.txt -- [75]
sitka0-ocr-089.txt -- [76]
sitka0-ocr-090.txt --
sitka0-ocr-091.txt -- [77]
sitka0-ocr-092.txt -- [78]
sitka0-ocr-093.txt -- [79]
sitka0-ocr-094.txt -- [80]
sitka0-ocr-095.txt -- [81]
sitka0-ocr-096.txt -- [82]
sitka0-ocr-097.txt -- [83]
sitka0-ocr-098.txt -- 184]
sitka0-ocr-099.txt -- [85]
sitka0-ocr-100.txt -- [86]
sitka0-ocr-101.txt -- [87]
sitka0-ocr-102.txt -- [88]
sitka0-ocr-103.txt -- [89]
sitka0-ocr-104.txt -- [90]
sitka0-ocr-105.txt -- [91]
sitka0-ocr-106.txt -- [92]
sitka0-ocr-107.txt -- [Blank Page]
sitka0-ocr-108.txt -- [93]
sitka0-ocr-109.txt -- [94]
sitka0-ocr-110.txt -- [Blank Page]
sitka0-ocr-111.txt -- [95]
sitka0-ocr-112.txt -- [96]
sitka0-ocr-113.txt -- [97]
sitka0-ocr-114.txt -- [98]
sitka0-ocr-115.txt -- [99]
sitka0-ocr-116.txt -- [100]
sitka0-ocr-117.txt --
sitka0-ocr-118.txt -- [101]
sitka0-ocr-119.txt -- [102]
sitka0-ocr-120.txt -- [103]
sitka0-ocr-121.txt -- [104]
sitka0-ocr-122.txt -- [105]
sitka0-ocr-123.txt -- [106]
sitka0-ocr-124.txt -- [107]
sitka0-ocr-125.txt -- [108]
sitka0-ocr-126.txt --
***
i will do a detailed look at that list, and explain everything in it,
but you might wanna take a gander first, to see what _you_ see.
since it might be more fun for you to figure it out for yourself,
rather than plow through my pedantic bullshit...
***
now, we need to do a little repair on some pages, as follows:
the left-bracket was misrecognized on 3 files, so fix that:
sitka017.txt -- 112]
sitka025.txt -- 120]
sitka098.txt -- 184]
the first 4 pages are front-matter, so add some "f" pagenumbers:
sitka001.txt -- add [f001]
sitka002.txt -- add [f002]
sitka003.txt -- add [f003]
sitka004.txt -- add [f004]
the first 2 pagenumbers were deleted by early proofers, so add back:
sitka005.txt -- add [1]
sitka006.txt -- add [2]
page 6 really is a blank page, so let's add a pagenumber to it:
sitka010.txt -- add [6]
the pagenumber on 1 file wasn't picked up by scanner, so we'll add it:
sitka068.txt -- add [58]
the pagenumber on the last page, a map, wasn't there, so we'll add it:
sitka126.txt -- add [109]
the rest are illustration pages (even though some claim to be "blank"),
which we can tell because they exist outside of the page-sequencing,
so we'll add the "a" filenaming convention to slide them into place...
append "a" to these unnumbered pages, which had no pagenumber:
sitka015.txt -- add [10a}
sitka066.txt -- add [56a}
sitka074.txt -- add [62a}
sitka079.txt -- add [66a}
sitka090.txt -- add [76a}
sitka117.txt -- add [100a}
sitka030.txt -- change [blank page] to [24a]
sitka043.txt -- change [blank page] to [36a]
sitka054.txt -- change [blank page] to [46a]
sitka063.txt -- change [blank page] to [54a]
sitka107.txt -- change [blank page] to [92a]
sitka110.txt -- change [blank page] to [94a]
as i said in a short response to juliet yesterday, many of these
missing and misrecognized pagenumbers _could_ have been
"filled in" automatically, because of pagenumber redundancy.
but editing them wasn't too difficult for this particular book...
(i did the editing using my new editor interface, which i will be
revealing to all you excited fans out there next week. oh boy!)
***
once all of the pagenumbers in the files have been corrected,
output from the above doglobal.pl script would look like this:
sitka0-ocr-001.txt -- [f001]
sitka0-ocr-002.txt -- [f002]
sitka0-ocr-003.txt -- [f003]
sitka0-ocr-004.txt -- [f004]
sitka0-ocr-005.txt -- [1]
sitka0-ocr-006.txt -- [2]
sitka0-ocr-007.txt -- [3]
sitka0-ocr-008.txt -- [4]
sitka0-ocr-009.txt -- [5]
sitka0-ocr-010.txt -- [6]
sitka0-ocr-011.txt -- [7]
sitka0-ocr-012.txt -- [8]
...
sitka0-ocr-024.txt -- [19]
sitka0-ocr-025.txt -- [20]
sitka0-ocr-026.txt -- [21]
sitka0-ocr-027.txt -- [22]
sitka0-ocr-028.txt -- [23]
sitka0-ocr-029.txt -- [24]
sitka0-ocr-030.txt -- [24a]
sitka0-ocr-031.txt -- [25]
...
sitka0-ocr-126.txt -- [109]
***
then we can do a variant of that output, to do the renaming for us:
rename sitka0-ocr-001.txt as sitkaf001.txt
rename sitka0-ocr-002.txt as sitkaf002.txt
rename sitka0-ocr-003.txt as sitkaf003.txt
rename sitka0-ocr-004.txt as sitkaf004.txt
rename sitka0-ocr-005.txt as sitkap001.txt
rename sitka0-ocr-006.txt as sitkap002.txt
rename sitka0-ocr-007.txt as sitkap003.txt
rename sitka0-ocr-008.txt as sitkap004.txt
rename sitka0-ocr-009.txt as sitkap005.txt
rename sitka0-ocr-010.txt as sitkap006.txt
rename sitka0-ocr-011.txt as sitkap007.txt
rename sitka0-ocr-012.txt as sitkap008.txt
...
rename sitka0-ocr-024.txt as sitkap019.txt
rename sitka0-ocr-025.txt as sitkap020.txt
rename sitka0-ocr-026.txt as sitkap021.txt
rename sitka0-ocr-027.txt as sitkap022.txt
rename sitka0-ocr-028.txt as sitkap023.txt
rename sitka0-ocr-029.txt as sitkap024.txt
rename sitka0-ocr-030.txt as sitkap024a.txt
rename sitka0-ocr-031.txt as sitkap025.txt
...
rename sitka0-ocr-126.txt as sitkap109.txt
***
remember that we have to do the scan files as well.
(we'll just do a global change from ".txt" to ".png".)
rename sitka0-ocr-001.png as sitkaf001.png
rename sitka0-ocr-002.png as sitkaf002.png
rename sitka0-ocr-003.png as sitkaf003.png
rename sitka0-ocr-004.png as sitkaf004.png
rename sitka0-ocr-005.png as sitkap001.png
rename sitka0-ocr-006.png as sitkap002.png
rename sitka0-ocr-007.png as sitkap003.png
rename sitka0-ocr-008.png as sitkap004.png
rename sitka0-ocr-009.png as sitkap005.png
rename sitka0-ocr-010.png as sitkap006.png
rename sitka0-ocr-011.png as sitkap007.png
rename sitka0-ocr-012.png as sitkap008.png
...
rename sitka0-ocr-024.png as sitkap019.png
rename sitka0-ocr-025.png as sitkap020.png
rename sitka0-ocr-026.png as sitkap021.png
rename sitka0-ocr-027.png as sitkap022.png
rename sitka0-ocr-028.png as sitkap023.png
rename sitka0-ocr-029.png as sitkap024.png
rename sitka0-ocr-030.png as sitkap024a.png
rename sitka0-ocr-031.png as sitkap025.png
...
rename sitka0-ocr-126.png as sitkap109.png
***
this example makes it pretty clear that -- if you only
leave the pagenumbers in the o.c.r., just leave 'em! --
it's pretty easy to use them to name your files wisely...
pagenumbers in the runhead are easy to grab as well.
they're either at the right side of the runhead (if odd)
or at the left side of the runhead (on the even pages).
(the runhead is usually the first line in the file, right?,
but sometimes the pagenumber drops to the second.
still it's usually the first _number_ you find in the file,
so it's easy enough to code your script to look for that.)
again, you have to check them!, to make sure they were
recognized correctly, so you can fix 'em if they weren't.
but once you've got them all in place, you are golden...
and the beauty is that now your files are named wisely!
you'll always know page 23 is in the file named "p023",
and page 46 is in "p046", and page 123 is in "p123"...
moreover, when you want to go to page 46, you will
actually _end_up_ on page 46, not some other page
that is kinda close, depending on what the "offset" is!
***
and here's another nice thing. you'll notice that we
had some unnumbered pages that were named with
an appended "a"? well, we need to keep the recto and
the verso straight, if we want to make good e-books,
so we can't just add an "a" without a backside "b" too.
but hey, that's no problem at all! after each "a" page,
we just slide in a "b" name underneath it, and presto!,
our recto/verso is right again. and we didn't have to
_readjust_ all filenames that followed each "insertion",
because those files were wisely-named to begin with.
***
there's one more thing to talk about: coding apps...
(if you don't do coding, you can leave now if you want;
but it probably won't hurt you to read the rest of this.
you made it _this_ far, so you must be a glutton for it.)
first let's get the necessary admission out of the way...
it's very easy to do your coding when you name your
files in a stupid 001.txt-999.txt way, because you can
simply code the number as a shortcut for the filename.
you use an integer for your pagenumber, and it's easy.
your _files_ go from 1 to 999, and so do your _names_.
it's easy to keep track of things; you just go up or down.
because of this ease, i can understand why you _might_
want to keep using those stupid filenames. but don't...
still, at first, it may not be immediately obvious to you
how to depart from this method. but it really is simple.
instead of thinking of each filename as a _number_
(i.e., an integer), think of it as a "name" (i.e., a string).
yes, the filename has a number _in_ it, and the number
is the _important_part_ (to your end-user), but do not
_think_ of it in this way, at least not for the time being.
think of the filename as a string, nothing but a string...
however, you will _load_ those strings into an _array_...
you'll have as many items in the array as you have files,
and the value of each item will be the _name_ of the file.
then you think of the _index_ for that array as an integer
-- because that's what it is! -- and you use _that_ in the
exact same way you used your pagenumber integer before.
so see, you didn't have to give up the easy convenience of
a number to keep on-track like you thought you'd have to.
your index array goes up and down, just like it did before.
in other words, you can still think of your _files_ as going
from 1 to 999, and increment your array index as before.
but whenever you want to know the _filename_ of a page,
you look-up the value of the array at that index-number.
so let's look at how this would work for our "sitka" book.
the string value of item array #1 would be "sitkaf001".
the string value of item array #2 would be "sitkaf002".
the string value of item array #5 would be "sitkap001",
because it's page 1, and that's where the foreword starts.
the string value of item array #11 would be "sitkap007",
because that's page 7, and that's where chapter 1 starts...
and the string value of item array #126 would be "sitkap109",
and that's the map that's on the last (recto) page in the book.
(of course, there will be a blank verso that'll be "sitkap110",
since a book cannot have an odd number of pages, can it?)
so the last question is "how do i populate the filename array?"
there are various ways you can do it, but two good ones are:
1. read the book's subdirectory to glean the graphic filenames.
2. create a "map" file intended to provide the graphic filenames.
you can also combine these 2 methods as "belt and suspenders";
you create a map file, but your viewer-app confirms the map by
reading the subdirectory to ensure all the graphic files are there.
it's not nearly as difficult to create a "map" file as you might think.
for instance, look closely at the sitka file we're working on:
> http://z-m-l.com/go/jimad/sitka0-ocr.txt
just pull out the separator lines, and you've got your map file.
of course, the current version of that file is using the current
stupid filenames, but you can generate a new concatenated file
after you've renamed your .txt files, and your map will be fine.
you can also just view your subdirectory structure in a browser,
and copy out the filenames, and save them in a file, and bingo!,
there's your map file.
myself, with z.m.l., i use the separator-line method, as you can
see if you look at any paginated z.m.l. file. the lines which have
double-braces enclosing a graphic filename constitute the map.
***
all in all, if you start naming your files intelligently, you'll find
that the benefits far outweigh the costs of doing any rename...
still, i've tried here, in this post, to show you how to do a rename
in the easiest possible way. just remember not to do like d.p. --
_and_keep_the_darn_filename_information_in_your_o.c.r._files_...
-bowerbird