
ok, on the "good news" front, it appears that rfrank has finally decided to start naming his files more wisely, so big respect to the people who steered in that direction. there seemed to be some uncertainty from roger about how to go about coding apps with those new filenames, so i'll talk a little bit about that and hope it filters back... but the initial info can be used by other people as well! sure, if you're scanning your own books, you can name the files intelligently from the get-go, and never worry. (but, um, if you _are_ scanning your own books, please ask me for advice on filenaming, and don't just do what d.p. did when they tried to implement smart filenames, because they got some of the "details" badly mangled.) but sometimes, from other people, you might get files which were named badly, and you'll have to rename 'em. even some of the big scanning projects -- umichigan and the internet archive and google (well, not so much google, not any more, they wised up pretty quickly) -- have been known to adopt some fairly stupid filenaming conventions, so if you use their stuff, you'll have to clean up their mess. so it behooves you to know how. first things first: get yourself "twisted", the dkretz program.
the initial impetus for this program was precisely this task of renaming files intelligently, and it works very well for it. so that's really all you need. but i'll tell you a bit more... let's say you're doing preprocessing. one of the things that d.p. does is it strips the pagenumbers out of the .txt files... that is just asinine! do not do that, folks. that is the info that you _need_, so -- obviously -- do _not_ throw it away! rfrank discards the pagenumber info from his .txt files too. sometimes, though, for some books, the pagenumber info sidesteps deletion. one such book was the "sitka" one that jim and i have been working on. you can find the file here:
you can see, at the bottom of each page, the pagenumber, enclosed in brackets. and oh what a lovely sight they are! because they tell exactly what the file _should_ be named! for instance, go down to the start of chapter 1. you will see that it occurs in the file rfrank named "011.txt". but, as shown by the pagenumber at the bottom, it's page 7, and _should_ be named "007.txt" or (better) "sitkap007.txt". (in case you're wondering why chapter 1 starts on page 7, it's because the _foreword_ starts on page 1, and runs to page 5. page 6 is a blank verso that is opposite chapter 1.) so we know the file "011.txt" should be "sitkap007.txt". great! but remember the another wrinkle too -- the pagescan filename. so if we know that "011.txt" should be named "sitkap007.txt", we also know that "011.png" should be named "sitkap011.png". now we're cooking... *** so, to find out the pagenumbers in each of the text-files, you can run a little perl program i've put up on the site:
that program is a simple "find" program that pulls out any line with the string ".txt" in it, or a right-bracket (i.e., "]"), as shown: sitka0-ocr-001.txt -- [Illustration][**fine print verified by CP] sitka0-ocr-002.txt -- sitka0-ocr-003.txt -- sitka0-ocr-004.txt -- [Illustration: Lovers' Lane, Sitka.] sitka0-ocr-005.txt -- sitka0-ocr-006.txt -- sitka0-ocr-007.txt -- [3] sitka0-ocr-008.txt -- [4] sitka0-ocr-009.txt -- [5] sitka0-ocr-010.txt -- [Blank Page] sitka0-ocr-011.txt -- [7] sitka0-ocr-012.txt -- [8] sitka0-ocr-013.txt -- [9] sitka0-ocr-014.txt -- [10] sitka0-ocr-015.txt -- sitka0-ocr-016.txt -- [11] sitka0-ocr-017.txt -- 112] sitka0-ocr-018.txt -- [13] sitka0-ocr-019.txt -- [14] sitka0-ocr-020.txt -- [15] sitka0-ocr-021.txt -- [16] sitka0-ocr-022.txt -- [17] sitka0-ocr-023.txt -- [18] sitka0-ocr-024.txt -- [19] sitka0-ocr-025.txt -- 120] sitka0-ocr-026.txt -- [21] sitka0-ocr-027.txt -- [22] sitka0-ocr-028.txt -- [23] sitka0-ocr-029.txt -- [24] sitka0-ocr-030.txt -- [Blank Page] sitka0-ocr-031.txt -- [25] sitka0-ocr-032.txt -- [26] sitka0-ocr-033.txt -- [27] sitka0-ocr-034.txt -- [28] sitka0-ocr-035.txt -- [29] sitka0-ocr-036.txt -- [30] sitka0-ocr-037.txt -- [31] sitka0-ocr-038.txt -- [32] sitka0-ocr-039.txt -- [33] sitka0-ocr-040.txt -- [34] sitka0-ocr-041.txt -- [35] sitka0-ocr-042.txt -- [36] sitka0-ocr-043.txt -- [Blank Page] sitka0-ocr-044.txt -- [37] sitka0-ocr-045.txt -- [38] sitka0-ocr-046.txt -- [39] sitka0-ocr-047.txt -- [40] sitka0-ocr-048.txt -- [41] sitka0-ocr-049.txt -- [42] sitka0-ocr-050.txt -- [43] sitka0-ocr-051.txt -- [44] sitka0-ocr-052.txt -- [45] sitka0-ocr-053.txt -- [46] sitka0-ocr-054.txt -- [Blank Page] sitka0-ocr-055.txt -- [47] sitka0-ocr-056.txt -- [48] sitka0-ocr-057.txt -- [49] sitka0-ocr-058.txt -- [50] sitka0-ocr-059.txt -- [51] sitka0-ocr-060.txt -- [52] sitka0-ocr-061.txt -- [53] sitka0-ocr-062.txt -- [54] sitka0-ocr-063.txt -- [Blank Page] sitka0-ocr-064.txt -- [55] sitka0-ocr-065.txt -- [56] sitka0-ocr-066.txt -- sitka0-ocr-067.txt -- [57] sitka0-ocr-068.txt -- sitka0-ocr-069.txt -- [59] sitka0-ocr-070.txt -- [60] sitka0-ocr-071.txt -- sitka0-ocr-072.txt -- [61] sitka0-ocr-073.txt -- [62] sitka0-ocr-074.txt -- sitka0-ocr-075.txt -- [63] sitka0-ocr-076.txt -- [64] sitka0-ocr-077.txt -- [65] sitka0-ocr-078.txt -- [66] sitka0-ocr-079.txt -- sitka0-ocr-080.txt -- [67] sitka0-ocr-081.txt -- [68] sitka0-ocr-082.txt -- [69] sitka0-ocr-083.txt -- [70] sitka0-ocr-084.txt -- [71] sitka0-ocr-085.txt -- [72] sitka0-ocr-086.txt -- [73] sitka0-ocr-087.txt -- [74] sitka0-ocr-088.txt -- [75] sitka0-ocr-089.txt -- [76] sitka0-ocr-090.txt -- sitka0-ocr-091.txt -- [77] sitka0-ocr-092.txt -- [78] sitka0-ocr-093.txt -- [79] sitka0-ocr-094.txt -- [80] sitka0-ocr-095.txt -- [81] sitka0-ocr-096.txt -- [82] sitka0-ocr-097.txt -- [83] sitka0-ocr-098.txt -- 184] sitka0-ocr-099.txt -- [85] sitka0-ocr-100.txt -- [86] sitka0-ocr-101.txt -- [87] sitka0-ocr-102.txt -- [88] sitka0-ocr-103.txt -- [89] sitka0-ocr-104.txt -- [90] sitka0-ocr-105.txt -- [91] sitka0-ocr-106.txt -- [92] sitka0-ocr-107.txt -- [Blank Page] sitka0-ocr-108.txt -- [93] sitka0-ocr-109.txt -- [94] sitka0-ocr-110.txt -- [Blank Page] sitka0-ocr-111.txt -- [95] sitka0-ocr-112.txt -- [96] sitka0-ocr-113.txt -- [97] sitka0-ocr-114.txt -- [98] sitka0-ocr-115.txt -- [99] sitka0-ocr-116.txt -- [100] sitka0-ocr-117.txt -- sitka0-ocr-118.txt -- [101] sitka0-ocr-119.txt -- [102] sitka0-ocr-120.txt -- [103] sitka0-ocr-121.txt -- [104] sitka0-ocr-122.txt -- [105] sitka0-ocr-123.txt -- [106] sitka0-ocr-124.txt -- [107] sitka0-ocr-125.txt -- [108] sitka0-ocr-126.txt -- *** i will do a detailed look at that list, and explain everything in it, but you might wanna take a gander first, to see what _you_ see. since it might be more fun for you to figure it out for yourself, rather than plow through my pedantic bullshit... *** now, we need to do a little repair on some pages, as follows: the left-bracket was misrecognized on 3 files, so fix that: sitka017.txt -- 112] sitka025.txt -- 120] sitka098.txt -- 184] the first 4 pages are front-matter, so add some "f" pagenumbers: sitka001.txt -- add [f001] sitka002.txt -- add [f002] sitka003.txt -- add [f003] sitka004.txt -- add [f004] the first 2 pagenumbers were deleted by early proofers, so add back: sitka005.txt -- add [1] sitka006.txt -- add [2] page 6 really is a blank page, so let's add a pagenumber to it: sitka010.txt -- add [6] the pagenumber on 1 file wasn't picked up by scanner, so we'll add it: sitka068.txt -- add [58] the pagenumber on the last page, a map, wasn't there, so we'll add it: sitka126.txt -- add [109] the rest are illustration pages (even though some claim to be "blank"), which we can tell because they exist outside of the page-sequencing, so we'll add the "a" filenaming convention to slide them into place... append "a" to these unnumbered pages, which had no pagenumber: sitka015.txt -- add [10a} sitka066.txt -- add [56a} sitka074.txt -- add [62a} sitka079.txt -- add [66a} sitka090.txt -- add [76a} sitka117.txt -- add [100a} sitka030.txt -- change [blank page] to [24a] sitka043.txt -- change [blank page] to [36a] sitka054.txt -- change [blank page] to [46a] sitka063.txt -- change [blank page] to [54a] sitka107.txt -- change [blank page] to [92a] sitka110.txt -- change [blank page] to [94a] as i said in a short response to juliet yesterday, many of these missing and misrecognized pagenumbers _could_ have been "filled in" automatically, because of pagenumber redundancy. but editing them wasn't too difficult for this particular book... (i did the editing using my new editor interface, which i will be revealing to all you excited fans out there next week. oh boy!) *** once all of the pagenumbers in the files have been corrected, output from the above doglobal.pl script would look like this: sitka0-ocr-001.txt -- [f001] sitka0-ocr-002.txt -- [f002] sitka0-ocr-003.txt -- [f003] sitka0-ocr-004.txt -- [f004] sitka0-ocr-005.txt -- [1] sitka0-ocr-006.txt -- [2] sitka0-ocr-007.txt -- [3] sitka0-ocr-008.txt -- [4] sitka0-ocr-009.txt -- [5] sitka0-ocr-010.txt -- [6] sitka0-ocr-011.txt -- [7] sitka0-ocr-012.txt -- [8] ... sitka0-ocr-024.txt -- [19] sitka0-ocr-025.txt -- [20] sitka0-ocr-026.txt -- [21] sitka0-ocr-027.txt -- [22] sitka0-ocr-028.txt -- [23] sitka0-ocr-029.txt -- [24] sitka0-ocr-030.txt -- [24a] sitka0-ocr-031.txt -- [25] ... sitka0-ocr-126.txt -- [109] *** then we can do a variant of that output, to do the renaming for us: rename sitka0-ocr-001.txt as sitkaf001.txt rename sitka0-ocr-002.txt as sitkaf002.txt rename sitka0-ocr-003.txt as sitkaf003.txt rename sitka0-ocr-004.txt as sitkaf004.txt rename sitka0-ocr-005.txt as sitkap001.txt rename sitka0-ocr-006.txt as sitkap002.txt rename sitka0-ocr-007.txt as sitkap003.txt rename sitka0-ocr-008.txt as sitkap004.txt rename sitka0-ocr-009.txt as sitkap005.txt rename sitka0-ocr-010.txt as sitkap006.txt rename sitka0-ocr-011.txt as sitkap007.txt rename sitka0-ocr-012.txt as sitkap008.txt ... rename sitka0-ocr-024.txt as sitkap019.txt rename sitka0-ocr-025.txt as sitkap020.txt rename sitka0-ocr-026.txt as sitkap021.txt rename sitka0-ocr-027.txt as sitkap022.txt rename sitka0-ocr-028.txt as sitkap023.txt rename sitka0-ocr-029.txt as sitkap024.txt rename sitka0-ocr-030.txt as sitkap024a.txt rename sitka0-ocr-031.txt as sitkap025.txt ... rename sitka0-ocr-126.txt as sitkap109.txt *** remember that we have to do the scan files as well. (we'll just do a global change from ".txt" to ".png".) rename sitka0-ocr-001.png as sitkaf001.png rename sitka0-ocr-002.png as sitkaf002.png rename sitka0-ocr-003.png as sitkaf003.png rename sitka0-ocr-004.png as sitkaf004.png rename sitka0-ocr-005.png as sitkap001.png rename sitka0-ocr-006.png as sitkap002.png rename sitka0-ocr-007.png as sitkap003.png rename sitka0-ocr-008.png as sitkap004.png rename sitka0-ocr-009.png as sitkap005.png rename sitka0-ocr-010.png as sitkap006.png rename sitka0-ocr-011.png as sitkap007.png rename sitka0-ocr-012.png as sitkap008.png ... rename sitka0-ocr-024.png as sitkap019.png rename sitka0-ocr-025.png as sitkap020.png rename sitka0-ocr-026.png as sitkap021.png rename sitka0-ocr-027.png as sitkap022.png rename sitka0-ocr-028.png as sitkap023.png rename sitka0-ocr-029.png as sitkap024.png rename sitka0-ocr-030.png as sitkap024a.png rename sitka0-ocr-031.png as sitkap025.png ... rename sitka0-ocr-126.png as sitkap109.png *** this example makes it pretty clear that -- if you only leave the pagenumbers in the o.c.r., just leave 'em! -- it's pretty easy to use them to name your files wisely... pagenumbers in the runhead are easy to grab as well. they're either at the right side of the runhead (if odd) or at the left side of the runhead (on the even pages). (the runhead is usually the first line in the file, right?, but sometimes the pagenumber drops to the second. still it's usually the first _number_ you find in the file, so it's easy enough to code your script to look for that.) again, you have to check them!, to make sure they were recognized correctly, so you can fix 'em if they weren't. but once you've got them all in place, you are golden... and the beauty is that now your files are named wisely! you'll always know page 23 is in the file named "p023", and page 46 is in "p046", and page 123 is in "p123"... moreover, when you want to go to page 46, you will actually _end_up_ on page 46, not some other page that is kinda close, depending on what the "offset" is! *** and here's another nice thing. you'll notice that we had some unnumbered pages that were named with an appended "a"? well, we need to keep the recto and the verso straight, if we want to make good e-books, so we can't just add an "a" without a backside "b" too. but hey, that's no problem at all! after each "a" page, we just slide in a "b" name underneath it, and presto!, our recto/verso is right again. and we didn't have to _readjust_ all filenames that followed each "insertion", because those files were wisely-named to begin with. *** there's one more thing to talk about: coding apps... (if you don't do coding, you can leave now if you want; but it probably won't hurt you to read the rest of this. you made it _this_ far, so you must be a glutton for it.) first let's get the necessary admission out of the way... it's very easy to do your coding when you name your files in a stupid 001.txt-999.txt way, because you can simply code the number as a shortcut for the filename. you use an integer for your pagenumber, and it's easy. your _files_ go from 1 to 999, and so do your _names_. it's easy to keep track of things; you just go up or down. because of this ease, i can understand why you _might_ want to keep using those stupid filenames. but don't... still, at first, it may not be immediately obvious to you how to depart from this method. but it really is simple. instead of thinking of each filename as a _number_ (i.e., an integer), think of it as a "name" (i.e., a string). yes, the filename has a number _in_ it, and the number is the _important_part_ (to your end-user), but do not _think_ of it in this way, at least not for the time being. think of the filename as a string, nothing but a string... however, you will _load_ those strings into an _array_... you'll have as many items in the array as you have files, and the value of each item will be the _name_ of the file. then you think of the _index_ for that array as an integer -- because that's what it is! -- and you use _that_ in the exact same way you used your pagenumber integer before. so see, you didn't have to give up the easy convenience of a number to keep on-track like you thought you'd have to. your index array goes up and down, just like it did before. in other words, you can still think of your _files_ as going from 1 to 999, and increment your array index as before. but whenever you want to know the _filename_ of a page, you look-up the value of the array at that index-number. so let's look at how this would work for our "sitka" book. the string value of item array #1 would be "sitkaf001". the string value of item array #2 would be "sitkaf002". the string value of item array #5 would be "sitkap001", because it's page 1, and that's where the foreword starts. the string value of item array #11 would be "sitkap007", because that's page 7, and that's where chapter 1 starts... and the string value of item array #126 would be "sitkap109", and that's the map that's on the last (recto) page in the book. (of course, there will be a blank verso that'll be "sitkap110", since a book cannot have an odd number of pages, can it?) so the last question is "how do i populate the filename array?" there are various ways you can do it, but two good ones are: 1. read the book's subdirectory to glean the graphic filenames. 2. create a "map" file intended to provide the graphic filenames. you can also combine these 2 methods as "belt and suspenders"; you create a map file, but your viewer-app confirms the map by reading the subdirectory to ensure all the graphic files are there. it's not nearly as difficult to create a "map" file as you might think. for instance, look closely at the sitka file we're working on:
just pull out the separator lines, and you've got your map file. of course, the current version of that file is using the current stupid filenames, but you can generate a new concatenated file after you've renamed your .txt files, and your map will be fine. you can also just view your subdirectory structure in a browser, and copy out the filenames, and save them in a file, and bingo!, there's your map file. myself, with z.m.l., i use the separator-line method, as you can see if you look at any paginated z.m.l. file. the lines which have double-braces enclosing a graphic filename constitute the map. *** all in all, if you start naming your files intelligently, you'll find that the benefits far outweigh the costs of doing any rename... still, i've tried here, in this post, to show you how to do a rename in the easiest possible way. just remember not to do like d.p. -- _and_keep_the_darn_filename_information_in_your_o.c.r._files_... -bowerbird