March 2010 - gutvol-d - lists.pglaf.org

book flipping scanning -- 200 pages a minute
by Bowerbird＠aol.com 24 Mar '10

24 Mar '10

check it out... > http://spectrum.ieee.org/automaton/robotics/robotics-software/book-flipping… > The system, developed by lab members Takashi Nakashima > and Yoshihiro Watanabe, lets you scan a book by > rapidly flipping its pages in front of a high-speed camera. > They call this method book flipping scanning. > They told me they can digitize a 200-page book in one minute, > and hope to make that even faster. no, it's not actually here yet, but this isn't just a pile of david rothman bullshit hype, this has some reality to it. -bowerbird

1 0

Re: his style grates on many and his egotism seems boundless
by Bowerbird＠aol.com 24 Mar '10

24 Mar '10

over on his fadedpage forums, rfrank said: > For those of you that read gutvol-d, > you should know that I do, also. ok. that clears that up, for anyone who was wondering. > Bowerbird seems very interested in the work > we are doing here you're darn tooting i am. :+) rfrank is currently doing the most interesting work in the whole arena of book distribution, and that's an arena where i have exhibited keen interest for a very long time, certainly long before rfrank ever became involved with it. > Bowerbird seems very interested in > the work we are doing here and posts his > observations and suggestions of the gutvol-d list. well, yeah, i've been posting on this listserve since 2003, except for that short time when the attack-pack got me "moderated", and i went on strike until it was rescinded. so again, nothing new there... > Though his style grates on many > and his egotism seems boundless, ok, let's break this down, shall we? > his style grates on many yes it does. and this is particularly so for those people who subscribe to the dale-carnegie school-of-thought on "how to win friends and influence people", of which i do firmly believe that rfrank is a very big follower... (and al haines too, as well as juliet sutherland.) so let me be perfectly clear on this matter once again. i _hate_ the dale-carnegie philosophy. with a passion. i consider it to be tremendously duplicitous, in that it zeros in on one of the most pathetic of human traits -- our insecurity about our worth -- and trades on it. it encourages one to feed people positive reinforcement, so as to make them feel good and overcome insecurity, so they will come to like you and be influenced by you... i'm not denying that it _works_. it works all too well! but it's cynical. and it's manipulative. and it's ugly. it tip-toes around the issue of flagrant dishonesty by informing its adherents to strive to be honest, and not to lie outright, but that's largely a cover-up which denies the fact one can't _always_ be positive, not if one feels any solid commitment to the truth, the whole truth, and nothing but the truth, as i do. so i will often go out of my way to do reverse-carnegie. dale says "never tell someone they are wrong". so when i am thoroughly convinced someone is wrong, when it's true, i say it, and with gusto: "you are wrong." and then i give all of the reasons _why_ they are wrong, which is also a reverse-carnegie, because dale says that you should always give people an out, a way to save face. those are just two quick examples. but that's enough, because we're not really here to talk about dale carnegie. the thing to remember is, i don't give a crap if anyone here becomes my "friend" or not. i have enough friends, and i don't even _want_ friends who i can't be honest with. and i'm not here attempting to "influence" anyone either. a lot of people get confused about this, because i'm often saying that things _should_be_done_ in a certain way, so people think i have some kind of personal interest about actually _having_ them done that way. i don't really care. you can do it however you want. because when i talk about "how something should be done", i'm talking about _logic_. i'm talking about the _arguments_ that dictate that decision. as shown here, frequently, a lot of people here don't seem to care about "logic" and "reasoned decisions" and stuff like that. which is fine by me. please make decisions however you want. the thing is, it really pisses off the carnegie adherents when you don't care whether you influence them or not, probably because they are willing to sell their soul to have influence, so your apathy (or hostility) about it contradicts their values. so when you fail to butter them up before you lobby them, like dale advises, they get all offended, and even _mean_... (that's right, they forget dale's advice to always be nice, which just goes to show they didn't absorb it very deeply; they only use it because it often works on a surface level.) so, yeah, my style "grates" on some people. so what? because a whole lot of other people -- who i actually like a lot better -- actually _appreciate_ and _respect_ someone who is willing to speak their mind honestly... > and his egotism seems boundless that's just a silly projection. i'm a humble person. i am honestly and truly humble. i'm unimposing, and i'm tremendously kind and gentle. and it's not just a phony act i put on to "win friends"... but there is something about truth. when you have truth on your side, you become strong. you become invincible. i work -- hard! -- to make sure i get to the bottom of a situation, and consider every angle, because it is _vitally_ important to me that i have truth on my side. if i'm on one side of an issue, and the strength of the argumentation suddenly flips truth to the _other_ side, i flip right along with it. because truth is important... yes, one of my biggest flaws is saying "i told you so." but one of my biggest assets is that i have absolutely no reluctance, at all, to say "i was wrong" when i was. a lot of people think i'm "egotistical" when i'm _really_ just extremely confident that i have truth on my side... so it actually has nothing to do with _me_, or my _ego_. instead, it has _everything_ to do with _truth_... > Though his style grates on many > and his egotism seems boundless, > at times there is something of worth in what he posts. of course there is. that's because i have truth on my side. it's also because i'm enough of a scientist that i'm willing -- nay, _eager_ -- to listen when someone says they think that i'm wrong. because if they're correct that i am wrong, i _want_ them to show me the light, so i can switch sides... but again, i don't really care if i "convince" anyone or not. it's an intellectual exercise for me, not a power struggle... > at times there is something of worth in what he posts. oh yeah, and the _other_ thing is that you can never trust a carnegie follower when they say anything nice about you, because they're probably just attempting to butter you up. so maybe roger doesn't even _believe_ what he said there. > He mistakenly reported that the SR version of the book > is being incrementally updated. well, the file that is now posted on your site, to which i gave the u.r.l., is _not_ the file that was posted yesterday. the pagination error i pointed out yesterday was corrected. so i'm not sure how you can use the term "mistakenly"... > He also shows he hasn't come to a complete understanding > of the unusual situation in the text regarding the > inconsistent usage of Wrangel and Wrangell spellings > as it applies to Barons, islands and native population. > It still isn't right and will be bimodally normalized > after smoothreading completes. i didn't really try to "come to a complete understanding". the p-book appears to me to be inconsistent in its usage, and you appear to be inconsistent too, and your usage does not achieve consistency with the p-book's usage... i pointed out the inconsistencies to show i'd found them. but there's no payoff for me to do any more work on that. > He did, however, correctly spot the effect of a superfluous > page transition marker after the last illustration > on a numbered page in the book. Since these books are > all generated from one source file, it was a simple fix > and it was regenerated in a heartbeat. ok. so the file that's up online was _not_ "updated", but it _was_ "regenerated". i'll try to remember this terminology. > He also believes that I may have post-processed > over 500 books, and I have not. well, i'd rather give rfrank _more_ credit than _less_... i know he's done _hundreds_and_hundreds_ of books. he's also programmed a lot of tools, and is now running the roundless experimental site, plus he's on the board at d.p., so it's clear that he's doing a lot, and i give him credit for it. > Though I could have a lot to say about his posts, I choose > not to engage him for historical and practical reasons. the "historical" reason might be that when he did engage me, he tried to deny reality, so i rubbed his nose in it, just like you rub a dog's nose in his pee when he urinates in your house... and the "practical" reason might be that he knows i will do that again if he tries to deny reality again, carnegie notwithstanding. but hey, i don't need for us to "engage". i'm self-motivated. i will say what i have to say, whether anyone listens or not... so he can say what he wants on his board, and i'll post here. -bowerbird

1 0

Re: save those pagenumber references
by Bowerbird＠aol.com 23 Mar '10

23 Mar '10

well, i guess i'll never know what an "uncompressed" filename would look like, or what a "telling filename" could possibly be... my loss, apparently... :+) -bowerbird

1 0

Re: save those pagenumber references
by Bowerbird＠aol.com 23 Mar '10

23 Mar '10

keith said: > I did get your point and they due have there merit, > yet no more than any other filenaming convention > where you overly compress the names. what do you mean by "overly compress the names"? what would a "noncompressed" filename look like? > I will not either go into how flawed they are. ...because you have no arguments of substance... > If you want telling filenames use them. again, what does this _mean_? > We are not living in a DOS world > where we are limited to 8 characters. the history of the u.r.l. in terms of its length, is rather interesting. everybody started with an ethic that they should be short and punchy. not just for convenience, but memorability too. gradually the u.r.l. began accumulating length, as websites got more extensive and files were segmented into subdirectories for convenience. then google started giving juice for content words in the u.r.l., and the length zoomed ridiculously, as everyone employed long names for s.e.o. purposes. things got so ludicrous that we had the emergence of u.r.l. "shorteners", web services that promised to end the scourge of a long u.r.l. by providing a much shorter one they maintained which rerouted people to the longer original, _plus_ furnished some stats, so you knew where the clicks were coming from, etc. what happened then was that twitter hit, and hit big. all of a sudden, people faced a 140-character limit. they didn't want to "waste" a substantial percentage of that limit every time they wanted to send a u.r.l., so the demand for shortener services skyrocketed... so before we could turn around, there were dozens such services, and not just 2 or 3 (bit.ly and tinyurl), and things got messy. first, the shortened u.r.l. is a pain in the ass for many people, because tweeters will often provide different shortened versions for the same long u.r.l., but your browser doesn't show them as already-visited links (since technically they _are_ different links, and your browser doesn't know that they all point to the same eventual destination). second, shortener services make the u.r.l. "brittle"... if the shortener service breaks down, so does their "rerouting" ability which points to the ultimate site, causing all those links to break for no good reason. as startups, with very little chance of "making it", the original shorteners had frequent down-time, so the problem was readily apparent, even then... but as more and more of these services started up -- hoping to hit the lottery by being "blessed" by twitter or google or anyone who would buy them for a boatload of money -- it was more and more clear that most of these services _would_fail_, and take all their short links with them when they did. and sure enough, then they did start closing down. and they continue to have cutbacks, to this very day. one of them -- http://tr.im/ -- just announced that it is no longer accepting u.r.l. shortening requests... luckily, they're still honoring their current redirects; but what happens when they go completely under? well, we're lucky once again, because google has come to the rescue. they have ensured that they will support a service designed to honor redirects for any shortener service that goes out of business. it makes sense, since they have a large degree of responsibility for this problem in the first place, since they give extra google juice to a long u.r.l. thankfully, though, the shortener services made us admit to ourselves that the long u.r.l. is a problem, bringing us to the current stage of u.r.l. history, where we are once again embracing the short u.r.l. many people are now voluntarily cutting back on the use of the long u.r.l.; google could help this effort by reversing its policy to give juice to the long u.r.l. because a short and clear u.r.l. is a better u.r.l. because people _do_ have to occasionally type in a u.r.l., and can't just do a simple copy-and-paste. because people often include u.r.l. in listserve posts, where there is an imposed length on the lines, and u.r.l. get printed in p-books, with limited line-length. because people tweet u.r.l. because people dislike the brittle shortened u.r.l. so that's why i think my 5-letter prefix works just fine. > Proog given that your naming convention is flawed > and so now you can change it !! huh? what? i guess you better run that by me again. no, on second thought, never mind. this is a great example about here discussion here is one big waste. i don't think you're _trying_ to sidetrack the dialog, keith, so i'm not going to scold you, but just tell you that you need to keep things moving _forward_, ok? -bowerbird

2 1

Re: save those pagenumber references
by Bowerbird＠aol.com 23 Mar '10

23 Mar '10

keith said: > BB I can not believe you are serious. is that so? because i find your disbelief to be quite humorous! :+) > 1) Your critic fails all logic. it fails _all_ logic? i have a hard time believing that, keith... :+) > Why in Gods name would anybody intermix scans > from more than one book in the same directory. > Their are more than enough files just from one book ! i wrote that huge post, and _that's_ what you took from it? talk about missing the point. you missed it by a mile, keith. (a mile is about 1.6 kilometers, in case you are wondering.) for the record, not that i think anyone else missed the point, it might not be that you'd _want_ to put more than one book in a directory, it's that you _could_ if you ever _did_ want to, whereas, when all books are named p001-p999, you cannot. the more important point is that, given the files for a book, and for another book, you wanna be able to tell them apart. all files for a book should be named with a common element. and the name of every file should be unique from all others, across your entire system. this is nothing but common sense. > 2) How is a sequence of five arbitary characters anymore > informative. Or can you remeber 26^5 titles. the characters are not informative in and of themselves, but they become meaningful when all files from a book receive the same prefix, because then you see, just from the names, they go together. and no, there's no need to remember them, since the catalog will keep all of the information straight and make the appropriate information available to the end-users. but i suspect you knew all that. *** but in order to see how someone might do it another way, go look at the internet archive and their naming convention. they went for longer names, hoping for _some_ meaning... and, to a degree, they attained it, at a cost in convenience. for instance, here's a subdirectory name: > http://www.archive.org/details/adventuresoftoms00twaiiala that subdirectory maps onto another more-specific one: > http://ia331317.us.archive.org/1/items/adventuresoftoms00twaiiala/ so their "name" for this book is "adventuresoftoms00twaiiala". therefore, you might guess -- correctly -- that this book is "the adventures of tom sawyer". but it doesn't inform you _which_ edition of the book this is, or where it came from, or if it is one of the several copies from project gutenberg, or when it was published, or any number of details about it. to get to that information, you'll have to visit their catalog, and if you're gonna visit a catalog anyway, you might as well visit the catalog to find out the 5-letter "prefix" of the book, a prefix that's much easier than "adventuresoftoms00twaiiala". and you better believe me, because it has happened to me, once you get a lot of the archive.org files on your machine, it starts to become very hard to discriminate names such as: > http://www.archive.org/details/adventuresoftoms00twaiiala > http://www.archive.org/details/theadventuresoft00074gut > http://www.archive.org/details/theadventuresoft07193gut > http://www.archive.org/details/theadventuresoft07194gut > http://www.archive.org/details/adventurestomsa02twaigoog > http://www.archive.org/details/adventurestomsa00twaigoog > http://www.archive.org/details/adventurestomsa00willgoog > http://www.archive.org/details/adventurestomsa01twaigoog > http://www.archive.org/details/adventurestomsa05twaigoog > http://www.archive.org/details/tomsawyer00twain > http://www.archive.org/details/adventuresoftoms20twai > http://www.archive.org/details/adventuresoftoms99twai > http://www.archive.org/details/adventuresoftoms00twai2 > http://www.archive.org/details/tomsawyeradv00twairich > http://www.archive.org/details/advtomsawyer00twairich > http://www.archive.org/details/booki-export-the-adventures-of-tom-sawyer so, for me anyway, a 5-letter prefix seems to do the job just fine. *** likewise, we can look at the system used by project gutenberg, where the "prefix" for the book is essentially its 5-digit name. digits are, in some ways, even more convenient that characters. the problem is, 5-digit names only work up to 99,999 books... that's enough for now, for project gutenberg, so that's fine, but i wanted more breathing room, so i chose 5-character names... *** or let's take a look at youtube names. here's a sample u.r.l.: > http://www.youtube.com/watch?v=sA_0cvd1EUM > http://www.youtube.com/watch?v=qybUFnY7Y8w first, i'm not sure why they need that "watch" in every u.r.l. surely "watching" a video would be the default action, not?, so it seems to me they could have abstracted that out, but... we find they're using an 11=character name, one that uses _both_ uppercase and lowercase letters (i only use lowercase), _and_ numbers, _and_ at least some other characters as well. that's going to give them _many_trillions_ of possible names, which i guess is how high you think if you sell for $1.6billion. *** speaking of google, let's see their book filename convention: > http://www.google.com/books?id=3n4hAAAAMAAJ > http://www.google.com/books?id=Y7sOAAAAIAAJ they've got a 12-character name, uppercase and lowercase, plus numbers. which, again, will accommodate lots of files. *** > Come On Man! Wake up. well, it's after midnight my time, so i'm about to go to sleep; but i will wake up tomorrow morning, all ready to post again. -bowerbird

2 1

Re: save those pagenumber references
by Bowerbird＠aol.com 23 Mar '10

23 Mar '10

al said: > Basic format: > > The prefix for the cover pages is: "c". > The prefix for the roman pages is: "f". > The prefix for the arabic pages is: "p". > > *** > > For blank pages there should be no file and > the page number should be skipped. > Optionally an image saying: > "This page is blank in the original." > may be inserted. > > *** > > Example of file naming: > > front cover c0001.png > back cover c0002.png > spine c0003.png > > i title page f0001.png > ii title verso f0002.png > iii dedication f0003.png > iv is blank > v contents f0005.png > > page 1 p0001.png > page 2 p0002.png > image on page 2 p0002-image1.png > image on page 2 p0002-image2.png > page 3 p0003.png > page 4 is blank > page 5 p0005.png > ... ... > page 9999 p9999.png dkretz said: > So far this "spec" seems to be primarily a legend. > Is it documented anywhere? al said: > No. It was developed and used by Joshua Hutchinson dkretz said: > Here's an extensive forum thread on DP > where we hashed this all out. oh lord. *** where do i begin? seriously, this is such a mess. where do i begin? *** well, to start with the last comment first, this wasn't "hashed out" at all. it was just messed up, because josh and marcello are too stubborn to take good advice from me. and on a more general level, this all shows that d.p. and p.g. can mess things up even when they actually try to do the right thing. *** so let's go back and examine the problems... *** we'll need to start with a short history lesson. years back, there was a push to get the scans hosted at p.g. with the text, and p.g. said ok. but when people started posting their scans, i noticed they had been named very stupidly. most stupid was that the filenames contained _numbers_ that were _not_ the _pagenumbers_. thus the file for page 123 might be "0128.png". this didn't surprise me, because d.p. has been naming their scans stupidly for many years... i'd tried to wise them up, but they didn't listen. but it's one thing to name _your_ files stupidly, since you're the only one who works with 'em, so you're the only one who pays the penalties of the big costs that stupid filenames impose. it is quite _another_ thing to name files that you post in public using a stupid convention, because the _public_ works with those files... luckily, the most insane position did not prevail. p.g. required that all scans must be named using the same number as the pagenumber. for a while, anyway, some d.p. people would rename all the scan-files so they could then be posted to p.g. yes, it's stupid to work with stupidly-named files, because you pay all the penalties of working with stupidly-named files, only to rename them to smarter names _after_ you're done working with them, but that's what d.p. was doing. for a little while. until it fizzed. the good news is that most scans at p.g. are named with a number that's the pagenumber. the bad news is that the renaming requirement essentially means not many scans get posted... the ugly news is that the names are _still_not_ really intelligent. they're not _moronic_, but they're not very intelligent either, not at all... on an i.q. scale, they'd weigh in at about 87. thus ends our history lesson to set context... *** ok, what comes next? first of all, let's remember the philosophy that should be a fundamental cornerstone of _any_ intelligent filenaming convention... one important principle (the first?) which should be at work here is that every filename is _unique._ that is, _each_and_every_ file should have a name that identifies _that_file_ separate from all others. now, there might be some cases where the same file might have different names in different places. (some would argue that; let's put that off for now.) but an _iron-clad_rule,_ with _no_ exception, is different files must always have different names. to say it another way, different files must _never_ have the same name. _never_, _never_, _never_. so right at the _very_outset_, the dp/pg model has failed us... all of their files are named with the same p0001.png-p9999.png convention and thus fail to meet the imperative to be _unique._ how can we tell one file named p0001.png from _every_other_ file named p0001.png? we cannot. and since every book has a p0001.png file, _bad_. this isn't rocket-science. it's common sense. _different_files_should_have_different names!_ we're back in the same old boat where we need to pay heed to the subdirectory name to know with certainty which book each file represents. if the filenames were unique, we could place every one of our files in a single subdirectory, and we would have no filename crashes and we could identify each file as a unique entity, just from its name, without looking inside it. i mean, it's great that we know that p0001.png is a scan of a page that was numbered as page 1 in the book in which it appeared, but the filename doesn't tell us _which_ book that was, so we are left out in the cold on the very first step we take. how sad... how utterly and thoroughly pathetic... *** to make my filenames _unique_ to a particular book, i give each scan in a book a 5-letter unique prefix... so, for the "sitka" book we've been analyzing lately, the 5-letter prefix for all the filenames is "sitka"... in case you're wondering, a 5-letter prefix gives us 26**5 possibilities for unique ones, which computes to 11 million possibilities. 11.8 million, to be exact, but some of those might be voided as unusable... if you feel a need to be able to label more books, a 6-letter prefix gives 308,915,776. (308+ million.) a 7-letter prefix gives 8 billion. 8-letter, 208 billion. let me know when you've got 208 billion documents. til then, an 8-letter prefix will work just fine, thanks. indeed, i'm happy with a 5-letter prefix at the moment. *** ok, so let's go on... jim said: > The prefix for the cover pages is: "c". > The prefix for the roman pages is: "f". > The prefix for the arabic pages is: "p". the "c", "f", and "p" convention is one i created... thankfully, this model was adopted by dp/pg. but there was a _reason_ i picked those letters, a good reason, and -- when it came to details -- dp/pg again screwed up with its implementation. the "p" stands for "page", and that's obvious. and "c" for "cover" is the obvious choice too. but some people suggested the front-matter should have an "r" prefix, for "roman numbers". know why i rejected "r" in favor of "f", do you? think about it for a minute, and see if you know. if you said i chose "f" to stand for "front-matter" or "forward-matter", you got an "f" on this quiz. it's a nice mnemonic, sure, but the real reason why i chose "f" is a much more pragmatic one... (know any other words that start with "mne" besides "mnemonic"? so what is its origin?) so, did you think of the answer why i used "f"? to explain why, think back to when i said that -- in coding your app and getting a "map" of the files within any specific book by reading the directory to see what files were there -- a vital component of that strategy will be that the filenames _sort_in_the_order_they_appear._ that is, we need to know not just the files that comprise the book, but their appearance order. so i choose "f" for front-matter pages because those pages appear between "c" and "p" pages -- the cover and the arabic-numbered pages -- so the prefix needed to fall between "c" and "p". and "f" worked just fine. you should also keep in mind that the letters "d" and "e" can be used between "c" and "f", if the idiosyncrasies of a certain book need it. likewise, there are lots of letters that can be used between "f" and "p", if a book needs 'em. and similarly, there are lots of letters _after_ "p" that can be used, for material that might come _after_ regular arabic-sequence "pages". but yeah, that's why i chose "f" instead of "r"... it was so the filenames would _sort_ correctly. *** and speaking along these lines, it's just plain silly that dp/pg pads their pagenumbers to 4 places... the vast majority of books are under 1000 pages, so padding the pagenumber to 3 places works well. that fourth padding place just causes more work. in those rare cases where you have pagenumbers that run in 4 digits, one can summon the "r" prefix to signify those pages, so "r000.png" is page 1000, "r001.png" would be 1001, "r002.png" 1002, etc. (yes, you could use "q" too. but as a general rule, you will leave yourself more flexibility if you do not choose to use prefixes that are directly adjoining.) *** the insanity continues... al says this: > For blank pages there should be no file > and the page number should be skipped. that's just crazy talk. include a blank image-file and name it appropriately, so the world doesn't suspect that you screwed up and dropped a file. because that's _exactly_ what they will suspect... (and with good reason. skipped pages happen, a lot, as the world learned from google's work.) *** ...and it goes on and on... al said: > front cover c0001.png > back cover c0002.png > spine c0003.png um, no. bad idea. very bad idea. you know how i said that the sort-order of the filenames should be identical to their order of appearance, right? so hopefully you understand that the back-cover -- i.e., the last thing in the book -- should have a filename that sorts to the end. not position #2. that's assuming that you even need a back-cover. and the spine? i suppose if you _must_ have it, you will be determined to include it, but please give it a name that sorts it to the end, too, since for most people it will just be a cute little gesture. consider it as the mint as you leave the restaurant. you might also remember that i insisted the files must reflect the recto/verso aspects of the book. for every recto file and filename, there _must_ be a verso file and filename. once again, if you fail to maintain this nicety, the world will suspect that you have lost a file, or that you just do not understand one of the basic structural aspects of the p-book, specifically that every piece of paper has two sides. that's why you always include a blank-page file... ...and why, if you have a file named "c0003.png", you must also have "c0004.png". don't forget it. *** ...and on and on... al said: > page 1 p0001.png > page 2 p0002.png > image on page 2 p0002-image1.png > image on page 2 p0002-image2.png > page 3 p0003.png > page 4 is blank > page 5 p0005.png > ... ... > page 9999 p9999.png first off, you can tell this originated from me, because of the all-lower-case look of it, _but_ i've always padded my numbers to just 3 digits. i believe it was marcello who added that 4th one. (and, as i just explained above, it's unnecessary.) and gee. you know, like jim said, what i propose is really -- at the very heart of it -- a simple system... so it's honestly quite _amazing_ that dp/pg could screw it up in so many different ways. _amazing_. look at the lines there pointing to "image on page 2". either marcello or josh must have added those too. this is something of a nightmare happening here. up to now, the files we've been talking about are _page-scans_. that is, they represent a full page. we all know why that's the case; it's because we are doing _proofing_, so we need the page-scan. now all of a sudden something different pops in, namely "images" contained on the same page as the page-scan (which, of course, is also an image). ok, i won't pretend i don't know what these are. they're higher-resolution versions of _pictures_ that were contained on that page in the p-book. which is all well and good, but let's not mix them in with the page-scans, which is what happens if you name the hi-res files using the same model. give those files names that are _quite_different_, and which sort them completely out of our range. it'd be good if you even stored them in a separate directory. (luckily, this is exactly what p.g. does, storing them in a subdirectory of the .html file, as these "subimages" are used by .html versions; but we certainly don't need 'em to do proofing.) better yet, examine if you need those files at all. if a particular page had a picture on it that needs to be scanned at a higher-resolution, then make the actual page-scan at that higher-resolution... there's no sense having a low-res version of it, especially if it's just going to cause us problems. then, in your e-book file, give instructions for the viewer-program about the coordinates of the scan that represent the picture that you want it to "clip". the viewer-app will then load in the high-res scan, clip out the picture, and then display it accordingly. (ok, this is a little futuristic, since no viewer-apps will do this currently, not even mine. but soon...) *** al said: > page 2 p0002.png > image on page 2 p0002-image1.png > image on page 2 p0002-image2.png one more thing about this. even though, as i mentioned above, these "subimage" filenames have no ill effects, as they're stored elsewhere, there is yet another problem presented here, one which _does_ manifest in the posted scans. you might get the idea, from that list there, that dashes are an ok thing in your filenames. the problem comes with unnumbered pages. let's say we have an unnumbered illustration facing page 36 in our "sitka" book, as we do. so our names would run like this: > sitkap035.png > sitkap036.png > sitkap036a.png > sitkap036b.png > sitkap037.png at least that's how _i_ do it... but if you looked at the policy as al wrote it, you might well conclude the names should be: > sitkap035.png > sitkap036.png > sitkap036-a.png > sitkap036-b.png > sitkap037.png or maybe you'd even think they could be: > sitkap035.png > sitkap036.png > sitkap036-1.png > sitkap036-2.png > sitkap037.png either way, the problem becomes clear if you once again recall that we want the filenames to _sort_ correctly... al's names will sort this way: > sitkap035.png > sitkap036-a.png > sitkap036-b.png > sitkap036.png > sitkap037.png this would cause the viewer-program to believe that it should place that unnumbered illustration between pages 35 and 36 -- a recto and a verso! this illustration either goes between 34 and 35, or it goes between 36 and 37, but that is unclear, and computer programs need things to be clear. *** if you are now asking "why do we need to be concerned with how computer programs will interpret these files?", then you're making the same mistake that the dp/pg people have made. you are failing to grasp the _larger_context_ in which these files will be used. and it is this larger context that is necessary to help us hone the conventions that we adopt in making e-texts. the pagenumber f.a.q. failed to consider the necessary linkage with the names of the scans, and the scanfile-naming rules failed to consider how those scans would be used by developers. this inability -- and unwillingness sometimes -- to see the big picture is why dp/pg isn't creating coherent policies on such matters, even when it actually _tries_ to do so (which is relatively rare). so there implementations will be short-sighted. when you add in the stubborn way that people like al and juliet and marcello and josh _refuse_ to take any advice from me, no matter how good, the situation can look bleak. however, i remain focused on the long-term, where i am confident -- supremely confident -- that my ideas will win. and in the short-term, i just remind myself, on the infrequent occasions when the question will present itself to me, that i am not the stupid one. -bowerbird

2 1

more sitka smoothreading glitches
by Bowerbird＠aol.com 23 Mar '10

23 Mar '10

i see rfrank is making improvements to his smoothreading version of the "sitka" book. i wasn't sure if he would do that as they came in, or if he would simply wait and do all the fixes one time at the end... > http://www.fadedpage.com/s/sitka/sitka.htm an incremental approach is just fine, but it means that no one has yet reported this next glitch, which is a rather amazing one, since it has survived the system through preprocessing and proofing and postprocessing, although it doesn't even pass a simple spellcheck: > Some looked with extreme disfavor upon the establishment, > while others wrere friendly. it's also unclear whether anyone has reported the inconsistencies in the spelling of the baron's name -- is it wrangel or wrangell? -- but perhaps rfrank decided to leave 'em as they are in the p-book. of course, if _that_ were the case, he wouldn't have changed the two cases of the baron's name on page 43, since they are clearly printed as "wrangel". but also there, two alaskan places which -- as the book directly states there -- "today perpetuate his name" are clearly printed as "wrangell", which is the cause for confusion, compounded by the fact that the name is spelled as "wrangell" on pages 54, 61, 63 (twice), and 102, but as "wrangel" on page 75... aside from the inconsistent-with-the-printed-page instances on page 43, rfrank was also inconsistent with the ink-on-paper on page 63 (the second instance), where he was not just inconsistent with the printed book, but with his own version on the same page. (in other words, the page was consistent itself, but rfrank was not.) *** all of this is not to criticize rfrank. indeed, i will tell you that he is an excellent postprocessor. he has a ton of experience; he's probably submitted over 500 books to p.g. by this time... what this _does_ show is that even an excellent postprocessor, with a ton of experience, can have errors that persist through preprocessing, proofing, and postprocessing, and maybe even through smoothreading. (at least this far, these glitches have.) so i think this is good evidence that "once and done" is _not_ a good strategy for a roundless system. that philosophy has _never_ been a part of the roundless system that _i_ preach... indeed, i believe any change should be reviewed and approved by two separate people before it is considered to be "golden"... it's also important to remind ourselves that we are not "short" of proofers. to the contrary, we have a huge _glut_ of proofers. distributed proofreaders has so many proofers that they are now actively considering ways to _throttle_ their p1 proofers! with an _abundance_ of proofers, there is no need to scrimp... we can have multiple proofers look at every page in every book. -bowerbird

1 0

Re: save those pagenumber references
by Bowerbird＠aol.com 22 Mar '10

22 Mar '10

al said: > What bowerbird failed (or didn't bother) > to mention was that > using curly braces for page numbers > and square brackets for footnotes \ > are practices that are documented in > PG's Volunteers' FAQ (V.98, V.99, V.103). > As such, my "personal practice" is not > an invention of my own, but are PG-standard, > documented, practices that I've adopted for my projects. discussions here are often so pointless it's not worth bothering. and yet i persist. sometimes i think _i_ must be the stupid one. but then, no, i realize, no, it's not me that's the stupid one at all. *** so... what al failed (or didn't bother) to mention is that whether or not any contributor _follows_ the f.a.q. is entirely a personal matter up to them... and that's why the f.a.q. don't really matter much... not unless the whitewashers enforce a particular aspect. and this one has not been enforced. so the vast majority of the .txt files have no pagenumbers. (well, actually, it _is_ enforced. because v.98 actually instructs producers that they should _not_ keep pagenumbers, except in "exceptional" cases. al tried to slip a fast one by us there, eh?) so p.g. has failed to create a convention about how it is done, even inside of its own cyberlibrary, let alone _outside_ of itself. and let me tell you that i respect michael hart's _principled_ decision not to enforce a standard much more than i respect a naive belief that -- just because it's in the f.a.q. -- you have established a convention. i don't respect that naivety at all... on the other hand, michael's unwillingness to take a stand _has_ meant that the producers have overruled the f.a.q. d.p. postprocessors have taken to including pagenumber info in their .html versions over the course of the last few years... many now include the pagenumbers as a matter of _routine._ that's the good news. the bad news is that the laissez faire attitude is paramount in d.p. postprocessors. they do things however they want. and they change how they do things whenever they want to. so, over the course of those last few years, they've treated pagenumber info in countless ways, with zero consistency. so it will be difficult or impossible to construct a "standard" from the d.p. practices, especially since the information is buried in the source .html, and not evident on the surface. it's also the case that there continue to be major problems with _all_ of their implementations, for reasons that might well be unavoidable, such as browsers that do not support the kind of functionalities that might be necessary to walk that tightrope i talked about between "pro" and "anti" forces. but, for people who like to view the glass as being 2/10 full instead of 8/10 empty, please enjoy the fact that the people who finish off the e-texts at d.p. now value pagenumbers... yet al still remains clueless... and his cluelessness moves up to a higher level as well. because remember that the _reason_ we want a convention is so that the developers of viewer-programs will support it, by programming the necessary capabilities into their apps... does anyone know any app developers who have done that? i mean, besides _me_ with _my_ apps? yeah, i thought not... the convention, even if obtained, is just a means to an end. and the pointlessness continues... one of the useful aspects of pagenumbers, as don points out, is they allow us to refer back to the page-scans of the book... but the f.a.q. betrays no knowledge of this beneficial purpose, and thus fails to enlighten the e-text producers of this linkage. if it _was_ based on this broad goal, the f.a.q. would also show awareness that pagenumbers per se are but a small part of the overall needs, along with things like _the_original_linebreaks_ and _the_original_end-line-hyphenates_. without those other vital aspects, it's similar to "baking a cake" with sugar as your only ingredient; the thing you get out at the end won't be cake. i talk more about this in the reply i drafted over the weekend, which i still intend to send today, so i won't belabor it now... -bowerbird

1 0

Re: save those pagenumber references
by Bowerbird＠aol.com 22 Mar '10

22 Mar '10

jim said: > How do you propose to deal with texts that have a large number > of “prefix” pages numbered something like “iii” for example? > > How do you propose to deal with texts that have a large number > of “prefix” pages which are not numbered at all? > > How do you propose to deal with texts where the numbering > scheme was screwed up in the original text? > > How do you propose to deal with texts which do not count > illustration pages in their numbering scheme? > > Etc. > > Again, it’s great to have a simple system that works except > when it doesn’t work in which case it’s not so simple anymore. gee, jim. i just talked about how people get bamboozled by small issues, which can be hurdled quite easily if you just set your mind to it... and here you make a reply with a whole handful of small issues. not even "small", really... more like _tiny_... even _teeny-tiny_... indeed, if you really look at the example i discussed, you'll see that several of your questions were answered there _already_... so i'm not even going to go through the exercise of answering. if you really want answers, you can generate them yourself, or go back and look where i have been discussing this issue for _many_years_, and review any one of those exhausting threads. there _is_ such a thing as a stupid question. i've asked them myself, as have all of us. and jim, you just asked a _handful._ but you know, jim, the thing i'm wondering is this... i've held this position on intelligent filenaming conventions for _years_ now. and that's just counting on _this_listserve._ i've been practicing what i preach for about two decades now. if there was really some problem with my system, don't you think i would have discovered it by now? do you really think that you can come up with a reaction in your first 5 minutes that i haven't experienced in the years and years and years i've been doing intensely close analysis of book digitization? i mean, _seriously_... did you really think i just happened to "overlook" that books generally have forward-matter pages, and that those pages have a different pagenumber sequence? and do you really think i just hadn't ever noticed that some of the illustration-pages in books are unnumbered pages? really? so let me say this _again_, jim... if you want to have dialog with me, you _cannot_ say stupid things. you simply cannot. because i won't continue to talk with you if you do. capiche? -bowerbird

2 1

Re: jim, i have some questions about pgdiff output
by James Adcock 22 Mar '10

22 Mar '10

>jim, here are 55 cases where your tool seems to give us more than just the 2 choices that i would expect to see... See my discussion below of what the Levenshtein Distance is and how pgdiff implements it. >in some cases, such as the second one listed (hot springs), it's because one of the proofer's notes contained a "|" in it. you'll want to screen the input for your significant characters, i.e., any "{" or "}" or "|", and eliminate them to avoid confusion. Agreed that this would be a problem if my tool is used as input to another "smart editor" tool that wants to present "Choose A" vs. "Choose B" type choices. Since instead the tool was targeting a regex editor being driven by a real human being who can recognize from context whether the "{|}" chars are being used to highlight differences vs. being used as part of the input text it hasn't been a problem for me re the intended problem domain. ====== Levenshtein Distance is the measure of the number of changes needed to transform one string of tokens into a different string of tokens, where the allowable edits are "insert", "delete" or "substitute." Different implementations of the algorithm would have different interpretation of what constitutes a "token" and what constitutes a "string". One obvious interpretation would be that a "token" is an ascii char and a string is a line of text (dictionary lookups of miss-spelled words) Another obvious interpretation is a "token" is a line of text and the string is the list of lines of text within a file (diff) pgdiff implements neither of these but rather a "token" to be a "word" where a "word" is a non-white sequence of chars followed by a white sequence of chars, where the white sequence of chars is considered not-significant for the purposes of the Levenshtein Distance, but IS significant for the display of output. pgdiff considers the "string" to be the entire list of words in the input file. The typical importance of the white part is whether words are separated by a space or by a linebreak. Pgdiff doesn't care about the white part in terms of the Levenshtein Distance, so that the two input files can have different line lengths and different linebreak locations, and still be comparable. This also means that typically including page break information in the input files such as the "====== filename.101 ====" type stuff would NOT be a good idea, since typically the input files may have their page breaks in different locations re their word content -- unless the two input files are from the same identical edition. So here's some answers to some implied questions or assumptions: Does pgdiff look for word differences within a line of text? No. Does pgdiff look for single word changes? No. OK, what does pgdiff do? What pgdiff does is to calculate a best match of words across two entire files. Assuming you set the input options large enough, for example, one input file could contain an entire chapter that the other input file doesn't contain and the algorithm would sync up just fine. Or in the case of a book I've worked on previously the US version had paragraphs removed by a censor, whereas the European version of the text had them intact. When the words do not match exactly, the mismatches are categorized three ways 1) Insert this missing word. 2) Delete this extraneous word. Or 3) Substitute this one word for a different word. Now by reversing the input order options 1) and 2) obviously become symmetrical -- an insertion in one case becomes a deletion in the other case. So in either case an isolated word difference is displayed like { this } or if a bunch of words in a row are delete or insert like { this is in one text but not the other } In case 3) if only one word is different in a row it displays the output choice like { this | that } But in case three if a bunch of words are different in a row how to display them? If the differences are due to scannos it is probably best to display the words next to each other { this | th*s is | iz a | u test | tost } whereas if the differences are due to human editing it would probably be best to display them as "sentences" { THIS IS A TEST | _this is a test_ } If you are implementing a "smart editor" then clearly you can choose to display them which way you want. In practice what one normally sees is some weird mixture of the two possible situations, and it isn't clear to me which display technique is best, so so far I have chosen the easiest approach to implement -- which is the first pattern of display { this | th*s is | iz a | u test | tost } >From the BBoutput.txt file, for example, consider: { Seattle, | Seattle, Washington | Washington } Which is of the first pattern. The ending } is on a newline since the two tokens differing in whitespace, space vs. linebreak. Taking that diff back out one gets: { Seattle, | Seattle, Washington | Washington } Which one can read as: Choose one of: Seattle, OR Seattle, Followed by: Choose one of: Washington OR Washington In this case if one KNEW the differences are due to humans rather than scannos , then it is "obvious" that the better display pattern would be the second one: { Seattle, Washington | Seattle, Washington } IE Choose one of: "Seattle, Washington" OR "Seattle, Washington" But in general the tool doesn't know if differences are due to human edits or scannos, and in general what one sees is a mixture of both problems happening at the same time. PS: OK pgdiff doesn't REALLY match across ENTIRE files since if the files are huge Levenshtein is an n^2 algorithm in space and time. What it does do is break a file into large overlapping chunks of text and calculate the measure across the chunks, where the size of the chunks can be specified as in input parm if you prefer, and where the chunks get sewn back together using an invariant of choosing places in the match where words DO match, and checking the sanity of that match to make sure we haven't lost sync. What this means in practice is that if you specify a parm of -10000 as an input setting then the algorithm can "ONLY" handle about 10000 word mismatches adjacent to each other in a row without erroring out. This parm in practice is important for versioning where two editions of a book have large chunks of text which don't match each other. IE a chapter is edited out or edited in or a censor has taken their knife to the text. Common problems are that two texts from different editions have entire book prefixes (introductions) or entire book suffixes (postscripts or indexes) which don't match -- which one is better to explicitly remove and deal with separately, but which the algorithm will try to handle if you set the input parm large enough.

2 3