pontifications from mount high horse -- #1497

another jumble of responses... *** roger said:
Uh, no. I didn't put the project up for you to scrape and try to draw conclusions.
except that "drawing conclusions" by analyzing data after experimentation is simply "what i do", roger...
I don't have those numbers and I'm not going to try to derive them from the data. ... Right now, I don't care which way is better, technically. I believe the choice will be made on which approach is most comfortable to the user. I've heard from some people that solo process that actually like to go through the book page at a time because they enjoy following the story as they go, which doesn't happen when someone is in production mode at the book level. ... You chose to scrape it and open up the discussion.
yes i did choose to do that. because that's what i do -- analyze data, draw conclusions, and open up discussion. but i realize that you didn't "ask" for this discussion, and you do not care to discuss some things i talk about, and if you don't want to continue with it, that's fine with me... i appreciate what i got from you this round, like always, but if you'd rather not be engaged, then i can accept that. i think you could write better software if you did do tests, with well-designed research, intended to get clear data, from which you could "draw conclusions" that will make you smarter about the underlying dynamics at work, but i'm past the point of caring very much about your stuff... if you wanted to talk to me about these things, you had chances in 2007 or so. you didn't want to _then_, either. as you can tell from my "pontifications", my willingness to persist in this quest for meaningful dialog is waning. so i'm just getting some things out of my system, so that when 2012 rolls in, i can start anew, with a clean slate... i feel a strong desire to stop treading all this old ground, and start moving the ball forward; the world is now ready. if i have a lingering need to say stuff in the coming year on any of the particular topics instantiated here recently, i'll address messages to a fictionalized amalgamation of don, roger, alex, etc. -- a character dubbed "dodger-x". and yes, i also realize my "tone" is often "disrespectful". it didn't start out that way, but the treatment i received from some people here often turned it in that direction. i don't feel that people automatically "deserve" respect. respect is something you have to _earn_, with _quality_. my commitment toward honesty, integrity, and truth is far too strong to dole out _respect_ to shoddy bullshit... you _can_ automatically expect _courtesy_ from me, yes. although some people forfeited even that, when they seemed to think it was ok to treat _me_ discourteously, just because i didn't pander to their need for "respect". and you certainly won't get "respect" by bullying me... i learned a long time ago not to let myself be bullied -- that you had to give bullies back their own "medicine", to expose their cowardice, or they would chew you up... but it's also the case that i simply _have_ no "respect" for the kind of willful ignorance you expressed above. i mean, it's _smart_ to "not care" about things that are irrelevant to what you do. but relevant considerations? well, it's _stupid_ to "not care" about _those_ things... and no, roger, don't get all upset, because i am _not_ "calling you stupid". precisely because two days later you were coming back to the list saying, "hey, i noticed that my accuracy at finding the italics really sucked..." so here you are, dragging in data that made you wiser about a flaw in your workflow, so you really _did_ "care". and that's the kind of attention to detail that i _respect_. that's the kind of _wisdom_ which i _value_. it's too bad that it's far too little, and far too late, to matter much... on the bright side, even though you're years behind me, you're _light-years_ ahead of almost everyone else here. so i give you the strongest encouragement possible. but, like i said, you shoulda talked to me back in 2007. because starting in 2012, i'll move on to better things. *** but since we're still here in 2011, let me clear my plate... *** everyone agrees that some parts of the digitization workflow need to be done on a book-wide basis, by a single individual. they just _fail_to_recognize_ those parts for the discussion... this typically includes steps at the beginning of the workflow -- e.g., obtaining the scans, photoediting them if necessary, getting the o.c.r. in order and the other preprocessing steps, including the naming of the files, and discarding the garbage. nobody would give much consideration to the argument that these steps should be done "page by page", or collaboratively. sometimes you need just one person in charge of operations. and typically, it also includes steps at the end of the workflow -- final checks to ensure the products are reasonably correct, mounting the final products, accepting any error-reports, etc. again, it would be silly not to have one person handle a book. it is the middle where differences of opinion emerge on: 1. whether it's done by one person, or multiple people. 2. whether it's on a book-wide basis, or page-by-page. those are the main divisions, but occasionally along the way some people have introduced even more confusions, such as the people who insisted -- when i started bringing this up -- that any book-level changes were those done "automatically", and thus "blindly", so they would then argue against _that_... fortunately, we haven't heard that kind of blather here lately. good thing, too. talk about a straw-man... geez! as another example, this one which _is_ extremely recent, roger said some things that gave me the strong impression that he defined the "book-level" versus the "page-level" by _how_the_text_was_stored_. using a concatenated-text-file meant that he was doing stuff at the level of the "book", but once he split the text for each page into its own file, then he was operating at the "page" level. now, i refuse to believe that he thought something as arbitrary as his _file-storage_ determined the level. but that's how he seemed to be talking! for the record, just to try to make lemonade out of this lemon, it's trivial to split the text of the book, slicing it various ways. you can split it into pages, and many times that slice is useful. you can split it into sections (e.g., chapter), also useful often... sometimes you'll need to split it into paragraphs, like we just discussed in the topic of checking for balanced doublequotes. and of course, for spellcheck, you need to split into _words_. for curly-quotes and em-dashes, you go down to characters. i rarely find a need for it, but you can slice by _sentence_ too. and then, last but not least, i often slice the file into _lines_... there's no question that lines are the best way to show diffs of the type that's most likely to occur when correcting o.c.r. so, when you're writing code to handle the digitization tasks, you're constantly splitting, joining, resplitting, and rejoining, depending on what you are trying to accomplish at the time... so, to a programmer, talking about the way the data is stored as your definition of the level of analysis is... bass-ackwards. oh, and just to close this sub-topic, i store text as _books_. it's too inconvenient to store the text in so many "page" files. "betty lee" is under 300k, so that is a very manageable size... (a couple of its _individual_ page-images are that same size!) at any rate, my point is that, on this overall thread involving how we think about an approach-level to digitization, there's a lot of confusion out there about how to conceive the topic. and, needless to say, it's impossible to argue against _every_ type of misconstrual out there, so let's ignore the wacky ones, and focus our concentration on the top two contenders now. (because, as we'll see, even the top two end up meaningless.) as per convention over at d.p., we'll call these middle parts "proofing" or "formatting". the difference of opinion is over: 1. whether it's done by one person, or multiple people. 2. whether it's on a book-wide basis, or page-by-page. now, i fully recognize that d.p. has warped a lot of brains into thinking these middle parts are purely reductionistic. which is bullshit. we're not building automobiles. we don't need a factory, filled with employees, milling around the assembly line... but let's go back in time, so we can give d.p. its due credit. before charlz franks dreamed his brilliant little brainstorm, book digitization was a solitary and lonely path to tread on. michael hart had done the job of creating a force of volunteers, but by and large those individuals did a book _by_themselves_. they might proof each other's work, after a book was "finished", but they didn't split a book to do it -- not even split it in half! so what charlz concocted was truly astoundingly revolutionary! it was one of the first projects which leveraged the connectivity -- the _social_ connectivity -- of the internet, believe it or not. and that alone marked it as a nifty piece of futurism. but the idea that you could digitize a book _page-by-page_ was just as challenging to the "accepted wisdom" of the day. indeed, going in, everybody wondered if it would even work! not if it would work "well"... whether it would work _at_all!_ most people actually _doubted_ it would get off the ground. hey, charlz himself wasn't sure. "it was just an experiment." so he was overjoyed -- along with the other people around -- when the first book was done, then the second, and a third... only after a few months (or maybe even more than "a few") did he actually come to believe that it was going to succeed. he told me that face-to-face in 2003, and even at that time, he appeared to be shocked and amused that it really worked. and maybe we shouldn't have been surprised. after all, the main task is correcting the scannos, right?, and how many ways _are_ there to correct a misspelling? but there _were_ differences of opinion on if it'd work, and perhaps it'll be instructive to consider the reasons. many people believed that digitizing a book required a consistency of purpose, a unity of vision, which would be difficult (or impossible) to achieve in a collaboration. there was a belief too many cooks would spoil the soup. and back then, it took a month, or 2 or 3, to do a book, and there was much doubt that a group would cohere for such an extended period, on a fairly difficult task. so there was also doubt cooks would even stick around long enough to ruin the soup! it'd be a revolving door, nothing more than a never-ending nightmare of churn. and, even with the hindsight of retrospect, the concerns weren't that far-fetched. if everyone went about the job doing things "their own way", it woulda ruined the soup. and there have been enough other "distributed" failures -- including several book-digitization ones -- that the doubt that critical mass would form has been validated; d.p. ain't the rule, it's the exception that proves the rule. thus it ended up that that very same social connectivity "solved" those problems, by creating a _community_... the community did form a critical mass, which ensured that there would be plenty of cooks, for a long time, even if some of them would burn out of the kitchen... and the community formed a social core that enforced some consistency on how its members performed, so a "unity of vision" was achieved via that peer pressure. not that it was easy. nope, despite the obvious response to "how many ways can there possibly be to correct a simple misspelling?", there arose a plethora of other points of disagreement. a _lot_ of them. disagreements about ellipses, and italics markup, and blank lines at the top of a page, and end-line hyphens, and how to solve the problems of persistent log-jams, and how many rounds to have, and how to skip rounds, and how to deal with the asshole known as bowerbird, and whether postprocessing can be distributed or not, and how come some people keep removing my notes?, and how such-and-such a rule should be "interpreted", and how to install guiguts, and what spellchecker to use, and eighty-seven possible options on that spellchecker, and how to handle greek, and whether to use utf-8, and how you hurt my feelings, now my feelings are hurt, and what are you gonna do about my hurt feelings, and why did you hurt my feelings, and now i feel so _hurt,_ and why are you such a mean person anyway?, meanie!, and how come my book is stuck in the p2 queue again, and what _do_ you do about a problem like maria, anyway? d.p. people are like a family, and that's how they fight, just like a family, over everything, for years and years... and then they try to kiss and make up, and tell everyone how much they love them, and how much they love d.p., how important it is to them, and the world at large, and how much they love each other, and... back to fighting... so no, it wasn't easy, not really. love and pain. family. and now a hardening of the arteries that makes it very difficult to improve because of a resistance to change. social connectivity ain't always all it's cracked up to be. *** now... where was i? oh yeah, so now, the pendulum swung the other way, because we found out that d.p. actually _would_ work, and now people think it's the _only_ way that can work. which is stupid, of course. i mean, seriously, kiddies, we knew how to spellcheck before distributed proofreaders came along... really... and it's _possible_ to spellcheck a whole book. really. it's not _required_ that you have a hundred people doing two pages a piece. seriously, kids, it's not... but that's what they know, and some people believe that what they know is the only thing that can possibly exist. and they refuse to peek out of their shell, because they don't wanna have that stupid belief become invalidated. so now we have way too many people who "doubt" that a book-level approach to digitization will actually work. but, as i pointed out, up at the top, there are many parts of the digitization process that are -- even now -- done by one individual working alone, on a book-wide basis... even roger, with his focus on the page-by-page method, is -- quite ironically -- doing much of his books alone... he chooses books, he scans them, he mounts the scans, sometimes he has people help him with the _proofing_ and the _formatting_, and he has a couple dependable smoothreaders too, but he does all the postprocessing by himself, and he submits the products to d.p. himself. heck, folks, he even writes his own software for the job. and once he learns how easy it is to do the whole book, with the right tool, watch how he will go off on his own. he's done about 600 books with the help of his friends. i won't be surprised if he does another 600 by himself. by the way, i got christmas greetings from nick hodson, a 77-year-old from england who's digitized 750 books. all by himself... using software tools he wrote himself... and for many years now, nick has used _text-to-voice_ as one of his best proofing tools, and he swears by it... i've played around with it myself, and yes, _it_works_. it is also convenient as can be, with iphone and ipad... i don't mention it, since you guys don't deserve it, but mark my words, that's how it'll be done in the future. with siri and mobile, you're gonna see that a move to voice-oriented systems will become very accelerated. in the meantime, you will still be hung up on things like "book-level" versus "page-level", which mean _nothing_. those two methodologies are _not_ mutually exclusive. they work _with_ each other, like a hand in a glove, in a synergistic manner, so you gotta be able to do both. at the same time. using the same tool. figure it out... *** ironically, the series i interrupted in order to pursue this current thread was "a review of digitization tools", where i was discussing points that are quite relevant. specifically, the ideal tool will run online and offline, and it will be designed so that the job can be done by one person all alone, or multiple people collaborating. so, for instance, a python script would fit the bill. a web-app _might_, but only providing that it can be run offline, without any access to the internet. (or an offline alternative app would be acceptable, if functionality is substantially identical, or better.) the truth is simple: certain tasks need to be done, in order for a book-digitization to be "successful"... and it doesn't matter if those tasks are done by one person or by multiple people, online or off. as long as they're done, the book gets digitized. for instance, since roger brought it up recently, one of the tasks is "checking the paragraphing". sometimes o.c.r. gets the paragraphs wrong... it either incorrectly splits a paragraph into two (or more) paragraphs with improper blank lines, or incorrectly joins two paragraphs into one by improperly deleting the blank line between 'em. so you need to look at each and every paragraph in your text, and compare it to the relevant scan, to make sure there were no paragraphing errors. it doesn't matter if you do that offline or online. and it doesn't matter if one person or many do it, as long as every paragraph on every page is done, and done correctly. now, from a practical position, if you have multiple people doing different pages, then you must _coordinate_ their efforts, so as to make sure that every page actually gets checked... but as long as it _does_, that's what truly matters. i don't know if you want to call that "page-level" or "book-level", but the label don't mean much. it is what it is. you have to check the whole book, and you do the check against each separate scan. call it whatever you like, it doesn't mean a thing. and that's the way almost all of these tasks are. what matters is that they _get_ done, not _how_. the other important consideration here, though, is that they are done _correctly_. when you have one person looking at a page, and that is _all_, the odds are that, over the course of a full book, an average person is gonna make a few mistakes. to err is human. even on an easy task. especially! so -- if you want to ensure each page is right -- you're gonna need to have more than one check. whether it's from a separate round of volunteers, or the beta smoothreaders, or your end-users, or even you yourself alone checking it over and over, that's what it's gonna take to have 100% accuracy. and even then ya might not reach that perfection. there's no need to be pessimistic, however, since there is another factor that has an impact here... namely, we can often code programmatic checks that offer varying degrees of certainty that a task has been completed successfully. indeed, we can sometimes even have these tasks do a "first pass" on the text, and then simply verify what it's done. let's take this paragraphing task as an example... first of all, there's a very simple find operation to help find some improperly-split paragraphs: two-linebreaks-followed-by-a-lowercase-letter. boom! odds are you just located a few problems; not all the improper-splits, but _some_ of them. now improve the routine, so you're searching for a-lowercase-letter-followed-by-two-linebreaks. boom! you might've located a few more glitches. (if you'd done this routine before the other one, it would have found some of the same problems. but both of them can detect some unique cases.) you can improve the routine further, if you work. or you can even try a whole different approach. that's what i've done, to check the paragraphing. to test my routine, i eliminate all the blank lines from an o.c.r. file, and have it put them back in. it does amazingly well. i'll show you sometime. so, to cut a long story short, when roger implied that this job needed to be done by human eyes, he was wrong... the computer can do very well, providing you have programmed it intelligently. heck, truth be told, even dumb programming does a fairly good job. if you put a blank line after every sentence-terminated line followed by a capital letter, you get a lot of them right. despite an excess of cases of improper splits, a big majority of the paragraphs are correct. and improving that routine is also fairly easy, until you've got 98% of the paragraphs right. ain't gonna tell you the secret of the last 2%, but i'll show you an app proving it is doable. if you ask nice. otherwise just take my word. it ends up that -- for most of the tasks that need to be done -- you can program a test, to see whether you think the file is adequate. i've already discussed code for doublequotes, but that's yet another perfect example of this. again, just so everyone is perfectly clear here, there's always a need for good smoothreaders. every digitizer is gonna make some mistakes... we march to perfection, we don't just "arrive"... and we're never really sure we've gotten there. we always expect "at least 1 more" error-report. but... there are book-digitization tasks which need to be accomplished... to say that book is "finished", you must be able to certify -- with some degree of certainty -- that those tasks were done. how those tasks get accomplished -- whether by one person or many, offline or online -- doesn't matter. what matters is that they got done, and that we can verify they're done. that's the tool that we need to have, to do the job, a tool that allows us to 1. do the tasks, and 2. verify they're done. you can make up terminology if you like, and draw up some graphs in your mind which are as meaningless as the pictures your kid colored stuck to your fridge-door (albeit less captivating than their pictures), but when you boil digitization to its essence, what you find is "there are tasks to be done." kudos to roger for building software to help. and to don too, for ongoing advocacy of the importance of "checklists" to the workflow... and let us not forget that don built an app. once again, folks, you should get "twister".
yeah, it's almost 2 years old now, and it has lots of features that are "not implemented", but you still get a feel for "how it should be". and you reg-ex fans will be happy to know that you can install your own set of reg-ex, and have the app use it. (i couldn't get that particular thing to work, but maybe you will.) but as you use "twister", ask yourself whether it's working at the "book" or the "page" level. if you can come up with an answer, tell me... i also have a brand new thing to show you, but i'll save that for a later "pontification"... have a nice day. -bowerbird
participants (1)
-
Bowerbird@aol.com