December 2011 - gutvol-d - lists.pglaf.org

Copyrigh
by Ricardo F Diogo 27 Jan '12

27 Jan '12

Hi, Can anyone please tell me if PG has ever faced any trouble with any authors or publishers for copyright reasons? Did we ever had to remove any books from our catalog? Were there any copyright related conflicts? Are they documented somewhere? (A postgraduation student needs this info for a college paper). Thanks, Ricardo F. Diogo

10 14

selling your own dogfood
by Bowerbird＠aol.com 06 Jan '12

06 Jan '12

once again, someone over at distributed proofreaders has raised the issue of people selling hard-copy of p.g. e-texts at amazon.com... so a brand-new newcomer, "exilefromgroggs", posted this: > I would suggest an alternative strategy. > Set PG up as an Amazon seller (yes, seller!). > Charge $1/£1 per book for books that are > bought that way, and point out that 100% > of the money that PG receives will be ploughed back > into the project to make further books available. > In the seller's blurb for the book, point out that > the books can, in fact, be downloaded for free > from PG's website, and include a link. I would imagine > that being up-front and open, and being clearly linked > to a "good cause" would mean that PG would rapidly > become a high-profile seller in its own right through > Amazon, and would take the wind out of the sails of > people who are trying to make a fast buck for no effort. because, you understand, that's always the exact way that those d.p. volunteers picture these amazon resellers -- as "people who are trying to make a fast buck for no effort." i fully understand that most of these resellers put as little effort as possible into churning out the "product" they sell. but the truth of the matter is that it takes time and money to turn an e-text into a printed book, and even more work to handle distribution, and the profits aren't _that_ great... but nobody bothered to tell that to the brand new newcomer. because doing so would force the existing volunteers there to confront the fact that their workflow doesn't have a way to create a decent print product. the ascii text is unstyled. the .html product is formatted for a web-browser, not print. the e-book products are formless blobs that barely work in the machines for which they're intended, let alone for print. that's why the vast majority of the resellers start with the text-file, instead of the .html file. so all of the work that post-processors put into making the .html file "look good" is just wasted energy. every one of these html-books is a "snowflake" -- unique unto itself, per the post-processor -- and it's far too much work to try to figure out each one, so you could port it to print. or, for that matter, anything else, including e-books, which is why those turn out rather badly. with a library this big, there has to be some standardization, or else you'll be unable to keep the files updated over time... of course, it's _easy_ to say that now, with all the experience we have from having the books _not_ being _convertable_ to the formats we want. but some of us said it all _years_ ago, that letting each post-processor go off in their own direction was a sure-fire way to make sure their work'd be short-lived. -bowerbird p.s. there is also _another_ newcomer over there who wants to program an html5 web-app that interacts with the d.p. site, and nobody is bothering to tell him not to waste his time and energy, because even if he codes it up, the powers-that-be will ignore it. p.p.s. ...waving back to lucy... :+)

22 99

Re: [gutvol-d] epubeditor.sourceforge.net
by Bowerbird＠aol.com 03 Jan '12

03 Jan '12

here's that post to lee which i started over a week ago, and part of what started making me feel that "despair". this post started out as a straightforward review of the difference between lee's tool and my methods. but i felt i needed to preface that with commentary on why a discussion with lee always is so frustrating, and why i eventually had to put him in my kill-folder, and how i wish that i wouldn't have reviewed his app, and the "preface" soon overwhelmed the "review"... so... if you want to skip the back-and-forth scratching, jump down to the two long lines of asterisks below, surrounding a section saying "take a deep breath"... *** lee's post (fri, oct 21st, at 17:47) can be found here: > http://lists.pglaf.org/mailman/private/gutvol-d/2011-October/008200.html dear lee... ok, first, lee, let me be perfectly clear to you... i understand all of your points -- every one -- about your program in your latest reply to me. indeed, i understood all of those points when you made them in your _previous_ reply to me. so you didn't need to make them _again_, and you won't need to make them again in a reply to _this_. because i understand 'em. honest! every one! totally and completely, lee! really! i had simply forgotten how tedious it can be to have a "conversation" with you, even when you're _not_ trying to spin it or sabotage me. but now i remember. so i will give you a few reminders, and other people here can see what i'm talking about. i said: > > or, ya know, you can always > > give 'em _your_ source-code. lee said: > But that's exactly what I did! yes, lee. i knew your code was open-source. i downloaded your code from sourceforge. and sourceforge is a host for open-source. anyone with minimal experience knows that. anyone who's been around a while knows that. anyone who can read the blurb that describes sourceforge, on the download page, knows that. so yes, i knew your code was open-source... and if you had thought about it for a second, or given me 1/10 of the credit i "deserve" for paying attention, or being a programmer, or putting in time, or being a web-surfer who isn't totally asleep, you woulda _known_ that i knew that your code was open-source, and you wouldn't have made the reply you made. so i wondered why you made that reply? but i stop wondering, after two seconds or so, because i've learned it simply doesn't matter... what it means, though, is that you missed that my suggestion was _ironic_, a bit _sarcastic_, and thus you missed the point i was making, which is that you -- and others just like you -- make noise about the "open-source" aspect, when -- in actuality -- the overwhelming mass of open-source projects _don't_ get treatment of the sort that you so-called "advocates" are so fond of talking about, namely that the code is worked over by a large number of people, who not only ensure that it is solid but also continually extend it to all kinds of new uses. oh sure, that happens with _some_ programs. but the vast majority of them are maintained by one person, who does all the work on it, until they tire of it, and then they respond to further requests with a "you can do it yourself". but nobody ever does. you know who's gonna work on your app, lee? you. and only you, lee. you. and nobody else. when i told d.p. i would code a _spellchecker_ for them, they told me they weren't interested, because "it won't be open-source"... so they went without a spellchecker instead, for years. and when they decided to take up the task of adapting an open-source spellchecker, it took a ton of time for them to get it to work, and it _still_, to this very day, doesn't always do what they would like it to do. and guess what? have they _ever_ went in to rewrite that code, so it would behave like they want it behave? no, they haven't. as it would be too difficult. is open-source a good idea? yes sir, it sure is! is free software an even better idea? you betcha! but let's not confuse the _real_ with the _ideal_. just because somebody else _can_ work on it does _not_ mean that they _will_ do that. ever. that's the long explanation of the point that i was making with my simple "suggestion"... i didn't want to have to type all of _that_, though, because then _i_ would've been the tedious one. but as you don't "get the joke", then even having any discussion with you becomes very tedious... *** here's another example, for you, and the others... i said: > > isn't it the "trivially easy" tasks that we want > > our computers to be performing _for_ us? lee said: > No, I don't think so. First you have to understand > that there are tasks that are > trivially easy /for a human bean/ that are > extraordinarily complex for a computer. > And there are tasks that are > enormously complex for a "human bean" > (primarily because they are so detailed) > that are trivially easy for a computer. well, gee, lee, thanks for the grand exposition there. i bet your friends think that you're really really smart. i grant that you're thorough, even as you manage to miss the point completely, and by 3.85 country miles. because we were talking about a task that is: 1. trivial for a human being. 2. trivial for a computer. 3. trivial for a human being to code a computer to do. and i think that we can all agree that your exposition is completely overblown in regard to that type of task. yet that's what we were talking about. (go look it up, if you need to, but the task was exactly like one that you'd just talked about by saying your program did it, so as "to relieve the tedium and avoid simple errors".) *** plowing through these diversions becomes very tiring. it's as if you're intentionally _trying_ to miss the point. (i'm not saying that you _are_ doing it "intentionally", mind you, because that _might_ be the "fundamental attributional error" raising its ugly head... but i have had to slash through the underbrush of these dodges so often that they sure do _seem_ to be "intentional". if they are not, then you would appear to have some serious problems when it comes to staying on-point.) i said: > one of the "themes" of the event is "beautiful books". lee said: > Hopefully, you have your tongue > planted firmly in your cheek. not only do you _not_ get my humor when i put it out, you _think_ i'm joking when i'm relating a simple fact, combined with a link which you must not have checked. these little misunderstandings cumulate to great frustration. i will say that yes, i did find that theme to be _ironic_... so maybe you can catch irony when i direct it at others, but not when i direct it at you. however, i wasn't poking _fun_ at that theme; i was appalled they would choose it! that being the case, though, no need for your "hopefully". their text is ugly, and thus they have no right to even use the term "beautiful" in conjunction with their text-versions. *** lee said: > I have no Mac, no access to a Mac, > and little interest in the Mac. > The promise of Java was "write once run everywhere." > Well, I wrote it once, now we'll just have to hope > that some Mac developer out there can > troubleshoot the problem (and tell me > what the solution is once s/he figures it out). this is exactly the type of attitude i was making fun of, in my post to which you wrote this response... for the record, let us note that "the promise of java" has once again gone unfulfilled in a real-life instance. *** i'll wrap this up, focusing on lee's post on wed oct 26 07:31. > http://lists.pglaf.org/mailman/private/gutvol-d/2011-October/008216.html > /Your/ file is > http://ia700600.us.archive.org/16/items/artofbook00holm/artofbook00holm_djv…. > There it is, the text, the whole text, and nothing but the text. i discussed why that text-file -- and all of the .djvu.txt files over at at archive.org have problems -- but my post might have _followed_ this one, in which case we couldn't blame lee for not knowing that. except that lee _should_ know all that. he has heard it before. nonetheless, he keeps trying to distort what i mean by "a text file". he keeps trying to talk about text-files as if _all_ of them had the deficiencies of the archive.org text-files, as if _all_ of them were lacking any structural information, and as if this was _required_... you can make a text-file "smart" if you want to, and it does _not_ require any angle-brackets at all. and anything that someone can do with angle-brackets can _also_ be achieved _some_other_way_, in a plain-text file, and it's just ridiculous to say that it can not... there's nothing magical about angle-brackets... nothing at all... > I'm just being a little more demanding. no, lee. you just _misinterpret_ what i am "demanding" as being much less than it really is, and then you think you have "more"... a direct and one-to-one correspondence can be made between what _you_ are asking for, and what _i_ am asking for. the job has some inherent demands, and if those demands are satisfied, then both of us can do the job... but if not, neither of us can... > What /I/ want is the output from FineReader > as though the "Save as HTML" option was selected, > with all the markup that FineReader was able to intuit if i get "all the markup that finereader was able to intuit", then i can do the job just as well as you can. maybe better. the point is that archive.org isn't giving us that information; they tell us we need to trawl their pile of x.m.l. crap to get it. > Does anyone want to furnish me > a *nix server with a fat pipe? i had pointed out that, although it would be _possible_ to run a script against all 3 million books at archive.org, the machinery and bandwidth required make it impractical. lee's solution? ask someone to "furnish" all that to him... i guess it never hurts to ask, 'eh? good luck with that, lee. *** anyway, there are your examples, folks... like i said, _tedious_. and then he repeats everything. this is why i point lee in my kill-file. now i remember. so let's bring this to a close. ************************************************************ take a deep breath to clear your system... take another deep breath to clear your system... take a third deep breath to clear your system... ************************************************************ i now direct the remainder of this post _back_ to the audience at large, not lee specifically... *** the main point of departure between lee and me is that he _starts_ with "the text is in .html form". _then_ his tool takes over. which is fine, i guess... except for the fact that it doesn't match the reality of how us regular humans actually make e-books. it doesn't describe the task that is being done by post-processors over at distributed proofreaders. it doesn't even reflect how the e-book designers who do the job _professionally_ go about the task. because _we_ all start with text. maybe the text is in a word-processing file, maybe it's raw ascii, but it's most assuredly not already marked up in .html. that's what _we_ have to do, to make it an _e-book_. if it was already in .html format, we'd call it "done", or mighty close to it. you might have noticed, up above, when i said the text _might_ be "in a word-processing file", yeah? so maybe you're just thinking that we could ask the word-processing app to convert the text into .html? well, yes, we could. and some of us novices do... but the professional book-designers don't do that. and they strongly advise even us amateurs not to... because what they have found is that the .html which is applied by word-processing apps is _very_crappy_. it gives poor results in most all the e-book viewers, and it is extremely difficult to work with, when you need to make changes. (and you almost always do.) so the admonition is fairly universal: don't do that! what do the professionals advise us amateurs to do? they advise us to save the file as plain-ascii text, and then to apply the .html to that plain text, including the reapplication of styling (e.g., italics) which gets _lost_ when the file is saved in plain-ascii format... indeed, that is precisely what those professionals do. preach what they practice, practice what they preach. (if you don't trust me, ask me to provide some links. or research it yourself. it's easy to find such advice. joshua tallent, liz castro, or thebookdesigner.com...) now, i think it's utterly ridiculous to strip away styling and then have to _reapply_ it. but that's what they do. the application of good solid .html, though, is wise, so _that_ part of the advice i can thoroughly second... even if you do it by hand, it's more economical than letting a word-processor apply crap, which you then waste more time -- long run -- trying to "improve". now, the truth is that those pros have "scripts" that apply the markup automatically. plus they _know_ .html already, well, so this comes naturally to 'em, even if they have to do some of the work manually. but their advice is still good advice for us amateurs, because we get _totally_ confused by crappy .html... without the slightest notions of how to "improve" it, or even to make those inevitably-required changes but whether you are a professional or an amateur, the reality of making an e-book these days is that you _start_ with text, which you mark up in .html... (actually, for .epub, it's .xhtml, but we don't need to even bother making such fine-grained distinctions.) sometimes -- as with d.p. -- the text is from o.c.r. other times, it was "born digital". whatever the case, however, the reality is that we all start with _text_... and the nature of the _job_ is doing .html markup... it _is_ true that -- once you have done the markup into .html, there's _still_ a bit more work after that. and it's also true that this "bit more work" is often _very_ confusing and time-consuming, _especially_ to us amateurs, because the i.d.p.f. -- which is the organization that maintains the .epub standard -- has _never_ provided solid information concerning just exactly what this "bit more work" really entails. even the pros get confused, sometimes hopelessly. (i'd give a link, but i don't want to embarrass them.) however, if you do enough grunt work, and are ready (if not willing) to power through frustrating failures that can number in the dozens, or even _hundreds_, you too can eventually discover the things that work, and you can develop templates that ease future pain. after you've done that, the "bit more work" that is required _after_ you've done your .html markup is fairly easy -- it's basically just filling in information that's included in some "auxiliary" files in an .epub. two of those files are the .opf file and the .ncx file. you might recognize those extensions, since they're the files about which i've lately been speaking to lee. his epubeditor produces the auxiliary files for you, and helps you put the required information in them. so if you're one of the amateurs who are _struggling_ with the proper creation of these files, lee's program would be a _godsend_ to you, saving time and hassle. if you're a professional, you're not spending any time or energy producing these files anyway, because you already have scripts which make them automatically. so you might use lee's tool to do occasional reviews of your .epub files, or make minor corrections, but it probably won't be an app you consider as "crucial". more to the point, though, is that there are lots of programs out there that already create .epub files -- from text -- which generate the auxiliary files (like .opf and .ncx) required _inside_ the .epub file. they apply the .html _and_ create the auxiliary files. so, to sum up, there are two steps to making an .epub: 1. transform the text of your book into an .html file. 2. create the auxiliary files required inside an .epub. total novices, with no tools or experience, will spend _much_ time on the first, and _much_ on the second... professionals, operating with their pro tools, will spend a good amount of time on the first, little on the second. and amateurs, with the decent tools out now, will spend a good amount of time on the first, little on the second. in other words, lee's tool helps with the second step, but no one except unexperienced novices spend time on the second step to begin with. the second step is the "paperwork" that must be done to "finish the job", as the old expression puts it. lee's tool totally ignores the first step, .html markup, which is where everybody spends most of their time... this makes me suspect that lee simply doesn't know how real people in the real world make real e-books. namely, we start with text, and we mark it up in .html. then we do whatever little dance needed to turn it into an e-book file that's viewable on our e-book machines. now, if we only had some kind of a program that would take plain old text, and automagically turn it into .html, plus then create the auxiliary files required in an .epub, _then_ we'd have an app we could call "an epub editor". wait, isn't there such an app coming out real soon now? well, yes, son, there is. called "jaguar". real soon now. in the meantime, while you are eagerly awaiting that, if you're on a mac, you might want to buy a program called "multimarkdown composer", by fletcher penney, which is an editor that incorporates "multimarkdown". multimarkdown, also penney's, is a variant of markdown. markdown is a light-markup system which converts text into .html output that validates as standards-compliant. thus "composer" is a great tool to help with the first step listed above -- the hard step that takes most of the time. "composer" is new in the app store, and it's just $7.99... or, you know, you can use sigil. free/free. and it works. as far as i know, it works fine. couldn't be much better. -bowerbird

10 25

pontifications from mount high horse -- #1498
by Bowerbird＠aol.com 02 Jan '12

02 Jan '12

ok, i went back and did a better job on "betty little", and then compared the product to the original o.c.r. well, what i _acutally_ have is what i scraped from his editor demo, and some of that text had been edited... but i got to it pretty early, so i think i minimized that. (it would be nice if roger put out his _actual_ o.c.r. it would also be great if roger put out his .rtf copy. but i'm not sure how interested he is in this stuff.) at any rate, i will shy away from hard numbers, and just report the pattern of results, which is very clear. in general, i found this digitization looks exactly like the dozens of other ones that i have reported on here. as usual, the o.c.r. was good. quite surprisingly good. (i guess it's time that we should no longer be surprised. still, these scans were _murky_; but no, it didn't matter.) there were right around 256 errors, on a 256-page book, on the raw o.c.r., which is pretty much what you'd expect. considering the number of lines -- over 7000 of them -- those 256 errors constitute an accuracy rate of over 95%. roger hasn't released his text yet, so i can do a compare, which will probably reveal more errors that i missed, but even if i missed twice as many as i found (quite unlikely), the accuracy of the raw o.c.r. is still gonna be about 90%. *** i spent more time finding and fixing errors in this book than i wanted to, more time than i would have otherwise, because i was using it as the content for a new program. so i can't give a good estimate of the cost-benefit ratio of the time i spent, but i can say that i did indeed catch a pretty good percentage of the errors on my first pass. i made errors, lots of them! -- many more than the 22 which roger reported a while back -- but i can also say that i woulda caught almost all of the errors originally _if_ i had done a careful job, and done it more fully... i did a rush job, because i didn't know how fast roger was going to act, and i wanted to get my stuff out first, so roger and everyone else would know that i _hadn't_ used his results to create mine, i did mine on my own. and i wasn't thorough, because i didn't know if people would care. heck, i didn't even know if _i_ would care. but the project ended up being fun. it was _nostalgic_, coming in at the end of the year, plus i hadn't done an analysis of a digitization in a long time. and i'm rarely able to assess my own performance, so that's a blast... maybe i'll make a list of the errors later, for you to see. or maybe not. either way, the results are unmistakable. the o.c.r. was good. many of the "errors" were due to scan-spots -- which o.c.r. is duty-bound to report -- or outright errors in the p-book. it was full of errors! this is one of those e-books that is _more_accurate_ -- out of the chute -- than the p-book it came from. and, to repeat, this result is _the_typical_finding_... across the board, i have demonstrated, over and over, that the o.c.r. is good, and the vast majority of errors can be fixed by using extremely simple preprocessing, the type that you can do in one hour for a simple book. correcting the o.c.r. -- and even doing the formatting -- for a book is easy. it doesn't take rounds and rounds of volunteers wasting time and energy poring over a book. all it takes is one or two people using a good tool, and a couple of smoothreaders to catch the stealthy stuff... if you want to split up the job, you can have 10 or 20 people using that same "good tool" to do the job, and 10 or 20 smoothreaders, probably giving better results. but even one person and one smoothreader can do fine. and if you solicit error-reports and act on 'em diligently, you can execute a very smart march toward perfection... *** the things i just said apply to d.p. and p.g., obviously, but they also apply to some points roger has made... for instance, roger said this: > I've heard from some people that solo process > that actually like to go through the book > page at a time because they enjoy > following the story as they go, which > doesn't happen when someone is > in production mode > at the book level. now, first of all, of course, this is another case where roger exhibits fundamental misunderstanding about the essence of "production mode at the book level"... rather, it is page-oriented systems like the one at d.p. which make it difficult for people to "follow the story". in a system like the ones i make, the entire book is available to a person at all times, so they can surely "follow the story" if they choose to read it in order. the main difference is that, in a book-oriented system, you will _begin_ by cleaning up the big errors first -- the ones that are simple for the system to auto-detect -- so that you can then "settle in" to read each page, during which process you can look for _subtle_ errors, without being distracted by a need to fix any big ones, as that does indeed detract from "following the story"... in a page-oriented system, an absence of preprocessing means you might need to fix a bug on nearly every page, and that hurts both your accuracy _and_ comprehension. so roger has not just "failed to get things right" here... he has actually gotten it _completely_backward_, sadly. and, like i said, he's one of the smarter guys here. sadly. *** if anyone wants to see my analysis of my performance on "betty lee", let me know. or view the product online. > http://z-m-l.com/go/betle/betle.zml > http://z-m-l.com/go/betle/betlep123.html oh yeah, i almost forgot to tell you... i've programmed yet another book-digitization editor. once again, it's in python, like the one i built recently. but it's rather full-fledged, like the one i built in perl, back in 2010, when i was working on roger's "sitka"... it's not all finished yet, but you can look at it here: > http://zenmagiclove.com/bettyedit.py that is targeted at the ipad right now, but i can also make it work on an iphone by sizing the text smaller. it's _increasingly_ important to offer people the chance to contribute to your digitization project when they are using a mobile form-factor, like the ipad or the iphone. *** have a nice day. -bowerbird

5 7

Re: [gutvol-d] a review of some digitization tools -- 022
by Bowerbird＠aol.com 31 Dec '11

31 Dec '11

keith said: > Why do you not ask john if pandoc can do it > without human intervention! "without human intervention"? whatever in the world could that possibly mean? -bowerbird

5 6

pontifications from mount high horse -- #1499
by Bowerbird＠aol.com 31 Dec '11

31 Dec '11

i don't think anyone will build another digitization community like distributed proofreaders. and d.p. will wither away soon... the digitizers of tomorrow will be people like paul flo williams, and james simmons, who take on the task as "a labor of love"... they'll do books that they love -- _because_ they love them -- and not just take part in some abstract "digitization" project... they'll be far more likely to do a half-dozen books end-to-end, than to take on some piece of dozens (or hundreds) of books... they might get a few people to help 'em out, or they might not. it won't matter, not much, because either way they'll trudge on. and they'll use tools that will fit their small, personal workflow. *** this brings up something roger said a while back: > I also feel it fits comfortably with > users that want to just get one page right. > There's a sense of accomplishment > resolving the warnings on one page > and knowing it is "done." first of all, let me give roger full credit for what is one of his most significant contributions over the past few years, and that's saying a lot because he's made plenty of them... but one of the most significant is that his actual tests showed that if a person _says_ they consider a page to be "finished", then the odds are that it really _is_ finished, i.e., error-free... i had always argued, based on my gut, that it took _2_ people to confirm a page as error-free before we could "trust" it was, _and_ that anyone who made a change to a page could _not_ be counted as one of them. in other words, 2 _confirmations._ roger's research showed that that's not necessarily the case... at least it _hinted_ at that... but there was one troubling confound in his test, which is that he didn't factor the _initial_state_ of the page. and we _know_ that -- typically in o.c.r. -- _many_ pages start out error-free. indeed, in some o.c.r. files, after a decent preprocessing run, the majority of the pages will be error-free, so a person could certify _all_ of them as "finished", without even looking at 'em, and be "correct" far more often than they were "wrong". but there's another aspect to this, a psychological aspect, and one that involves the _motivation_ to do digitization, especially in a large project where you're a cog in a wheel. i fully recognize that people get satisfaction from the feeling of "finishing" a page, and that we can use that for motivation. but at what point does that person begin to feel manipulated, because the vast majority of the pages were "finished" before the person even looked at them, and we neglected to tell 'em, or -- worse yet -- we implied that the page _did_ have errors. i mean, it's one thing to say, "we think this book is error-free, so go ahead and read it, and let us know if you catch anything", which is the way that i would pitch beta-reading to my people. but it's another thing to say, "proofread this word-by-word", when you darn well know _most_ of the pages are error-free. there is a dishonesty about that which causes me discomfort. or, in other words, what kind of satisfaction does a person get from certifying a page as "finished" if the page was "finished" before they even got it, but you led them to believe it wasn't? *** there's a flip side to this, too. what kind of satisfaction did roger get from calling his pages "done" when he later found that he had missed many of the italics? this cuts both ways... anyway, just giving you something to think about in 2012... *** have a nice year. -bowerbird

1 0

pontifications from mount high horse -- #1498
by Bowerbird＠aol.com 31 Dec '11

31 Dec '11

roger said: > my sampling of the kind of errors > left behind by BB's process as far as it went. thanks for the fuller exposition, roger... just as a note, before we begin, i'd like to issue one minor note of protest that these errors are labeled as "bb:", when, actually, they were errors that were in _the_o.c.r._... it's not as if i _introduced_ these errors... it's just that my preprocessing failed to find and fix them. there's a difference... i don't have a suggestion for an alternative, and i see the reason for that nomenclature, just want people to grok the bias in the label. *** on a continued note... it's also the case that my preprocessing _did_ find-and-fix some 250 errors in the o.c.r., so i think the fact that it missed 21 is acceptable, as a first-pass. it can probably be improved, too, but i believe a 92% reduction in errors is nothing to sneeze at. even if i missed _100_, that would be a 70% reduction in o.c.r. errors. *** now that we're done with the minor protest... for the folks who aren't paying full attention, the whole point behind my argument is that a minimum of effort can fix a _lot_ of errors. i have never claimed it can find _all_ of them -- indeed, that would be a form of magic -- or that it's a _sufficient_ mechanism, by itself. this "minimum of effort" absolutely _needs_ to be followed by a beta-reading component. although i have never failed to stress that need, sometimes people seem to forget it completely. please remember to consider the full picture... once you do, you'll understand my rejoinder: my beta-readers woulda caught these errors. (just like roger's smoothreader caught them.) having said that, it is of some concern to me that this seems to change the error-rate so it becomes worse than 1-error-per-10-pages. gonna have to see what i can do about that... > With the addition of a good smoothreader, > many of these diffs would have disappeared right. (well, i'd say _all_ of them, but i'll settle with "many of them" if you insist that it's true, as we're talking about imaginary beta-readers, in my case, so i can't be certain of their skills, but if you say they're not perfect, i won't argue.) these results are from a midpoint in the workflow. while all the errors you reported are very real -- or almost all of them, anyway -- they are from classes that i have oft-acknowledged as ones that are missed in my _preprocessing_... there's a good reason i use that particular word -- _preprocessing_ -- because it _emphasizes_ the point that this doesn't create a final product. let's look at those acknowledged weakspots: > stealth scannos > missed italics > splotches > spelling discrepancies "stealth scannos" are something my preprocessing will never catch. that's why it needs beta-readers. "missed italics" will depend on abbyy's .rtf quality. if you don't use abbyy, you'll do all of it manually. and if you suck at it, like me, the results will suck. it's not that i cannot imagine coding a routine that could check for stealth scannos, or italics. i could develop some ideas and take a stab at it. i just wouldn't expect that i'd be very successful, in cost-benefit terms. you can do lotsa checks for stealth scannos, but you get too many false alarms. and my whole point is that you utilize cost-benefit. it's easy -- and efficient -- for beta-readers to catch stealth scannos, and even missed italics... so let them do that job! that's the wisest course. "splotches" are something else that's hard to catch, sometimes even impossible. i'll look at these errors, and see if they are the type of thing that _could_ be detected with some programmed routines. but still, it's obvious that some can't, which is -- yet again -- why you need to have beta-readers in the workflow. as for "spelling discrepancies", i do a spellcheck, but if both forms pass, then i let both of them go. and i didn't do consistency checking on this book. i _will_ make it part of my process, when i finally formalize it, but i didn't do it on this book, nope. *** so this wasn't a fair fight. roger used a smoothreader, and i didn't. so _of_course_ his results will be better... roger might've even done a word-by-word proofing, similar to the kind done at distributed proofreading. that can give even _better_ results. sometimes not, but sometimes it _will_, there's no question about it. where the question comes in is whether it's worth it, in terms of the time and energy that it takes people. go take a look at a project page over at d.p., and see how much time it takes the proofers to step through the pages of a round. you can see a page being saved every minute, or two, or three, so a 256-page book -- like "betty lee" -- can take about 600 minutes to go through a single round. that's 10 hours of work! for one round! is this a wise use of time and energy? i sure don't think so. when a beta-reader does a book, at least they get the pleasure of actually _reading_ it... those proofers over at d.p. don't even get that benefit. i don't think the time and energy that they _volunteer_ is being used in the best possible cost-benefit manner. and _that_ is what this discussion should be about... *** all in all, these results don't surprise me one bit. this o.c.r. had more scannos and splotches than i usually find, but the scans weren't all that hot, as you'll readily see just by paging through 'em. maybe i'll run 'em through scan-tailor, just to see if i can then get better o.c.r. out of them... but if roger will promise to make available his data -- i.e., scans, r.t.f. -- for his future books, i will formalize my preprocessing into a system, so he could run regular tests of its efficiency... -bowerbird p.s. now, notes on each of the individual errors... stealth scannos -- i leave these for the beta-readers... > RF: some of them, and give Ramon's message, > BB: some of them, and give Earn on's message, shoulda caught this, as an unexpected mid-sentence cap. > RF: Betty and next to Peggy Pollard, who, it > BB: Betty and nest to Peggy Pollard, who, it crap. i did catch one "nest/next" scanno, but forgot to check the rest of the file for another. > RF: a thing to work for that being president > BB: a tiling to work for that being president i'd guess a check for "tiling" would be a worthy one. > RF: the back. Mary Emma could not go with > BB: the back. Mary Emma could hot go with "could hot" and "can hot" checks will surely be worthy. > RF: problems. From Lucia's manner, she > BB: problems. From Lucia's manlier, she a "manlier" check might well be worthy. > RF: of the page and below was a brief resume > BB: of the page and below war; a brief resume a check for "war" might be good, or might give false alarms. > RF: I'm the crossest girl you ever saw, so far as mere looks > BB: I'm the Grossest girl you ever saw, so far as mere looks shoulda caught this, as an unexpected mid-sentence cap. *** spelling discrepancies -- these are just because i made mistakes > RF: of those still, quiet stiletto exchanges > BB: of those still, quiet stilletto i probably gave this the o.k. because i don't know how to spell it. > RF: tonsillitis. Betty saw her and overheard > BB: tonsilitis. Betty saw her and overheard i probably gave this the o.k. because i don't know how to spell it. *** splotches -- not much i can do about these > RF: packed a thin chiffon dress, while > BB: packed, a thin chiffon dress, while can't see how i could ever devise a test to catch that. > RF: this, Miss Betty Lee!" > BB: this,' Miss Betty Lee!" didn't do a check of the balancing of single-quotemarks. *** missed italics -- not much i can do about these > RF: wouldn't do _one thing_. She is sweet > BB: wouldn't do one thing. She is sweet i don't know how to code a test for italics. > RF: other times too, but _always_ then, > BB: other times too, but always then, before i suck at finding italics. i missed 40 cases, not? *** missing quote marks -- my tests should've caught these > RF: won't you?" > BB: won't you? shoulda found this, unless both quotemarks were missing. > RF: who sat down. "How is your mother > BB: who sat down. How is your mother shoulda found this, unless both quotemarks were missing. *** extra quote marks -- an ironic coincidence of splotches > RF: little habit of dropping in when > BB: little habit of 'dropping in' when two splotches happened to do something that made sense. *** levenshtein check -- i should include this in my process > RF: are the Sevillas and where do they live? > BB: are the Savillas and where do they live? i got so confused on this name, but thought i checked 'em all. *** guiguts-catchable errors -- i just plain forgot these tests > RF: sometimes! I can't study! Come over here > BB: sometimes! I can't study I Come over shoulda caught this, as an unexpected mid-sentence cap. > RF: who reads the sport page." > BB: who reads the! sport page." shoulda caught this, as an unexpected sentence-starting lower. > RF: know." > BB: know," (at end of paragraph) shoulda caught this, as an improper paragraph-termination. *** a bug in BB's generator -- not "a bug"; indicates a continued quote > RF: like my residence here. > BB: like my residence here." " not a bug. will be deleted before the product goes "final".

1 0

Seven Million+ New Reader Devices activated on Xmas Day Alone
by James Adcock 30 Dec '11

30 Dec '11

http://www.the-digital-reader.com/2011/12/27/nearly-7-million-ios-and-androi d-devices-were-activated-on-christmas/#more-28013

7 8

pontifications from mount high horse -- #1497
by Bowerbird＠aol.com 30 Dec '11

30 Dec '11

another jumble of responses... *** roger said: > Uh, no. I didn't put the project up for you > to scrape and try to draw conclusions. except that "drawing conclusions" by analyzing data after experimentation is simply "what i do", roger... > I don't have those numbers and I'm not going to > try to derive them from the data. ... > Right now, I don't care which way is better, technically. > I believe the choice will be made on which approach > is most comfortable to the user. I've heard from > some people that solo process that actually > like to go through the book page at a time because > they enjoy following the story as they go, which > doesn't happen when someone is in production mode > at the book level. ... > You chose to scrape it and open up the discussion. yes i did choose to do that. because that's what i do -- analyze data, draw conclusions, and open up discussion. but i realize that you didn't "ask" for this discussion, and you do not care to discuss some things i talk about, and if you don't want to continue with it, that's fine with me... i appreciate what i got from you this round, like always, but if you'd rather not be engaged, then i can accept that. i think you could write better software if you did do tests, with well-designed research, intended to get clear data, from which you could "draw conclusions" that will make you smarter about the underlying dynamics at work, but i'm past the point of caring very much about your stuff... if you wanted to talk to me about these things, you had chances in 2007 or so. you didn't want to _then_, either. as you can tell from my "pontifications", my willingness to persist in this quest for meaningful dialog is waning. so i'm just getting some things out of my system, so that when 2012 rolls in, i can start anew, with a clean slate... i feel a strong desire to stop treading all this old ground, and start moving the ball forward; the world is now ready. if i have a lingering need to say stuff in the coming year on any of the particular topics instantiated here recently, i'll address messages to a fictionalized amalgamation of don, roger, alex, etc. -- a character dubbed "dodger-x". and yes, i also realize my "tone" is often "disrespectful". it didn't start out that way, but the treatment i received from some people here often turned it in that direction. i don't feel that people automatically "deserve" respect. respect is something you have to _earn_, with _quality_. my commitment toward honesty, integrity, and truth is far too strong to dole out _respect_ to shoddy bullshit... you _can_ automatically expect _courtesy_ from me, yes. although some people forfeited even that, when they seemed to think it was ok to treat _me_ discourteously, just because i didn't pander to their need for "respect". and you certainly won't get "respect" by bullying me... i learned a long time ago not to let myself be bullied -- that you had to give bullies back their own "medicine", to expose their cowardice, or they would chew you up... but it's also the case that i simply _have_ no "respect" for the kind of willful ignorance you expressed above. i mean, it's _smart_ to "not care" about things that are irrelevant to what you do. but relevant considerations? well, it's _stupid_ to "not care" about _those_ things... and no, roger, don't get all upset, because i am _not_ "calling you stupid". precisely because two days later you were coming back to the list saying, "hey, i noticed that my accuracy at finding the italics really sucked..." so here you are, dragging in data that made you wiser about a flaw in your workflow, so you really _did_ "care". and that's the kind of attention to detail that i _respect_. that's the kind of _wisdom_ which i _value_. it's too bad that it's far too little, and far too late, to matter much... on the bright side, even though you're years behind me, you're _light-years_ ahead of almost everyone else here. so i give you the strongest encouragement possible. but, like i said, you shoulda talked to me back in 2007. because starting in 2012, i'll move on to better things. *** but since we're still here in 2011, let me clear my plate... *** everyone agrees that some parts of the digitization workflow need to be done on a book-wide basis, by a single individual. they just _fail_to_recognize_ those parts for the discussion... this typically includes steps at the beginning of the workflow -- e.g., obtaining the scans, photoediting them if necessary, getting the o.c.r. in order and the other preprocessing steps, including the naming of the files, and discarding the garbage. nobody would give much consideration to the argument that these steps should be done "page by page", or collaboratively. sometimes you need just one person in charge of operations. and typically, it also includes steps at the end of the workflow -- final checks to ensure the products are reasonably correct, mounting the final products, accepting any error-reports, etc. again, it would be silly not to have one person handle a book. it is the middle where differences of opinion emerge on: 1. whether it's done by one person, or multiple people. 2. whether it's on a book-wide basis, or page-by-page. those are the main divisions, but occasionally along the way some people have introduced even more confusions, such as the people who insisted -- when i started bringing this up -- that any book-level changes were those done "automatically", and thus "blindly", so they would then argue against _that_... fortunately, we haven't heard that kind of blather here lately. good thing, too. talk about a straw-man... geez! as another example, this one which _is_ extremely recent, roger said some things that gave me the strong impression that he defined the "book-level" versus the "page-level" by _how_the_text_was_stored_. using a concatenated-text-file meant that he was doing stuff at the level of the "book", but once he split the text for each page into its own file, then he was operating at the "page" level. now, i refuse to believe that he thought something as arbitrary as his _file-storage_ determined the level. but that's how he seemed to be talking! for the record, just to try to make lemonade out of this lemon, it's trivial to split the text of the book, slicing it various ways. you can split it into pages, and many times that slice is useful. you can split it into sections (e.g., chapter), also useful often... sometimes you'll need to split it into paragraphs, like we just discussed in the topic of checking for balanced doublequotes. and of course, for spellcheck, you need to split into _words_. for curly-quotes and em-dashes, you go down to characters. i rarely find a need for it, but you can slice by _sentence_ too. and then, last but not least, i often slice the file into _lines_... there's no question that lines are the best way to show diffs of the type that's most likely to occur when correcting o.c.r. so, when you're writing code to handle the digitization tasks, you're constantly splitting, joining, resplitting, and rejoining, depending on what you are trying to accomplish at the time... so, to a programmer, talking about the way the data is stored as your definition of the level of analysis is... bass-ackwards. oh, and just to close this sub-topic, i store text as _books_. it's too inconvenient to store the text in so many "page" files. "betty lee" is under 300k, so that is a very manageable size... (a couple of its _individual_ page-images are that same size!) at any rate, my point is that, on this overall thread involving how we think about an approach-level to digitization, there's a lot of confusion out there about how to conceive the topic. and, needless to say, it's impossible to argue against _every_ type of misconstrual out there, so let's ignore the wacky ones, and focus our concentration on the top two contenders now. (because, as we'll see, even the top two end up meaningless.) as per convention over at d.p., we'll call these middle parts "proofing" or "formatting". the difference of opinion is over: 1. whether it's done by one person, or multiple people. 2. whether it's on a book-wide basis, or page-by-page. now, i fully recognize that d.p. has warped a lot of brains into thinking these middle parts are purely reductionistic. which is bullshit. we're not building automobiles. we don't need a factory, filled with employees, milling around the assembly line... but let's go back in time, so we can give d.p. its due credit. before charlz franks dreamed his brilliant little brainstorm, book digitization was a solitary and lonely path to tread on. michael hart had done the job of creating a force of volunteers, but by and large those individuals did a book _by_themselves_. they might proof each other's work, after a book was "finished", but they didn't split a book to do it -- not even split it in half! so what charlz concocted was truly astoundingly revolutionary! it was one of the first projects which leveraged the connectivity -- the _social_ connectivity -- of the internet, believe it or not. and that alone marked it as a nifty piece of futurism. but the idea that you could digitize a book _page-by-page_ was just as challenging to the "accepted wisdom" of the day. indeed, going in, everybody wondered if it would even work! not if it would work "well"... whether it would work _at_all!_ most people actually _doubted_ it would get off the ground. hey, charlz himself wasn't sure. "it was just an experiment." so he was overjoyed -- along with the other people around -- when the first book was done, then the second, and a third... only after a few months (or maybe even more than "a few") did he actually come to believe that it was going to succeed. he told me that face-to-face in 2003, and even at that time, he appeared to be shocked and amused that it really worked. and maybe we shouldn't have been surprised. after all, the main task is correcting the scannos, right?, and how many ways _are_ there to correct a misspelling? but there _were_ differences of opinion on if it'd work, and perhaps it'll be instructive to consider the reasons. many people believed that digitizing a book required a consistency of purpose, a unity of vision, which would be difficult (or impossible) to achieve in a collaboration. there was a belief too many cooks would spoil the soup. and back then, it took a month, or 2 or 3, to do a book, and there was much doubt that a group would cohere for such an extended period, on a fairly difficult task. so there was also doubt cooks would even stick around long enough to ruin the soup! it'd be a revolving door, nothing more than a never-ending nightmare of churn. and, even with the hindsight of retrospect, the concerns weren't that far-fetched. if everyone went about the job doing things "their own way", it woulda ruined the soup. and there have been enough other "distributed" failures -- including several book-digitization ones -- that the doubt that critical mass would form has been validated; d.p. ain't the rule, it's the exception that proves the rule. thus it ended up that that very same social connectivity "solved" those problems, by creating a _community_... the community did form a critical mass, which ensured that there would be plenty of cooks, for a long time, even if some of them would burn out of the kitchen... and the community formed a social core that enforced some consistency on how its members performed, so a "unity of vision" was achieved via that peer pressure. not that it was easy. nope, despite the obvious response to "how many ways can there possibly be to correct a simple misspelling?", there arose a plethora of other points of disagreement. a _lot_ of them. disagreements about ellipses, and italics markup, and blank lines at the top of a page, and end-line hyphens, and how to solve the problems of persistent log-jams, and how many rounds to have, and how to skip rounds, and how to deal with the asshole known as bowerbird, and whether postprocessing can be distributed or not, and how come some people keep removing my notes?, and how such-and-such a rule should be "interpreted", and how to install guiguts, and what spellchecker to use, and eighty-seven possible options on that spellchecker, and how to handle greek, and whether to use utf-8, and how you hurt my feelings, now my feelings are hurt, and what are you gonna do about my hurt feelings, and why did you hurt my feelings, and now i feel so _hurt,_ and why are you such a mean person anyway?, meanie!, and how come my book is stuck in the p2 queue again, and what _do_ you do about a problem like maria, anyway? d.p. people are like a family, and that's how they fight, just like a family, over everything, for years and years... and then they try to kiss and make up, and tell everyone how much they love them, and how much they love d.p., how important it is to them, and the world at large, and how much they love each other, and... back to fighting... so no, it wasn't easy, not really. love and pain. family. and now a hardening of the arteries that makes it very difficult to improve because of a resistance to change. social connectivity ain't always all it's cracked up to be. *** now... where was i? oh yeah, so now, the pendulum swung the other way, because we found out that d.p. actually _would_ work, and now people think it's the _only_ way that can work. which is stupid, of course. i mean, seriously, kiddies, we knew how to spellcheck before distributed proofreaders came along... really... and it's _possible_ to spellcheck a whole book. really. it's not _required_ that you have a hundred people doing two pages a piece. seriously, kids, it's not... but that's what they know, and some people believe that what they know is the only thing that can possibly exist. and they refuse to peek out of their shell, because they don't wanna have that stupid belief become invalidated. so now we have way too many people who "doubt" that a book-level approach to digitization will actually work. but, as i pointed out, up at the top, there are many parts of the digitization process that are -- even now -- done by one individual working alone, on a book-wide basis... even roger, with his focus on the page-by-page method, is -- quite ironically -- doing much of his books alone... he chooses books, he scans them, he mounts the scans, sometimes he has people help him with the _proofing_ and the _formatting_, and he has a couple dependable smoothreaders too, but he does all the postprocessing by himself, and he submits the products to d.p. himself. heck, folks, he even writes his own software for the job. and once he learns how easy it is to do the whole book, with the right tool, watch how he will go off on his own. he's done about 600 books with the help of his friends. i won't be surprised if he does another 600 by himself. by the way, i got christmas greetings from nick hodson, a 77-year-old from england who's digitized 750 books. all by himself... using software tools he wrote himself... and for many years now, nick has used _text-to-voice_ as one of his best proofing tools, and he swears by it... i've played around with it myself, and yes, _it_works_. it is also convenient as can be, with iphone and ipad... i don't mention it, since you guys don't deserve it, but mark my words, that's how it'll be done in the future. with siri and mobile, you're gonna see that a move to voice-oriented systems will become very accelerated. in the meantime, you will still be hung up on things like "book-level" versus "page-level", which mean _nothing_. those two methodologies are _not_ mutually exclusive. they work _with_ each other, like a hand in a glove, in a synergistic manner, so you gotta be able to do both. at the same time. using the same tool. figure it out... *** ironically, the series i interrupted in order to pursue this current thread was "a review of digitization tools", where i was discussing points that are quite relevant. specifically, the ideal tool will run online and offline, and it will be designed so that the job can be done by one person all alone, or multiple people collaborating. so, for instance, a python script would fit the bill. a web-app _might_, but only providing that it can be run offline, without any access to the internet. (or an offline alternative app would be acceptable, if functionality is substantially identical, or better.) the truth is simple: certain tasks need to be done, in order for a book-digitization to be "successful"... and it doesn't matter if those tasks are done by one person or by multiple people, online or off. as long as they're done, the book gets digitized. for instance, since roger brought it up recently, one of the tasks is "checking the paragraphing". sometimes o.c.r. gets the paragraphs wrong... it either incorrectly splits a paragraph into two (or more) paragraphs with improper blank lines, or incorrectly joins two paragraphs into one by improperly deleting the blank line between 'em. so you need to look at each and every paragraph in your text, and compare it to the relevant scan, to make sure there were no paragraphing errors. it doesn't matter if you do that offline or online. and it doesn't matter if one person or many do it, as long as every paragraph on every page is done, and done correctly. now, from a practical position, if you have multiple people doing different pages, then you must _coordinate_ their efforts, so as to make sure that every page actually gets checked... but as long as it _does_, that's what truly matters. i don't know if you want to call that "page-level" or "book-level", but the label don't mean much. it is what it is. you have to check the whole book, and you do the check against each separate scan. call it whatever you like, it doesn't mean a thing. and that's the way almost all of these tasks are. what matters is that they _get_ done, not _how_. the other important consideration here, though, is that they are done _correctly_. when you have one person looking at a page, and that is _all_, the odds are that, over the course of a full book, an average person is gonna make a few mistakes. to err is human. even on an easy task. especially! so -- if you want to ensure each page is right -- you're gonna need to have more than one check. whether it's from a separate round of volunteers, or the beta smoothreaders, or your end-users, or even you yourself alone checking it over and over, that's what it's gonna take to have 100% accuracy. and even then ya might not reach that perfection. there's no need to be pessimistic, however, since there is another factor that has an impact here... namely, we can often code programmatic checks that offer varying degrees of certainty that a task has been completed successfully. indeed, we can sometimes even have these tasks do a "first pass" on the text, and then simply verify what it's done. let's take this paragraphing task as an example... first of all, there's a very simple find operation to help find some improperly-split paragraphs: two-linebreaks-followed-by-a-lowercase-letter. boom! odds are you just located a few problems; not all the improper-splits, but _some_ of them. now improve the routine, so you're searching for a-lowercase-letter-followed-by-two-linebreaks. boom! you might've located a few more glitches. (if you'd done this routine before the other one, it would have found some of the same problems. but both of them can detect some unique cases.) you can improve the routine further, if you work. or you can even try a whole different approach. that's what i've done, to check the paragraphing. to test my routine, i eliminate all the blank lines from an o.c.r. file, and have it put them back in. it does amazingly well. i'll show you sometime. so, to cut a long story short, when roger implied that this job needed to be done by human eyes, he was wrong... the computer can do very well, providing you have programmed it intelligently. heck, truth be told, even dumb programming does a fairly good job. if you put a blank line after every sentence-terminated line followed by a capital letter, you get a lot of them right. despite an excess of cases of improper splits, a big majority of the paragraphs are correct. and improving that routine is also fairly easy, until you've got 98% of the paragraphs right. ain't gonna tell you the secret of the last 2%, but i'll show you an app proving it is doable. if you ask nice. otherwise just take my word. it ends up that -- for most of the tasks that need to be done -- you can program a test, to see whether you think the file is adequate. i've already discussed code for doublequotes, but that's yet another perfect example of this. again, just so everyone is perfectly clear here, there's always a need for good smoothreaders. every digitizer is gonna make some mistakes... we march to perfection, we don't just "arrive"... and we're never really sure we've gotten there. we always expect "at least 1 more" error-report. but... there are book-digitization tasks which need to be accomplished... to say that book is "finished", you must be able to certify -- with some degree of certainty -- that those tasks were done. how those tasks get accomplished -- whether by one person or many, offline or online -- doesn't matter. what matters is that they got done, and that we can verify they're done. that's the tool that we need to have, to do the job, a tool that allows us to 1. do the tasks, and 2. verify they're done. you can make up terminology if you like, and draw up some graphs in your mind which are as meaningless as the pictures your kid colored stuck to your fridge-door (albeit less captivating than their pictures), but when you boil digitization to its essence, what you find is "there are tasks to be done." kudos to roger for building software to help. and to don too, for ongoing advocacy of the importance of "checklists" to the workflow... and let us not forget that don built an app. once again, folks, you should get "twister". > http://code.google.com/p/dp50/downloads/list yeah, it's almost 2 years old now, and it has lots of features that are "not implemented", but you still get a feel for "how it should be". and you reg-ex fans will be happy to know that you can install your own set of reg-ex, and have the app use it. (i couldn't get that particular thing to work, but maybe you will.) but as you use "twister", ask yourself whether it's working at the "book" or the "page" level. if you can come up with an answer, tell me... i also have a brand new thing to show you, but i'll save that for a later "pontification"... have a nice day. -bowerbird

1 0

book vs. page proofing
by Roger Frank 28 Dec '11

28 Dec '11

I've looked at this some more and I've come to the conclusion that working at the page-level is sufficient but inefficient, and that working at the book-level is efficient but insufficient. Here's what I base that on. In my latest experiement on a new book, I concatenated all the text files first and did the global corrections. A lot of improvements were made quickly. When I was convinced I had done all I could, I burst that back into individual pages and loaded it into PPE, the page-at-a-time editor which presents the image, and edit window, and an analysis window simultaneously. I found two types of errors that I could not have found at the book level and which would have also likely been missed by the smoothies. The first was where the text of two paragraphs was combined into one, without an intervening blank line. This happened in the middle of two separate pages. Abbyy doesn't preserve the start-of-paragraph indent, so especially if the first paragraph ends its last line near the ight margin, this error is invisible unless you are looking at the image. The other are words incorrectly marked as italic. Typically they are short words ("I" and "we" most commonly), but not always. If you can't see the image, you can't expect to get these right. I had to undo about 8 of these. For the next book, likely I'll do this same procedure: full text and then page at at time in PPE. The first makes it easier; the second makes it right. --Roger

2 1