January 2012 - gutvol-d - lists.pglaf.org

pontifications from mount high horse -- #1500
by Bowerbird＠aol.com 02 Jan '12

02 Jan '12

remember i said that, in the early days of d.p., many people doubted that it would even work, because too many cooks would spoil the soup? the same scenario is now playing again at d.p., as some people there believe postprocessing can't be distributed, giving similar reasoning. meanwhile, other people -- including don -- think it'd be wise to test to see if distribution will work well enough to clear a bad backlog. plus, as i said, i believe that digitization will increasingly be done "for the love of a book", by people who'll do that book "end-to-end". while some of them -- like roger frank, or nick hodson -- will do hundreds of books, others might only do one or two, or a few... whatever the case, it would be good if they had a website that guides them through it, and helps 'em attain a good level of quality. this project, the one i started christmas day, might eventually grow into that kind of site: > http://zenmagiclove.com/bkdig.py because although some people might get a sense of satisfaction by finishing a page, as roger recently argued, i think more will seek that feeling by finishing a whole book, especially a book that's meaningful to them. as one aspect of "book-dig", i built an editor: > http://zenmagiclove.com/bettyedit.py the name is because i used the "betty lee" book that we've been using for sample content lately. like the rest of the "book-dig" system, bettyedit is still "under construction", and thus a bit raw (e.g., edits aren't "sticky" yet, but will be soon). but as an example of the type of help it'll give, i've currently coded it to render italic highlights (in magenta) and doublequote-checks (in blue), with due apologies to the color-blind out there. cleaning the text of books is a lot of fun for me. and coding apps to do that job is even more fun. *** have a nice year in 2012. i will. -bowerbird

2 1

pontifications from mount high horse -- #1498
by Bowerbird＠aol.com 02 Jan '12

02 Jan '12

ok, i went back and did a better job on "betty little", and then compared the product to the original o.c.r. well, what i _acutally_ have is what i scraped from his editor demo, and some of that text had been edited... but i got to it pretty early, so i think i minimized that. (it would be nice if roger put out his _actual_ o.c.r. it would also be great if roger put out his .rtf copy. but i'm not sure how interested he is in this stuff.) at any rate, i will shy away from hard numbers, and just report the pattern of results, which is very clear. in general, i found this digitization looks exactly like the dozens of other ones that i have reported on here. as usual, the o.c.r. was good. quite surprisingly good. (i guess it's time that we should no longer be surprised. still, these scans were _murky_; but no, it didn't matter.) there were right around 256 errors, on a 256-page book, on the raw o.c.r., which is pretty much what you'd expect. considering the number of lines -- over 7000 of them -- those 256 errors constitute an accuracy rate of over 95%. roger hasn't released his text yet, so i can do a compare, which will probably reveal more errors that i missed, but even if i missed twice as many as i found (quite unlikely), the accuracy of the raw o.c.r. is still gonna be about 90%. *** i spent more time finding and fixing errors in this book than i wanted to, more time than i would have otherwise, because i was using it as the content for a new program. so i can't give a good estimate of the cost-benefit ratio of the time i spent, but i can say that i did indeed catch a pretty good percentage of the errors on my first pass. i made errors, lots of them! -- many more than the 22 which roger reported a while back -- but i can also say that i woulda caught almost all of the errors originally _if_ i had done a careful job, and done it more fully... i did a rush job, because i didn't know how fast roger was going to act, and i wanted to get my stuff out first, so roger and everyone else would know that i _hadn't_ used his results to create mine, i did mine on my own. and i wasn't thorough, because i didn't know if people would care. heck, i didn't even know if _i_ would care. but the project ended up being fun. it was _nostalgic_, coming in at the end of the year, plus i hadn't done an analysis of a digitization in a long time. and i'm rarely able to assess my own performance, so that's a blast... maybe i'll make a list of the errors later, for you to see. or maybe not. either way, the results are unmistakable. the o.c.r. was good. many of the "errors" were due to scan-spots -- which o.c.r. is duty-bound to report -- or outright errors in the p-book. it was full of errors! this is one of those e-books that is _more_accurate_ -- out of the chute -- than the p-book it came from. and, to repeat, this result is _the_typical_finding_... across the board, i have demonstrated, over and over, that the o.c.r. is good, and the vast majority of errors can be fixed by using extremely simple preprocessing, the type that you can do in one hour for a simple book. correcting the o.c.r. -- and even doing the formatting -- for a book is easy. it doesn't take rounds and rounds of volunteers wasting time and energy poring over a book. all it takes is one or two people using a good tool, and a couple of smoothreaders to catch the stealthy stuff... if you want to split up the job, you can have 10 or 20 people using that same "good tool" to do the job, and 10 or 20 smoothreaders, probably giving better results. but even one person and one smoothreader can do fine. and if you solicit error-reports and act on 'em diligently, you can execute a very smart march toward perfection... *** the things i just said apply to d.p. and p.g., obviously, but they also apply to some points roger has made... for instance, roger said this: > I've heard from some people that solo process > that actually like to go through the book > page at a time because they enjoy > following the story as they go, which > doesn't happen when someone is > in production mode > at the book level. now, first of all, of course, this is another case where roger exhibits fundamental misunderstanding about the essence of "production mode at the book level"... rather, it is page-oriented systems like the one at d.p. which make it difficult for people to "follow the story". in a system like the ones i make, the entire book is available to a person at all times, so they can surely "follow the story" if they choose to read it in order. the main difference is that, in a book-oriented system, you will _begin_ by cleaning up the big errors first -- the ones that are simple for the system to auto-detect -- so that you can then "settle in" to read each page, during which process you can look for _subtle_ errors, without being distracted by a need to fix any big ones, as that does indeed detract from "following the story"... in a page-oriented system, an absence of preprocessing means you might need to fix a bug on nearly every page, and that hurts both your accuracy _and_ comprehension. so roger has not just "failed to get things right" here... he has actually gotten it _completely_backward_, sadly. and, like i said, he's one of the smarter guys here. sadly. *** if anyone wants to see my analysis of my performance on "betty lee", let me know. or view the product online. > http://z-m-l.com/go/betle/betle.zml > http://z-m-l.com/go/betle/betlep123.html oh yeah, i almost forgot to tell you... i've programmed yet another book-digitization editor. once again, it's in python, like the one i built recently. but it's rather full-fledged, like the one i built in perl, back in 2010, when i was working on roger's "sitka"... it's not all finished yet, but you can look at it here: > http://zenmagiclove.com/bettyedit.py that is targeted at the ipad right now, but i can also make it work on an iphone by sizing the text smaller. it's _increasingly_ important to offer people the chance to contribute to your digitization project when they are using a mobile form-factor, like the ipad or the iphone. *** have a nice day. -bowerbird

5 7

Happy Public Domain Day!
by Wallace J.McLean 01 Jan '12

01 Jan '12

http://t.co/As075vYh

1 0

pontifications from mount high horse -- #1501
by Bowerbird＠aol.com 01 Jan '12

01 Jan '12

it's time for e-books to leave their incunabula period, and nobody else is better suited to removing their swaddling... tomorrow... tonight, i'm quaffing scotch through a straw... *** have a happy new year. -bowerbird

1 0

pontifications from mount high horse -- #1498
by Bowerbird＠aol.com 01 Jan '12

01 Jan '12

don said: > Al, Roger, Bird, anyone else, > For your list of scannos, > do you examine every instance > of every word on the list? > Many of those words > would be more frequently correct > than a misscan. i recommend you not even consider testing for a specific scanno unless you suspect that it will return more hits than false-alarms. then i would collect data from every test, to _ensure_ that it does, and stop using any that do not. that's a very high bar to clear, but if a test returns more false-alarms than hits, it's wasting my time. it is more important and valuable to build an infrastructure which acts quickly and efficiently on error-reports from your community than to waste the time and energy of your digitization volunteers. sadly, p.g. never gave itself the opportunity to learn that lesson... -bowerbird

1 0