ok, i went back and did a better job on "betty little",
and then compared the product to the original o.c.r.
well, what i _acutally_ have is what i scraped from his
editor demo, and some of that text had been edited...
but i got to it pretty early, so i think i minimized that.
(it would be nice if roger put out his _actual_ o.c.r.
it would also be great if roger put out his .rtf copy.
but i'm not sure how interested he is in this stuff.)
at any rate, i will shy away from hard numbers, and
just report the pattern of results, which is very clear.
in general, i found this digitization looks exactly like
the dozens of other ones that i have reported on here.
as usual, the o.c.r. was good. quite surprisingly good.
(i guess it's time that we should no longer be surprised.
still, these scans were _murky_; but no, it didn't matter.)
there were right around 256 errors, on a 256-page book,
on the raw o.c.r., which is pretty much what you'd expect.
considering the number of lines -- over 7000 of them --
those 256 errors constitute an accuracy rate of over 95%.
roger hasn't released his text yet, so i can do a compare,
which will probably reveal more errors that i missed, but
even if i missed twice as many as i found (quite unlikely),
the accuracy of the raw o.c.r. is still gonna be about 90%.
***
i spent more time finding and fixing errors in this book
than i wanted to, more time than i would have otherwise,
because i was using it as the content for a new program.
so i can't give a good estimate of the cost-benefit ratio
of the time i spent, but i can say that i did indeed catch
a pretty good percentage of the errors on my first pass.
i made errors, lots of them! -- many more than the 22
which roger reported a while back -- but i can also say
that i woulda caught almost all of the errors originally
_if_ i had done a careful job, and done it more fully...
i did a rush job, because i didn't know how fast roger
was going to act, and i wanted to get my stuff out first,
so roger and everyone else would know that i _hadn't_
used his results to create mine, i did mine on my own.
and i wasn't thorough, because i didn't know if people
would care. heck, i didn't even know if _i_ would care.
but the project ended up being fun. it was _nostalgic_,
coming in at the end of the year, plus i hadn't done an
analysis of a digitization in a long time. and i'm rarely
able to assess my own performance, so that's a blast...
maybe i'll make a list of the errors later, for you to see.
or maybe not. either way, the results are unmistakable.
the o.c.r. was good. many of the "errors" were due to
scan-spots -- which o.c.r. is duty-bound to report --
or outright errors in the p-book. it was full of errors!
this is one of those e-books that is _more_accurate_
-- out of the chute -- than the p-book it came from.
and, to repeat, this result is _the_typical_finding_...
across the board, i have demonstrated, over and over,
that the o.c.r. is good, and the vast majority of errors
can be fixed by using extremely simple preprocessing,
the type that you can do in one hour for a simple book.
correcting the o.c.r. -- and even doing the formatting --
for a book is easy. it doesn't take rounds and rounds of
volunteers wasting time and energy poring over a book.
all it takes is one or two people using a good tool, and
a couple of smoothreaders to catch the stealthy stuff...
if you want to split up the job, you can have 10 or 20
people using that same "good tool" to do the job, and
10 or 20 smoothreaders, probably giving better results.
but even one person and one smoothreader can do fine.
and if you solicit error-reports and act on 'em diligently,
you can execute a very smart march toward perfection...
***
the things i just said apply to d.p. and p.g., obviously,
but they also apply to some points roger has made...
for instance, roger said this:
> I've heard from some people that solo process
> that actually like to go through the book
> page at a time because they enjoy
> following the story as they go, which
> doesn't happen when someone is
> in production mode
> at the book level.
now, first of all, of course, this is another case where
roger exhibits fundamental misunderstanding about
the essence of "production mode at the book level"...
rather, it is page-oriented systems like the one at d.p.
which make it difficult for people to "follow the story".
in a system like the ones i make, the entire book is
available to a person at all times, so they can surely
"follow the story" if they choose to read it in order.
the main difference is that, in a book-oriented system,
you will _begin_ by cleaning up the big errors first --
the ones that are simple for the system to auto-detect
-- so that you can then "settle in" to read each page,
during which process you can look for _subtle_ errors,
without being distracted by a need to fix any big ones,
as that does indeed detract from "following the story"...
in a page-oriented system, an absence of preprocessing
means you might need to fix a bug on nearly every page,
and that hurts both your accuracy _and_ comprehension.
so roger has not just "failed to get things right" here...
he has actually gotten it _completely_backward_, sadly.
and, like i said, he's one of the smarter guys here. sadly.
***
if anyone wants to see my analysis of my performance
on "betty lee", let me know. or view the product online.
> http://z-m-l.com/go/betle/betle.zml
> http://z-m-l.com/go/betle/betlep123.html
oh yeah, i almost forgot to tell you...
i've programmed yet another book-digitization editor.
once again, it's in python, like the one i built recently.
but it's rather full-fledged, like the one i built in perl,
back in 2010, when i was working on roger's "sitka"...
it's not all finished yet, but you can look at it here:
> http://zenmagiclove.com/bettyedit.py
that is targeted at the ipad right now, but i can also
make it work on an iphone by sizing the text smaller.
it's _increasingly_ important to offer people the chance
to contribute to your digitization project when they are
using a mobile form-factor, like the ipad or the iphone.
***
have a nice day.
-bowerbird