I am busy, but I do read these. Sadly I do not have enough time to actually go through and try to duplicate the examples, so no, I did not catch the errors. But if you want to continue, I'll keep reading, and if work ever slows down for me I will certainly go through the series and try to learn.

Thanks for the posts.

--
b

Sent from my iPhone

On Oct 3, 2011, at 4:08 PM, Bowerbird@aol.com wrote:

this thread explains how to digitize a book quickly and easily.
search archive.org for "booksculture00mabiuoft" for the book.

***

i said:
>   i left them in, because i was curious to see if anyone noticed.
>   evidently, nobody paid enough attention to notice.  oh well...

of course, i also presented the _errors_ incorrectly, with the
"error" and "fixed" labels reversed, and nobody commented
about that, either, so it appears nobody is paying attention.

but that's ok, folks.  i keep myself mightily amused with it all.

this is just good fodder for those future cyber-archeologists,
so they realize we weren't _all_ stupid idiots about e-books...

for the record, i've appended the correct version of the chart.

***

now, let's have a little more fun with the cleaning dimension.

if you're new here, you might've noticed something "curious"
about this last stage of the clean-up job, where i _evaluated_
how well we'd done by comparing the text to a "clean" copy...

and you might have thought to yourself, "well, that's great, but
the problem is that you usually don't _have_ a clean copy, which
is why you are cleaning the text to begin with, so that does not
work as some kind of 'general solution' for us".  and you would
be right.  or at least half-right.  for at least as far as you went...

but there are some interesting wrinkles just beneath the surface.

because it ends up that we can use another _incorrect_ copy of
the digitized text to help ourselves find the errors in our work;
i.e., the "comparison" copy doesn't necessarily have to be clean.

here's the strategy: you compare the second digitization, which
let us say (for the purpose of illustration) is another set of o.c.r.,
with the current digitization, on a line-by-line basis throughout.

for each pair of compared lines, there are 5 possibilities:
1.  the lines match, and each of them is correct.
2.  the lines match, and each of them is incorrect.
3.  the lines don't match, and your current copy is correct.4
4.  the lines don't match, and the second digitization is correct.
5.  the lines don't match, and neither digitization is correct.

if o.c.r. is generally correct, and we know that yes, it is indeed,
the first scenario is the one that will happen most of the time...

for the time being, we'll put aside the second scenario.

for the remaining 3 scenarios, we can compare the line in our
current digitization to the pagescan, and correct it if required
(which will be in the cases which fall under scenarios 4 and 5).

in the case of scenario 3, we'll look at the line, see it's correct,
and leave it as it is.  it was a little work, but no harm was done.

the secret of this approach is that a _difference_ between the
two lines in our pair will flag that line so that we examine it...
whether it's right or wrong is immaterial, since we _look_ at it.

and since the number of nonmatching pairs of lines is _small_
(relative to the number that match because they're both correct)
this approach is relatively good at finding the _incorrect_ lines...

indeed, the only case that gives us any pause at all is scenario 2,
where the lines match, but each of them happens to be incorrect.

now, i can't blame you if you think scenario 2 will be common...

it kinda makes sense that if one o.c.r. makes an error on a line,
that a second o.c.r. pass would make that identical error again...
and in such a case, they'd both match, but they'd both be wrong.

but how about if we do an empirical test of that assumption, eh?
indeed, how about doing _lots_and_lots_ of empirical tests of it?

i've done just that -- lots of empirical tests of that assumption --
and my research has shown very strongly that it is _not_ the case.

and i can demonstrate it once again, by using our current book...

***

if you look, you'll see there's another digitization of this book
done by internet archive.  it's from the university of california,
whereas the first copy was scanned at the university of toronto.

here's the term to search for in order to find the second copy:
>   booksandculture00mabirich

here's the landing page for it:
>   http://www.archive.org/details/booksandculture00mabirich

and here's the o.c.r. text for it:
http://ia700303.us.archive.org/10/items/booksandculture00mabirich/booksandculture00mabirich_djvu.txt

now, there are a lot of things to dislike about that second copy.

first and foremost, you'll notice that it's missing its em-dashes.
i'm serious; the em-dashes are missing in action, gone entirely!

it's _also_ missing many of the end-of-line hyphens, and some
apostrophes as well, but let's just focus on the em-dashes now.

look at page 30 to see all three of the missing-in-action cases:
>   http://z-m-l.com/go/mabie/mabiep030.jpg

this is a very common problem with books from internet archive.
it's a serious problem -- a fatal problem, really -- and yet it is
a problem that has infested far too many books in their library...

it's downright embarrassing...  i would be _ashamed_ to put out
books that were missing their em-dashes like that, but evidently
the people at internet archive do not have any problem with it...

because i informed them about these glitches many _years_ ago.

and -- when they ignored me -- i _continued_ to inform them...

and when they _still_ ignored me, i finally started to pester them.
and i pestered them for _years_, until finally they agreed to _fix_
those em-dash problems...  and so i waited for them to do that...

and i waited.  and i waited.  and i waited some more.  until finally,
i got tired of waiting, and i asked 'em why they hadn't fixed them.

and i was ignored.  and brushed off.  and they called me names,
and asked me why _i_ was being so difficult.  they blamed _me_!
hey, i wasn't the person who was losing their damn em-dashes...

so i continued to pester them...  until finally they _banned_ me...

and now, years and years and years later, some of their e-texts
-- like this one we're viewing right now -- have no em-dashes...

they should be ashamed of themselves.  they should be disgraced.

but instead, they are putting out press-releases, and trumpeting
how wonderful they are, now that kindle is grabbing their books.

evidently, they don't expect those kindle people to actually _read_
their e-books, and discover what a big load of crap their o.c.r. is.

and i would say it directly to their face:  you should be ashamed.

***

anyway...

though i wouldn't advocate using this second copy of the book
for a full-on comparison, since you'd have to "work around" all
the missing characters, we can certainly take a quick look at the
16 lines in our first book which had errors in them, to evaluate...

we find 12 of the 16 -- _75%_ -- were correct in the second book.

here are those 12:
>   plete expressions, in that concrete
>   The latest of them, Count Tolstoi s
>   The reality of this element of per
>   all they saw and knew a part of them
>   process, and culture and genius stand
>   there has always been, not only a
>   stories and constantly touched upon
>   exploration, travel, and discovery ; he
>   There are, it is true, a few men and
>   from the days of the earliest Greek
>   sion, thought, impulse, which never
>   grance, and growth which lie enfolded

in other words, most of the time, the two sets of o.c.r. did _not_
have matching errors.  one got it right -- the other got it wrong.

so this comparison could've reduced our error-count for this book
to _4_ errors, out of 279 pages, which is an extremely good payoff
for a proofing procedure that -- unlike distributed proofreaders --
did not require a painful word-by-word comparison page-by-page.

which is _not_ to say a "smooth-reading" wouldn't be a good thing.
because it would be a good thing.  but let's have people who _want_
to read this particular book be the ones who do the smooth-reading.

***

so, all in all, this _comparison_ methodology is far superior to the
word-by-word method now employed by distributed proofreaders,
providing benefits which are comparable but at a _far_ lower cost.

this is a methodology that can _scale_ to the required dimensions,
unlike the d.p. way, which has stalled out at 2,500 e-texts a year...

and yes, i told this to the people over at distributed proofreaders.
you can still find it, still sitting there, in a thread on their forums:
>   http://www.pgdp.net/phpBB2/viewtopic.php?t=24008

what did d.p. do?  they banned me too.  seemed to be going around.

-bowerbird

=====================================

p.s.  all in all, the results were very good.
recall the book is over 200k, and 279 pages.

***

16 lines with o.c.r. errors going uncaught,
most of which were involving _punctuation_.
2 were on a name, which we shoulda checked.

wrong> plete expressions. In that concrete
fixed> plete expressions, in that concrete
chstr> =================^=^===============

wrong> and exhausts while he instructs, the
fixed> and exhausts while he instructs; the
chstr> ===============================^====

wrong> bought Gary's crib, and took it with
fixed> bought Cary's crib, and took it with
chstr> =======^============================

wrong> read my Gary's Plato. It so hap-
fixed> read my Cary's Plato. It so hap-
chstr> ========^=======================

wrong> The latest of them. Count Tolstoi's
fixed> The latest of them, Count Tolstoi's
chstr> ==================^================

wrong> mass the facts about any given period,
fixed> mass the facts about any given period;
chstr> =====================================^

wrong> The reality of this, element of per-
fixed> The reality of this element of per-
chstr> ===================^----------------

wrong> ail they saw and knew a part of them-
fixed> all they saw and knew a part of them-
chstr> =^===================================

wrong> process, and culture and genius stand,
fixed> process, and culture and genius stand
chstr> =====================================^

wrong> there has always been, not only a
fixed> there has always been not only a
chstr> =====================^-----------

wrong> stones and constantly touched upon
fixed> stories and constantly touched upon
chstr> ===^^------------------------------

wrong> exploration, travel, and discovery; ha
fixed> exploration, travel, and discovery; he
chstr> =====================================^

wrong> There are. It is true, a few men and
fixed> There are, it is true, a few men and
chstr> =========^=^========================

wrong> from the days of the earliest Greek,
fixed> from the days of the earliest Greek
chstr> ===================================^

wrong> sion, thought. Impulse, which never
fixed> sion, thought, impulse, which never
chstr> =============^=^===================

wrong> grance, and growth which he enfolded
fixed> grance, and growth which lie enfolded
chstr> =========================^^----------
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d