for 32 days, i am showing samples of the problems
with the text in e-books from the internet archive...

***

today's example is from "books and culture", the book
google offered as its very first public-domain scan-set.

here's the pagescan for page 14:
>   http://www.archive.org/stream/booksandculture00mabirich#page/14

here's the o.c.r. for the book:
>   http://ia341039.us.archive.org/2/items/booksandculture00mabirich/booksandculture00mabirich_djvu.txt

and here's the o.c.r. for page 14:
>
>   Material and Method.
>  
>   the wisest of modern readers has said
>   that the most important character-
>   istio of the real critic the man who
>   penetrates the secret of a work of
>   art is the ability to admire greatly;
>   and there is but a short step between
>   admiration and love. And as if to
>   emphasise the value of a quality so
>   rare among critics, the same wise
>   reader, who was also the greatest
>   writer of modern times, says also
>   that fc where keen perception unites
>   with good will and love, it gets at
>   the heart of man and the world ;
>   nay, it may hope to reach the high
>   est goal of all." To get at the heart
>   of that knowledge, life, and beauty
>   which are stored in books is surely
>   one way of reaching the highest
>   goal.
>  
>   That goal, in Goethe s thought,
>   was the complete development of
>   14

if you look at that text, you'll see a few o.c.r. errors...
specifically, "charateristio" instead of "characteristic",
and a double-quotemark that's misrecognized as "fc".

the bigger problems, however, are what you do not see.

comparing the text against its scan, you'll discover that
one end-line-hyphenate -- "highest' -- lost its hyphen.

also notice that the apostrophe in "goethe's" was lost...

both of these problems happened frequently in this book.
they're serious flaws, to be sure, but they lend themselves
to fixes that can be applied more-or-less automatically...

although that fact gives little solace once we consider the
offsetting fact that the auto-fixes have not been applied.

but there's yet another problem here, even more serious.

as you see, em-dashes were lost in the o.c.r. of this page,
review of other pages shows this problem is _book-wide_,
and a look at other books reveals it's a common problem.

although it might not seem like that big of a deal, the loss
of em-dashes can make some passages weird and illogical.

as a solution, i programmed a tool that will let you easily
insert em-dashes into a text.  it works fine, but the job is
still time-consuming.  indeed, i found it much quicker to
simply re-do the o.c.r., a route that's clearly unacceptable.

the problem got more interesting, however, when i learned
-- after considerable sleuthing -- that this is a problem in
the _workflow_ of the internet archive, not the o.c.r. per se.

the o.c.r. records em-dashes, but somewhere along the line,
due to some bad coding, they are being lost from the file...
in other words, the employees at archive.org screwed it up.
this is human error, from people being paid to know better.

this problem is especially galling to me, because i thought
-- after i'd gone through some severe agitation to get it --
i had extracted a promise from the archive.org people that
they would go back and _fix_ this glitch on all the old files.

but as can be seen clearly, they reneged on their promise...

i guess they thought that once they banned me from their
listserve, they could forget the promises they made to me.

but i can't figure out why they don't care about their books.

-bowerbird