for 32 days, i am showing samples of the problems
with the text in e-books from the internet archive...

***

today's sample is "alice's adventures in wonderland",
the book that -- more than any other -- captured the
imagination of people, from martin gardner on down.

in the e-book arena, "alice" has been "the prototype",
in that it's the overwhelming choice to use in demos.
it's the text which every new system cuts its teeth on.

one warning: don't use any archive.org o.c.r. version!

here's the page-scan:
>   http://www.archive.org/stream/alicesadventur00carr#page/20

here's the o.c.r. for the full book:
>   http://ia311325.us.archive.org/3/items/alicesadventur00carr/alicesadventur00carr_djvu.txt

and here's the o.c.r. for that page (which is page 20, not "n35"):
>
>   20 THE POOL
>  
>   and four times six is thirteen, and four times
>   seven is — oh dear! I shall never get to twenty
>   at that rate ! However, the Multiplication Table
>   doesn't signify : let's try Geography. London is
>   the capital of Paris, and Pciris is the capital of
>   Eome, and Eome — no, that 's all wrong, I'm
>   certain ! I must have been changed for Mabel !
>   I'll try and say ' Hoiv doth the little — '" and she
>   crossed her hands on her lap as if she were
>   saying lessons, and began to repeat it, but her
>   voice sounded hoarse and strange, and the words
>   did not come the same as they used to do : —
>  
>   " Hoi'j doth the little crocodile
>   Improve Iris shining tail,
>   And 2^our the v:aters of the Site
>   On every golden sccde !
>  
>   " Hoir cheerfully he seems to grin,
>   Hoio neatly spreads his elavjs,
>   And welcomes little fishes in
>   With gently smiling jaws/"

gosh, what a litany of errors and glitches on just this one page.

first, that funny character after "second is" in the second line is
an em-dash.  it's a utf8 em-dash.  that's right.  to "solve" the
problem of the em-dashes that their workflow had misplaced,
they stick a _utf8_ em-dash into what is otherwise 7-bit ascii...
i think that's _stupid_, but i guess it's what they call "progress".

but let's go on.  numerous examples of "floating" punctuation
dot the page, which is not all that unusual for o.c.r., and i note
that it _can_ be corrected automatically, except it _hasn't_ been.

we also have misrecognitions on "paris" and "rome" (twice)...

moreover, the _italics_ were lost, both in the text for emphasis,
and on the poem.  the _centering_ of the poem was lost as well.

and there were more misrecognitions on the italicized text,
on "how" (4), "his", "pour", "waters", "nile", "scale", and "claws",
plus an exclamation-point that was misrecognized as a slash.

all in all, a pretty dismal performance, all on this single page...
a classic book like this deserves much better...

-bowerbird