the problem with the e-books from the internet archive -- 04 of 32

for 32 days, i am showing samples of the problems with the text in e-books from the internet archive... *** today's sample is "alice's adventures in wonderland", the book that -- more than any other -- captured the imagination of people, from martin gardner on down. in the e-book arena, "alice" has been "the prototype", in that it's the overwhelming choice to use in demos. it's the text which every new system cuts its teeth on. one warning: don't use any archive.org o.c.r. version! here's the page-scan:
here's the o.c.r. for the full book:
http://ia311325.us.archive.org/3/items/alicesadventur00carr/alicesadventur00... and here's the o.c.r. for that page (which is page 20, not "n35"):
20 THE POOL
and four times six is thirteen, and four times seven is — oh dear! I shall never get to twenty at that rate ! However, the Multiplication Table doesn't signify : let's try Geography. London is the capital of Paris, and Pciris is the capital of Eome, and Eome — no, that 's all wrong, I'm certain ! I must have been changed for Mabel ! I'll try and say ' Hoiv doth the little — '" and she crossed her hands on her lap as if she were saying lessons, and began to repeat it, but her voice sounded hoarse and strange, and the words did not come the same as they used to do : —
" Hoi'j doth the little crocodile Improve Iris shining tail, And 2^our the v:aters of the Site On every golden sccde !
" Hoir cheerfully he seems to grin, Hoio neatly spreads his elavjs, And welcomes little fishes in With gently smiling jaws/"
gosh, what a litany of errors and glitches on just this one page. first, that funny character after "second is" in the second line is an em-dash. it's a utf8 em-dash. that's right. to "solve" the problem of the em-dashes that their workflow had misplaced, they stick a _utf8_ em-dash into what is otherwise 7-bit ascii... i think that's _stupid_, but i guess it's what they call "progress". but let's go on. numerous examples of "floating" punctuation dot the page, which is not all that unusual for o.c.r., and i note that it _can_ be corrected automatically, except it _hasn't_ been. we also have misrecognitions on "paris" and "rome" (twice)... moreover, the _italics_ were lost, both in the text for emphasis, and on the poem. the _centering_ of the poem was lost as well. and there were more misrecognitions on the italicized text, on "how" (4), "his", "pour", "waters", "nile", "scale", and "claws", plus an exclamation-point that was misrecognized as a slash. all in all, a pretty dismal performance, all on this single page... a classic book like this deserves much better... -bowerbird
participants (1)
-
Bowerbird@aol.com