the problem with the e-books from the internet archive -- 01 of 32

for 32 days, i am showing samples of the problems with the text in e-books from the internet archive... *** let me begin by saying that this is guided by _love_... there is much that i admire about the internet archive. right at the top is the man at the top -- brewster kahle. he crafted, and is being guided by, the right philosophy: "universal access to knowledge". at a time when _many_ people adapt a "pragmatic" approach to our cyberlibrary, it's useful and vital to offset it with a philosophical tone. because the unique opportunity has presented itself now, creating a cyberlibrary is the _right_ thing for us to do... to _fail_ to do so would be a massive failure on our part. so i believe strongly in the mission of the internet archive. but in practice, there is much improvement to be desired... specifically, what good is a library when its text is garbled? *** and now some background... as is often the case, the problem's origin is deep in the past. archive.org took a fork in the road a long time back that has ramified to its present, in a way that is not particularly good. specifically, a decision was made to focus on _scans_ of books, a decision which made digitization of the text less important... moreover, it made digitization seem less costly than it truly is, because it ignored the costs associated with cleaning the o.c.r. but mostly, it was that the _attention_ got focused incorrectly; text was a second-class citizen, thus o.c.r. was largely ignored. in some ways, this decision was "understandable" at the time... early efforts proved that correcting o.c.r. was time-consuming; plus then creating a "book-like" version of that corrected o.c.r. was a layout task that seemed too hard for archive.org people. meanwhile, brewster was impressed by a page-flipping demo by the british library. and the scans didn't require the "layout" that seemed so difficult. so archive.org headed in that direction. as you might expect, a good many people (notably michael hart, with his firm insistence that "a picture of a book is _not_ a book") questioned this decision, and advocated on behalf of digital text. but the decision stuck to focus on scanning. as a consequence, correction of the digital text has never been a priority, a decision that has come back to haunt archive.org... end-users have found that archive.org's huge scan-sets are just too inconvenient. scan downloads are bulky, and take up space, and the absence of any searchable text is a distinct disadvantage, as is a total inability to reflow the book to differing form-factors. moreover, our shift to _mobile_ machines magnifies the need for e-books to be nimble, and archive.org scans will never be nimble. even the low-resolution scansets can run to dozens of megabytes. (and the high-res scansets are so darn big that they're ridiculous.) so, the demand is for digital text, now and into the future too. which means archive.org has been caught with its pants down, and is now paying the price for a bad decision made years ago. they find that their end-users are _demanding_ digital text, primarily in the form of .epub downloads and accessible text for blind and dyslexic users (which must be available by law), but the only digital text archive.org can provide is riddled with embarrassing o.c.r. errors making the text unusable practically. *** archive.org hasn't made _any_ efforts to clean up their o.c.r. consequently, it's quite easy to find examples of terrible o.c.r. in the archive.org library. indeed, it's so easy that i wouldn't ordinarily even bother to point out such _common_ examples. i'd say that 95% of the archive.org e-books exhibit problems in their digital text. finding e-books _without_ any problems is the difficult task here, not finding ones that have problems. indeed, the problem is so pervasive that if _i_ were making the decision to release this text -- as it is -- to the general public, i would refuse to do so... that's how bad some of this text is... but not only is archive.org _releasing_ the text, it's _promoting_ the release of its flawed text, as if it were _proud_ of the release. i don't really understand why they'd _do_ this. (it's almost as if they expect that nobody will actually even _look_ at their books, even though they're promoted with high-profile press releases.) but -- after having tried for years and years and years to bring this problem to their attention and have them do something to correct it -- i can no longer sit by without making a public fuss. and let there be no mistake about this. i am _willing_ to help archive.org correct their o.c.r. i've done lots of work on this, and i _have_ offered my assistance. i was largely ignored, so i persisted, and -- to my amazement -- i was _banned_ from their mailing listserves. talk about shooting the messenger! *** so here's today's example. for this one, i will pick a book that archive.org itself once trumpeted as an example of its work -- the adventures of tom sawyer. in keeping with its focus on its page-flipping scan-set viewer, archive.org showed one of the chapter-header pages, complete with an illustration that overlapped with the opening text there. here's where they promoted the page:
and here's the actual page:
http://www.archive.org/stream/adventuresoftoms00twaiiala#page/26
the reason (the scan of) this page was so impressive was because our current e-book formats are largely incapable of integrating text with pictures in such a sophisticated and pleasing mesh... but that intertwining of text and graphic has a deleterious effect on the o.c.r. so if you pull up the o.c.r. text from this very page, you'll see that it shows this:
morning was
come, and all the summer world was bright and fresh, and brimming with life. There was a song in every heart ;. and- if the heart was young the music issued at the lips. There was cheer in every face and a spring in every step. The locust trees were in bloom and the fragrance of the blossoms filled the air. Cardiff Hill, beyond the Village and above it, was green with vegetation, and it lay just far enough away to seem a Delectable Land, dreamy, reposeful, and inviting. Tom appeared on the sidewalk with a bucket of whitewash and a long- handled brush. He surveyed the fence, and all gladness left him and a deep melancholy settled down upon his spirit. Thirty yards of board fence nine feet high. Life to him seemed hollow, and existence but a burden. Sighing.
26
as you can see, the chapter-header has been lost, and the word "saturday" at the beginning of the chapter too. so a person who is "reading" this o.c.r. text won't know that this is a new chapter, or that the day is a saturday. other than that, aside from a few other o.c.r. glitches, the text is ok. but that's a serious problem for a reader, not knowing that this is the start of a new chapter. and this is on a page that was actively promoted by archive.org! -bowerbird
participants (1)
-
Bowerbird@aol.com