
jim said:
the Google Books approach of simply digitizing the page images of a book is a perhaps reasonable approach where the readership for that book is
small compared to the effort required to take the Gutenberg approach of producing a clean digital text. If one is going to simply digitize the page images of a book then one needs a reader with a screen big enough
to display that page image legibly -- unless one imagines a technology
to slice and re-dice the page image in an intelligent manner so that snippets of the page image fits readably onto an iPhone.
ok, where do i begin here? just pick a spot and jump in, i guess... i believe the google book "settlement" should be rejected. it gives far too much to google -- a virtual monopoly -- and would thus shut down experimentation prematurely... but what you have said here, jim, doesn't do google justice. you seem to think that google is just scanning the books, and displaying those scans to people. that's not the case. google is doing o.c.r., and is using the results of that o.c.r. further, they show indications they'll be cleaning that o.c.r. contrary to what you have implied here, it does _not_ take a great deal of time or effort to create "a clean digital text". i have shown here -- repeatedly -- that the vast majority of o.c.r. errors can be corrected with little or no human effort, using rigorously-applied computerized correction routines. and google knows how to create, test, and refine those routines. i have also shown here -- again, repeatedly -- that one attains high accuracy by comparing different digitizations to each other, and minimizes the human attention required in the process... google can use this, as it is scanning lots of books multiple times. so why does distributed proofreaders work so hard to do its books? because it's stupid, with its head in the sand, ignores what i prove, and doesn't mind wasting the energy and time of its volunteers... next, one doesn't have to "imagine" a technology that will slice and re-dice a page-image to fit it onto a certain display-size... bill janssen over at parc demonstrated such a system long ago. and if you look, google is currently using its own variant of that. finally, there are a lot of page-images that fit nicely on an iphone even without being sliced and diced. for instance, navigate here: this page-image displays quite nicely on an iphone, thank you. -bowerbird ************** Make your summer sizzle with fast and easy recipes for the grill. (http://food.aol.com/grilling?ncid=emlcntusfood00000004)

jimad said: "Google simply digitizing" Bowerbird said: you seem to think that google is just scanning the books, and displaying those scans to people. that's not the case. google is doing o.c.r., and is using the results of that o.c.r. Sorry, I know the broad strokes of what Google is doing. Rather I was pandering to my PG audience to soften the point I was trying to make [which is also somewhat the point you're trying to make] -- which is that --perhaps-- at some point in time in the near future using human beans to make txt files will no longer represent the best technological approach to making PD books available to the public -- and that with as examples the DX and Google "Page Image" PDFs maybe that day is getting pretty close. Google is still making the page image primary, and making the OCR -- however cleaned up or not -- secondary. IE google is using the OCR to make the book more-or-less searchable -- wonder why google would bother to do that? Some Google books OCR is very good, others OCR is very bad, and some Google books have only page images no OCR at all. Which begs the question, what IS the bottom-line goal of PG, and/or of DP? What IS IT we are really trying to accomplish here? Bowerbird said: one doesn't have to "imagine" a technology that will slice and re-dice a page-image to fit it onto a certain display-size...google is currently using its own variant of that. Sorry, where does google do a "slice and dice" -- can you provide a pointer? -- I know they do pan and scan. I also know some OCRs will do a mixed OCR text / word-image or char-image approach to digitizing a page based on how confident they are on a recognized word or not -- as in "paperless offices"
participants (2)
-
Bowerbird@aol.com
-
Jim Adcock