
On 3/3/2010 6:03 PM, James Adcock wrote:
do you want to bet against google?
because i'll take that bet against you.
Sure, I'd be happy to take that bet, if I am allowed to win it or lose it in a finite amount of time -- such as a decade. What I think is much more likely in a decade is that Google is either gives up or they figure out how to post much more attractive page images. I actually don't think they have much of any interest in posting higher quality automatic OCR transcriptions.
Wrong again. Google is funding development of open source OCR software via project called ocropus. I believe a beta version is due out shortly. Further, Google bought ReCaptcha. That's the company and software that make you prove you are human on many websites. They provide two scanned words, one known and one not. The human types in both. This works well because what is hard for OCR software, eg a computer, is often easy for a human. Over millions of comparisons they are able to build up a pretty good version of the text. Since they don't address punctuation, and because capital and non-capital letters, and some blobs, can be hard to recognize out of context, they won't get the text perfect. But they can turn something from total gibberish into readable text. I believe that there will always be a place for humans in preparing etext versions of some books. But, just as OCR eventually became good enough to start with, eventually technology will improve enough humans will add value only on very difficult texts, or by contributing semantic information. I don't know when that will happen, but it is certainly coming. Juliet Sutherland