[gutvol-d] Re: DP/PG vs. Google

5 Mar 2010

      On 3/3/2010 6:03 PM, James Adcock wrote:
...
...
do you want to bet against google?
...
because i'll take that bet against you.
Sure, I'd be happy to take that bet, if I am allowed to win it or lose 
it in a finite amount of time -- such as a decade.  What I think is 
much more likely in a decade is that Google is either gives up or they 
figure out how to post much more attractive page images.  I actually 
don't think they have much of any interest in posting higher quality 
automatic OCR transcriptions.
Wrong again. Google is funding development of open source OCR software 
via project called ocropus. I believe a beta version is due out shortly.

Further, Google bought ReCaptcha. That's the company and software that 
make you prove you are human on many websites. They provide two scanned 
words, one known and one not. The human types in both. This works well 
because what is hard for OCR software, eg a computer, is often easy for 
a human. Over millions of comparisons they are able to build up a pretty 
good version of the text. Since they don't address punctuation, and 
because capital and non-capital letters, and some blobs, can be hard to 
recognize out of context, they won't get the text perfect. But they can 
turn something from total gibberish into readable text.

I believe that there will always be a place for humans in preparing 
etext versions of some books. But, just as OCR eventually became good 
enough to start with, eventually technology will improve enough humans 
will add value only on very difficult texts, or by contributing semantic 
information. I don't know when that will happen, but it is certainly coming.

Juliet Sutherland