michael said:
> The bet is that a Xerox machine type of scanning and OCR
> will produce a 95% accurate copy of certain pages selected
> from an average set of books, magazines, etc.
> Just go to a library and ask for samples.
that's not the bet at all.
the bet is whether google can increase accuracy to 96% or 97%.
we're not talking about the limits of scanning and o.c.r., people.
we're talking about what a company with virtually unlimited funds
and lots and lots and lots and lots of expertise with handling text
can do _after_ they've scanned books and done o.c.r. on the scans,
in order to improve the accuracy of that text.
folks, we're talking about how well they can clean up their o.c.r.
and i'm conservative by saying 96% or 97%... quite conservative.
i've shown how useful it can be to compare two book digitizations.
but for some editions of some books, google will have _many_
different digitizations, involving different physical copies taken
from different physical libraries throughout the country, scanned
by different machines, and perhaps processed using different o.c.r.
they will certainly experiment with despeckling and resolution,
and other variables, and should hit on a comparison combination
which -- for their particular scans -- works remarkably effectively.
they will also have tons of data on the types of errors that are made
by their equipment, and knowing that _will_ help them fix the errors.
but mostly just having _multiple_digitizations_ of the same edition
of a book gives them the chance to raise accuracy through the roof.
you guys want to tie google's hands in the same way yours are tied.
but google's money and expertise mean they are _miles_ ahead...
and eventually probably even light-years ahead...
-bowerbird
p.s. and the limitation on the bet that google can't use humans?
why not? they have billions of pageviews every single day, not?
why do you think they bought recaptcha and hired luis von ahn?
they're not limited by the shackles that you want them to wear...