>
> From: Jim Tinsley <jtinsley(a)pobox.com>
> Date: 2004/12/31 Fri PM 12:19:56 PST
> To: gutvol-d(a)lists.pglaf.org
> Subject: Re: [gutvol-d] !@!Googleberg eBooks
<snip>
> 3. Google cut the pages ('cos the scans are just _beautiful_!) and scan
> the pages of the books into images.
As I've previously noted, destructive scanning of modern reprints is easy and usually results in good images and good OCR.
> 4. Google run OCR on the pages. Along with every word, they store its
> position in the image. Like: the word "poorer" is on page 62, in a box
> 1.1 cm wide and 0.4cm high whose top left corner is 4.2 cm from the top
> of the page and 3.1 cm from the left margin, . . . except I'm sure
> they're not using cm. as their unit. Abbyy does this in its internal
> files it saves, so it wouldn't shock me to find that they're using Abbyy
> for OCR.
The folks at The Million Book Project and The Internet Archive are using something called djvu that does this. It creates bounding boxes around each word in the image, then stores that information along with the text. The OCR associated with djvu is not ABBYY but another product that does not work quite as well.
A DP volunteer posted the following in our forums:
----------------------------
Here's an interesting experiment...
Go to http://www.google.com/googleblog/.
Under "All booked up" (which talks about the Google/Library project), click on the link labelled "the survival of the fittest". This takes you to a beta of Google Print, for the specific book "Darwin, and After Darwin".
Under "Search within this book", type "Darwin" and hit "Go". You'll get a new window with 3 images, showing the first few occurrences of "Darwin" in the book, where "Darwin" is highlighted in yellow.
What's interesting is that in the third image, there are two occurrences of the word "Darwin", but the first is not highlighted.
Similarly, if you search for "Berkeley", one occurrence in the second image is missing its highlight.
This suggests that their searches are based on unproofed OCR results (where the unhighlighted occurrences correspond to uncorrected scannos).
... searching for "1 arwin" (one, space, arwin) and having it highlight "Darwin". (Try it, it's neat!)
---------------
All of the above would appear to confirm Jim's assessment about what Google has done to date.
JulietS