Re: Re: [gutvol-d] !@!Googleberg eBooks

From: Jim Tinsley <jtinsley@pobox.com> Date: 2004/12/31 Fri PM 12:19:56 PST To: gutvol-d@lists.pglaf.org Subject: Re: [gutvol-d] !@!Googleberg eBooks
<snip>
3. Google cut the pages ('cos the scans are just _beautiful_!) and scan the pages of the books into images.
As I've previously noted, destructive scanning of modern reprints is easy and usually results in good images and good OCR.
4. Google run OCR on the pages. Along with every word, they store its position in the image. Like: the word "poorer" is on page 62, in a box 1.1 cm wide and 0.4cm high whose top left corner is 4.2 cm from the top of the page and 3.1 cm from the left margin, . . . except I'm sure they're not using cm. as their unit. Abbyy does this in its internal files it saves, so it wouldn't shock me to find that they're using Abbyy for OCR.
The folks at The Million Book Project and The Internet Archive are using something called djvu that does this. It creates bounding boxes around each word in the image, then stores that information along with the text. The OCR associated with djvu is not ABBYY but another product that does not work quite as well. A DP volunteer posted the following in our forums: ---------------------------- Here's an interesting experiment... Go to http://www.google.com/googleblog/. Under "All booked up" (which talks about the Google/Library project), click on the link labelled "the survival of the fittest". This takes you to a beta of Google Print, for the specific book "Darwin, and After Darwin". Under "Search within this book", type "Darwin" and hit "Go". You'll get a new window with 3 images, showing the first few occurrences of "Darwin" in the book, where "Darwin" is highlighted in yellow. What's interesting is that in the third image, there are two occurrences of the word "Darwin", but the first is not highlighted. Similarly, if you search for "Berkeley", one occurrence in the second image is missing its highlight. This suggests that their searches are based on unproofed OCR results (where the unhighlighted occurrences correspond to uncorrected scannos). ... searching for "1 arwin" (one, space, arwin) and having it highlight "Darwin". (Try it, it's neat!) --------------- All of the above would appear to confirm Jim's assessment about what Google has done to date. JulietS

On Fri, 31 Dec 2004 20:09:25 -0800, <juliet.sutherland@verizon.net> wrote:
Here's an interesting experiment...
Go to http://www.google.com/googleblog/.
Under "All booked up" (which talks about the Google/Library project), click on the link labelled "the survival of the fittest". This takes you to a beta of Google Print, for the specific book "Darwin, and After Darwin".
Under "Search within this book", type "Darwin" and hit "Go". You'll get a new window with 3 images, showing the first few occurrences of "Darwin" in the book, where "Darwin" is highlighted in yellow.
What's interesting is that in the third image, there are two occurrences of the word "Darwin", but the first is not highlighted.
Similarly, if you search for "Berkeley", one occurrence in the second image is missing its highlight.
This suggests that their searches are based on unproofed OCR results (where the unhighlighted occurrences correspond to uncorrected scannos).
... searching for "1 arwin" (one, space, arwin) and having it highlight "Darwin". (Try it, it's neat!) ---------------
Thanks for the cite, Juliet! I didn't know about that thread. I read it, and the main thing that struck me was that bowerbird found the OCRed text, because it sure wasn't in the HTML sent back to me using Mozilla. Hm. Could they be tailoring their pages depending on User-Agent: or the Accept: line in the headers sent by the browser? The answer is yes. When I search for "1 arwin" using Lynx, or Mozilla with images turned off (must be turned off before you start your initial Google Search), I get text instead of the images, like: Darwin, and After Darwin Pages 1 - 1 of 1 in book for 1 arwin. (0.03 seconds) Page i 1)ARWIN, AND AFTER darwin A1V ¿xfositfoiv OF TIFF DARWINIAN TIFEOR V AND A discussion OF POST-DARWZNL4N QUES7IONS BY THE LATE GEORGE JOHN ROMANES , MA, LL. ... This is obviously the text they're searching. Unfortunately, the whole text of a page is not similarly displayed when I do a page view. Interestingly, both "I arwin" and "1 arwin" (capital "I" and digit "1") find the same passage. It seems that somebody in Google Print has decided to tweak its search to be tolerant of at least some common OCR errors. jim
participants (2)
-
Jim Tinsley
-
juliet.sutherland@verizon.net