
On Fri, 31 Dec 2004 20:09:25 -0800, <juliet.sutherland@verizon.net> wrote:
Here's an interesting experiment...
Go to http://www.google.com/googleblog/.
Under "All booked up" (which talks about the Google/Library project), click on the link labelled "the survival of the fittest". This takes you to a beta of Google Print, for the specific book "Darwin, and After Darwin".
Under "Search within this book", type "Darwin" and hit "Go". You'll get a new window with 3 images, showing the first few occurrences of "Darwin" in the book, where "Darwin" is highlighted in yellow.
What's interesting is that in the third image, there are two occurrences of the word "Darwin", but the first is not highlighted.
Similarly, if you search for "Berkeley", one occurrence in the second image is missing its highlight.
This suggests that their searches are based on unproofed OCR results (where the unhighlighted occurrences correspond to uncorrected scannos).
... searching for "1 arwin" (one, space, arwin) and having it highlight "Darwin". (Try it, it's neat!) ---------------
Thanks for the cite, Juliet! I didn't know about that thread. I read it, and the main thing that struck me was that bowerbird found the OCRed text, because it sure wasn't in the HTML sent back to me using Mozilla. Hm. Could they be tailoring their pages depending on User-Agent: or the Accept: line in the headers sent by the browser? The answer is yes. When I search for "1 arwin" using Lynx, or Mozilla with images turned off (must be turned off before you start your initial Google Search), I get text instead of the images, like: Darwin, and After Darwin Pages 1 - 1 of 1 in book for 1 arwin. (0.03 seconds) Page i 1)ARWIN, AND AFTER darwin A1V ¿xfositfoiv OF TIFF DARWINIAN TIFEOR V AND A discussion OF POST-DARWZNL4N QUES7IONS BY THE LATE GEORGE JOHN ROMANES , MA, LL. ... This is obviously the text they're searching. Unfortunately, the whole text of a page is not similarly displayed when I do a page view. Interestingly, both "I arwin" and "1 arwin" (capital "I" and digit "1") find the same passage. It seems that somebody in Google Print has decided to tweak its search to be tolerant of at least some common OCR errors. jim