Re: [gutvol-d] !@!Googleberg eBooks

31 Dec 2004

      On Fri, 31 Dec 2004 10:59:51 -0800 (PST), Michael Hart <hart@pglaf.org> wrote:
...
Aren't the PG eBooks already in Google Print?
No. Definitively, no. That is one of the things my experiments
demonstrated (see "pickle-bottle").

Our texts, or at least, as Greg says, the first 100K or so of them, 
are indexed in Google, and Yahoo!, and other search engines. But that's 
Google, not Google Print.

Google Print is a NEW content source. The content for Google Print is
not directly available on the web now; it is held internally by Google.

I have no inside information, but I think that my reconstruction below,
based on my actually trying the thing, is pretty close.

1. Google agree with Penguin Classics, among others, that they can use
their publications in Google Print.

2. Penguin Classics, et. al., ship Google a copy of every book they
currently have in print (which is covered by this agreement -- I imagine
there may be some restrictions).

3. Google cut the pages ('cos the scans are just _beautiful_!) and scan
the pages of the books into images.

4. Google run OCR on the pages. Along with every word, they store its
position in the image. Like: the word "poorer" is on page 62, in a box
1.1 cm wide and 0.4cm high whose top left corner is 4.2 cm from the top
of the page and 3.1 cm from the left margin, . . . except I'm sure
they're not using cm. as their unit. Abbyy does this in its internal
files it saves, so it wouldn't shock me to find that they're using Abbyy
for OCR.

5. Google resize and transform the images to JPEG for display. (I can't
prove that they didn't start with JPEGs of that size, but I think it's
likely that they scanned at 600 or higher initially.)

6. Google store the OCRed text, complete with the co-ordinates of each
word on the pages where it appears, and index that OCRed text. They also
store the JPEG images. Because they know that all the text in a book is
useful (and that a book is of a finite size!) they store _all_ of the
text of each book, not just the first 100K.

7. When a Google search is run, not only the main Google index is
searched, but also the Google Print OCR text.

8. If the search returns results from Google Print, they are displayed
on the search results page, along with the main Google results.

9. If a user clicks on a Google Print result, they are brought to the
first page image -- the JPEG file -- where that search term is found in
the OCRed text. When the page image is displayed, the search term is
highlighted in yellow, using the co-ordinates captured at OCR time.
(Actually, what is shown is the page image without the yellow, as I
demonstrated by viewing the page images directly, with the HTML creatd
dynamically to overlay yellow at the appropriate co-ordinates.)

10. The user can then browse back and forth, with limitations, through
the page images.

11. The text that Google OCRed is never actually displayed as text, or
HTML; it is used only to find the right page and highlight the search
term.
...
That's what I heard,
Then I feel quite certain that you heard wrong.
...
so I would have figured they would have
re-indexed them to make them complete???
I wonder if they left the old files, and just are making new ones,
still from PG eBooks?
If so, how would you tell the difference?
If they were using our texts, which I am quite sure they are not, we
could tell the difference by seeing whether their text was the same as
our text. I do that quite a lot when checking out corrections to our
texts, and I can actually reel off various errors in various eeditions
of e-texts around the web by now. Their page images, and their search
index, do not contain the same words as our texts. My "pickle-bottle"
example is the least demonstration of that: many of the Penguin Classics
they have in Google Print include introductions that we do not have.
And, remember, they never display text: they _only_ display page images.

No, I conclude that Google Print overlaps not at all with PG, except
that we both have (different editions of) a large number of classic
books.

jim

Re: [gutvol-d] !@!Googleberg eBooks

Jim Tinsley