
On Fri, 31 Dec 2004 10:59:51 -0800 (PST), Michael Hart <hart@pglaf.org> wrote:
Aren't the PG eBooks already in Google Print?
No. Definitively, no. That is one of the things my experiments demonstrated (see "pickle-bottle"). Our texts, or at least, as Greg says, the first 100K or so of them, are indexed in Google, and Yahoo!, and other search engines. But that's Google, not Google Print. Google Print is a NEW content source. The content for Google Print is not directly available on the web now; it is held internally by Google. I have no inside information, but I think that my reconstruction below, based on my actually trying the thing, is pretty close. 1. Google agree with Penguin Classics, among others, that they can use their publications in Google Print. 2. Penguin Classics, et. al., ship Google a copy of every book they currently have in print (which is covered by this agreement -- I imagine there may be some restrictions). 3. Google cut the pages ('cos the scans are just _beautiful_!) and scan the pages of the books into images. 4. Google run OCR on the pages. Along with every word, they store its position in the image. Like: the word "poorer" is on page 62, in a box 1.1 cm wide and 0.4cm high whose top left corner is 4.2 cm from the top of the page and 3.1 cm from the left margin, . . . except I'm sure they're not using cm. as their unit. Abbyy does this in its internal files it saves, so it wouldn't shock me to find that they're using Abbyy for OCR. 5. Google resize and transform the images to JPEG for display. (I can't prove that they didn't start with JPEGs of that size, but I think it's likely that they scanned at 600 or higher initially.) 6. Google store the OCRed text, complete with the co-ordinates of each word on the pages where it appears, and index that OCRed text. They also store the JPEG images. Because they know that all the text in a book is useful (and that a book is of a finite size!) they store _all_ of the text of each book, not just the first 100K. 7. When a Google search is run, not only the main Google index is searched, but also the Google Print OCR text. 8. If the search returns results from Google Print, they are displayed on the search results page, along with the main Google results. 9. If a user clicks on a Google Print result, they are brought to the first page image -- the JPEG file -- where that search term is found in the OCRed text. When the page image is displayed, the search term is highlighted in yellow, using the co-ordinates captured at OCR time. (Actually, what is shown is the page image without the yellow, as I demonstrated by viewing the page images directly, with the HTML creatd dynamically to overlay yellow at the appropriate co-ordinates.) 10. The user can then browse back and forth, with limitations, through the page images. 11. The text that Google OCRed is never actually displayed as text, or HTML; it is used only to find the right page and highlight the search term.
That's what I heard,
Then I feel quite certain that you heard wrong.
so I would have figured they would have re-indexed them to make them complete???
I wonder if they left the old files, and just are making new ones, still from PG eBooks?
If so, how would you tell the difference?
If they were using our texts, which I am quite sure they are not, we could tell the difference by seeing whether their text was the same as our text. I do that quite a lot when checking out corrections to our texts, and I can actually reel off various errors in various eeditions of e-texts around the web by now. Their page images, and their search index, do not contain the same words as our texts. My "pickle-bottle" example is the least demonstration of that: many of the Penguin Classics they have in Google Print include introductions that we do not have. And, remember, they never display text: they _only_ display page images. No, I conclude that Google Print overlaps not at all with PG, except that we both have (different editions of) a large number of classic books. jim