!@!Googleberg eBooks

newer
Re: Re: [gutvol-d] !@!Googleberg...

older
Re: [gutvol-d] !@!Googleberg eBooks

Michael Hart

28 Dec 2004 28 Dec '04

11:49 a.m.

How many of you have tried Google Print? Have you noticed that the intitial offering of eBooks strongly resembles the Project Gutenberg catalogue??? We'd love to hear your experiences with Google Print. Thanks!!! Michael S. Hart

Show replies by date

Pauline

30 Dec 30 Dec

11:16 a.m.

Michael Hart wrote:

...

How many of you have tried Google Print?

Have you noticed that the intitial offering of eBooks strongly resembles the Project Gutenberg catalogue???

Why not include info in all PG ebooks which make it: a) easy for readers to identify the source of the book (PG & the "Produced by" line) b) easy for readers/mirror sites/republishers to send corrections back to the source (PG &/| the producers) c) not OK to drop this info from PG ebooks when they are republished As a reader, knowing the source of the book is exceedingly valuable. Cheers, P -- Distributed Proofreaders: http://www.pgdp.net "Preserving history one page at a time."

Robert Shimmin

4:35 p.m.

Pauline wrote:

...

c) not OK to drop this info from PG ebooks when they are republished

The idea of a public domain is that anyone can do anything they like with the text, including edit it, republish it, and package it however they wish. -- RS

Michael Hart

4:55 p.m.

On Thu, 30 Dec 2004, Robert Shimmin wrote:

...

Pauline wrote:

...
c) not OK to drop this info from PG ebooks when they are republished

The idea of a public domain is that anyone can do anything they like with the text, including edit it, republish it, and package it however they wish.

But you can't say you are the author. . .and perhaps other things. mh

Michael Hart

5 p.m.

On Thu, 30 Dec 2004, Pauline wrote:

...

Michael Hart wrote:

...
How many of you have tried Google Print?

Have you noticed that the intitial offering of eBooks strongly resembles the Project Gutenberg catalogue???

Why not include info in all PG ebooks which make it: a) easy for readers to identify the source of the book (PG & the "Produced by" line)

eBooks often have multiple paper sources.

...

b) easy for readers/mirror sites/republishers to send corrections back to the source (PG &/| the producers)

There is already a email address for errors in the eBooks, not to mention bugs@pglaf.org and my own email address. You can pretty much send error messages to ANY PG address and they will be fixed.

...

c) not OK to drop this info from PG ebooks when they are republished

As in earlier messages, we only have something to say if they use the PG trademark. mh

Jim Tinsley

9:03 p.m.

On Tue, 28 Dec 2004 03:49:41 -0800 (PST), Michael Hart <hart@pglaf.org> wrote:

...

How many of you have tried Google Print?

Have you noticed that the intitial offering of eBooks strongly resembles the Project Gutenberg catalogue???

This and some responses made me think that some people are thinking along the lines that they are using our texts in some way, so I checked it out. I figure that the answer is no, to both the explicit and implied questions. I started by searching for quotes from 20 etexts chosen at random from etext99, as follows: book "cardinals, abbots, councillors, legates, bishops, princes" book "indeed we be no fatted bullocks, we two" book "Est-ce que je ne connais pas mon filleul?" book "Suchet's head-quarters at that time was the old palace of the" book "She always has this man of letters of hers on her" book "Afterwards," he answered quickly. "A cursed gutta serena." book "himself with the people, he partially recognizes the truth of his words." book "Epistles are spurious, as that the Republic, the Timaeus, and the Laws" book "You may recall that our mutual and dear friend, old Allan Quatermain," book "Where rose the husbandman's abode," book "the felicity of his fellow beings, and sit down darkling" book "by a tub, artesian cold, and a loud and joyous singing of" book "As desires of waking hours are answered in sleep," book "Even while speaking at random, perhaps the better to hide" book "Calm and proud, Tartarin of Tarascon marched on in the night" book "Another fallacy is produced which turns on the absoluteness of" book "The evidence for the steadily growing danger of secession" book "Morose-minded people may complain of this; for myself I regard it" book "THAT old bell, presage of a train, had just" All of them returned normal search results, including a few from PG, but only the second (Jungle Book 2) offered a Google Print link. (Incidentally, for those who want to try, I find that preceding your search term with "book" will often produce a Google Print link when the bare search term doesn't.) A search for "book Tarzan" yielded, in Print results: Tarzan of the Apes - by Edgar Rice Burroughs - 320 pages Human Computer Interaction - edited by Julie Jacko, Constantine Stephanidis - 1348 pages C Primer Plus - by Stephen Arata, Stephen Prata, Kathleen Prata - 970 pages Not what I'd consider a typical PG search result! :-) "book barsoom" and "book mars" did even less well. No sign of the ERB series. Erewhon, Alice, Little Women, Oliver Twist, Tom Sawyer, Huck Finn, Zenda, Decline and Fall, at least some Sherlock Holmes, Last of the Mohicans, several from Plato and at least most of Shakespeare, are present. Richard Feveral is there, but Shagpat is nowhere. Tom Swift is AWOL. Tartarin of Tarascon can't be found. John Carter is once again mysteriously missing. Kai Lung has effaced himself into invisibility. And in the process of searching for these, I turned up about twice as many modern as pre-23 book titles. The page images I looked at are all from modern reprints, with "Copyrighted Material" tags on their sides. I imagine that the publishers would insist on this, which makes much sense of Google wanting to work with a collection of PD books from libraries. This pattern is, I think, consistent with what book publishers might be willing to provide. Any list of books drawn up by English speakers is going to have the most popular classics on it. An awful lot of the search results I found were from Penguin Classics, so it may well be that they simply have the whole Penguin Classics range. If so, a significant overlap with PG is inevitable. And the Google Print entries seem to have a lot more modern books than classics. Hmmm. Interesting. The only Tarzan link for Google Print is "Tarzan of the Apes", and the only Tarzan search result at the Penguin Classics site is, guess what? "Tarzan of the Apes". And Penguin Classics does not publish the Barsoom series. "Coincidence? I think not!" Interesting: both the search book "she could have seen through a pair of stove-lids just as well." and book "A robber is more high-toned" find Tom Sawyer in Google Print, and book "Christmas won't be Christmas without any presents," finds Little Women, but book "Papa was a pickle bottle" doesn't, and book Little Women pickle does find the book, but with the word pickle much further down in the book. Hmmm, I see. The text in the Google Print image reads "pa was a pickle-bottle" instead. So much for any thought of them using our text. The larger reason that they can't be using our text is that their search results point to page images, with the search term highlighted in yellow. You really couldn't do that unless you had mapped your text to the dimensions and placing of the image: it would be vastly easier to do it programmatically from the OCR process than to use an outside text.

...

We'd love to hear your experiences with Google Print.

It will be handy, though probably not as handy as Amazon, for confirming unclear corrections in some older texts. They've somewhat protected their page images from downloading by the casual browser, but it's easy to bypass that. The more significant restriction is the number of pages any one session is allowed to download. This seems, to me, a reasonable compromise for genuinely-copyrighted books, though an annoyance on these reprints where the main story is in the PD and only the bookends are in copyright. It'll be interesting to see what they do with 100% pre-23 guaranteed content. jim

Greg Newby

11:39 p.m.

On Thu, Dec 30, 2004 at 04:03:18PM -0500, Jim Tinsley wrote:

...

On Tue, 28 Dec 2004 03:49:41 -0800 (PST), Michael Hart <hart@pglaf.org> wrote:

...
How many of you have tried Google Print?

Have you noticed that the intitial offering of eBooks strongly resembles the Project Gutenberg catalogue???

This and some responses made me think that some people are thinking along the lines that they are using our texts in some way, so I checked it out. I figure that the answer is no, to both the explicit and implied questions.

I started by searching for quotes from 20 etexts chosen at random from etext99, as follows: ...

Fascinating analysis, thanks. Just a quick note that Google only indexes the first 100 or 150K of eBooks (they didn't give me a firm number, but confirmed there was a limit). This means that quotes from later parts of our eBooks > ~150K won't be found. -- Greg

Jim Tinsley

11:49 p.m.

On Thu, Dec 30, 2004 at 03:39:23PM -0800, Greg Newby wrote:

...

Just a quick note that Google only indexes the first 100 or 150K of eBooks (they didn't give me a firm number, but confirmed there was a limit). This means that quotes from later parts of our eBooks > ~150K won't be found.

This is true for our books, as searched for by Google in general, like any other page, but it is not true for the Google Print search results; when they search Google Print, they do search the whole text, regardless of length. I did confirm this by searching for quotes that were near the ends of books. jim

Michael Hart

31 Dec 31 Dec

6:59 p.m.

On Thu, 30 Dec 2004, Jim Tinsley wrote:

...

On Thu, Dec 30, 2004 at 03:39:23PM -0800, Greg Newby wrote:

...
Just a quick note that Google only indexes the first 100 or 150K of eBooks (they didn't give me a firm number, but confirmed there was a limit). This means that quotes from later parts of our eBooks > ~150K won't be found.

This is true for our books, as searched for by Google in general, like any other page, but it is not true for the Google Print search results; when they search Google Print, they do search the whole text, regardless of length. I did confirm this by searching for quotes that were near the ends of books.

jim

Aren't the PG eBooks already in Google Print? That's what I heard, so I would have figured they would have re-indexed them to make them complete??? I wonder if they left the old files, and just are making new ones, still from PG eBooks? If so, how would you tell the difference? mh

Jim Tinsley

8:19 p.m.

On Fri, 31 Dec 2004 10:59:51 -0800 (PST), Michael Hart <hart@pglaf.org> wrote:

...

Aren't the PG eBooks already in Google Print?

No. Definitively, no. That is one of the things my experiments demonstrated (see "pickle-bottle"). Our texts, or at least, as Greg says, the first 100K or so of them, are indexed in Google, and Yahoo!, and other search engines. But that's Google, not Google Print. Google Print is a NEW content source. The content for Google Print is not directly available on the web now; it is held internally by Google. I have no inside information, but I think that my reconstruction below, based on my actually trying the thing, is pretty close. 1. Google agree with Penguin Classics, among others, that they can use their publications in Google Print. 2. Penguin Classics, et. al., ship Google a copy of every book they currently have in print (which is covered by this agreement -- I imagine there may be some restrictions). 3. Google cut the pages ('cos the scans are just _beautiful_!) and scan the pages of the books into images. 4. Google run OCR on the pages. Along with every word, they store its position in the image. Like: the word "poorer" is on page 62, in a box 1.1 cm wide and 0.4cm high whose top left corner is 4.2 cm from the top of the page and 3.1 cm from the left margin, . . . except I'm sure they're not using cm. as their unit. Abbyy does this in its internal files it saves, so it wouldn't shock me to find that they're using Abbyy for OCR. 5. Google resize and transform the images to JPEG for display. (I can't prove that they didn't start with JPEGs of that size, but I think it's likely that they scanned at 600 or higher initially.) 6. Google store the OCRed text, complete with the co-ordinates of each word on the pages where it appears, and index that OCRed text. They also store the JPEG images. Because they know that all the text in a book is useful (and that a book is of a finite size!) they store _all_ of the text of each book, not just the first 100K. 7. When a Google search is run, not only the main Google index is searched, but also the Google Print OCR text. 8. If the search returns results from Google Print, they are displayed on the search results page, along with the main Google results. 9. If a user clicks on a Google Print result, they are brought to the first page image -- the JPEG file -- where that search term is found in the OCRed text. When the page image is displayed, the search term is highlighted in yellow, using the co-ordinates captured at OCR time. (Actually, what is shown is the page image without the yellow, as I demonstrated by viewing the page images directly, with the HTML creatd dynamically to overlay yellow at the appropriate co-ordinates.) 10. The user can then browse back and forth, with limitations, through the page images. 11. The text that Google OCRed is never actually displayed as text, or HTML; it is used only to find the right page and highlight the search term.

...

That's what I heard,

Then I feel quite certain that you heard wrong.

...

so I would have figured they would have re-indexed them to make them complete???

I wonder if they left the old files, and just are making new ones, still from PG eBooks?

If so, how would you tell the difference?

If they were using our texts, which I am quite sure they are not, we could tell the difference by seeing whether their text was the same as our text. I do that quite a lot when checking out corrections to our texts, and I can actually reel off various errors in various eeditions of e-texts around the web by now. Their page images, and their search index, do not contain the same words as our texts. My "pickle-bottle" example is the least demonstration of that: many of the Penguin Classics they have in Google Print include introductions that we do not have. And, remember, they never display text: they _only_ display page images. No, I conclude that Google Print overlaps not at all with PG, except that we both have (different editions of) a large number of classic books. jim

Philip Baker

1:12 a.m.

Jim Tinsley <jtinsley@pobox.com> wrote:

...

(Incidentally, for those who want to try, I find that preceding your search term with "book" will often produce a Google Print link when the bare search term doesn't.)

A few days ago Steve Thomas gave the following link to an article in the San Francisco Chronicle: http://www.sfgate.com/cgi-bin/article.cgi?f=/c/a/2004/12/20/BUGROAD6QT1.DTL In the article it says: Typing in "book" and any search term within the Google window generates a "Book results" listing if a match of the search term is made within an indexed book. These results can be clicked to read excerpts from the book. Looks as if this may develop into a 'book: key-words' type search which will only search Google Print. -- Philip Baker

7655

Age (days ago)

7658

Last active (days ago)

List overview

Download

10 comments

6 participants

participants (6)

Greg Newby
Jim Tinsley
Michael Hart
Pauline
Philip Baker
Robert Shimmin