Re: Re: [gutvol-d] !@!Googleberg eBooks

Of particular concern to PG voluteers will be the clarity of the page scans of Google Print's public domain works, which will mainly come from the academic libraries' rare books archives. As far as I am concerned this is the best potential of Google Print--to make works available that 99.999% of the population never had access to. (How important, really, is it to look at just a few pages of a book that is in most public libraries and many book stores?) I doubt those libraries will allow Google to cut up the books--and rightly so--therefore the quality of the images may not be as good. Although Google makes it difficult to download these pages images, we all know that where there is a will, there is a way. And perhaps some PG volunteers will use these page scans for a real e-book. Better scans would make for easier transcriptions.
From the Google Print side, worse scan would probably cause more errors in their behind-the-scenes OCR database linked to each page scan--making searches of these pages less accurate. Hopefully for researches, this increase in error rate will be a fraction of a percent, but who knows?
-----Original Message----- From: juliet.sutherland@verizon.net Sent: Dec 31, 2004 11:09 PM To: Project Gutenberg Volunteer Discussion <gutvol-d@lists.pglaf.org> Subject: Re: Re: [gutvol-d] !@!Googleberg eBooks
3. Google cut the pages ('cos the scans are just _beautiful_!) and scan the pages of the books into images.
As I've previously noted, destructive scanning of modern reprints is easy and usually results in good images and good OCR. --------------------------- Dennis McCarthy nihil_obstat@mindspring.com

From the Google Print side, worse scan would probably cause more errors in their behind-the-scenes OCR database linked to each page scan--making searches of these pages less accurate. Hopefully for researches, this increase in error rate will be a fraction of a percent, but who knows?
Once upon a time, I had access to JSTOR, and frequently browsed through their scans of old (as in 17th-18th century) Philofophical Tranfactions. From the glimpses (mostly from the text surrounding my search terms) I got of the underlying OCR text, I came to the conclusion that even error-ridden OCR is good enough to return keyword search results of non-embarrassing calibre. And I can well imagine that some sort of fuzzy term matching to compensate for the most common known scanno themes could be employed to make raw OCR very suitable for keyword searches. -- RS

On Sat, 1 Jan 2005, Dennis McCarthy wrote:
Of particular concern to PG voluteers will be the clarity of the page scans of Google Print's public domain works, which will mainly come from the academic libraries' rare books archives.
Yes, one concern we all have is how good Googleberg's scans will be. Will they give us access to the best hi-res scans? Or only to something that is easy on their storage and bandwidth, and consquently not so good for OCR? [I'm guessing they will NOT make the best materials available to all. Either in the case of raw scans, or the OCRed full text files.]
As far as I am concerned this is the best potential of Google Print--to make works available that 99.999% of the population never had access to.
Of course, this raises the question if 99.999% of the population WANTS access to these books. . .a question I raised earlier. . .will the Googleberg collection be so stilted that it is mostly for scholars?
(How important, really, is it to look at just a few pages of a book that is in most public libraries and many book stores?)
Well. . .this brings us to the entire point of why have Project Gutenberg? Why give people an entire home library of eBooks that are "in most public libraries and many book stores?" * I say the answer is simply individual access rather than public access. Of course, Ray Bradbury VIOLENTLY disagrees with me here, and I understand why he does. . .he believes in the social experience of libraries.
I doubt those libraries will allow Google to cut up the books--and rightly so--therefore the quality of the images may not be as good. Although Google makes it difficult to download these pages images, we all know that where there is a will, there is a way. And perhaps some PG volunteers will use these page scans for a real e-book. Better scans would make for easier transcriptions.
I'm betting Googleberg will store the hi-res scans offline, hidden behind some VERY powerful security. As for cutting up the books, some are, some aren't. Michael
participants (3)
-
Dennis McCarthy
-
Michael Hart
-
Robert Shimmin