Re: [gutvol-d] Google print

12 Dec 2005


      A little more fiddling aaround with google print. Allthough most of the
images seem to be png files, one book was very dark jpeg files. So there is
not perfect consistency. If you find a bad page there is a drop down box to
check off the salient errors. Whether they do anything with that is of
course another question. Another feature is the "search in this document"
box. If you clear the box and put in say "243" and search, page 243 will
come up. Google haas OCR'ed the book to the point of this text finding. The
OCR version is not visible however. Similarly IX would find chapter IX etc.
Other than this you can only go forward and back a page at a time.

Norm Wolcott  nwolcott2@post.harvard.edu
----- Original Message -----
From: "N Wolcott" <nwolcott@dsdial.net>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org>
Sent: Monday, December 05, 2005 5:26 PM
Subject: Re: [gutvol-d] Google print
...
I found that I could search for an author (Verne) using the Addvance
Search
on books.google.com and choosing dates 1850-1920. Got 5 hits including
Mathias Sandorf. Images seemed to be 72 dpi png, but they were in color,
so
I imagine suitable smoothing etc could modify them to 150 b/w. I believe
Capio does such things. Still lots of work. I'll lstick with Brewster for
a
while. I don;t know if  ABBY OCR takes advantage of the color information.
I'll try one of the scripts for a test. Thanks.
Norm Wolcott  nwolcott2@post.harvard.edu
----- Original Message -----
From: "Jon Ingram" <jon.ingram@gmail.com>
To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org>
Sent: Monday, December 05, 2005 4:30 AM
Subject: Re: [gutvol-d] Google print
...
On 12/5/05, N Wolcott <nwolcott@dsdial.net> wrote:
...
Now that the dust has cleared, can us proles have the final info-- can
one
download scans from google print, what is the best way, are they
holding
back, what is the resoloution, can one search (other than for the
dejavu
image),  etc etc. Are there P2P networks to share stuff cribbed from
google.
Talking about PD images of course.
Google does not use dejavu -- that's the Internet Archive. Google
presents a fairly small jpeg image for each page of the book. There's
no fixed resolution, but instead a fixed width. This means that the
images generated from small books are relatively easy to OCR (and
equate to around 100 dpi), while the images from books with large
pages are hard for even humans to read.  Those of you who can get
access to Google Print should be able to download these 'web
resolution' images from them just by right-clicking and saving. As far
as I know there's no way to access the higher resolution images they
must have made when they originally scanned the material; nor is there
any way to access the OCRed text they use for searching purposes.
Google provided no mechanism to download all the images for a book.
You'll have to roll your own download script, or use one of the
scripts written by others, such as the perl script gharvest, available
from
  http://www.zuhause.org/dp/gharvest
Google also provides no index to the material they have scanned.
Several people have generated one by the crude means of searching for
many different phrases, and storing the results. The most extensive
list is probably also Bruce's, available from
  http://www.zuhause.org/dp/gfound1.html
I've used this as a basis for a page showing the DP harvesting status
of the material:
  http://homepage.ntlworld.com/jenjonliz/jon/tia/google.html
--
Jon Ingram
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d