
A little more fiddling aaround with google print. Allthough most of the images seem to be png files, one book was very dark jpeg files. So there is not perfect consistency. If you find a bad page there is a drop down box to check off the salient errors. Whether they do anything with that is of course another question. Another feature is the "search in this document" box. If you clear the box and put in say "243" and search, page 243 will come up. Google haas OCR'ed the book to the point of this text finding. The OCR version is not visible however. Similarly IX would find chapter IX etc. Other than this you can only go forward and back a page at a time. Norm Wolcott nwolcott2@post.harvard.edu ----- Original Message ----- From: "N Wolcott" <nwolcott@dsdial.net> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org> Sent: Monday, December 05, 2005 5:26 PM Subject: Re: [gutvol-d] Google print
I found that I could search for an author (Verne) using the Addvance Search on books.google.com and choosing dates 1850-1920. Got 5 hits including Mathias Sandorf. Images seemed to be 72 dpi png, but they were in color, so I imagine suitable smoothing etc could modify them to 150 b/w. I believe Capio does such things. Still lots of work. I'll lstick with Brewster for a while. I don;t know if ABBY OCR takes advantage of the color information. I'll try one of the scripts for a test. Thanks. Norm Wolcott nwolcott2@post.harvard.edu ----- Original Message ----- From: "Jon Ingram" <jon.ingram@gmail.com> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org> Sent: Monday, December 05, 2005 4:30 AM Subject: Re: [gutvol-d] Google print
On 12/5/05, N Wolcott <nwolcott@dsdial.net> wrote:
Now that the dust has cleared, can us proles have the final info-- can one download scans from google print, what is the best way, are they holding back, what is the resoloution, can one search (other than for the dejavu image), etc etc. Are there P2P networks to share stuff cribbed from google. Talking about PD images of course.
Google does not use dejavu -- that's the Internet Archive. Google presents a fairly small jpeg image for each page of the book. There's no fixed resolution, but instead a fixed width. This means that the images generated from small books are relatively easy to OCR (and equate to around 100 dpi), while the images from books with large pages are hard for even humans to read. Those of you who can get access to Google Print should be able to download these 'web resolution' images from them just by right-clicking and saving. As far as I know there's no way to access the higher resolution images they must have made when they originally scanned the material; nor is there any way to access the OCRed text they use for searching purposes.
Google provided no mechanism to download all the images for a book. You'll have to roll your own download script, or use one of the scripts written by others, such as the perl script gharvest, available from http://www.zuhause.org/dp/gharvest Google also provides no index to the material they have scanned. Several people have generated one by the crude means of searching for many different phrases, and storing the results. The most extensive list is probably also Bruce's, available from http://www.zuhause.org/dp/gfound1.html I've used this as a basis for a page showing the DP harvesting status of the material: http://homepage.ntlworld.com/jenjonliz/jon/tia/google.html
-- Jon Ingram _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d