
My success with google pd books is about 30%. Some books are so dark they are unreadable, let alone ocr, these seem to appear as jpegs. Others have whole sides of pages clipped off throughout the entire book. When the images are pretty good they seem to appear as png's. I have found most of the books with the png extension are pretty good. All seemed to have the occasional missing pages. I have sent many errors in to google and get a nice canned reply, but no improvement in the output is visible nor further feedback. I have found these books most useful when I already have a copy of the book, and can use the google scan to help speed up the scanning/ocr process. In fact I don't see how DP is coping with these google texts giving their now stricter requirements that a perfect scan of every page and illustration must be provided before the book can even get into their processing queue. A missing part of a page or illegible word cannot be corrected from another edition, due to their high standard of perfection. With the average book now requiring 2 years to go through their four levels of proofreading, one does wonder. nwolcott2@post.harvard.edu ----- Original Message ----- From: "Frank van Drogen" <fvandrog@scripps.edu> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@pglaf.org> Sent: Monday, May 22, 2006 5:19 PM Subject: re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries
it's clear that google has gotten their legs under them in regard to doing the scanning. let's hope that they'll get their quality-control under control very soon too...
I have found less missing pages and other problems in books from Google then in those from the MBP and Canadian/IA. They are, however, still far from perfect. When they get a report regarding a missing or wrongly
page in a PD book; it is apparently up to the providing library to get the problem sorted out. I've heard report of complete books being rescanned (with the risk of having another page missing in the end ;) ). I've also heard somebody mentioning that the full rescanned book was stuck behind
scanned the
existing one (rather space consuming, but for DP purposes a lot saver.
What worries me in this is that Google doesn't seem to care whether pages are missing or not... as long as they get 99% of the pages from a book stored, changes are most search terms pointing to the particular book will be identified. Their interest lies in people purchasing the book via Amazon, Abe etc. after identifying them via book.google.com.
The best quality control I have encountered so far is on Gallica, where appart from missing pages due to those pages missing in the original scanned manuscript, I've not encountered incomplete books. I'd be actually interesting to see how they perfrom their quality control.
Frank
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d