re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries

frank said:
Even that number is a misinterpretation
thanks for clearing that up for us, frank... it's clear that google has gotten their legs under them in regard to doing the scanning. let's hope that they'll get their quality-control under control very soon too... it is important to keep in mind that 100,000 books is <1% of the 10.5 million (or more) they'll do eventually; it's understandable if the process isn't up to speed yet. -bowerbird

it's clear that google has gotten their legs under them in regard to doing the scanning. let's hope that they'll get their quality-control under control very soon too...
I have found less missing pages and other problems in books from Google then in those from the MBP and Canadian/IA. They are, however, still far from perfect. When they get a report regarding a missing or wrongly scanned page in a PD book; it is apparently up to the providing library to get the problem sorted out. I've heard report of complete books being rescanned (with the risk of having another page missing in the end ;) ). I've also heard somebody mentioning that the full rescanned book was stuck behind the existing one (rather space consuming, but for DP purposes a lot saver. What worries me in this is that Google doesn't seem to care whether pages are missing or not... as long as they get 99% of the pages from a book stored, changes are most search terms pointing to the particular book will be identified. Their interest lies in people purchasing the book via Amazon, Abe etc. after identifying them via book.google.com. The best quality control I have encountered so far is on Gallica, where appart from missing pages due to those pages missing in the original scanned manuscript, I've not encountered incomplete books. I'd be actually interesting to see how they perfrom their quality control. Frank

After demonstrating PGDP to some people, I got in touch with an NGO who likes to scan their entire holdings, and make it available on the web. Has anybody on this list experience with outsourcing scanning jobs (on a larger scale)? I am looking at a project which includes about half a million pages that need to be digitized. Ofcourse I am not going to scan that much myself, and I heard prices at that scale can be as low as a few cents per page when done in the Philippines. Has anybody prepared documents describing quality control processes, etc., for such a bulk process. Hopefully, much of the material will be made available on-line, (although it will not copyright clear with PG, I don't expect issues with copyright). I may even set up a 'Distributed Proofreading' system for it. Jeroen.

Several people I know have tried oursourcing scanning, OCR, etc., but all with disappointing results. Sorry, mh On Mon, 22 May 2006, Jeroen Hellingman (Mailing List Account) wrote:
After demonstrating PGDP to some people, I got in touch with an NGO who likes to scan their entire holdings, and make it available on the web.
Has anybody on this list experience with outsourcing scanning jobs (on a larger scale)? I am looking at a project which includes about half a million pages that need to be digitized. Ofcourse I am not going to scan that much myself, and I heard prices at that scale can be as low as a few cents per page when done in the Philippines. Has anybody prepared documents describing quality control processes, etc., for such a bulk process. Hopefully, much of the material will be made available on-line, (although it will not copyright clear with PG, I don't expect issues with copyright). I may even set up a 'Distributed Proofreading' system for it.
Jeroen.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

On Mon, May 22, 2006 at 11:46:04PM +0200, Jeroen Hellingman (Mailing List Account) wrote:
After demonstrating PGDP to some people, I got in touch with an NGO who likes to scan their entire holdings, and make it available on the web.
Has anybody on this list experience with outsourcing scanning jobs (on a larger scale)? I am looking at a project which includes about half a million pages that need to be digitized. Ofcourse I am not going to scan that much myself, and I heard prices at that scale can be as low as a few cents per page when done in the Philippines. Has anybody prepared documents describing quality control processes, etc., for such a bulk process. Hopefully, much of the material will be made available on-line, (although it will not copyright clear with PG, I don't expect issues with copyright). I may even set up a 'Distributed Proofreading' system for it.
Jeroen.
Brewster Kahle of the Internet Archive did this for many thousands of titles (we harvested quite a few). I'm sure he'd be able to provide more detail, though getting his attention can be difficult. It was not a cheap proposition, but the costs for scanning (done in India) was less than a few cents per page. Costs included book aquisition & shipping. Then, in India, oversight for quality control (which improved dramatically with more oversight). -- Greg

On Mon, 22 May 2006, Frank van Drogen wrote:
it's clear that google has gotten their legs under them in regard to doing the scanning. let's hope that they'll get their quality-control under control very soon too...
After about 25% of their 6 year schedule to 10 million books, it would appear they are approaching 1% or 100,000 total books, with perhaps half of those easily downloadable, but in varying states of completion and accuracy. If you presume they keep up with Moore's Law, 6 years looks like: Totals Dates Doublings Years 00 Dec 14, 2004 0 0 50,000 Jun 14, 2006 1 1.5 100,000 Dec 14, 2007 2 3 200,000 Jun 14, 2009 3 4.5 400,000 Dec 14, 2010 4 6 which continues as 800,000 Jun 14, 2012 5 7.5 1,600,000 Dec 14, 2013 6 9 3,200,000 Jun 14, 2015 7 10.5 6,400,000 Dec 14, 2016 8 12 12,800,000 Jun 14, 2018 9 13.5 which would put them at over 12 years to their 10 million books in terms of downloadable eBooks. However, if you presume they have 100,000 by June 14, 2006, this would take 18 months off their total time, by counting non-downloadable and non-readable books.
I have found less missing pages and other problems in books from Google then in those from the MBP and Canadian/IA. They are, however, still far from perfect. When they get a report regarding a missing or wrongly scanned page in a PD book; it is apparently up to the providing library to get the problem sorted out. I've heard report of complete books being rescanned (with the risk of having another page missing in the end ;) ). I've also heard somebody mentioning that the full rescanned book was stuck behind the existing one (rather space consuming, but for DP purposes a lot saver.
What worries me in this is that Google doesn't seem to care whether pages are missing or not... as long as they get 99% of the pages from a book stored, changes are most search terms pointing to the particular book will be identified. Their interest lies in people purchasing the book via Amazon, Abe etc. after identifying them via book.google.com.
When your goal is simply the appearance of having a lot of books, 99% is a perfectly good business plan. And if your goal is to get people to BUY the books from your other business partners, then there is even less reason for moving to 99+%.
The best quality control I have encountered so far is on Gallica, where appart from missing pages due to those pages missing in the original scanned manuscript, I've not encountered incomplete books. I'd be actually interesting to see how they perfrom their quality control.
If you can give me any contact info on Gallica, I will see if I can find out for you. Thanks!!! Give the world eBooks in 2006!!! Michael S. Hart Founder Project Gutenberg Blog at http://hart.pglaf.org

My success with google pd books is about 30%. Some books are so dark they are unreadable, let alone ocr, these seem to appear as jpegs. Others have whole sides of pages clipped off throughout the entire book. When the images are pretty good they seem to appear as png's. I have found most of the books with the png extension are pretty good. All seemed to have the occasional missing pages. I have sent many errors in to google and get a nice canned reply, but no improvement in the output is visible nor further feedback. I have found these books most useful when I already have a copy of the book, and can use the google scan to help speed up the scanning/ocr process. In fact I don't see how DP is coping with these google texts giving their now stricter requirements that a perfect scan of every page and illustration must be provided before the book can even get into their processing queue. A missing part of a page or illegible word cannot be corrected from another edition, due to their high standard of perfection. With the average book now requiring 2 years to go through their four levels of proofreading, one does wonder. nwolcott2@post.harvard.edu ----- Original Message ----- From: "Frank van Drogen" <fvandrog@scripps.edu> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@pglaf.org> Sent: Monday, May 22, 2006 5:19 PM Subject: re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries
it's clear that google has gotten their legs under them in regard to doing the scanning. let's hope that they'll get their quality-control under control very soon too...
I have found less missing pages and other problems in books from Google then in those from the MBP and Canadian/IA. They are, however, still far from perfect. When they get a report regarding a missing or wrongly
page in a PD book; it is apparently up to the providing library to get the problem sorted out. I've heard report of complete books being rescanned (with the risk of having another page missing in the end ;) ). I've also heard somebody mentioning that the full rescanned book was stuck behind
scanned the
existing one (rather space consuming, but for DP purposes a lot saver.
What worries me in this is that Google doesn't seem to care whether pages are missing or not... as long as they get 99% of the pages from a book stored, changes are most search terms pointing to the particular book will be identified. Their interest lies in people purchasing the book via Amazon, Abe etc. after identifying them via book.google.com.
The best quality control I have encountered so far is on Gallica, where appart from missing pages due to those pages missing in the original scanned manuscript, I've not encountered incomplete books. I'd be actually interesting to see how they perfrom their quality control.
Frank
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

On Mon, 22 May 2006 Bowerbird@aol.com wrote:
frank said:
Even that number is a misinterpretation
thanks for clearing that up for us, frank...
it's clear that google has gotten their legs under them in regard to doing the scanning. let's hope that they'll get their quality-control under control very soon too...
it is important to keep in mind that 100,000 books is <1% of the 10.5 million (or more) they'll do eventually; it's understandable if the process isn't up to speed yet.
I wonder how great a percentage of Google's six year plan will have to expire before Mr. Bowerbird will admit that it doesn't look as if Google is even trying to make it to 10 million in 6 years. My own projections show it taking about twice that long, if Mr. Bowerbird is correct, and they have indeed gotten their feet under them already. mh
participants (6)
-
Bowerbird@aol.com
-
Frank van Drogen
-
Greg Newby
-
Jeroen Hellingman (Mailing List Account)
-
Michael Hart
-
Norm Wolcott