
On Wed, 22 Dec 2004, Lloyd Benson wrote:
From: Jon Roland <jon.roland@constitution.org> Date: Tue, 21 Dec 2004 14:50:22 -0600 Subject: Google Print questions
The announcements would seem to suggest that Google intends to not only scan the images of all these books, but to OCR and correct the recognition errors of all of them, so they can be made searchable, offer the complete texts of all the public domain works, and excerpts of the copyrighted ones (presumably under the fair use doctrine). One announcement also estimated a cost of $10 per volume.
Project Gutenberg has already produced and distributed nearly 15,000 eBooks, with a budget that has yet to reach a significant total for all 33+ years, and is projected to reach a million eBooks without undue expense or effort. We'll just have to wait and see if either Google Print, or any of the various "Million eBook Projects" will ever come up with even 1% of a million eBooks that you can carry with you on a one inch stack of plain homemade DVDs. If it hasn't been proofread, and if you can't take it with you, it is only of limited value. . .sort of like reading over someone's shoulder. With Project Gutenberg eBooks, you OWN them. . .forever. . .and can save them in your own favorite formats, fonts, margination, pagination, or whatever, and you can search, quote, print, and do all the normal eBook fuctions. "A picture of a book is not an eBook." The term eBook should not be used to describe raw scans or raw OCR, as has been tried by some of the Google and "Million eBook" particpants over the past decade. I would say that an eBook has to be at least 99.9% accurate, and that it should then be a process as people read the eBooks, to send in corrections. Most of the Project Gutenberg and Distributed Proofeaders would say it has to be over 99.99% and perhaps even over 99.999%. 99.999% would be one error perhaps every 100 pages or so, and I'm pretty sure the source materals we have are not that accurate. . .not that eBooks won't become more and more accurate, closer and closer to 100% accuracy, but I'm not sure they have to be all the much better than 99.9% before they can be made available.
This is highly ambitious, even to scan the images. The experience of the U. of Michigan should show that it is not feasible to OCR these works accurately for that cost, or in that timeframe. While uncorrected OCRs might enable search, since most words appear more than once in a work, and at least one of them might be expected to be recognized accurately, searching on entire phrases could be expected to be much more problematic.
I have heard this described before. . .has anyone tried their test eBooks???
As one who works from a lot of older works to not only scan and OCR but correct them, I know how much human labor is involved. There are volunteer efforts like Distributed Proofreaders http://www.pgdp.net/c/default.php , but I have concluded that it takes me more time to set up a project for them than it would take for me to do the proofreading myself, and my work would likely be more accurate, since I would understand the underlying content and know how to render obscure text.
While it does take a little time to set up one's first project with the Distributed Proofreaders, it is usually quite a bit easier the second time, not to mention that we have volunteers who will walk you through processes the first few times around, which seems to do the trick for nearly everyone.
So my basic question and concern is, how do we ensure that this project does not release too many uncorrected texts into the world that never get corrected, and perhaps propagate errors that come to be accepted as accurate even when they are not?
I wonder how many of these will be "released into the world". . .I have a strong suspicion that the answer is "none." Unless some outside source does it.
I would submit that it would be better to prioritize these works and release fully corrected and annotated digital editions of the most important first, going for quality rather than quantity. This has been the approach used by the online collections such as ours at http://www.constitution.org/liberlib.htm Although we do put some works up before the correcting and reformatting is finished, we always flag those that are still in progress, indicating the state of completion, and we stand by to quickly make corrections that outsiders may discover are needed.
I view all eBooks as "still in progress" as I have never proofread one in which I didn't find any mistakes. . . . My own views are that I would prefer to have access to twice as many eBooks at the 99.95% accuracy level [the Library on Congress standard] than half as many at the 99.995% level I think is being suggested here. After all, the books that get read the most will be the ones that get the most corrections. . .an obvious way to aim effort at the proper targets! Not only that, but, viewing the entire eBook effort as a 50 year process, of which I have walked 33+ years, I must state for the record that I think OCR, spellcheckers, grammarcheckers., etc. will be so much better a decade from now that doing the proofreading on the more obscure works will require so much less effort than it does today, that it will be a great trade-off. I'm not at all sure why people want eBooks to be so perfect to start with. I would prefer to get all 10 million public domain works we can find. . . or at least a million of them. . .online and freely downloadable before we try to approach the 100% accuracy level. Of course, I don't believe in the "raw OCR" idea that seems to be what the Google Print idea has in mind, even with spelling and "scanno" checkers, and I also don't believe in going so far in the other direction that we try for such accuracy levels that the number of eBooks only grows at half the rate it has been growing. The path is obviously somewhere in the middle. . .machine production is obviously not accurate enough [except in certain tests I have seen run with high contrast new materials] and after a certain point it becomes inefficient to keep proofreading before letting the public have access. After all, the public IS what this is all about, is it not? So let's let the public do the final proofreading, as a process, for all the years to come. . .at least until we have OCR that makes only 1 error in a million characters. . .and thus most of the errors we find are from the original publications. [Bye the bye, this is one of the reasons for using more than one paper edition to produce an eBook, when multiples of paper editions are available. Then the machine processes can compare to find even more errors. Well, enough now. . .let's make more eBooks!!! Thanks!!! So Nice To Hear From You! Happy Holidays!!! Michael Give FreeBooks!!! In 39 Languages!!! As of December 23, 2004 ~14,780 FreeBooks at: ~220 to go to 15,000 http://www.gutenberg.org http://www.gutenberg.net We are ~95% of the way from 10,000 to 15,000. Now even more PG eBooks In 104 Languages!!! http://gutenberg.cc http://gutenberg.us Michael S. Hart <hart@pobox.com> Project Gutenberg Executive Coordinator^M "*Internet User ~#100*" If you do not receive a prompt reply, please resend, keep resending.