re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries

michael said:
Actually, from what Brewster told us, and I think you were there, he really does want to take the cheap route, and would prefer to do mostly only raw scans if he could get away with it
the world will eventually explain to him that he can't get away with it. if you can't search their text, you lose the vast majority of the benefits associated with having electronic-books. however, as i have said many times here, if you do the scans correctly, and do the o.c.r. correctly, and run a good post-o.c.r. correction app, you will end up with text that is _highly_accurate_. so accurate that it's much more than good enough for search purposes, even good enough to move to the public for "continuous proofreading". and again, i will "make this happen" _by_myself_ -- if it's necessary -- within 5 years, so this is no sticking point, none at all, really, i promise. and because brewster's stuff will be open-access, and guaranteed so, there's no need to even worry about that content. we'll get it straight. it is the stuff from _google_ that we have to be concerned about, because there's no assurance so far, let alone a guarantee, that we will be able to access that content easily, not unless we scrape it... i've said this before, too -- we need a coordinated scraping campaign. every public-domain book we scrape is liberated -- forever and ever. o.c.r., clean-up, and format conversions can be done _automatically_. michael, you asked me backchannel if scansets would need volunteers. for scraping, yes. but nothing else. just scraping. i'd estimate it takes 15 minutes of human work to scrape a book. (the actual downloading takes longer, but it can be unattended.) i'd then say it takes 30 minutes, on average, to do quality-control. (most books will take 10 minutes, but any books with problems take a disproportionate amount of time, so 30 minutes average.) quality-control involves renaming files according to solid policy. and then i'd estimate it takes 15 minutes to upload those scans to their permanent home, where they'll be viewable immediately and ready for treatment by automated systems when those come online. so this one hour of work is all it takes to liberate a book from google. -bowerbird

On Wed, 31 May 2006 Bowerbird@aol.com wrote:
michael said:
Actually, from what Brewster told us, and I think you were there, he really does want to take the cheap route, and would prefer to do mostly only raw scans if he could get away with it
the world will eventually explain to him that he can't get away with it.
I am always amazed at the number of people who claim raw scans equal eBooks.
if you can't search their text, you lose the vast majority of the benefits associated with having electronic-books.
That's what I've been saying all along. Of course, I would prefer to use my own favorite search engines, etc.
however, as i have said many times here, if you do the scans correctly, and do the o.c.r. correctly, and run a good post-o.c.r. correction app, you will end up with text that is _highly_accurate_.
I've seen examples of this happening for a few pages, not for whole books.
so accurate that it's much more than good enough for search purposes, even good enough to move to the public for "continuous proofreading".
Depending on your proofreaders, this might have been largely true quite a number of years ago.
and again, i will "make this happen" _by_myself_ -- if it's necessary --
More power to ya!
within 5 years, so this is no sticking point, none at all, really, i promise.
I think the foundation of the rest of the history of eBooks will be laid down within 5 years, so I wouldn't wait if I were you, it might get to be too late. I would at least lay down an example set of a dozen or two this year, then at least twice as many next year, etc., just so the world has an example to look at.
and because brewster's stuff will be open-access, and guaranteed so, there's no need to even worry about that content. we'll get it straight.
I've hear reports that many of Brewster's scans might have to be redone.
it is the stuff from _google_ that we have to be concerned about, because there's no assurance so far, let alone a guarantee, that we will be able to access that content easily, not unless we scrape it...
Not to mention that it appears Google and Gallica both intentionally leave us only with reduced resolution scans that might not do OCR in a feasible manner.
i've said this before, too -- we need a coordinated scraping campaign.
every public-domain book we scrape is liberated -- forever and ever.
o.c.r., clean-up, and format conversions can be done _automatically_.
michael, you asked me backchannel if scansets would need volunteers.
for scraping, yes. but nothing else. just scraping.
i'd estimate it takes 15 minutes of human work to scrape a book. (the actual downloading takes longer, but it can be unattended.)
You don't think bots like "The Wayback Machine" can do this?
i'd then say it takes 30 minutes, on average, to do quality-control. (most books will take 10 minutes, but any books with problems take a disproportionate amount of time, so 30 minutes average.) quality-control involves renaming files according to solid policy.
This would obviously be a human intensive section of the work.
and then i'd estimate it takes 15 minutes to upload those scans to their permanent home, where they'll be viewable immediately and ready for treatment by automated systems when those come online.
Again, why not just let bots take care of this part?
so this one hour of work is all it takes to liberate a book from google.
I'd like to think half an hour will be enough, when the time comes, and eventually, after it has all been done once, it will be trivial to do it all over again, better.
-bowerbird
Thanks!!! Give the world eBooks in 2006!!! Michael S. Hart Founder Project Gutenberg Blog at http://hart.pglaf.org
participants (2)
-
Bowerbird@aol.com
-
Michael Hart