michael said:
> Actually, from what Brewster told us, and I think you were there,
> he really does want to take the cheap route, and would prefer to
> do mostly only raw scans if he could get away with it
the world will eventually explain to him that he can't get away with it.
if you can't search their text, you lose the vast majority of the benefits
associated with having electronic-books.
however, as i have said many times here, if you do the scans correctly,
and do the o.c.r. correctly, and run a good post-o.c.r. correction app,
you will end up with text that is _highly_accurate_.
so accurate that it's much more than good enough for search purposes,
even good enough to move to the public for "continuous proofreading".
and again, i will "make this happen" _by_myself_ -- if it's necessary --
within 5 years, so this is no sticking point, none at all, really, i promise.
and because brewster's stuff will be open-access, and guaranteed so,
there's no need to even worry about that content. we'll get it straight.
it is the stuff from _google_ that we have to be concerned about,
because there's no assurance so far, let alone a guarantee, that we
will be able to access that content easily, not unless we scrape it...
i've said this before, too -- we need a coordinated scraping campaign.
every public-domain book we scrape is liberated -- forever and ever.
o.c.r., clean-up, and format conversions can be done _automatically_.
michael, you asked me backchannel if scansets would need volunteers.
for scraping, yes. but nothing else. just scraping.
i'd estimate it takes 15 minutes of human work to scrape a book.
(the actual downloading takes longer, but it can be unattended.)
i'd then say it takes 30 minutes, on average, to do quality-control.
(most books will take 10 minutes, but any books with problems
take a disproportionate amount of time, so 30 minutes average.)
quality-control involves renaming files according to solid policy.
and then i'd estimate it takes 15 minutes to upload those scans to
their permanent home, where they'll be viewable immediately and
ready for treatment by automated systems when those come online.
so this one hour of work is all it takes to liberate a book from google.
-bowerbird