re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries

31 May 2006

      michael-
...
I've seen examples of this happening for a few pages, not for whole 
books.
you haven't seen anything even close to the array that i will throw at the 
problem.
...
Depending on your proofreaders, this might have been largely true
   quite a number of years ago.
my "proofreaders" are the general public, who are just "regular" readers,
so they're not very good.   but they will have at their disposal a very 
simple
system for reporting errors, and will be properly informed that we _adore_
error-reports, and that's all you really need to move e-texts to 
perfection...

that and a whole bunch of eyeballs.

and if a book can't attract eyeballs,
then maybe it doesn't matter if it
happens to have errors in it, eh?

but most of the text will be super-clean before it ever gets to these 
readers.
...
I think the foundation of the rest of the history of eBooks will be laid 
down
   within 5 years, so I wouldn't wait if I were you, it might get to be too 
late.
nah.   no matter what norms get laid down in the next 5 years,
if they aren't good ones, then mine will eventually replace them.
...
I would at least lay down an example set of a dozen or two this year
have already started.   will continue to show developments.
...
I've hear reports that many of Brewster's scans might have to be redone.
might be.   but he hasn't scanned enough yet for this to be a problem.

even google hasn't done more than 1% of what they will do eventually,
so even if they had to re-do everything they've done so far, no big deal.

realistically, it's still too early to even _ask_ if the workflow is correct.

you should do that about the 5% mark.   and you can even profitably
"start over" at the 10% mark   -- which for google is 1 million books --
if you have identified and solved a significant flaw in your processes.

but until you've done 5% of the job (and hopefully a _random_ 5%,
so your selection biases don't blind-spot any substantial problems),
you haven't adequately confronted the difficulty-factor of the task,
so you can't even make an informed decision about your workflow.
...
Not to mention that it appears Google and Gallica both 
   intentionally leave us only with reduced resolution scans 
   that might not do OCR in a feasible manner.
that just means our post-o.c.r. program has to work a little harder.
again, you don't need to worry about text digitization.   it is solved.
...
You don't think bots like "The Wayback Machine" can do this?
um, no.   because google excludes bots from its cyberlibrary.
you must respect robots.txt, or you're not being responsible.

so there simply must be a human at the machine.

it's more-or-less a technicality, yes, because we _are_
scraping, make no mistake about it, we say it out loud,
but it is a technicality that we absolutely have to meet.

besides, we need someone to do the quality-control,
and for _that_, you do have to have a sentient human.

ditto with the re-upload to our own webspace.
...
I'd like to think half an hour will be enough, when the time comes,
um, not really.   not if you're going the download/qc/upload route.

you _could_ save a _lot_ of time if google just handed over their scans
on a couple big hard-drives, or pointed you to an open-access server.

somebody needs to open up such negotiations with google.

google's biggest flaw in this whole arena is its tight-lipped nature.
i understand that that's typical of the company at large, but still,
at some point -- which i place at december 14th of this year --
google will _have_to_ start talking with us electronic-book people,
or risk losing our continuing support in regard to their lawsuits...

if they want us to continue to believe they're doing this on our behalf,
they must speak with us...   they simply cannot continue to stonewall...

i truly believe they could win themselves all the friends they'll need
if they simply release all of their public-domain scan-sets for free,
and i also truly believe they're smart enough to figure that out, but
at a minimum they will need to start sharing their progress with us.

again, somebody needs to open up the negotiations that would
let us take a swift and convenient possession of their p-d scansets.

if we can save one-half hour each on 100,000 books, it adds up.
do that on a million books, or on 10 million, and it really adds up.

it'd also jack your count up considerably, and we could use that too,
especially since this wouldn't be some meaningless inflation either,
but a solid increase in their access to books by the general public...
...
and eventually, after it has all been done once, 
   it will be trivial to do it all over again, better.
actually, once it's been done once, it will be unnecessary to do it again.

-bowerbird

Bowerbird＠aol.com

tags

participants (1)