carel said:
> I've read your posts for many years
so _you're_ the one who's been reading them! ;+)
> I'm not sure the most efficient system on earth
> could keep up with the scan rate.
sure we can. just have 10 million people do 1 book each.
piece'a'cake.
> And, if you speed things up by skipping the manual proofing
> and use automation completely for the first release then you
> will have text little to no better than what google already has.
oh, don't kid yourself. google's text is _much_ better than
what they're showing to us members of the general public.
> Wikimedia then takes this further by adding the ability
> for humans to modify the text in a never-ending cycle.
i haven't been impressed by the wiki approach, honestly...
> In what manner do you suggest that this would be
> differentiated from the current wiki book editing
> process at wikimedia that justifies the creation of
> the project for the purpose of PG?
i'm not suggesting this for p.g. it could be for anybody.
including google. as for how it differs, please see below.
> it wouldn't be that hard to cull the data from google
well, it's not that "hard", but it's not that _easy_ either...
first, they've started rewrapping the lines, the dirty rats.
second, and more importantly, their terms prohibit it...
people on this listserve have been cut off for scraping...
you can still do it, of course, but you have to be careful.
> Then, someone can
yeah, what you've laid out is basically the standard plan.
i would bother to nitpick certain parts of it, like the x.m.l.,
but really, my objection is at the more fundamental level.
yes, i believe every page of every book should be online,
on its own webpage with unchanging u.r.l., text and scan,
with an error-report form that the general public can use
to detail any problems with the page, or ask questions, or
even make annotations and have dialog about that page...
but i also believe that the text for the vast majority of these
pages should be _perfect_ at the time that text is mounted.
and i've demonstrated, repeatedly, that this is fully possible.
one person can take an average book to near-perfection in
the space of a couple hours, if they are given a good tool...
("near-perfection" means 1-error-or-less-every-10-pages.)
i just showed how you can do this, using gardner's book...
so, if 9 out of 10 pages are perfect to begin with, then you
simply don't need a full-on wiki format. a wiki is good when
you are constantly editing something. but book digitization
isn't like that. corrections will be very few and far between.
meanwhile, there's no reason to put every page up for grabs.
even on a first pass, straight outta o.c.r., assuming good scans
and a good o.c.r. app, _half_ the pages have no errors on 'em.
in 2001, the idea for distributed proofreaders was brilliant.
(i was once a graduate student at u.c.l.a. in social psychology,
studying cooperation/competition in the arena of the internet,
so i'm probably the best person on the planet to understand
just precisely how brilliant the idea was, and it was brilliant.)
in 2001, scansets were rare, and o.c.r. stunk (and cost a lot),
and it was hard -- very hard -- to digitize a book by yourself.
moreover, the primary method of connection was slow-speed.
so doling out pages one-at-a-time to people online was
a relatively efficient way of doing book digitization in 2001,
especially as it allowed us to _cumulate_ the _partial_effort_
of many people. simply put, it made sense to go distributed.
a good number of people were willing to "proof a page a day".
in 2010, the situation has been turned completely on its head.
scansets are no longer rare; indeed, we are swimming in them.
o.c.r. no longer stinks, nor does it cost a lot. indeed, many of
those scansets now _come_ with the o.c.r. already being done,
and -- after some preprocessing, which isn't difficult to do --
the high level of the accuracy of that o.c.r. is rather startling.
and perhaps most importantly, broadband is now widespread;
we think nothing of throwing tens of megabytes around now.
so, in 2010, doling out pages one-at-a-time to people online
is just a silly idea. i can download in the background now, but
even if i _watch_, i can download a whole scanset in minutes.
and a 300k text-file that represents a book? in just seconds.
these days, with the right tools at your disposal, it's possible
for a person to digitize a book all by yourself in _one_hour_.
i proved this -- with a stopwatch -- and documented it fully.
so here's how i see things going...
when you find a scanset online of a book you want to digitize,
you'll download it, along with the o.c.r. (if it comes from a site
that won't give you the o.c.r., you can upload it to archive.org,
and they'll o.c.r. it for you. providing you have no o.c.r. app.)
you'll use a handy-dandy tool to digitize it in an hour or two,
and then you'll make the results available to everyone online.
or... your alternative is waiting 3 years for d.p. to get it done.
and carel, since you're interested in programming such tools,
we'll have plenty to talk about...
> I am very interested in your text processing scripts
> though and always have been. I just haven't been
> processing any OCR text for a while.
well, like i told gardner just the other day, i really don't
_have_ any collection of scripts. i use a text-editor and
start looking at the book. i see an error, and devise a
global search-and-replace to deal with that type of error.
when they're all fixed, i look again to find the next error.
i think of this process as "listening to the book", in that
the book itself will tell you what kinds of errors it has...
and yeah, there are some things that you do to every book,
like fix spacey quotemarks, and pull in spacey contractions.
and it would be good to write macros or scripts or routines
to do these. but it's fairly trivial to type them in each time.
it's also the case that a lot of the time you _must_ examine
the text in order to approve or disapprove the change, and
-- while it is not impossible to code an interface for that --
your typical text-editor _has_ one, which you know and love.
> I merely meant that we are not in a position to
> have any say in what DP does with itself.
> Nor in what PG does with itself.
but i love telling them what they should be doing, and why.
because once they decide to do it, and they _will_, because
i'm never wrong about these things, i can say "i told you so".
-bowerbird