michael said:
>   We work with all possible sources to get eBooks.

all of your sources, combined, won't be able to keep up.
not until the digitization becomes nearly-automatic.
which, as i've said, is not that far down the line anyway.

>   I seem to get plenty of messages from scholarly types
>   who think source scans will always be in high demands,
>   at the ivory tower level, at least.

i'm not sure they know what they want.

in fact, i'm almost sure that they don't...

>   after 10-20 years the actual hardware requirements will
>   appear so drastically reduced that the load will be nil.

it won't be the resources required (or not required) that
makes us take down the scans, it will be the lack of demand.
digital reprints will do the same job, better, with fewer resources.

>   Once you provide a better alternative, you force those
>   who should have done it originally to do it better too.

if you can scare up the $250-million budget, please do.
as i've said elsewhere, that's what we spend in _two_days_
on the war in iraq, so you wouldn't _think_ it's all that hard
to find the same amount for such a culturally important task.
but i don't see anyone except google stepping up to the plate.

>   I was under the impression that much of this low-quality
>   was intentional, so I don't think those will be improving,
>   at least until someone provides a better mousetrap.

that's exactly why i asked "what can we do about it?"

>   All depends on how much effort it is for the particular person
>   in question. . .if it's a lot of effort to get the materials,
>   but low effort to do the scanning, you may as well replace the
>   entire file with your better examples of what should be done.

i agree. and in the cases where we can't use google's scans,
that's what we'll have to do. let's just hope, though, that that
won't be the case for the bulk of those 10-million unique titles.

>   1. Makes for better OCR

i'm rooting for better o.c.r. i sincerely hope that it happens,
and i suspect the abbyy folks still have tricks up their sleeves.

and, just to remind everybody here, they have _already_ made
a version of their software that's specially-adapted for old books,
a version that nobody here, to my knowledge, has even _tried_,
so y'all will need to do some convincing in order to convince me
that you're really as concerned with the o.c.r. thing as you claim.

but as for me, i'm not counting on the o.c.r. much at all;
i'll take what is currently available in regard to o.c.r. tech.

my aim is to jack up the post-o.c.r. correction routines,
using a wide array of automagic.

>   2. The scholarly types, as above.

let 'em use their scholar dollars to create whatever they need.
i can't be bothered with their esotericism. i just love the books.

>   Yes, and we should.

well, i'm glad i finally got _that_ tooth pulled!         ;+)

>   It matters to the integrity of the eBook world.

my integrity does not turn on semantics.

>   _I_ have no intention of quitting
>   until I can give away a million books,

that's what i love about you, big boy, your dedication.

>   It will be interesting to see who can put a million eBooks
>   online first, and how good they are.

i agree.

>   Yep. . .scans are just one step, I say it's the easiest.

depends on how many you do.

>   I can only hope he meant something more worthwhile
>   to the masses than what most of the current scan-sets provide
>   and that he will be able to find some way to keep the ball rolling.

i'll be happy to help him out, just like i'm happy to help you out.

>   We'll see, and I am taking bets.

pizza. loser buys in the winner's city.
you can fly out to santa monica and
spend some time with me sometime
when it's wintry cold there in illinois.

>   I think we should work as though it all depends on us,
>   and hope that Google will get somewhere.

you're not the best person to do that negotiation anyway.

>   They claim all those million are spent on scaning, not OCR.

it is. and that's why it will take a _negotiation_ to get them to
release the public-domain scans. they won't do it "just because".

but i think there _are_ some things we can offer in negotiation.

one would be the quality-control that we're willing to do for 'em.
although i think they'll realize soon they need to do this themselves,
at the time when they've still got the book right there by the scanner,
it never hurts to have another entity take a look at your work later...

another would be an offer to serve as their "reading room", which
would mean we'd dish the pages to people for reading, so google
could instead concentrate completely on being "the search engine".
(this might mean we'd have to agree not to furnish our own search
capability, but as long as their engine is nicely integrated into our
presentation regime, i don't think that would be a problem at all;
many websites use google as their search-engine even at present.)

and perhaps most importantly, what we could offer is huge help
in the form of friend-of-the-court briefs that would be supportive
of google's scanning project in facing their various legal challenges.
public opinion will be very important when this comes to judgment,
and a good-faith effort like turning their public-domain scans loose
could go a _long_ way in drumming up public support for their work.
on the other hand, a selfish attitude on google's part would make 'em
look bad, and that appearance could be quite devastating to their case.

i assume all of these points are reasonably apparent to google already,
so the "negotiation" wouldn't have to be antagonistic in nature. indeed,
it might be very short and very sweet, and we could find ourselves with
100,000 scan-sets on our machines before we knew it. that possibility
sounds too good to me to pass up without giving it serious consideration.

>   Actually, you have it backwards there. . .think about it. . . .
>   Google's monster speciality is SEARCH ENGINES!!!
>   They are MUCH more interested in writing a search engine that will
>   read fuzzy OCR text than in increasing the accuracy of the text.

if you can search fuzzy text, you can correct fuzzy text. that's the point.

if google lets its text remain fuzzy, it will be because they _decided_to_.
and there are a couple reasons why they might well decide to do that,
but i'd rather not take a chance of making them real by discussing them.

still, as i've said, i myself will show the world how to correct fuzzy text,
within 5 years, assuming that abbyy hasn't already solved the problem.

-bowerbird