michael said:
>  
I seem to recall an earlier report from someone
>   who did lots of searches for Google books and
>   determined that 88% of them were published after 1922.


i've posted this before, taken
from lorcan dempsey's weblog,
summarizing an article in d-lib.
i always find it again easily by
searching his site for "anatomy".

>   http://orweblog.oclc.org/archives/000800.html
>   The anatomy of an aggregate collection
>   September 17, 2005
>   Approximately half of the print books
>   in the combined Google 5 collection
>   were published after 1974.
>   Almost three-quarters were published
>   after the Second World War.
>   Using the year 1923 as a rough break-off
>   point between materials that are
>   out of copyright and materials that are
>   in copyright [16], more than 80 percent
>   of the materials in the Google 5 collections are
>   still in copyright (this is of course an upper bound).

if google has scanned roughly 100,000 pre-1923 items,
and they were taking books off the shelves randomly,
then we could assume they scanned 400,000 post-1923.

but if we assume they were doing the pre-1923 items first,
100,000 pre-1923 scanned means 100,000 total scanned.

seems to me assuming things does us absolutely no good.

but google is _going_ to scan 10+ million books, eventually,
so i'm not sure what difference it makes _how_many_ they've
done "so far".  are we really questioning their _resolve_ here?
seems to me that they've proven they are dedicated to this...

so attempts to figure out "how many books so far?" are silly.

especially since we know that many of the post-1923 items
did not have their copyrights renewed -- except that we do
_not_ know what percentage, and thus cannot even _assume_
the answer to that important question, not with any certainty.

if we say that half of the post-1923 books were not renewed,
then that means that 60% (20% and 40%) are not in copyright.
if we say that 1/3 of the post-1923 books were not renewed,
then that means that 53% (20% and 33%) are not in copyright.
if we say that 2/3 of the post-1923 books were not renewed,
then that means that 86% (20% and 66%) are not in copyright.

not that the answer would matter any, because due to the
litigious arena into which we have allowed the project to
be thrown, there's probably no way google would be likely
to take the risk of showing _any_ of the orphaned material.
so we're back to the original 20% that is pre-1923 and clear.

of course, the answer to this is to give google an immunity,
to let them serve as the "test-bed" that will act to bring out
any claims of copyrighted material that might be lurking...

in other words, let google show each book, in full, _until_
some _proof_ of copyright is rendered by another party.
(and i do mean proof, and not just some bullshit claim...)

-bowerbird