[gutvol-d] !@!Re: E-DOCS: Google Print Questions [J. Roland]

23 Dec 2004

      On Wed, 22 Dec 2004, Lloyd Benson wrote:
...
From: Jon Roland <jon.roland@constitution.org>
Date: Tue, 21 Dec 2004 14:50:22 -0600
Subject: Google Print questions
The announcements would seem to suggest that Google intends to not only scan 
the images of all these books, but to OCR and correct the recognition errors 
of all of them, so they can be made searchable, offer the complete texts of 
all the public domain works, and excerpts of the copyrighted ones (presumably 
under the fair use doctrine). One announcement also estimated a cost of $10 
per volume.
Project Gutenberg has already produced and distributed nearly 15,000 eBooks,
with a budget that has yet to reach a significant total for all 33+ years,
and is projected to reach a million eBooks without undue expense or effort.

We'll just have to wait and see if either Google Print, or any of the various
"Million eBook Projects" will ever come up with even 1% of a million eBooks
that you can carry with you on a one inch stack of plain homemade DVDs.

If it hasn't been proofread, and if you can't take it with you, it is only
of limited value. . .sort of like reading over someone's shoulder.

With Project Gutenberg eBooks, you OWN them. . .forever. . .and can save them
in your own favorite formats, fonts, margination, pagination, or whatever,
and you can search, quote, print, and do all the normal eBook fuctions.

"A picture of a book is not an eBook."

The term eBook should not be used to describe raw scans or raw OCR,
as has been tried by some of the Google and "Million eBook" particpants
over the past decade.

I would say that an eBook has to be at least 99.9% accurate, and that it
should then be a process as people read the eBooks, to send in corrections.

Most of the Project Gutenberg and Distributed Proofeaders would say it has
to be over 99.99% and perhaps even over 99.999%.

99.999% would be one error perhaps every 100 pages or so, and I'm pretty
sure the source materals we have are not that accurate. . .not that eBooks
won't become more and more accurate, closer and closer to 100% accuracy,
but I'm not sure they have to be all the much better than 99.9% before
they can be made available.
...
This is highly ambitious, even to scan the images. The experience of the U. 
of Michigan should show that it is not feasible to OCR these works accurately 
for that cost, or in that timeframe. While uncorrected OCRs might enable 
search, since most words appear more than once in a work, and at least one of 
them might be expected to be recognized accurately, searching on entire 
phrases could be expected to be much more problematic.
I have heard this described before. . .has anyone tried their test eBooks???
...
As one who works from a lot of older works to not only scan and OCR but 
correct them, I know how much human labor is involved. There are volunteer 
efforts like Distributed Proofreaders http://www.pgdp.net/c/default.php , but 
I have concluded that it takes me more time to set up a project for them than 
it would take for me to do the proofreading myself, and my work would likely 
be more accurate, since I would understand the underlying content and know 
how to render obscure text.
While it does take a little time to set up one's first project with the
Distributed Proofreaders, it is usually quite a bit easier the second time,
not to mention that we have volunteers who will walk you through processes
the first few times around, which seems to do the trick for nearly everyone.
...
So my basic question and concern is, how do we ensure that this project does 
not release too many uncorrected texts into the world that never get 
corrected, and perhaps propagate errors that come to be accepted as accurate 
even when they are not?
I wonder how many of these will be "released into the world". . .I have a
strong suspicion that the answer is "none."  Unless some outside source
does it.
...
I would submit that it would be better to prioritize these works and release 
fully corrected and annotated digital editions of the most important first, 
going for quality rather than quantity. This has been the approach used by 
the online collections such as ours at 
http://www.constitution.org/liberlib.htm Although we do put some works up 
before the correcting and reformatting is finished, we always flag those that 
are still in progress, indicating the state of completion, and we stand by to 
quickly make corrections that outsiders may discover are needed.
I view all eBooks as "still in progress" as I have never proofread one in which
I didn't find any mistakes. . . .

My own views are that I would prefer to have access to twice as many eBooks
at the 99.95% accuracy level [the Library on Congress standard] than half
as many at the 99.995% level I think is being suggested here.

After all, the books that get read the most will be the ones that get the
most corrections. . .an obvious way to aim effort at the proper targets!

Not only that, but, viewing the entire eBook effort as a 50 year process,
of which I have walked 33+ years, I must state for the record that I think
OCR, spellcheckers, grammarcheckers., etc. will be so much better a decade
from now that doing the proofreading on the more obscure works will require
so much less effort than it does today, that it will be a great trade-off.

I'm not at all sure why people want eBooks to be so perfect to start with.
I would prefer to get all 10 million public domain works we can find. . .
or at least a million of them. . .online and freely downloadable before we
try to approach the 100% accuracy level.

Of course, I don't believe in the "raw OCR" idea that seems to be what the
Google Print idea has in mind, even with spelling and "scanno" checkers,
and I also don't believe in going so far in the other direction that we
try for such accuracy levels that the number of eBooks only grows at half
the rate it has been growing.

The path is obviously somewhere in the middle. . .machine production is
obviously not accurate enough [except in certain tests I have seen run
with high contrast new materials] and after a certain point it becomes
inefficient to keep proofreading before letting the public have access.

After all, the public IS what this is all about, is it not?

So let's let the public do the final proofreading, as a process, for all
the years to come. . .at least until we have OCR that makes only 1 error
in a million characters. . .and thus most of the errors we find are from
the original publications.  [Bye the bye, this is one of the reasons for
using more than one paper edition to produce an eBook, when multiples of
paper editions are available.  Then the machine processes can compare to
find even more errors.

Well, enough now. . .let's make more eBooks!!!

Thanks!!!

So Nice To Hear From You!

Happy Holidays!!!

Michael

Give FreeBooks!!!
In 39 Languages!!!

As of December 23, 2004
~14,780 FreeBooks at:
~220 to go to 15,000
http://www.gutenberg.org
http://www.gutenberg.net

We are ~95% of the way
from 10,000 to 15,000.

Now even more PG eBooks
In 104 Languages!!!
http://gutenberg.cc
http://gutenberg.us

Michael S. Hart
<hart@pobox.com>
Project Gutenberg
Executive Coordinator^M
"*Internet User ~#100*"

If you do not receive
a prompt reply, please
resend, keep resending.

[gutvol-d] !@!Re: E-DOCS: Google Print Questions [J. Roland]

Michael Hart