re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries

1 Jun 2006

      On Wed, 31 May 2006 Bowerbird@aol.com wrote:
...
sebastien said:
[snip] see previous message
...
...
ebooks are much more than photographs of regular analog books.
yes, but photographs of regular analog books
_might_ qualify as e-books, for _some_ people.
different people can disagree on that too.
...
3. is the top we are heading for. 2. is just a step on the way.
but #2 might serve the needs of person x just fine.
...
I did that and got
   20845628 bytes for 604 pages.
scans are resource hogs.   nobody disagrees about that.
one argument is that since these resources are now plentiful,
it doesn't matter that scans are resource hogs.
different people can disagree on that too.
as long as we can easily move scan-sets to digitized text,
i don't see much purpose in continuing to debate these two
as if they were competitors.   they're not.   they're complementary.
-bowerbird
Several issues worth thinking about here:

File size, bandwidth, storage:  important to whom?

Are all scans food for OCR?

Do raw scans qualify as eBooks?

File size, bandwidth, storage:  important to whom?

Perhaps the way to think about this is to consider
just how many more or less readers we would get if
the file sizes were that much larger or smaller.

In the end, I think we should provide both.

Are all scans good for OCR?

Some operations deliberately do not put their high
resolution scans online for downloading, rather an
automated process reduces the resolution, so these
scans are no longer suitable for OCRing.

Requests for those higher resolution scans seem to
have a very limited success rate.

The odds of being able to create a complete eBook,
using those scans that are usually made available,
perhaps about 1/4 to 1/3, based on the reports you
have probably already seen.

Once you go through the effort of scanning missing
pages, rescanning the pages that did not work with
your OCR programs, etc., it often might seem worth
the effort simply to scan the entire book with the
higher resolution scans that you can then post for
others to use.

Do raw scans qualify as eBooks?

Obviously those who would prefer to claim a larger
number of eBooks in using smaller amount of effort
would prefer to be able to claim raw scans=eBooks.

As mentioned in the various steps above, scanning,
such as it is, can be nearly completely automated,
to the point of cutting off book bindings, feeding
the pages to the scanner in the same way as copier
machines let you feed in stacks of pages, and then
claiming the result of that minimal labor as eBook
output in the catalog.

This is the "quick and dirty approach" and doesn't
cost much in terms of time, effort or money and it
does provide a reasonably readable output if pages
go through smoothly.  Apparently they don't always
go so smoothly, as many of the books were reported
to have missing pages not to mention pages scanned
poorly enough to be a problem; the report I recall
mentioned some 30% as being acceptable:  but these
do not take into account some setups intentionally
created to be not suitable for OCR.

***

I suppose the real question comes down to purposes
for making eBooks.

Obviously Google, Yahoo, Amazon, and those Library
of Congress projects all have different purposes:
and it remains to be seen how much of the purposes
will be revealed as they each start to move from a
single percentage point of their goals to counting
a majority of their collection as completed.

The various university projects still seem to be a
great deal concerned with keep their eBooks out of
the hands of the public, as has Google, though the
Google philosophy may be in the process of change.
Right now it's hard to tell what Google has chosen
as their goal; will they really try to do millions
of books in the next 54 months after perhaps stats
of .1 million in the first 18 months?  Will Google
change their philosophy per downloading scans, and
or downloading their full text searching database?

Until Google decides to actually proofread eBooks,
I don't think they will want anyone to see what an
eBook from Google looks like in full text:  simply
because it would be too obvious that proofreading,
even on a moderate basis, is not part of the plan.
However, I _DO_ think that the "second pass" eBook
collection, whether done by Google or others, will
be good enough, simply due to advanced technology,
someone will do it all over again, 10 times better
and 10 times faster and 10 times cheaper.

However, I don't predict this before 2020.

So, there it is in a nutshell, what eBooks will be
in the near and distant future, as I see it.

Will raw scans ever be the default?

No.

Why?

Because full text will become easier to and people
will keep making more and more full text eBooks in
contrast to the raw scans.

Obviously raw scans will continue to be cheap/easy
for another few years, perhaps long enough for the
Google, Yahoo, etc., efforts to claim some success
in that area, but by the time they could claim any
real success we will find that full text is coming
along fast enough that the Google efforts would be
lost in the shuffle as better full text emerges.

My own goal has always been for the public to have
their own home eLibraries, just as they have their
own home computers.  These eLibraries should be an
entirely flexible set of products that can be read
in virtually any hardware/software combination for
the world at large to use.  Such libraries are not
dependent on particular search engines, or formats
or any other particular product.  Everyone will be
free to keep their own copies of these libraries--
the number of persons owning libraries from now on 
will rise on the same order as did people owning a
book after the invention of Gutenberg's Press.

Thanks!!!

Give the world eBooks in 2006!!!

Michael S. Hart
Founder
Project Gutenberg

Blog at http://hart.pglaf.org

re: !@!Re: [gutvol-d] Kevin Kelly in NYT on future of digital libraries

Michael Hart