
On Wed, 31 May 2006 Bowerbird@aol.com wrote:
sebastien said:
[snip] see previous message
ebooks are much more than photographs of regular analog books.
yes, but photographs of regular analog books _might_ qualify as e-books, for _some_ people.
different people can disagree on that too.
3. is the top we are heading for. 2. is just a step on the way.
but #2 might serve the needs of person x just fine.
I did that and got 20845628 bytes for 604 pages.
scans are resource hogs. nobody disagrees about that.
one argument is that since these resources are now plentiful, it doesn't matter that scans are resource hogs.
different people can disagree on that too.
as long as we can easily move scan-sets to digitized text, i don't see much purpose in continuing to debate these two as if they were competitors. they're not. they're complementary.
-bowerbird
Several issues worth thinking about here: File size, bandwidth, storage: important to whom? Are all scans food for OCR? Do raw scans qualify as eBooks? File size, bandwidth, storage: important to whom? Perhaps the way to think about this is to consider just how many more or less readers we would get if the file sizes were that much larger or smaller. In the end, I think we should provide both. Are all scans good for OCR? Some operations deliberately do not put their high resolution scans online for downloading, rather an automated process reduces the resolution, so these scans are no longer suitable for OCRing. Requests for those higher resolution scans seem to have a very limited success rate. The odds of being able to create a complete eBook, using those scans that are usually made available, perhaps about 1/4 to 1/3, based on the reports you have probably already seen. Once you go through the effort of scanning missing pages, rescanning the pages that did not work with your OCR programs, etc., it often might seem worth the effort simply to scan the entire book with the higher resolution scans that you can then post for others to use. Do raw scans qualify as eBooks? Obviously those who would prefer to claim a larger number of eBooks in using smaller amount of effort would prefer to be able to claim raw scans=eBooks. As mentioned in the various steps above, scanning, such as it is, can be nearly completely automated, to the point of cutting off book bindings, feeding the pages to the scanner in the same way as copier machines let you feed in stacks of pages, and then claiming the result of that minimal labor as eBook output in the catalog. This is the "quick and dirty approach" and doesn't cost much in terms of time, effort or money and it does provide a reasonably readable output if pages go through smoothly. Apparently they don't always go so smoothly, as many of the books were reported to have missing pages not to mention pages scanned poorly enough to be a problem; the report I recall mentioned some 30% as being acceptable: but these do not take into account some setups intentionally created to be not suitable for OCR. *** I suppose the real question comes down to purposes for making eBooks. Obviously Google, Yahoo, Amazon, and those Library of Congress projects all have different purposes: and it remains to be seen how much of the purposes will be revealed as they each start to move from a single percentage point of their goals to counting a majority of their collection as completed. The various university projects still seem to be a great deal concerned with keep their eBooks out of the hands of the public, as has Google, though the Google philosophy may be in the process of change. Right now it's hard to tell what Google has chosen as their goal; will they really try to do millions of books in the next 54 months after perhaps stats of .1 million in the first 18 months? Will Google change their philosophy per downloading scans, and or downloading their full text searching database? Until Google decides to actually proofread eBooks, I don't think they will want anyone to see what an eBook from Google looks like in full text: simply because it would be too obvious that proofreading, even on a moderate basis, is not part of the plan. However, I _DO_ think that the "second pass" eBook collection, whether done by Google or others, will be good enough, simply due to advanced technology, someone will do it all over again, 10 times better and 10 times faster and 10 times cheaper. However, I don't predict this before 2020. So, there it is in a nutshell, what eBooks will be in the near and distant future, as I see it. Will raw scans ever be the default? No. Why? Because full text will become easier to and people will keep making more and more full text eBooks in contrast to the raw scans. Obviously raw scans will continue to be cheap/easy for another few years, perhaps long enough for the Google, Yahoo, etc., efforts to claim some success in that area, but by the time they could claim any real success we will find that full text is coming along fast enough that the Google efforts would be lost in the shuffle as better full text emerges. My own goal has always been for the public to have their own home eLibraries, just as they have their own home computers. These eLibraries should be an entirely flexible set of products that can be read in virtually any hardware/software combination for the world at large to use. Such libraries are not dependent on particular search engines, or formats or any other particular product. Everyone will be free to keep their own copies of these libraries-- the number of persons owning libraries from now on will rise on the same order as did people owning a book after the invention of Gutenberg's Press. Thanks!!! Give the world eBooks in 2006!!! Michael S. Hart Founder Project Gutenberg Blog at http://hart.pglaf.org