
From what I understand, Google and the five libraries are going to do some serious DRM on all their sites and these scans and files will NOT go out.
mh On Wed, 15 Dec 2004, Jon Noring wrote:
Tony Kline wrote:
Bowerbird@aol.com wrote:
well, since google _is_ a search engine, they'll obviously o.c.r. the text. and clean up the text, because errors would muck up their search engine.
Did they say OCR or did you deduce that? I got the impression they are imaging pages, and maybe adding some identifying keywords for each page. That is you'll be able to Google to a title chapter and page maybe, but you won't be able to Google within pages. Try OCR'ing some of the stuff in the Bodleian...there ain't no such fonts!! Does anyone know what they mean by digitizing?
My understanding, which may be wrong, is that Google will OCR the page scans, but do only cursory machine cleanup of the raw unstructured text that results (which I call "raw digital text" or RDT), and use the still-error-laden RDT in their search system to pull up the page scans (or simply to refer to book title and page number.)
[Obviously, RDT will have numerous scanning errors, and those who are familiar with the output of OCR engines know that that RDT is overall one big ball of wax. Certainly Google can write some advanced program to try to clean up the more obvious scanning errors in the RDT, but it will only correct some of the errors, but the result is probably good enough for search purposes. I rather doubt they will do any human proofing (it is way too expensive, and anyway, it's better to turn the public domain stuff over to Distributed Proofreaders who will do it *for free* via enthusiastic volunteer power. Any corporate entity that does not take advantage of free human labor to further their business is not serving their stockholders!)]
Interestingly, this is what the University of Michigan (one of the Google partners I believe) did in their "Making of America" collection, which has been around for a few years now. See:
http://www.hti.umich.edu/m/moagrp/
MoA scanned the books, placed the scanned page images online (they are freely available -- it's a cool collection that, strangely, hardly anyone has heard of), and built a search engine to search the resulting RDT from OCR. Then one by one they have been converting the RDT from selected books to highly-proofed SDT (structured digital text) using human proofers and TEI (I think) for structuring. So, the scans came first, and then the cleanup was (and is being) done at a later time.
It's entirely possible that Google will give, upon request, the page scans for any public domain books they've scanned to established groups like Distributed Proofreaders for conversion into proofed SDT, so long as Google gets a copy of the resulting high-quality SDT. I hope they will do this. If not, it will be disappointing -- but at least we have the Internet Archive who will make all their scanned books available to the world. They may end up with over one million books, enough to feed Distributed Proofreaders for quite a while.
Jon Noring
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d