Re: [gutvol-d] re: [BP] Google Partners with Oxford, Harvard & Others to Digitize Libraries

16 Dec 2004

...
From what I understand, Google and the five libraries are going to do some
serious DRM on all their sites and these scans and files will NOT go out.
mh

On Wed, 15 Dec 2004, Jon Noring wrote:
...
Tony Kline wrote:
...
Bowerbird@aol.com wrote:
...
...
well, since google _is_ a search engine, they'll obviously o.c.r. the text.
and clean up the text, because errors would muck up their search engine.
...
Did they say OCR or did you deduce that? I got the impression they are
imaging pages, and maybe adding some identifying keywords for each page.
That is you'll be able to Google to a title chapter and page maybe, but
you won't be able to Google within pages. Try OCR'ing some of the stuff
in the Bodleian...there ain't no such fonts!! Does anyone know what
they mean by digitizing?
My understanding, which may be wrong, is that Google will OCR the
page scans, but do only cursory machine cleanup of the raw unstructured
text that results (which I call "raw digital text" or RDT), and use the
still-error-laden RDT in their search system to pull up the page scans
(or simply to refer to book title and page number.)
[Obviously, RDT will have numerous scanning errors, and those who are
familiar with the output of OCR engines know that that RDT is overall
one big ball of wax. Certainly Google can write some advanced program
to try to clean up the more obvious scanning errors in the RDT, but it
will only correct some of the errors, but the result is probably good
enough for search purposes. I rather doubt they will do any human
proofing (it is way too expensive, and anyway, it's better to turn the
public domain stuff over to Distributed Proofreaders who will do it
*for free* via enthusiastic volunteer power. Any corporate entity that
does not take advantage of free human labor to further their business
is not serving their stockholders!)]
Interestingly, this is what the University of Michigan (one of the
Google partners I believe) did in their "Making of America" collection,
which has been around for a few years now. See:
http://www.hti.umich.edu/m/moagrp/
MoA scanned the books, placed the scanned page images online (they
are freely available -- it's a cool collection that, strangely, hardly
anyone has heard of), and built a search engine to search the
resulting RDT from OCR. Then one by one they have been converting the
RDT from selected books to highly-proofed SDT (structured digital text)
using human proofers and TEI (I think) for structuring. So, the scans
came first, and then the cleanup was (and is being) done at a later
time.
It's entirely possible that Google will give, upon request, the page
scans for any public domain books they've scanned to established
groups like Distributed Proofreaders for conversion into proofed SDT,
so long as Google gets a copy of the resulting high-quality SDT. I
hope they will do this. If not, it will be disappointing -- but at
least we have the Internet Archive who will make all their scanned
books available to the world. They may end up with over one million
books, enough to feed Distributed Proofreaders for quite a while.
Jon Noring
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d

Re: [gutvol-d] re: [BP] Google Partners with Oxford, Harvard & Others to Digitize Libraries

Michael Hart