re: [BP] Google Partners with Oxford, Harvard & Others to Digitize Libraries

On Tue, 14 Dec 2004, Tony Kline wrote:
Bowerbird@aol.com wrote:
tony said:
That's very good, though image files hardly meet the needs of those users who want digital text and the ability to download, cut and paste etc
well, since google _is_ a search engine, they'll obviously o.c.r. the text. and clean up the text, because errors would muck up their search engine.
Did they say OCR or did you deduce that? I got the impression they are imaging pages, and maybe adding some identifying keywords for each page. That is you'll be able to Google to a title chapter and page maybe, but you won't be able to Google within pages. Try OCR'ing some of the stuff in the Bodleian...there ain't no such fonts!! Does anyone know what they mean by digitizing?
Here's what I have gleaned from 5 TV network news shows and the various NYT, SF Chron, etc., articles: There will be one "full text" respository at Google, but users won't be able to access more than a "snippet" around any quotation they look up, much as with general Google searches today, and then, if they want more, they will have to click on the item and will then arrive at a second database, this one provided by one of the five libraries [NYCPL, Harvard, Michigan, Stanford, Oxford] where they will get a graphical representation of the non-printable page that contains the quotation. Why they chose to call it "Google Print" when printing is outlawed, I have no idea. Michael

Did they say OCR or did you deduce that? I got the impression they are imaging pages, and maybe adding some identifying keywords for each page. That is you'll be able to Google to a title chapter and page maybe, but you won't be able to Google within pages. Try OCR'ing some of the stuff in the Bodleian...there ain't no such fonts!! Does anyone know what they mean by digitizing?
I don't think so. Have you seen catalog.google.com? David A. Desrosiers desrod@gnu-designs.com http://gnu-designs.com

Tony Kline wrote:
Bowerbird@aol.com wrote:
well, since google _is_ a search engine, they'll obviously o.c.r. the text. and clean up the text, because errors would muck up their search engine.
Did they say OCR or did you deduce that? I got the impression they are imaging pages, and maybe adding some identifying keywords for each page. That is you'll be able to Google to a title chapter and page maybe, but you won't be able to Google within pages. Try OCR'ing some of the stuff in the Bodleian...there ain't no such fonts!! Does anyone know what they mean by digitizing?
My understanding, which may be wrong, is that Google will OCR the page scans, but do only cursory machine cleanup of the raw unstructured text that results (which I call "raw digital text" or RDT), and use the still-error-laden RDT in their search system to pull up the page scans (or simply to refer to book title and page number.) [Obviously, RDT will have numerous scanning errors, and those who are familiar with the output of OCR engines know that that RDT is overall one big ball of wax. Certainly Google can write some advanced program to try to clean up the more obvious scanning errors in the RDT, but it will only correct some of the errors, but the result is probably good enough for search purposes. I rather doubt they will do any human proofing (it is way too expensive, and anyway, it's better to turn the public domain stuff over to Distributed Proofreaders who will do it *for free* via enthusiastic volunteer power. Any corporate entity that does not take advantage of free human labor to further their business is not serving their stockholders!)] Interestingly, this is what the University of Michigan (one of the Google partners I believe) did in their "Making of America" collection, which has been around for a few years now. See: http://www.hti.umich.edu/m/moagrp/ MoA scanned the books, placed the scanned page images online (they are freely available -- it's a cool collection that, strangely, hardly anyone has heard of), and built a search engine to search the resulting RDT from OCR. Then one by one they have been converting the RDT from selected books to highly-proofed SDT (structured digital text) using human proofers and TEI (I think) for structuring. So, the scans came first, and then the cleanup was (and is being) done at a later time. It's entirely possible that Google will give, upon request, the page scans for any public domain books they've scanned to established groups like Distributed Proofreaders for conversion into proofed SDT, so long as Google gets a copy of the resulting high-quality SDT. I hope they will do this. If not, it will be disappointing -- but at least we have the Internet Archive who will make all their scanned books available to the world. They may end up with over one million books, enough to feed Distributed Proofreaders for quite a while. Jon Noring

It's entirely possible that Google will give, upon request, the page scans for any public domain books they've scanned to established groups like Distributed Proofreaders for conversion into proofed SDT, so long as Google gets a copy of the resulting high-quality SDT.
My guess is that part of the deal is that the libraries are going to get copies of those page scans, and they will probably make them available in various ways in addition to whatever Google does with them. By the way, it's astonishing to me how far OCR has come in the last 10 years. I think the low cost of storage has made page image storage of many historical documents feasible, relatively suddenly, and that means that the problem of OCR'ing handwritten text, odd fonts, early books, and other similar things has suddenly become a hot research topic. Bill

On Thu, 16 Dec 2004, Bill Janssen wrote:
It's entirely possible that Google will give, upon request, the page scans for any public domain books they've scanned to established groups like Distributed Proofreaders for conversion into proofed SDT, so long as Google gets a copy of the resulting high-quality SDT.
My guess is that part of the deal is that the libraries are going to get copies of those page scans, and they will probably make them available in various ways in addition to whatever Google does with them.
AFAIK each library will keep the scans of their own books, and z/j/ealously guard them. . . .
By the way, it's astonishing to me how far OCR has come in the last 10 years. I think the low cost of storage has made page image storage of many historical documents feasible, relatively suddenly, and that means that the problem of OCR'ing handwritten text, odd fonts, early books, and other similar things has suddenly become a hot research topic.
I heard they are still having huge troubles with older books. . . . mh

From what I understand, Google and the five libraries are going to do some serious DRM on all their sites and these scans and files will NOT go out.
mh On Wed, 15 Dec 2004, Jon Noring wrote:
Tony Kline wrote:
Bowerbird@aol.com wrote:
well, since google _is_ a search engine, they'll obviously o.c.r. the text. and clean up the text, because errors would muck up their search engine.
Did they say OCR or did you deduce that? I got the impression they are imaging pages, and maybe adding some identifying keywords for each page. That is you'll be able to Google to a title chapter and page maybe, but you won't be able to Google within pages. Try OCR'ing some of the stuff in the Bodleian...there ain't no such fonts!! Does anyone know what they mean by digitizing?
My understanding, which may be wrong, is that Google will OCR the page scans, but do only cursory machine cleanup of the raw unstructured text that results (which I call "raw digital text" or RDT), and use the still-error-laden RDT in their search system to pull up the page scans (or simply to refer to book title and page number.)
[Obviously, RDT will have numerous scanning errors, and those who are familiar with the output of OCR engines know that that RDT is overall one big ball of wax. Certainly Google can write some advanced program to try to clean up the more obvious scanning errors in the RDT, but it will only correct some of the errors, but the result is probably good enough for search purposes. I rather doubt they will do any human proofing (it is way too expensive, and anyway, it's better to turn the public domain stuff over to Distributed Proofreaders who will do it *for free* via enthusiastic volunteer power. Any corporate entity that does not take advantage of free human labor to further their business is not serving their stockholders!)]
Interestingly, this is what the University of Michigan (one of the Google partners I believe) did in their "Making of America" collection, which has been around for a few years now. See:
http://www.hti.umich.edu/m/moagrp/
MoA scanned the books, placed the scanned page images online (they are freely available -- it's a cool collection that, strangely, hardly anyone has heard of), and built a search engine to search the resulting RDT from OCR. Then one by one they have been converting the RDT from selected books to highly-proofed SDT (structured digital text) using human proofers and TEI (I think) for structuring. So, the scans came first, and then the cleanup was (and is being) done at a later time.
It's entirely possible that Google will give, upon request, the page scans for any public domain books they've scanned to established groups like Distributed Proofreaders for conversion into proofed SDT, so long as Google gets a copy of the resulting high-quality SDT. I hope they will do this. If not, it will be disappointing -- but at least we have the Internet Archive who will make all their scanned books available to the world. They may end up with over one million books, enough to feed Distributed Proofreaders for quite a while.
Jon Noring
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
participants (4)
-
Bill Janssen
-
David A. Desrosiers
-
Jon Noring
-
Michael Hart