
What the Million Book Project does is fundamentally different from what PG does. MBP is in the business of providing scans of actual pages. In order to facilitate searching, they run OCR to obtain text which is used by the search engine, the general idea being that any word that's important will turn up enough times in a book to have been OCR'd correctly at least once. The djvu system allows them to take the user to the correct page, and highlight the word(s) found. Virtually all of the image archives do something similar to this. The key words appear to be "full text search". What MBP does NOT do is provide corrected OCR so that the book is readable in plain text. Unlike most of the other archives, which bury the OCR'd text so that only the search engine can see it, MBP allows access to the uncorrected OCR. Examples like the one you give show exactly what DP does. Turn nonsense into real text. MBP is very friendly and encourages DP and PG to make as much use of their scans as possible. Having real text versions of their books is only to their advantage and that of their users. Virtually none of the other image archives provide corrected text. It is simply cost-prohibitive to do so. What DP does as a volunteer effort would be extremely costly to replicate. For this reason alone, I believe that Google will be using raw OCR behind their scans. On new material, raw OCR from a good program can be very close to 100% correct. It is the older material that causes problems. BTW, MBP does NOT destroy the books they are having scanned. The books are sent to India (or China) for scanning on orbital scanners (think a really good digital camera). Orbital scanning still requires a human to turn the pages but it is much more gentle on the books than flatbed scanners. Books that are on loan from US Libraries will be returned to them. Other material is given to libraries and schools in India when it is no longer needed. The Internet Archive, home of the MBP scans, is also working with the University of Toronto to use a robotic scanner to scan their books. The robot does not damage the books and they are returned to the shelves after scanning. The process is still ramping up, but they are cranking out more books all the time. I'm told that a few DPers from that area are working (for money, lucky them) with the robot. Someone posted on this list (I think) awhile ago with a really good list of what true text (corrected OCR) books can be used for that is not possible with image scans. The difference is huge. JulietS DP Administrator N Wolcott wrote:
Here is Chapter I of the Steam House by Jules Verne. I think they have a way to go. I wonder if Google is planning to do better? To be fair the Dejavu look at the rather bad images allows them to be mostly made out, so one could reconstruct most of this page. The first line is "TWO THOUSAND POUNDS FOR A HEAD". The images are like with a 300 mp camera, so not surprizing the poor OCR. I think PG texts are easier to read. I understand the books are being destroyed after scanning: might as well put them right into the dumpster and save all the trouble. *
CHAPTER L "•nv<» thousand rnuNns for a hkaiv,"
*
*
" A tst'.WAfct* of two thousand pounds will he paid to any Mjtr who will drhVrr up, dead or alive, oik* of the prime movrr» <»f thf Srpoy revolt, *at *present known to he in the ifMdrney, the Nabob Datulou Taut, nnnuionly
Jitu'h *vv»t'i *the ijotirr rratl hy the iiilialntants tif Aurun*
jUitMul, mi tltr rvniiiijj uf the* 6th of March, 1807,
A cojy iff the jil,u,;in! h«u! been recently affixed to the wall iif «i loiirly and ntiiuu! hunj;ulow on the hanks of llir iJitiuUttiu, *iud already the conu*r of the paper htvir- iiijj flu* j;r«'on*l fi*iiiir*-«a fiaiiir cx<,%nited hy tiuinCj .secretly Utltuii'rt) hy uthn^r ji'**"WiH jjone,
The fiiiiti^ had Urea there^ printal in Inrjjo letters, hut If wafi torn off hy the hand of a ^solitary fakir who IKij^ed t»y that Ucst^latc *sj>c,it, The name of the Governor
* *N Wolcott nwolcott2@post.harvard.edu <mailto:nwolcott2@post.harvard.edu>*
* * ------------------------------------------------------------------------ * _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d *