
On 2/23/2010 10:47 AM, Jim Adcock wrote:
Begs the question why DP doesn't just institute a quality hosted OCR and let people just submit the page images. Ask people to test run a couple pages by the hosted OCR before settling on their digitization settings in order to make sure they know what they are doing.
Having done OCR on several thousand books, I can safely say that even with the most advanced OCR programs currently available, this is NOT a good idea for books of any complexity at all. It might be OK for straight fiction. The big stumbling block is that ABBYY often segments the page incorrectly or orders the segments incorrectly. A classic example often comes up in the Table of Contents where it may group all of the chapter titles into one block and then the page numbers into another block. When this is saved as plain text for proofing, that will make lines of chapter titles appear, followed by lines of the page numbers where what we really want is for the page number to appear on the same line as the chapter title. Not much fun for the proofers to clean up. ABBYY 10 does much better than the previous version I was using, but still sometimes gets things wrong. Getting blocks of text in the wrong order, as can sometimes happen when there are multiple illustrations on a page dividing the text up into separate blocks, is equally bad. Another common OCR error is missing the last word of a paragraph when it appears by itself on a line. When I scan a book, I keep an eye out for any pages having anything other than a single solid block of text. If the book has any, I'll then go through page by page to make sure that the OCR got the text block segmentation and order correct. I often end up redrawing the text blocks, sometimes re-ordering them, and then running the OCR a second time on that page. I would not trust a "batch" or "remote" OCR program to do this correctly. Despite assertions to the contrary, the content providers at DP do go to some considerable lengths to make things easier for the other volunteers. There are other problems with providing a central OCR service, which include expense, processing load, etc. But to my mind, the definitive problem is what I outlined above. Without an interactive capability OCR results often are not good enough for books of any complexity. And before someone says "so make the OCR engine on the server be interactive", let me say that communication and processing costs would be prohibitively expensive, and further, the OCR engines that are sold for that kind of multiple user, production environment use, don't (as far as I know) make that kind of interaction easy to accomplish. JulietS