[gutvol-d] Re: let us not be confused

23 Feb 2010

      On 2/23/2010 10:47 AM, Jim Adcock wrote:
...
Begs the question why DP doesn't just institute a quality hosted OCR and let
people just submit the page images. Ask people to test run a couple pages by
the hosted OCR before settling on their digitization settings in order to
make sure they know what they are doing.
Having done OCR on several thousand books, I can safely say that even 
with the most advanced OCR programs currently available, this is NOT a 
good idea for books of any complexity at all. It might be OK for 
straight fiction. The big stumbling block is that ABBYY often segments 
the page incorrectly or orders the segments incorrectly. A classic 
example often comes up in the Table of Contents where it may group all 
of the chapter titles into one block and then the page numbers into 
another block. When this is saved as plain text for proofing, that will 
make lines of chapter titles appear, followed by lines of the page 
numbers where what we really want is for the page number to appear on 
the same line as the chapter title. Not much fun for the proofers to 
clean up. ABBYY 10 does much better than the previous version I was 
using, but still sometimes gets things wrong. Getting blocks of text in 
the wrong order, as can sometimes happen when there are multiple 
illustrations on a page dividing the text up into separate blocks, is 
equally bad. Another common OCR error is missing the last word of a 
paragraph when it appears by itself on a line.

When I scan a book, I keep an eye out for any pages having anything 
other than a single solid block of text. If the book has any, I'll then 
go through page by page to make sure that the OCR got the text block 
segmentation and order correct. I often end up redrawing the text 
blocks, sometimes re-ordering them, and then running the OCR a second 
time on that page. I would  not trust a "batch" or "remote" OCR program 
to do this correctly. Despite assertions to the contrary, the content 
providers at DP do go to some considerable lengths to make things easier 
for the other volunteers.

There are other problems with providing a central OCR service, which 
include expense, processing load, etc. But to my mind, the definitive 
problem is what I outlined above. Without an interactive capability OCR 
results often are not good enough for books of any complexity. And 
before someone says "so make the OCR engine on the server be 
interactive", let me say that communication and processing costs would 
be prohibitively expensive, and further, the OCR engines that are sold 
for that kind of multiple user, production environment use, don't (as 
far as I know) make that kind of interaction easy to accomplish.

JulietS

[gutvol-d] Re: let us not be confused

Juliet Sutherland