improve scanning on color docs?

love comments from anyone who knows about this: Improve OCR Accuracy on Color Documents This white paper confirms that industry-standard practices to clean color document images can be improved to produce higher OCR accuracy. Image Detergent™ from Accusoft Pegasus improves OCR accuracy by 5-10% more than a standard Smoothing filter. This white paper leads the reader through the testing that proves it.

Looking at the details of their paper they seem to be dealing with simple "modern" digitizations of simple "modern" documents which ought to be duck-soup for any modern OCR -- except they deliberately corrupt the image by doing a very lossy jpeg compression of the digitization and then set the binary threshold "wrong" so that the resulting characters lose important parts. Suggest instead of buying their software just don't do that! Do not store your digitizations in jpeg mode but rather in some lossless form such as png. Spend some time with playing with thresholding software like Photoshop if your OCR requires binary images else send the OCR a grayscale digitization to begin with and let the OCR pick its own levels. A little Unsharp Masking can go a long way too -- as does setting your dpi appropriate in the first place. 300 dpi for 12pt "equals" 600 dpi for a text in 6pt ! Playing around a bit to figure out what works best can easily affect your error rates +- 20% -- which is a lot more than this software claims! http://www.accusoft.com/Improve_OCR_Accuracy_on_Color_Documents.pdf

I believe this is mostly gobbledygook. As Jim pointed out, the example is contrived. JPG artefacts are not, at least in my experience, the worst type of problems OCR systems have to contend with. I plunked the image -- both before and after -- into Photoshop to see what's what. Simply extracting the blue channel from the sample yielded results that were much improved. After that, a high-pass filter yielded pretty well perfect results. I do not believe that automated optimization of colour images for OCR is particularly interesting. There is likely the odd colour image out there, but I think they would be things that you would handle as a one-off as Jim suggested. I have had occasion to tinker with colour image manipulation in an effort to improve OCR. The thing with the use-case I had is that I was able to construct a customised series of filter steps to get what I needed, and then apply those to all the pages at once. I tinkered a lot with this one: http://www.archive.org/stream/waunangeeormassa00rich#page/49/mode/1up before attempting to clean up the rest of the pages. (BB, you should look at the IA OCR results on *that* page ;-)) What I found was that the foxing shows up most strongly in the blue channel, and if you select the red channel, you can get rid of it almost entirely. Next make a histogram of the pixel values. You expect a somewhat bi-modal distribution. A percentile cut is what I wound up with, to pick the cutoff between black and white. I think that a filter that performed principle component analysis on colourspace, used a statistical analysis of the results to find the colourspace axis that gives a pixel distribution resembling text, then a little high-pass filtering would do the job on what accusoft is going for. I also think the machine vision folks are way out in front. On 17-May-2010 19:44, Jim Adcock wrote:
http://www.accusoft.com/Improve_OCR_Accuracy_on_Color_Documents.pdf
============================================================ Gardner Buchanan <gbuchana@teksavvy.com> Ottawa, ON FreeBSD: Where you want to go. Today.
participants (3)
-
Gardner Buchanan
-
Jim Adcock
-
Kimo Crossman