
On Thursday 16 March 2006 03:53 pm, Michael Hart wrote: (a lot of things, but I wanted to keep the thread separated. Hi, Michael!)
On Thu, 16 Mar 2006, Holden wrote:
There are even a few Free Software / Open Source OCR systems which would gladly welcome any interested developers:
Ocrad: http://www.gnu.org/software/ocrad/ocrad.html GOCR/JOCR: http://jocr.sourceforge.net/ ClaraOCR: http://www.geocities.com/claraocr/
I'm not a researcher in the field, but I have mucked in on ocrad (which is a single developer project), and managed to get two minor patches accepted. Frankly, though, each of those packages uses very different approaches and native internal formats, and mostly rely on simpler models to recognize characters. ocrad almost exclusively depends on feature recognition in b/w and has a very simplistic confidence model. The others I don't recall details of off the top of my head, but I believe one of them was trying to use feature recognition plus same-page similarity modeling. I don't believe any of them use the "dictionary trick" and they all pretty much fail on merged and broken characters. From black-box observation, FR seems to start with feature recognition, and uses similarity, curve reconstruction, adaptive thresholding, and even outline tracing for comparison/similarity against ttf font curves. I suspect they may also be using digraph and trigraph frequencies (at least for English) to improve their confidence scorings. Probably they also compare same-page word shapes to resolve cases where a character in a bounded word has low confidence value. At any rate, you'd have to be pretty damned dedicated and/or already fairly knowledgeable in several disciplines to contribute significantly to these projects. IMO, the single biggest improvement anyone could offer one of these open source projects is a better way to bound broken and merged characters. Feature recognition does a fairly good job up to that point.