
Hi! On Fri, 2006-03-10 at 12:33 +0100, Keith J. Schultz wrote:
Hello,
Am 10.03.2006 um 11:24 schrieb Holden McGroin:
On Fri, 2006-03-10 at 10:32 +0100, Keith J. Schultz wrote:
text. Today, dictionaries are used to guess which words are to be recognised. That is why the OCR systems today give us better results if the original has DECENT quality!!!
The pattern recognition systems have not gotten better and the dictionary trick takes the motivation away to develop better OCR algorithms.
I'm going to have to call bullshit here. As a researcher working in the field of document recognition, I've noticed tremendous improvements in OCR quality even just in the past five years. Before you start to swear, read and understand! Maybe in the development labs, but not for the non-high end user!!!!
OCR results are improving across the board. One only has to compare Finereader 8, a mainstream OCR product, with version 5 or so to see the improvement in standard OCR packages over the last 5 years. Recognition quality improves (where there is room for improvement) and so does the range of documents which can be recognised. Each passing year brings improvements in quality for older, noisy and lower quality documents. Again, I stress that this is *real-world* improvement in mainstream OCR products. In your initial post, you stated that the "dictionary trick" takes away the motivation to develop better OCR algorithms. Yet, it is still an extremely active research subject. Perhaps you're not familiar with the research community around OCR but there are many major conferences, workshops and journals devoted entirely or mainly to the task of digitising documents. And of course, where do you think the improvements in mainstream OCR applications come from? Yesterday's innovation in the research lab forms the basis of new features in today's commercial OCR packages. Likewise, the work that's going on now in the lab will improve tomorrow's OCR packages.
We have not seen any improvements in the field for the past five years!!! The improvements are mainly due to the use of dictionaries!! Not the improvement of character recognition!! Most systems in the field get their performance out of word recognition !!!
Well, that's a nice statement to make since the vast majority of systems in the field are black-box commercial systems. How do you know where the performance comes from? I'm a researcher in the field. I attend conferences and read journals and I don't know much about the internals of ABBYY. Unsurprisingly, it's something they keep under close wraps. So all you really have is the fact that commercial (and research) OCR systems are improving and your unfounded assertion that the improvements are mainly due to dictionaries.
I did mean to say not there is no improvement in Optical Character Recognition, but the improvment over the past 10 years is minimal at most. When I see a OCR system that just uses raw results, then I will bow my head in recognition of true achieve meant. Furthermore, when the image processing gets that far it will open up new possiblities in all kinds of sciences.
There are countless tools which can be used to improve OCR performance. Using dictionary lookups is just one tool in the box. OCR is improving using many different techniques. I've been observing improvements in many different areas over the last few years (as long as I've been in the area), including: - Improvements in low-level Image processing techniques - Improvements in feature extraction from characters - Improvements in character recognition based on those features If you don't like dictionary lookups, don't use them. Raw OCR performance is improving in the lab and in the marketplace and is already great for a large proportion of documents. I must apologise on behalf of the research community if you find the rate of progress to be inadequate. That said, if you don't like it, muck in. There are many research labs around the world working on improving OCR and related techniques and I'm sure they'd be glad to have someone as knowledgeable as yourself join. There are even a few Free Software / Open Source OCR systems which would gladly welcome any interested developers: Ocrad: http://www.gnu.org/software/ocrad/ocrad.html GOCR/JOCR: http://jocr.sourceforge.net/ ClaraOCR: http://www.geocities.com/claraocr/ Cheers, Holden