Re: [gutvol-d] google and the translation thing

17 Mar 2006


      Hi Holden,

	Thank you for your kind and sober reply.
	I did not intend to offend the OCR developers
	or say that their is no improvement.
	Basically, all comercial products use
	somekind of "vodoo" for better results.
	That is their perfect right.

	As a reseachers know that money is the motor
	to efficiently progress. Companies want the
	results yesterday and do not care if the improvements
	in their product is due to "vodoo" or improvement
	in the fundemental technology.

	I have had to study the technology and decided to use it
	or not. I generally do not as that results I required in my field
	take up to many resources for most of my goals. There are
	cheaper ways of getteng things done resource wise.

	OCR would be just one tool that I use and is just the beginning of
	what I want and need to do.

  	It took me 20 years to own my own scanner, and believe me I did  
not get it
	for OCR. Still waiting and willing to wait for the quality I consider
	adequate.

	Believe me. I would finance OCR reseacher to get 99 % recognition  
out of the
	box if i could. I do know how hard it is to get money for research.

	One a side track here. Humans do not recognize Characters, but words  
and phrases.
	That is how we learn to read!!!

		regards
			Keith.

Am 16.03.2006 um 16:57 schrieb Holden McGroin:
...
Hi!
On Fri, 2006-03-10 at 12:33 +0100, Keith J. Schultz wrote:
...
Hello,
Am 10.03.2006 um 11:24 schrieb Holden McGroin:
...
On Fri, 2006-03-10 at 10:32 +0100, Keith J. Schultz wrote:
...
text. Today, dictionaries are used to guess which words are
to be recognised. That is why the OCR systems today give us
better results if the original has DECENT quality!!!
...
The pattern recognition systems have not gotten better and
the dictionary trick takes the motivation away to
develop better OCR algorithms.
I'm going to have to call bullshit here. As a researcher working in
the
field of document recognition, I've noticed tremendous  
improvements in
OCR quality even just in the past five years.
Before you start to swear, read and understand! Maybe in the
development labs, but not for the non-high end user!!!!
OCR results are improving across the board. One only has to compare
Finereader 8, a mainstream OCR product, with version 5 or so to see  
the
improvement in standard OCR packages over the last 5 years.  
Recognition
quality improves (where there is room for improvement) and so does the
range of documents which can be recognised. Each passing year brings
improvements in quality for older, noisy and lower quality documents.
Again, I stress that this is *real-world* improvement in mainstream  
OCR
products.
In your initial post, you stated that the "dictionary trick" takes  
away
the motivation to develop better OCR algorithms. Yet, it is still an
extremely active research subject. Perhaps you're not familiar with  
the
research community around OCR but there are many major conferences,
workshops and journals devoted entirely or mainly to the task of
digitising documents.
And of course, where do you think the improvements in mainstream OCR
applications come from? Yesterday's innovation in the research lab  
forms
the basis of new features in today's commercial OCR packages.  
Likewise,
the work that's going on now in the lab will improve tomorrow's OCR
packages.
...
We have not seen any improvements in the field for the past five
years!!! The improvements are mainly due to the use of dictionaries!!
Not the improvement of character recognition!! Most systems in the
field get their performance out of word recognition !!!
Well, that's a nice statement to make since the vast majority of  
systems
in the field are black-box commercial systems. How do you know  
where the
performance comes from? I'm a researcher in the field. I attend
conferences and read journals and I don't know much about the  
internals
of ABBYY. Unsurprisingly, it's something they keep under close wraps.
So all you really have is the fact that commercial (and research) OCR
systems are improving and your unfounded assertion that the  
improvements
are mainly due to dictionaries.
...
I did mean to say not there is no improvement in Optical
Character Recognition, but the improvment over the past
10 years is minimal at most. When I see a OCR system that
just uses raw results, then I will bow my head in recognition
of true achieve meant. Furthermore, when the image processing
gets that far it will open up new possiblities in all kinds
of sciences.
There are countless tools which can be used to improve OCR  
performance.
Using dictionary lookups is just one tool in the box. OCR is improving
using many different techniques. I've been observing improvements in
many different areas over the last few years (as long as I've been in
the area), including:
- Improvements in low-level Image processing techniques
 - Improvements in feature extraction from characters
 - Improvements in character recognition based on those features
If you don't like dictionary lookups, don't use them. Raw OCR
performance is improving in the lab and in the marketplace and is
already great for a large proportion of documents. I must apologise on
behalf of the research community if you find the rate of progress  
to be
inadequate.
That said, if you don't like it, muck in. There are many research labs
around the world working on improving OCR and related techniques  
and I'm
sure they'd be glad to have someone as knowledgeable as yourself join.
There are even a few Free Software / Open Source OCR systems which  
would
gladly welcome any interested developers:
Ocrad:     http://www.gnu.org/software/ocrad/ocrad.html
GOCR/JOCR: http://jocr.sourceforge.net/
ClaraOCR:  http://www.geocities.com/claraocr/
Cheers,
Holden
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d