google and the translation thing

newer
periodic update

older
well, this is interesting

Bowerbird＠aol.com

9 Mar 2006 9 Mar '06

8:41 a.m.

http://www.dancohen.org/blog/posts/no_computer_left_behind said:

...

Google researchers have demonstrated (but not yet released to the general public) a powerful method for creating 'good enough' translations—not by understanding the grammar of each passage, but by rapidly scanning and comparing similar phrases on countless electronic documents in the original and second languages. Given large enough volumes of words in a variety of languages, machine processing can find parallel phrases and reduce any document into a series of word swaps. Where once it seemed necessary to have a human being aid in a computer's translating skills, or to teach that machine the basics of language, swift algorithms functioning on unimaginably large amounts of text suffice. Are such new computer translations as good as a skilled, bilingual human being? Of course not. Are they good enough to get the gist of a text? Absolutely. So good the National Security Agency and the Central Intelligence Agency

...

increasingly rely on that kind of technology to scan, sort, and mine gargantuan amounts of text and communications (whether or not the rest of us like it).

sounds like something you might find interesting, michael. of course, a "good enough" translation probably wouldn't be, not for literature, where the realm of creativity is instantiated, but could it work as a "first pass" that would do the bulk of the "heavy lifting", so a person knowledgeable in both languages could come in and spend relatively little time smoothing it out? well, it's certainly possible, i would think. and maybe probable. especially if progress on the technique proves to be forthcoming... -bowerbird

Attachments:

attachment.html (text/html — 2.3 KB)

Show replies by date

Keith J. Schultz

9 Mar 9 Mar

10:24 a.m.

Hi There, Let me chime in here. Yes, you can use these tools as a start and for casual use, but otherwise you can forget them as a professional tool. - due to the statistacal modell you get only 80-90 % accuracy - I see a lot of sites on which the content for different languages is diferent no one to comparison possible - I have work and help develope such tools and know that they give interresting results and are in the range above. Yet, these methods are only good as a analyse tool. - a system with fairly decent grammar models and lexicons give better results using less resources give better results. The Problem here is is that then are not publically availble. The actual problem with translation is the so-called extra- linguistical part!! Culture related facts, context, register etc. to get the last 5 % for a decent translation the effort and resources rises exponentially. As proof the japanese in the 80s said they would have a real-time translation for telephones on the market in 5 years. This was is vaporware. The method is not new. It was used successfully for wheather reports already in the 80s. The method works only for small areas of knowledge/language. In the 70s word for word used to be good enough. Now they have something they say is "good enough". Two Euro cents worth Keith. Am 09.03.2006 um 09:41 schrieb Bowerbird@aol.com:

...

http://www.dancohen.org/blog/posts/no_computer_left_behind said:

...
Google researchers have demonstrated (but not yet released to the general public) a powerful method for creating 'good enough' translations—not by understanding the grammar of each passage, but by rapidly scanning and comparing similar phrases on countless electronic documents in the original and second languages. Given large enough volumes of words in a variety of languages, machine processing can find parallel phrases and reduce any document into a series of word swaps. Where once it seemed necessary to have a human being aid in a computer's translating skills, or to teach that machine the basics of language, swift algorithms functioning on unimaginably large amounts of text suffice. Are such new computer translations as good as a skilled, bilingual human being? Of course not. Are they good enough to get the gist of a text? Absolutely. So good the National Security Agency and the Central Intelligence Agency increasingly rely on that kind of technology to scan, sort, and mine gargantuan amounts of text and communications (whether or not the rest of us like it).

sounds like something you might find interesting, michael. of course, a "good enough" translation probably wouldn't be, not for literature, where the realm of creativity is instantiated,

but could it work as a "first pass" that would do the bulk of the "heavy lifting", so a person knowledgeable in both languages could come in and spend relatively little time smoothing it out? well, it's certainly possible, i would think. and maybe probable. especially if progress on the technique proves to be forthcoming...

-bowerbird _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Michael Hart

10 Mar 10 Mar

4:58 a.m.

Yes, this is part of what I have been talking about for a few years. Once OCR gets to an acceptable level for all, the next big thing, the killer ap, so to speak, will be MT [Machine Translation] which will convert the 10 million eBooks that will be available into 100 different languages, for a billion free online eBooks. mh On Thu, 9 Mar 2006 Bowerbird@aol.com wrote:

...

http://www.dancohen.org/blog/posts/no_computer_left_behind said:

...
Google researchers have demonstrated (but not yet released to the general public) a powerful method for creating 'good enough' translationsÿÿnot by understanding the grammar of each passage, but by rapidly scanning and comparing similar phrases on countless electronic documents in the original and second languages. Given large enough volumes of words in a variety of languages, machine processing can find parallel phrases and reduce any document into a series of word swaps. Where once it seemed necessary to have a human being aid in a computer's translating skills, or to teach that machine the basics of language, swift algorithms functioning on unimaginably large amounts of text suffice. Are such new computer translations as good as a skilled, bilingual human being? Of course not. Are they good enough to get the gist of a text? Absolutely. So good the National Security Agency and the Central Intelligence Agency

...
increasingly rely on that kind of technology to scan, sort, and mine gargantuan amounts of text and communications (whether or not the rest of us like it).

sounds like something you might find interesting, michael. of course, a "good enough" translation probably wouldn't be, not for literature, where the realm of creativity is instantiated,

but could it work as a "first pass" that would do the bulk of the "heavy lifting", so a person knowledgeable in both languages could come in and spend relatively little time smoothing it out? well, it's certainly possible, i would think. and maybe probable. especially if progress on the technique proves to be forthcoming...

-bowerbird

Dave Fawthrop

7:46 a.m.

On Thu, 9 Mar 2006 20:58:14 -0800 (PST), Michael Hart <hart@pglaf.org> wrote: | |Yes, this is part of what I have been talking about for a few years. And will be *talked about* for decades/centuries to come. |Once OCR gets to an acceptable level for all, the next big thing, |the killer ap, so to speak, will be MT [Machine Translation] which |will convert the 10 million eBooks that will be available into 100 |different languages, for a billion free online eBooks. In your dreams! Only a *tiny* fraction of the *human* population can produce acceptable translations ATM. Machines will have to become more ?intelligent? than 99% of the population before MT becomes a reality. Machines can now compete on equal terms with an earwig. -- Dave Fawthrop <dave hyphenologist co uk> "Intelligent Design?" my knees say *not*. "Intelligent Design?" my back says *not*. More like "Incompetent design". Sig (C) Copyright Public Domain

Keith J. Schultz

9:32 a.m.

Am 10.03.2006 um 05:58 schrieb Michael Hart:

...

Yes, this is part of what I have been talking about for a few years.

Once OCR gets to an acceptable level for all, the next big thing, the killer ap, so to speak, will be MT [Machine Translation] which will convert the 10 million eBooks that will be available into 100 different languages, for a billion free online eBooks.

Not in the next 100 or so years. In the 80s there where OCR systems you could/(had to) be trained. They would give you 95 to 99% accuracy. But, inorder to get these results you would train the system a long time and this training could basically be used just on one text. Today, dictionaries are used to guess which words are to be recognised. That is why the OCR systems today give us better results if the original has DECENT quality!!! The pattern recognition systems have not gotten better and the dictionary trick takes the motivation away to develop better OCR algorithms. Interesting is also, I had a Apple Newton and it recognized my handwriting with 98-99% accuracy. Yet, most OCR systems today will fail!! They can not be trained! I still have to find a system today with similar performance. So much for technological break throughs. Keith.

...

mh

On Thu, 9 Mar 2006 Bowerbird@aol.com wrote:

...
http://www.dancohen.org/blog/posts/no_computer_left_behind said:

...
Google researchers have demonstrated (but not yet released to the general public) a powerful method for creating 'good enough' translationsÿÿnot by understanding the grammar of each passage, but by rapidly scanning and comparing similar phrases on countless electronic documents in the original and second languages. Given large enough volumes of words in a variety of languages, machine processing can find parallel phrases and reduce any document into a series of word swaps. Where once it seemed necessary to have a human being aid in a computer's translating skills, or to teach that machine the basics of language, swift algorithms functioning on unimaginably large amounts of text suffice. Are such new computer translations as good as a skilled, bilingual human being? Of course not. Are they good enough to get the gist of a text? Absolutely. So good the National Security Agency and the Central Intelligence Agency

...
increasingly rely on that kind of technology to scan, sort, and mine gargantuan amounts of text and communications (whether or not the rest of us like it).

sounds like something you might find interesting, michael. of course, a "good enough" translation probably wouldn't be, not for literature, where the realm of creativity is instantiated,

but could it work as a "first pass" that would do the bulk of the "heavy lifting", so a person knowledgeable in both languages could come in and spend relatively little time smoothing it out? well, it's certainly possible, i would think. and maybe probable. especially if progress on the technique proves to be forthcoming...

-bowerbird

gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Holden McGroin

10:24 a.m.

On Fri, 2006-03-10 at 10:32 +0100, Keith J. Schultz wrote:

...

text. Today, dictionaries are used to guess which words are to be recognised. That is why the OCR systems today give us better results if the original has DECENT quality!!!

...

The pattern recognition systems have not gotten better and the dictionary trick takes the motivation away to develop better OCR algorithms.

I'm going to have to call bullshit here. As a researcher working in the field of document recognition, I've noticed tremendous improvements in OCR quality even just in the past five years. The fact is, OCR and document recognition as a whole is a field of tremendous ongoing research. It's no secret that the problem of OCR is not "solved" yet but for some types of document (particularly clean ones using lating characters), results are already damn good. In other areas, particularly regarding degraded documents, results aren't as good but are steadily improving. You state that the so-called "dictionary trick" takes away all motivation to research in the field. This is not what I observe going on in the research community. Dictionary-based lookups are one tool in the arsenal but that's something that's well understood. Some of my colleagues are currently researching novel image processing and feature extraction techniques with the goal of improving raw OCR results. OCR is improving. We're working on it. Cheers, Holden

Keith J. Schultz

11:33 a.m.

Hello, Am 10.03.2006 um 11:24 schrieb Holden McGroin:

...

On Fri, 2006-03-10 at 10:32 +0100, Keith J. Schultz wrote:

...
text. Today, dictionaries are used to guess which words are to be recognised. That is why the OCR systems today give us better results if the original has DECENT quality!!!

...
The pattern recognition systems have not gotten better and the dictionary trick takes the motivation away to develop better OCR algorithms.

I'm going to have to call bullshit here. As a researcher working in the field of document recognition, I've noticed tremendous improvements in OCR quality even just in the past five years. Before you start to swear, read and understand! Maybe in the development labs, but not for the non-high end user!!!!

...

The fact is, OCR and document recognition as a whole is a field of tremendous ongoing research. It's no secret that the problem of OCR is not "solved" yet but for some types of document (particularly clean ones using lating characters), results are already damn good. In other areas, particularly regarding degraded documents, results aren't as good but are steadily improving.

You state that the so-called "dictionary trick" takes away all motivation to research in the field. This is not what I observe going on in the research community. Dictionary-based lookups are one tool in the arsenal but that's something that's well understood. Some of my colleagues are currently researching novel image processing and feature extraction techniques with the goal of improving raw OCR results.

We have not seen any improvements in the field for the past five years!!! The improvements are mainly due to the use of dictionaries!! Not the improvement of character recognition!! Most systems in the field get their performance out of word recognition !!!

...

OCR is improving. We're working on it.

I did mean to say not there is no improvement in Optical Character Recognition, but the improvment over the past 10 years is minimal at most. When I see a OCR system that just uses raw results, then I will bow my head in recognition of true achieve meant. Furthermore, when the image processing gets that far it will open up new possiblities in all kinds of sciences.

Holden McGroin

16 Mar 16 Mar

3:57 p.m.

Hi! On Fri, 2006-03-10 at 12:33 +0100, Keith J. Schultz wrote:

...

Hello,

Am 10.03.2006 um 11:24 schrieb Holden McGroin:

...
On Fri, 2006-03-10 at 10:32 +0100, Keith J. Schultz wrote:

...
text. Today, dictionaries are used to guess which words are to be recognised. That is why the OCR systems today give us better results if the original has DECENT quality!!!

...
The pattern recognition systems have not gotten better and the dictionary trick takes the motivation away to develop better OCR algorithms.

I'm going to have to call bullshit here. As a researcher working in the field of document recognition, I've noticed tremendous improvements in OCR quality even just in the past five years. Before you start to swear, read and understand! Maybe in the development labs, but not for the non-high end user!!!!

OCR results are improving across the board. One only has to compare Finereader 8, a mainstream OCR product, with version 5 or so to see the improvement in standard OCR packages over the last 5 years. Recognition quality improves (where there is room for improvement) and so does the range of documents which can be recognised. Each passing year brings improvements in quality for older, noisy and lower quality documents. Again, I stress that this is *real-world* improvement in mainstream OCR products. In your initial post, you stated that the "dictionary trick" takes away the motivation to develop better OCR algorithms. Yet, it is still an extremely active research subject. Perhaps you're not familiar with the research community around OCR but there are many major conferences, workshops and journals devoted entirely or mainly to the task of digitising documents. And of course, where do you think the improvements in mainstream OCR applications come from? Yesterday's innovation in the research lab forms the basis of new features in today's commercial OCR packages. Likewise, the work that's going on now in the lab will improve tomorrow's OCR packages.

...

We have not seen any improvements in the field for the past five years!!! The improvements are mainly due to the use of dictionaries!! Not the improvement of character recognition!! Most systems in the field get their performance out of word recognition !!!

Well, that's a nice statement to make since the vast majority of systems in the field are black-box commercial systems. How do you know where the performance comes from? I'm a researcher in the field. I attend conferences and read journals and I don't know much about the internals of ABBYY. Unsurprisingly, it's something they keep under close wraps. So all you really have is the fact that commercial (and research) OCR systems are improving and your unfounded assertion that the improvements are mainly due to dictionaries.

...

I did mean to say not there is no improvement in Optical Character Recognition, but the improvment over the past 10 years is minimal at most. When I see a OCR system that just uses raw results, then I will bow my head in recognition of true achieve meant. Furthermore, when the image processing gets that far it will open up new possiblities in all kinds of sciences.

There are countless tools which can be used to improve OCR performance. Using dictionary lookups is just one tool in the box. OCR is improving using many different techniques. I've been observing improvements in many different areas over the last few years (as long as I've been in the area), including: - Improvements in low-level Image processing techniques - Improvements in feature extraction from characters - Improvements in character recognition based on those features If you don't like dictionary lookups, don't use them. Raw OCR performance is improving in the lab and in the marketplace and is already great for a large proportion of documents. I must apologise on behalf of the research community if you find the rate of progress to be inadequate. That said, if you don't like it, muck in. There are many research labs around the world working on improving OCR and related techniques and I'm sure they'd be glad to have someone as knowledgeable as yourself join. There are even a few Free Software / Open Source OCR systems which would gladly welcome any interested developers: Ocrad: http://www.gnu.org/software/ocrad/ocrad.html GOCR/JOCR: http://jocr.sourceforge.net/ ClaraOCR: http://www.geocities.com/claraocr/ Cheers, Holden

Michael Hart

8:53 p.m.

New subject: OCR Trends, and Not: was Google Translation

The following messages give widely opposing points of view. The reason could be, as one stated, that the bottom of the line scanner and OCR combinations are not yet good enough, at least for that person's particular needs. My own observation is that it might simply be the wrong tool for the wrong application. We all see more and more features in calculators that are under $100, and even under $10, to the point where no one is really going to say the TI-84 has no improvements over the previous versions, even if you get it for $60, as the price was where I saw it last. If your applications are simple four function arithmetic, there isn't much point in comparing any new calculators-- they will all do what you want, and the hardware may be a more important aspect than the software. . . how long the calculator and/or the batteries will last, etc. To those who really need a supercomputer, no difference. The same is most likely true of scanners and OCR combos-- some improvements may not apply to what YOU are doing and others may be totally beyond any appplications you have a mind to be using. The same is true for all those different kinds of cheaper calculators out there. It sounds a little as if one person in this conversation, I didn't keep track of various portions and names, was an example of the person who says it does not matter at all, because none of them create perfect results. To this kind of person it doesn't matter how full a glass is getting, until that very last drop is added, then that glass becomes full, otherwise it is empty. The exact same thing has been said here and there via the error rate for eBooks. If a certain element of perfection is missing, then ebook value remains zero even though the paper book has errors. By the way, I saw what appeared to be a perfect scan/OCR, at least 10 years ago, perhaps 15, on the original Apple- Flatbed scanner. I forget the model and the OCR, but the demonstration certainly made me wake up to OCR more and I eventually talked Apple into giving me a Mac and scanner. Thanks Apple!!! Thanks Steve Cisler!!! More to the point about the current topic is what a user wants out of the hardware/software combination. If you don't do your homework when buying these, you are not likely to get what you want. However, and I stress this, the people in these messages are VERY likely, given their positions, to find salesmen and saleswomen who would be MORE than happy to show your people their products and answer questions. Just contact them. . .your report of their demonstration will multiply the effect of their work! This would probably be of great interest to us all. I wonder if the next time we have some kind of meeting-- should we invite some demonstrations??? Michael PS On the topic of calculator, I heard that even if it is not your thing to use something like Encarta, that the current version includes a calculator program that may be worth more than the cost of the entire Encarta. Anyone seen it? On Thu, 16 Mar 2006, Holden McGroin wrote:

...

Hi!

On Fri, 2006-03-10 at 12:33 +0100, Keith J. Schultz wrote:

...
Hello,

Am 10.03.2006 um 11:24 schrieb Holden McGroin:

...
On Fri, 2006-03-10 at 10:32 +0100, Keith J. Schultz wrote:

...
text. Today, dictionaries are used to guess which words are to be recognised. That is why the OCR systems today give us better results if the original has DECENT quality!!!

...
The pattern recognition systems have not gotten better and the dictionary trick takes the motivation away to develop better OCR algorithms.

I'm going to have to call bullshit here. As a researcher working in the field of document recognition, I've noticed tremendous improvements in OCR quality even just in the past five years. Before you start to swear, read and understand! Maybe in the development labs, but not for the non-high end user!!!!

OCR results are improving across the board. One only has to compare Finereader 8, a mainstream OCR product, with version 5 or so to see the improvement in standard OCR packages over the last 5 years. Recognition quality improves (where there is room for improvement) and so does the range of documents which can be recognised. Each passing year brings improvements in quality for older, noisy and lower quality documents. Again, I stress that this is *real-world* improvement in mainstream OCR products.

In your initial post, you stated that the "dictionary trick" takes away the motivation to develop better OCR algorithms. Yet, it is still an extremely active research subject. Perhaps you're not familiar with the research community around OCR but there are many major conferences, workshops and journals devoted entirely or mainly to the task of digitising documents.

And of course, where do you think the improvements in mainstream OCR applications come from? Yesterday's innovation in the research lab forms the basis of new features in today's commercial OCR packages. Likewise, the work that's going on now in the lab will improve tomorrow's OCR packages.

...
We have not seen any improvements in the field for the past five years!!! The improvements are mainly due to the use of dictionaries!! Not the improvement of character recognition!! Most systems in the field get their performance out of word recognition !!!

Well, that's a nice statement to make since the vast majority of systems in the field are black-box commercial systems. How do you know where the performance comes from? I'm a researcher in the field. I attend conferences and read journals and I don't know much about the internals of ABBYY. Unsurprisingly, it's something they keep under close wraps.

So all you really have is the fact that commercial (and research) OCR systems are improving and your unfounded assertion that the improvements are mainly due to dictionaries.

...
I did mean to say not there is no improvement in Optical Character Recognition, but the improvment over the past 10 years is minimal at most. When I see a OCR system that just uses raw results, then I will bow my head in recognition of true achieve meant. Furthermore, when the image processing gets that far it will open up new possiblities in all kinds of sciences.

There are countless tools which can be used to improve OCR performance. Using dictionary lookups is just one tool in the box. OCR is improving using many different techniques. I've been observing improvements in many different areas over the last few years (as long as I've been in the area), including:

- Improvements in low-level Image processing techniques - Improvements in feature extraction from characters - Improvements in character recognition based on those features

If you don't like dictionary lookups, don't use them. Raw OCR performance is improving in the lab and in the marketplace and is already great for a large proportion of documents. I must apologise on behalf of the research community if you find the rate of progress to be inadequate.

That said, if you don't like it, muck in. There are many research labs around the world working on improving OCR and related techniques and I'm sure they'd be glad to have someone as knowledgeable as yourself join. There are even a few Free Software / Open Source OCR systems which would gladly welcome any interested developers:

Ocrad: http://www.gnu.org/software/ocrad/ocrad.html GOCR/JOCR: http://jocr.sourceforge.net/ ClaraOCR: http://www.geocities.com/claraocr/

Cheers, Holden

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

D Garcia

11:38 p.m.

New subject: OCR Trends, and Not: was Google Translation

On Thursday 16 March 2006 03:53 pm, Michael Hart wrote: (a lot of things, but I wanted to keep the thread separated. Hi, Michael!)

...

On Thu, 16 Mar 2006, Holden wrote:

...
There are even a few Free Software / Open Source OCR systems which would gladly welcome any interested developers:

Ocrad: http://www.gnu.org/software/ocrad/ocrad.html GOCR/JOCR: http://jocr.sourceforge.net/ ClaraOCR: http://www.geocities.com/claraocr/

I'm not a researcher in the field, but I have mucked in on ocrad (which is a single developer project), and managed to get two minor patches accepted. Frankly, though, each of those packages uses very different approaches and native internal formats, and mostly rely on simpler models to recognize characters. ocrad almost exclusively depends on feature recognition in b/w and has a very simplistic confidence model. The others I don't recall details of off the top of my head, but I believe one of them was trying to use feature recognition plus same-page similarity modeling. I don't believe any of them use the "dictionary trick" and they all pretty much fail on merged and broken characters. From black-box observation, FR seems to start with feature recognition, and uses similarity, curve reconstruction, adaptive thresholding, and even outline tracing for comparison/similarity against ttf font curves. I suspect they may also be using digraph and trigraph frequencies (at least for English) to improve their confidence scorings. Probably they also compare same-page word shapes to resolve cases where a character in a bounded word has low confidence value. At any rate, you'd have to be pretty damned dedicated and/or already fairly knowledgeable in several disciplines to contribute significantly to these projects. IMO, the single biggest improvement anyone could offer one of these open source projects is a better way to bound broken and merged characters. Feature recognition does a fairly good job up to that point.

Keith J. Schultz

17 Mar 17 Mar

8:16 a.m.

Hi Holden, Thank you for your kind and sober reply. I did not intend to offend the OCR developers or say that their is no improvement. Basically, all comercial products use somekind of "vodoo" for better results. That is their perfect right. As a reseachers know that money is the motor to efficiently progress. Companies want the results yesterday and do not care if the improvements in their product is due to "vodoo" or improvement in the fundemental technology. I have had to study the technology and decided to use it or not. I generally do not as that results I required in my field take up to many resources for most of my goals. There are cheaper ways of getteng things done resource wise. OCR would be just one tool that I use and is just the beginning of what I want and need to do. It took me 20 years to own my own scanner, and believe me I did not get it for OCR. Still waiting and willing to wait for the quality I consider adequate. Believe me. I would finance OCR reseacher to get 99 % recognition out of the box if i could. I do know how hard it is to get money for research. One a side track here. Humans do not recognize Characters, but words and phrases. That is how we learn to read!!! regards Keith. Am 16.03.2006 um 16:57 schrieb Holden McGroin:

...

Hi!

On Fri, 2006-03-10 at 12:33 +0100, Keith J. Schultz wrote:

...
Hello,

Am 10.03.2006 um 11:24 schrieb Holden McGroin:

...
On Fri, 2006-03-10 at 10:32 +0100, Keith J. Schultz wrote:

...
text. Today, dictionaries are used to guess which words are to be recognised. That is why the OCR systems today give us better results if the original has DECENT quality!!!

...
The pattern recognition systems have not gotten better and the dictionary trick takes the motivation away to develop better OCR algorithms.

I'm going to have to call bullshit here. As a researcher working in the field of document recognition, I've noticed tremendous improvements in OCR quality even just in the past five years. Before you start to swear, read and understand! Maybe in the development labs, but not for the non-high end user!!!!

OCR results are improving across the board. One only has to compare Finereader 8, a mainstream OCR product, with version 5 or so to see the improvement in standard OCR packages over the last 5 years. Recognition quality improves (where there is room for improvement) and so does the range of documents which can be recognised. Each passing year brings improvements in quality for older, noisy and lower quality documents. Again, I stress that this is *real-world* improvement in mainstream OCR products.

In your initial post, you stated that the "dictionary trick" takes away the motivation to develop better OCR algorithms. Yet, it is still an extremely active research subject. Perhaps you're not familiar with the research community around OCR but there are many major conferences, workshops and journals devoted entirely or mainly to the task of digitising documents.

And of course, where do you think the improvements in mainstream OCR applications come from? Yesterday's innovation in the research lab forms the basis of new features in today's commercial OCR packages. Likewise, the work that's going on now in the lab will improve tomorrow's OCR packages.

...
We have not seen any improvements in the field for the past five years!!! The improvements are mainly due to the use of dictionaries!! Not the improvement of character recognition!! Most systems in the field get their performance out of word recognition !!!

Well, that's a nice statement to make since the vast majority of systems in the field are black-box commercial systems. How do you know where the performance comes from? I'm a researcher in the field. I attend conferences and read journals and I don't know much about the internals of ABBYY. Unsurprisingly, it's something they keep under close wraps.

So all you really have is the fact that commercial (and research) OCR systems are improving and your unfounded assertion that the improvements are mainly due to dictionaries.

...
I did mean to say not there is no improvement in Optical Character Recognition, but the improvment over the past 10 years is minimal at most. When I see a OCR system that just uses raw results, then I will bow my head in recognition of true achieve meant. Furthermore, when the image processing gets that far it will open up new possiblities in all kinds of sciences.

There are countless tools which can be used to improve OCR performance. Using dictionary lookups is just one tool in the box. OCR is improving using many different techniques. I've been observing improvements in many different areas over the last few years (as long as I've been in the area), including:

- Improvements in low-level Image processing techniques - Improvements in feature extraction from characters - Improvements in character recognition based on those features

If you don't like dictionary lookups, don't use them. Raw OCR performance is improving in the lab and in the marketplace and is already great for a large proportion of documents. I must apologise on behalf of the research community if you find the rate of progress to be inadequate.

That said, if you don't like it, muck in. There are many research labs around the world working on improving OCR and related techniques and I'm sure they'd be glad to have someone as knowledgeable as yourself join. There are even a few Free Software / Open Source OCR systems which would gladly welcome any interested developers:

Ocrad: http://www.gnu.org/software/ocrad/ocrad.html GOCR/JOCR: http://jocr.sourceforge.net/ ClaraOCR: http://www.geocities.com/claraocr/

Cheers, Holden

_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

Dave Fawthrop

10 Mar 10 Mar

10:37 a.m.

On Fri, 10 Mar 2006 10:32:29 +0100, "Keith J. Schultz" <schultzk@uni-trier.de> wrote: | text. Today, dictionaries are used to guess which words are | to be recognised. That is why the OCR systems today give us | better results if the original has DECENT quality!!! And get it *wrong* <mumblehategrumble> very often. For my Yorkshire Dialect stuff which include "wor" many times, this gets changed into "war", most of the time. To the extent that I use an initial edit to put it right. -- Dave Fawthrop <dave hyphenologist co uk> Freedom of Speech, Expression, Religion, and Democracy are the keys to Civilization, together with legal acceptance of Fundamental Human rights.

Keith J. Schultz

11:36 a.m.

Hi There, Am 10.03.2006 um 11:37 schrieb Dave Fawthrop:

...

On Fri, 10 Mar 2006 10:32:29 +0100, "Keith J. Schultz" <schultzk@uni-trier.de> wrote:

| text. Today, dictionaries are used to guess which words are | to be recognised. That is why the OCR systems today give us | better results if the original has DECENT quality!!!

And get it *wrong* <mumblehategrumble> very often. exactly my point.

...

For my Yorkshire Dialect stuff which include "wor" many times, this gets changed into "war", most of the time. To the extent that I use an initial edit to put it right. A small tip. Try using a custom dictionary. Or get a system that you can train!

Keith.

7048

Age (days ago)

7056

Last active (days ago)

List overview

Download

12 comments

6 participants

participants (6)

Bowerbird＠aol.com
D Garcia
Dave Fawthrop
Holden McGroin
Keith J. Schultz
Michael Hart