
On 27 October 2011 18:13, <Bowerbird@aol.com> wrote: [snip nod + uhuh]
The OCR project I'm loosely affiliated with, Tesseract, is also a Google product. I was aware of that, and I was told a few things about it, but as I can't remember which I was asked not to repeat, I'd just prefer to stay on the safe side and not mention anything about it :)
i'll save you the trouble. tesseract is a piece of crap. :+)
I'll concede that it has had its problems... it's still the only OCR software in existence that comes with all the tools required to support an entirely new language, which is where my interest began. (They might be available for Finereader, but the limb and/or offspring price is somewhat out of my range).
but its lousy performance will mean that google has to do a lot of very hard work to write the routines that can correct the lousy output turned out by that piece of crap.
Not quite; they did quite a lot of work to bring the engine itself up to modern standards, and have been relatively successful: apparently, someone somewhere did a test, and found that tesseract outperforms Finereader Mobile, so we've had Abbyy employees on our mailing list trying to convince people otherwise.
which will turn out to be a _great_ thing in the long run.
but hey, i would _love_ an open-source o.c.r. program. love love love... so if you can turn it into a worthy app, i'm sure a lot of people like me will heap praise on you.
I'm not an 'app' guy, I'm a language technology guy. The best I can do at visual design is to discern when something is ugly, and my attempts generally fall into that category. So I'll stick to what I'm good at.
ok, yes, i admit it, i haven't actually evaluated tesseract in the last year or so, so maybe it improved immensely. _maybe_. but i really doubt it. it needed a lot of work.
but hey, let me know if i'm wrong, ok?, and i'll review it.
I'd say hold off for a while. It got a huge shot in the arm when the Android guys took notice (the handful of people who usually work on it in the Google Books building were mostly occupied with adding new language support), but since then Google Docs and Google Image Search have integrated it, so there should be a definite improvement in the next version (which, I think, is due to be released around Thanksgiving? I'm not American, so I only have a vague idea of when that is :)
Now that I didn't know. Seems counterproductive.
if my enemy buys my friend, my friend is no longer my friend.
The XML data is quite poor for that purpose, as it lacks word coordinates, so I chose not to mention it. (I think they made public the scripts they use to convert the data into a usable form for that purpose, but I'm not sure).
hmm... i guess i'll have to take another look at that. as soon as i work up the nerve; i'm allergic to x.m.l.
Makes you break out in angle bracket shaped hives, eh? There's a script out there somewhere to convert it into the djvused format, I'll see if I can find it. -- <Sefam> Are any of the mentors around? <jimregan> yes, they're the ones trolling you