Re: precisely why i called the bluff

jim said:
As one who has actually worked professionally on a number of different recognition system I respectfully disagree with your prediction of the future. No OCR system will ever be "error free."
i didn't say the o.c.r. would be error-free, jim... i said their _text_ would eventually be error-free. as i have proven here, exhaustively, there are _plenty_ of post-o.c.r. fixes that are programmatically applied, first and foremost among these many measures being comparison of separate digitizations of the same book. (and google sometimes has half-a-dozen digitizations.) if i had their corpus, i could create hundreds of fixes... plus i expect google has invented a good number more of these effective techniques than i could ever dream of.
"Demos" often remain "demos" forever, because turning "demos" into "real world products accepted by real world users" is so danged hard.
well, yes, sometimes that charge has merit... and sometimes it's nothing but a cheap shot. which is it in this case? well, jim, you might have wanted to check the subject-header here. check out the rest of this reply, to benjamin... *** benjamin said:
And would your lordship stoop to provide
cut the cutesy crap, ok? :+) i've done far more than enough grunt work here to prove that i don't consider myself as anything more than a grunt. no matter how loudly i yell... the only people i am superior to here are the stupid, who refuse to learn, who stick their head in the sand. (unfortunately, there are quite a few of 'em here, but the good news is most have put me in their kill-files, so they don't know i told the truth about them again.)
And would your lordship stoop to provide the location (as in URI) of these demos to a relative newcomer with the best of intentions?
here's the standard set of things that i point to:
http://z-m-l.com/go/myant/myantp123.html http://z-m-l.com/go/mabie/mabiep123.html http://z-m-l.com/go/sgfhb/sgfhbp123.html
these are all page 123 from respective books, nestled in the set of all pages from each book, shown with text on one side, scan on the other. here are the master files that generated the sets:
http://z-m-l.com/go/myant/myant.zml http://z-m-l.com/go/mabie/mabie.zml http://z-m-l.com/go/sgfhb/sgfhb.zml
each of the pages in the set contains a form at the bottom where errors can be reported... go ahead and make a sample report on a page; you'll find the report is appended to that page, and also collected on a separate page with all the reports that have been made for that book. as i said, though, error-correction is quite easy. the hard part -- or so people here seem to think -- is turning the text into a respectable e-book, but i've got that base covered quite thoroughly as well, as you see if you go to the main page at z-m-l.com and follow the link for examples. text-to-html, and the next step (.mobi and .epub) follows from there...
I always thought that that project was for independent producers producing ebooks on their own for PG, as opposed to the general public proposing fixes to PG volunteers.
well, that whole sentence shows you're very confused, but that's to be expected in "a relative newcomer", eh? i have suggested all kinds of methodologies, ranging from independent producers to collaborative methods (e.g., d.p.) to encouraging and using feedback from the general public. but slicing up the world that way isn't particularly productive. the tools i've created can be used by independent producers, or in collaborative workflows, or after-the-fact by the public. they can exist offline or online, and the behavior is the same. the tools don't know (or care) how the humans split the tasks. if you want more demos, i've got a ton. i'm also willing to program new ones, quite specific to exactly what you want, providing you agree to host them online for people to use... same offer goes for you, jim. note that i'm calling your bluff. -bowerbird

Well. On Aug 31, 2011, at 10:38 PM, Bowerbird@aol.com wrote:
jim said:
As one who has actually worked professionally on a number of different recognition system I respectfully disagree with your prediction of the future. No OCR system will ever be "error free."
i didn't say the o.c.r. would be error-free, jim...
i said their _text_ would eventually be error-free.
as i have proven here, exhaustively, there are _plenty_ of post-o.c.r. fixes that are programmatically applied, first and foremost among these many measures being comparison of separate digitizations of the same book. (and google sometimes has half-a-dozen digitizations.)
if i had their corpus, i could create hundreds of fixes...
plus i expect google has invented a good number more of these effective techniques than i could ever dream of.
"Demos" often remain "demos" forever, because turning "demos" into "real world products accepted by real world users" is so danged hard.
well, yes, sometimes that charge has merit... and sometimes it's nothing but a cheap shot. which is it in this case? well, jim, you might have wanted to check the subject-header here. check out the rest of this reply, to benjamin...
***
benjamin said:
And would your lordship stoop to provide
cut the cutesy crap, ok? :+)
i've done far more than enough grunt work here to prove that i don't consider myself as anything more than a grunt. no matter how loudly i yell…
I use “your lordship” frequently, often to refer to persons whose actual lordship I totally do not respect. So I did not mean by it what you thought I meant. That only means that it’s a stupid place to use it given that that context is missing, but I digress (and so do you).
the only people i am superior to here are the stupid, who refuse to learn, who stick their head in the sand.
“Their” is plural; “head” is singular. :P / :-) (File this under “pointing out insignificant inconsistencies for the sake of pointing them out.” Also under “digress.”)
(unfortunately, there are quite a few of 'em here, but the good news is most have put me in their kill-files, so they don't know i told the truth about them again.)
And would your lordship stoop to provide the location (as in URI) of these demos to a relative newcomer with the best of intentions?
here's the standard set of things that i point to:
http://z-m-l.com/go/myant/myantp123.html http://z-m-l.com/go/mabie/mabiep123.html http://z-m-l.com/go/sgfhb/sgfhbp123.html
these are all page 123 from respective books, nestled in the set of all pages from each book, shown with text on one side, scan on the other.
here are the master files that generated the sets:
http://z-m-l.com/go/myant/myant.zml http://z-m-l.com/go/mabie/mabie.zml http://z-m-l.com/go/sgfhb/sgfhb.zml
each of the pages in the set contains a form at the bottom where errors can be reported... go ahead and make a sample report on a page; you'll find the report is appended to that page, and also collected on a separate page with all the reports that have been made for that book.
as i said, though, error-correction is quite easy.
the hard part -- or so people here seem to think -- is turning the text into a respectable e-book, but i've got that base covered quite thoroughly as well, as you see if you go to the main page at z-m-l.com and follow the link for examples. text-to-html, and the next step (.mobi and .epub) follows from there...
I always thought that that project was for independent producers producing ebooks on their own for PG, as opposed to the general public proposing fixes to PG volunteers.
well, that whole sentence shows you're very confused, but that's to be expected in "a relative newcomer", eh?
I am more than merely “very” confused. I am _totally_ confused. I am confused in the extreme. (I even question whether I am quite conscious of my own confusion, so confused am I.)
i have suggested all kinds of methodologies, ranging from independent producers to collaborative methods (e.g., d.p.) to encouraging and using feedback from the general public.
but slicing up the world that way isn't particularly productive.
the tools i've created can be used by independent producers, or in collaborative workflows, or after-the-fact by the public. they can exist offline or online, and the behavior is the same. the tools don't know (or care) how the humans split the tasks.
Up until now I only knew of one tool, and that is the one that was being discussed on this list back in February, of which new demos were being uploaded, and of which minor criticisms were being noted, and I never knew that any such demo allowed feedback “after-the-fact by the public.” I got the idea that it was more of an ebook producer with a text, wishing to convert it into various formats. Did I mention that I am (or at least certainly was) confused? Now I am off to see those demos. -- b

i said their _text_ would eventually be error-free.
This is silly word gaming. Anything programmatic "fixes" which can be applied without human input can be considered effectively part of the OCR algorithm.
After having looked at these examples I stand by what I said earlier about the difficulty moving from demoware to product which "real world customers" actually accept. Insulting the customers is indicative of the problem, not the solution.
participants (3)
-
Benjamin Klein
-
Bowerbird@aol.com
-
Jim Adcock