let's look at things a bit more closely

jon said:
Who says I've advocated "zero" errors? It's a nice goal, but not possible without a lot of eyeballs, some auto-processing, and some luck. But we can get the error-rate down to very low levels.
gee, jon, i thought that was _my_ position! i thought that was _everybody's_ position! you're the one complaining about _errors_! so let's stop talking in vague generalities and get down to some brass tacks, alright? i reference jim tinsley, who recently said he suspects the average number of errors in p.g. e-texts released these days is _50_. since jim is the person who fields all of the error-reports, and a major whitewasher too, i'd think his estimate is better than anyone's. if we figure that book has some 100-200 pages, that's one error every 2-4 pages. now of course, some of the errors are slight, barely noticeable. that number should be lower, but the judgment of the people who are actually _doing_ the e-texts is that they would rather move on to the next one than spend more time driving down the error-rate. they are volunteers, and set their own priorities. and given that the processes are fairly thorough at present, significantly lowering the error-rate would probably require a substantial investment of volunteer time and energy, so i'm not sure that it would prove its worth on a cost-benefit basis, even if the volunteers _were_ willing to give it. (of course, the value of better tools here should not be neglected. but i feel i've made that point.) so, the question is: how do we approach perfection? tell me, jon, do you have a solution for the problem? my answer is a system of "continuous proofreading", where we involve the end-users in the advancement by incorporating their efforts as valued contributors. and i've laid out an example showing how to do that: http://users.aol.com/bowerbird/proof_wiki.html this technological part was fairly easy. what is hard is stimulating end-users out of their apathy of inertia. what i'd do is release books initially only like this -- page-by-page within our error-reporting framework, with text on one side of the screen and scan on the other -- until the general public has found a few dozen errors (or until "enough" people certify each page error-free), and only _then_ release the e-text in the usual manner... if the only way people can read the e-text originally is on a website that actively encourages and _facilitates_ the reporting of errors, and where they're told explicitly that errors might still be present, so are to be expected, then i believe that they will join in the task with gusto... so i've actually designed a plan to move to perfect e-texts. meanwhile, you are soliciting for proofreader volunteers...
I am advocating DP's system (or a next-gen version of it), for digitizing texts. It is proven, it works
well, once again, it looks like we're advocating the same thing. the difference is that i've already built models, and you haven't. and i also suggest that we make tools available to people who, for one reason or another, prefer to digitize books independently, as individuals, because that system is "proven" and it "works" too. and the difference, here too, is i've programmed apps; you haven't...
It is proven, it works, and it can be improved.
this makes me smile, jon, a big smile. because, you see, i too firmly believe that distributed proofreaders can be "improved". i even spent a lot of time over on their forums, telling them exactly how. but you know what? they didn't wanna hear it. not a single word of it. true enough, they have already incorporated _many_ of my suggestions, once i left them alone for a while, and i expect that they'll eventually incorporate all of 'em, but let me tell you, they didn't want to hear one word of it! and anyone can read their forums and see it for themselves. the cult is warm and friendly until they sniff a nonbeliever, and then they turn on that person with a very ugly vengeance. so i'm not too sure i'd recommend you share thoughts with 'em. still, jon, _i_ would be interested in what you would suggest. so do feel free to post your recommendations here, would you? and no offense, jon, but you really know _very_ little about d.p. so i'm kind of curious about how you think you could make _any_ improvements on their processes. but hey, i'm willing to listen...
It also has positive social implications for getting the public involved with public domain texts, and without having anyone commit much time to help.
i agree. those are several of the beauties of distributed processes. that charles franks idea was pure genius. and people made it work...
I do intend to upload them at the Internet Archive as well.
please do let all of us know where we can upload books we scan too! i've got the scans from mabie's "books and culture" book from google.
You can say what you want, but unless you "show the pudding", no one's going to believe you. Your credibility is zero.
thanks for your report on my credibility, jon. :+) i've said there are 2 errors (at least) in your text. anytime you want me to reveal 2, just send the $75. or once you find those 2 errors yourself, and correct them, i'll be happy to confirm to you that you did indeed find them. and until that time, whatever my current rate for the 2 errors, i will pay you _4_times_as_much_ if you "call my bluff" and i cannot point out 2 errors. that would be a full $300 right now. so feel free to "call me" any time; we'll see who gets the cash... on the other hand, if i know where 12 errors are, then even if you catch 10 of them, i've still got 2 i can use to cover my bet. so unless you're very confident of what you've got, be careful! (or am i just bluffing again? hard to know if you don't call me.)
That is an important factor in digitizing public domain texts -- the process needs to be open and transparent to the public.
not really. as long as an entity delivers text or scans or both without conditions and for absolutely no cost, i'd don't care if the processes they use are "open and transparent" or not... i would strongly prefer it if the scans were of a high quality -- which, by the way, doesn't have to mean high resolution -- and for the text to be completely error-free, but, you know, since i am not _paying_ for it, i can't really make demands... i also want the exact book done by the time i want to read it. but again, beggars can't be choosers, eh? so if i really want that specific book badly enough, and no one will do it for me, i guess i'll have to do it myself, right? that is the situation that more and more people will find themselves in, over time. they'll find somebody -- probably the brewster kahle crew -- scanned the book, but they will want it in text form instead; that's why we need to give them tools to do that conversion... if you check the posts i did make to the bookpeople listserve after the turn of this year, you'll see i predicted exactly this; referencing "the avalanche of scanned books available soon"...
Yes, and I noted that in my prior reply (refering to what you probably did), which you seemed to have missed.
i just confirmed that you were right about that, and that yes, indeed, i had done o.c.r. on your scans, and retained linebreaks. i wouldn't have bothered to confirm that, except that you seemed to miss the importance of the fact that this created an independent work-product, which is precisely what serves to best check for differences. you also showed you don't grasp the significance of this when you implied that d.p. would work on your text-file, rather than starting from scratch with some fresh o.c.r.
I'm not thinking of myself on this one. What do you plan to do with the $75 anyway? Donate it to PG? Or buy a little Happiness? <smile/>
i would put it into a savings account to purchase that version of abbyy finereader that works on older books... :+)
Anyone who volunteers will check out what's there and decide for themselves whether they want to help out or not. It's only abuse when something is unknowingly forced on someone.
um, no, i'm just telling you why you will not get many volunteers to do that job for you. do you really expect people to do proofing the way you described? and any volunteers that you _might_ get won't last for long after that first book, because your processes waste their time and energy, rather than show _respect_ for it... and when i offered to help you out by adapting my tool for you? well, i still haven't heard your reaction to that offer...
You seem to imply that people are basically stupid and easily duped, except you of course.
no, it's precisely because people are _not_ "basically stupid" _or_ "easily duped" that they won't let you waste their time.
Maybe that's not your intent, but words sometimes can be misconstrued.
and sometimes intentionally misconstrued, to misrepresent...
Except for the lack of consistent line-breaks, which would make proofing easier,
um, it makes it _significantly_ easier. probably an order of magnitude or so... and there are other benefits to retaining the line-breaks as well.
what we have is actually not that primitive based on how many proofers/copyeditors already work. DP is better, of course.
oh please. do you really think proofers and copyeditors consistently print out all the stuff they are proofing? only the ancient ones and the stupid ones. the smart ones are actively using the machine to help them do their jobs. no wonder you think this task is difficult. you have no clue. you're going about it in a wrongheaded way, so it _is_ hard.
I'm aware of how clean it is since I had a few pages (but not the whole book) run through abbyy 7.
put your o.c.r. output up, and we'll let people judge for themselves. if you put up half of the pages, i'll give you the other half to put up. it might be interesting to see if there are any differences based on any different settings we might have used on some of the features...
I'm aware of how clean it is since I had a few pages (but not the whole book) run through abbyy 7. It still contains numerous errors
i never said abbyy was perfect. my rough memory is that it averaged 1-2 errors per page, but that a simple spell-check would pinpoint most of them... of course, a simple spellcheck is _not_ the best tool in my chest...
a significant number of which cannot be found/properly-auto-corrected by post-processing tools.
and that's where you're wrong. and i'll document the figures at some point down the line... also, when i release my tool, sometime at the end of spring, you will see how wrong you are... correcting this text to high accuracy -- again, 1 error every 10-20 pages -- should take a moderately-skilled user roughly 4 hours, tops. compare that to how long it took you to do the scanning? which reminds me: how about a tally of precisely how many hours have now gone into the making of your "my antonia" book? breaking it down by each specific task would be tremendously enlightening to a lot of people. and i think people will be downright amazed that your cost-per-book (for this one book!) is so frighteningly large, well beyond the pale of what most of us would consider reasonable, even for what "could" be considered a prototype.
This worship of abbyy 7 by you is perplexing.
of course, you have no experience with it, do you? i listen to people who do lots and lots of scanning. to me, the amazing thing is that you want to persist in your belief that this job is _difficult_, but then you are so resistant to factors that make it _easy_. it's almost as if you _want_ to believe that it's hard.
It is very, very good,
that's a very, very good description...
but it is *not perfect*
...but it's not a perfect description. i would say abbyy v7 is _first-rate_. and then, when it's combined with the right scanner -- again, for your info, the plustek opticbook3600 -- and the scans are subjected to intelligent clean-up, that combination can give you _excellent_. output... _then_, with the right post-o.c.r. tools, a person can create _superior_ results. (which i would define operationally as 1 error or less for every 20-40 pages.)
--it's just an imperfect tool
i again notice insistence that this tool is "imperfect". of course it's not perfect, nothing ever is, but it's still a _lot_ less imperfect than you lead people to believe...
--it's just an imperfect tool to be used as part of an overall process.
and you have very little idea what that overall process should look like.
Possibly. I made the decision back in Jan./Feb. to meet a deadline.
d.p. could have met your deadline a whole lot faster than you did. once the book got in their queue, anyway. sometimes that can take a while. which is why, as i mentioned above, many people will increasingly find it's faster just to do a book themselves...
The markup is mundane, but then most markup for most books is pretty mundane. Mundane is not the issue.
if that was true, you would have marked up my test-suite a long time ago. but that was a challenge that you ignored...
This document is being used as part of a bigger project dealing with the use of texts, including interpublication linking, community text annotation, blogs/vlogs, etc.
but you have none of that on your site.
You don't see that because that is not being shown on the particular entry point to the demo page.
i see, so it's a "proof of concept" for someone else, not for us.
There's a lot of behind-the-scenes stuff going on elsewhere.
perhaps you're selling some new clothes to the emperor? :+)
Look again at the "mundane" markup. There's a little more there than meets the eye.
i know exactly what's there, and why it's there. it's still mundane. i suggest you go and look at the markup manual they've developed over at d.p.
This demo project is just a small part of a much bigger non-profit project, which I've actually revealed a lot about if you know how to read between the lines of my many prior messages on this
yes, i know how to read between your lines very well, jon.
At this time I don't want to go into the full details of what we are doing since that would be premature.
so much for the "open" part of this "open project", eh?
That's why your playing games with "My Antonia" errata submission is probably seen by most here as self-centered and vindictive rather than helpful to the public and the goals of digitizing public domain texts.
that's the way you'd like to spin it, i'm quite sure. and -- as you constantly like to remind everyone, _you_ believe perception is as important as reality. so you're trying to create the _perception_ that i am "self-centered and vindictive". good luck. _i_ think people realize that i won't be your pawn. meanwhile, i'm not the one writing business plans.
Laugh, pay me $100 and I'll show them to you.
will you pay me $400 if they aren't really "new ideas"? :+)
it would have been *totally wrong* for me to even ask them to meet a time deadline of *mine*
again, you show you have no knowledge of the d.p. processes. a first-time producer gets to move to the head of the queue. you sacrificed the book, did the scans, and were willing to post-process the book. all you needed was the proofreading, and that would have been done in a matter of a couple hours. that's not "taking advantage" of d.p. that's why they're there. that's what pourlean has been trying to tell you, repeatedly...
The focus now is to finish the proofing of "My Antonia"
you have proofed it to within an inch or two of its life, jon. the fact that you still don't consider it "finished" leads me to believe that you really _do_ advocate zero errors. and that leads us back to where we started this post, right? it's one more trip i've made on the noring merry-go-round. you have no sense of hitting the point of diminishing returns.
Since you already submitted a few errata, your name has already been added to the list of proofers at the "My Antonia" web site at http://www.openreader.org/myantonia/ .
please remove it. i don't wish to be associated with the page.
You've stated many times that you believe your "system" will *soon* (if not now) make D.P. unnecessary.
no. that's a misrepresentation of my position. d.p. will always be a cool thing for people who want to give only a small amount of time in support of e-texts. it will always be a cool thing for the people who want to embed themselves in a community of like-minded others. it will always be a cool thing for books that need some specialized treatment -- math, tables, greek, and so on. it will always be a cool thing in the demonstration of how distributed processes can bring about great results. it will always be a cool thing because it's proving ground for tools that seek to facilitate the digitization of books. d.p. will always be a cool thing for any number of reasons... if we give individual people powerful tools that help them digitize books on their own, then it will not be necessary for them to work through d.p. if they choose not to do that. but that doesn't make d.p. itself "unnecessary". with a thousand or more active proofers, and a total roster that's no approaching 20,000, d.p. will be around a long time. they will also be a big creator of e-texts, and i salute them...
Human proofers/final-assemblers will always be needed to do final proofing, cleanup and markup (or structuring if not working in XML) of digital texts, at least for the next couple decades
post-o.c.r. tools need human beings to operate them intelligently. what i've said is that for a simple book -- like "my antonia" -- a moderately-skilled person who has the kind of tool i'll release in 8 weeks or so can do the post-o.c.r. processing in one evening. 8 weeks is much sooner than "the next couple decades". it's like, you know, the next couple of _months_... :+) this tool will _not_ do the type of markup you think is necessary. part of what i will be proving -- with some excellent pudding -- over the course of the next 6 months, is that you are wrong, jon, when you maintain that your type of markup is necessary. it's not.
to make them high-quality, fully usable, repurposeable texts.
my e-books will be high-quality, fully usable, and "repurposeable" (what does that mean, anyway?) exemplars, fully capable of the benefits that you promise (but will have a hard time delivering), and they will do so without any of your horrendous markup costs. so i'd suggest you make that sale to the emperor real soon now. in a short few months, you will never be able to close the deal...
(hmmm, how do *you* have access to abbyy 7?
um, i think i'll decline to tell you, let you use that as a way to demonstrate your i.q.
Does it run on your legacy Mac?
not v7, nope. not even close. i went to a computer rental place to do the o.c.r. and at just $3/hour, it wasn't too expensive either; it's cheap 'cause it caters to kids playing online games, meaning it has big screens, fast machines, and fat pipes. so i'm lucky i live here in l.a., where the kids are depraved; you might not be so lucky in utah, where the kids are chaste. anyway, i'll make my money back when you cough up the $75... :+) or maybe i'll have to eat my investment if somebody else reports my errors to you first. that's a danger you run when you decide to hold out. :+)
How do you afford it?
i don't mind paying a few bucks to know what i'm talking about. it saves me from saying some stupid things. you should try it...
But why not make public your abbyy 7 raw text?
first of all, multiple copies of the same product don't help much. you need independent products to cross-check against each other. which means that you should go and do the o.c.r. yourself, jon. second -- and more important -- why should i? you keep thinking i am gonna donate my time, money, and energy to help you build up material you can use in your "business plans", and to attack the "trustworthiness" of project gutenberg e-texts. as usual, you're wrong.
Now you may say we've arrived with current OCR, so his test suite is useless. He is very aware of abbyy 7 yet he wants to assemble the "test suite". What does he know about OCR that you don't?
well, i'm not sure. why doesn't brewster tell us himself? i'm also curious about why his crew isn't using abbyy v7, most especially that version geared towards older books. why use inferior technology? in the long run, it's _not_ cheaper. unless you don't consider the true costs of correcting the errors. maybe it is a big mistake to lead brewster to believe that he can get the errors corrected by distributed proofreaders for nothing. -bowerbird
participants (1)
-
Bowerbird@aol.com