re: still more on "my antonia"

jon said:
Anyway, I think you mentioned that "consistency" error before
no, this is another one. it's a place where you've deviated from the text on the scan. you _might_ say you've done it for reasons of "consistency"; however, there is no annotation to that effect in the file, as i believe you _would_ put, if that were really the case...
I guess what you really meant to say is that "I will gladly submit my errata immediately after I am paid." <laugh/>
yep, that's it. :+) although, if you haven't found the errors i know about by august -- that's 6 months after you put the file up -- i'll submit my error-free version to project gutenberg. at no cost. :+) (but, since there _is_ a serious point here, let me address it: what you are advocating is a system that is quite well-done, but which is _extremely_ expensive. indeed, it's _so_ costly that it simply _can_not_be_justified_ on a cost-benefit basis, especially since many promised benefits will not materialize. i'm just driving that point home to you. from the standpoint of "every text must be perfect", it might be worth it to _you_ to pay $50 to attain that perfection. but to end-users? no way. they'd rather keep that $50, and live with that one little error -- since no one has even managed to notice it so far anyway. as for source metadata, users wouldn't pay _5_ bucks for that.)
And of course, I won't tell you here exactly what errors the proofers found
do you think i can't pinpoint them myself? (and here's a hint back for you: those weren't the ones.)
What you are implying by this statement is that *you know* what the error-free text is supposed to be.
i do. because i have the scans to refer to. :+)
Interesting that you apparently missed at least a couple errors. <laugh/>
what makes you think i missed those errors? :+) all i said was that i was sitting on 2 errors. i did not say i was sitting on 2 and only 2. i never show all my cards if i don't need to. i only expect to get paid the $50 for the _last_ error. (although if you wait too long, i'll soon be considering a pricetag of $25 for the _next-to-last_ one as well...)
You did bring up a good point about the line break issue. This leads to my answer to Pauline's question on why I didn't use DP, and why I'm hesitant now to submit it to DP:
just run the scans through o.c.r. again, and keep the linebreaks. that's what i did. that gives you an independent work-product, which is _really_ what you need to have to find all the errors.
it would be a waste to have DP do it all over again when it is now in very good shape.
except it might save you 75 bucks... :+)
If anyone reading this wants to help and proof a few pages using the "primitive" system I now have (e.g., just print out the pages from the page scans you would like to proof, and compare them with the online XHTML version which shows the page numbering and breaks. There are other ways to do this proofing by comparison, such as opening up two windows in your browser, one showing the page scan and the other the online XHTML text.)
it is an abuse of volunteers to make them endure processes like that...
As noted above, I would use a DP process for "mass production", but for this particular project did not for the reasons cited above.
well, if you run it through abbyy finereader v7.x, the output is clean. this book would fly through d.p. -- i'd be surprised if it lasted over 3 hours in the queue, because the scans are clear, the text clean, and the book easy and interesting -- and you'd have a separate product, which is worth gold. so you're just not thinking very clearly here...
"My Antonia" is, and has always been, a demonstration project
i guess i'm still not sure what it is you're "demonstrating" with it. you say it is a "proof of concept", but what exactly is the concept? this book is dirt-simple -- among the simplest possible 10%, i'd say. and the markup you did pales against that routinely coming from d.p. i don't see anything but ordinary-and-mundane in what you've done. (i sincerely don't mean anything derogatory in saying that, because the job you did is competent, and that is all that is really required. i just don't see anything over and above simple competence here. but if i'm missing something, please do feel free to enlighten me.)
to experiment with new ideas
i don't see any.
to get my hands "dirty" with the production process (although I've transcribed a dozen texts before),
but you've just told us that you used a one-time-only process here. (good thing, too, because, as we agree, your process was primitive.) and basically, you borrowed an e-text that someone else produced! that's a good trick, when you can do it. but it won't scale very far. and it can hardly be considered to be "proof of concept".
and to use it for showing to some people interested in this.
ok, well i hope it "worked"... there's one born every minute...
This will be a tough test for your "we-don't-need-DP" approach:
i never said anything close to that. i think d.p. is swell. many people digitize books individually and independently, so it certainly can be done. given the right tools, it can even be easy. nonetheless, i think cooperation is dandy... but as for your test, bring it on. if you have an old book, with hard-to-o.c.r. text, i highly recommend that you buy the abbyy version specialized for old, hard-to-o.c.r. books. without that, you won't know how successful o.c.r. can be.
Maybe we should do a competition (you seem to love competition!):
actually, i believe competition is so 20th-century... but i love to be challenged. and i love to expose the hype people try to spin. so bring on your test. but the way i'll work it is this: you put it through d.p., and then i'll see if i can find any errors they didn't find. if i do, i'll figure out what i'll charge you to reveal them. :+) (if _i_ were you, i'd use the money to buy finereader instead.)
Of course, you will have to promise not to do late hours hand proofing of the text -- that'd be cheating -- you have to do it all auto-magically as you've been advocating.
right. like i'm gonna spend a lot of my time volunteering for _you_. i take on your challenges because i can smash them so effortlessly... -bowerbird

Bowerbird said:
jon said:
although, if you haven't found the errors i know about by august -- that's 6 months after you put the file up -- i'll submit my error-free version to project gutenberg. at no cost. :+)
No problem. I've done the scans and you can submit those, too, or I'll do it. We don't seek any credit for the text. What we do is for the benefit of the Public and for digital preservation of the Public Domain.
(but, since there _is_ a serious point here, let me address it: what you are advocating is a system that is quite well-done, but which is _extremely_ expensive. indeed, it's _so_ costly that it simply _can_not_be_justified_ on a cost-benefit basis, especially since many promised benefits will not materialize. i'm just driving that point home to you. from the standpoint of "every text must be perfect", it might be worth it to _you_ to pay $50 to attain that perfection. but to end-users? no way. they'd rather keep that $50, and live with that one little error -- since no one has even managed to notice it so far anyway. as for source metadata, users wouldn't pay _5_ bucks for that.)
Who says I've advocated "zero" errors? It's a nice goal, but not possible without a lot of eyeballs, some auto-processing, and some luck. But we can get the error-rate down to very low levels. I am advocating DP's system (or a next-gen version of it), for digitizing texts. It is proven, it works, and it can be improved. It also has positive social implications for getting the public involved with public domain texts, and without having anyone commit much time to help. DP can be improved, of course, but so can everything.
And of course, I won't tell you here exactly what errors the proofers found
do you think i can't pinpoint them myself? (and here's a hint back for you: those weren't the ones.)
I wrote a couple comments in my prior reply which you somehow missed, acknowledgeing that you certainly could find them if you wanted to. So anyway, I guess you missed those two errors which I presented. <laugh/>
What you are implying by this statement is that *you know* what the error-free text is supposed to be.
i do. because i have the scans to refer to. :+)
As do I. As does everyone. The work that we've done is right there open to the public, warts and all, with calls for it to be improved. James Linden, who graciously hosts the openreader.org site (thanks, James!), said a lot of people have downloaded the page scans, which is great news. They are intended to be preserved for public access. I do intend to upload them at the Internet Archive as well.
Interesting that you apparently missed at least a couple errors. <laugh/>
what makes you think i missed those errors? :+)
<laugh/> You can say what you want, but unless you "show the pudding", no one's going to believe you. Your credibility is zero. Everything that we've done is totally open to the public -- it's right there at http://www.openreader.org/myantonia/ , just like DP is an open process. That is an important factor in digitizing public domain texts -- the process needs to be open and transparent to the public.
You did bring up a good point about the line break issue. This leads to my answer to Pauline's question on why I didn't use DP, and why I'm hesitant now to submit it to DP:
just run the scans through o.c.r. again, and keep the linebreaks. that's what i did. that gives you an independent work-product, which is _really_ what you need to have to find all the errors.
Yes, and I noted that in my prior reply (refering to what you probably did), which you seemed to have missed.
it would be a waste to have DP do it all over again when it is now in very good shape.
except it might save you 75 bucks... :+)
I'm not thinking of myself on this one. What do you plan to do with the $75 anyway? Donate it to PG? Or buy a little Happiness? <smile/>
If anyone reading this wants to help and proof a few pages using the "primitive" system I now have (e.g., just print out the pages from the page scans you would like to proof, and compare them with the online XHTML version which shows the page numbering and breaks. There are other ways to do this proofing by comparison, such as opening up two windows in your browser, one showing the page scan and the other the online XHTML text.)
it is an abuse of volunteers to make them endure processes like that...
Anyone who volunteers will check out what's there and decide for themselves whether they want to help out or not. It's only abuse when something is unknowingly forced on someone. You seem to imply that people are basically stupid and easily duped, except you of course. Maybe that's not your intent, but words sometimes can be misconstrued. Except for the lack of consistent line-breaks, which would make proofing easier, what we have is actually not that primitive based on how many proofers/copyeditors already work. DP is better, of course.
As noted above, I would use a DP process for "mass production", but for this particular project did not for the reasons cited above.
well, if you run it through abbyy finereader v7.x, the output is clean.
I'm aware of how clean it is since I had a few pages (but not the whole book) run through abbyy 7. It still contains numerous errors, a significant number of which cannot be found/properly-auto-corrected by post-processing tools. This worship of abbyy 7 by you is perplexing. It is very, very good, but it is *not perfect* --it's just an imperfect tool to be used as part of an overall process.
this book would fly through d.p. -- i'd be surprised if it lasted over 3 hours in the queue, because the scans are clear, the text clean, and the book easy and interesting -- and you'd have a separate product, which is worth gold. so you're just not thinking very clearly here...
Possibly. I made the decision back in Jan./Feb. to meet a deadline. And as noted, it did not need to be ultra-clean for that demo, just clean enough. And not having the tools available to me (such as abbyy 7 OCRing for doing the whole text -- I now have some abbyy access via an acquaintance), I had to make do with what I could get my hands on at the time.
"My Antonia" is, and has always been, a demonstration project
i guess i'm still not sure what it is you're "demonstrating" with it. you say it is a "proof of concept", but what exactly is the concept? this book is dirt-simple -- among the simplest possible 10%, i'd say. and the markup you did pales against that routinely coming from d.p.
i don't see anything but ordinary-and-mundane in what you've done.
The markup is mundane, but then most markup for most books is pretty mundane. Mundane is not the issue. This document is being used as part of a bigger project dealing with the use of texts, including interpublication linking, community text annotation, blogs/vlogs, etc. You don't see that because that is not being shown on the particular entry point to the demo page. There's a lot of behind-the-scenes stuff going on elsewhere. Look again at the "mundane" markup. There's a little more there than meets the eye.
(i sincerely don't mean anything derogatory in saying that, because the job you did is competent, and that is all that is really required. i just don't see anything over and above simple competence here. but if i'm missing something, please do feel free to enlighten me.)
No offense is taken. This demo project is just a small part of a much bigger non-profit project, which I've actually revealed a lot about if you know how to read between the lines of my many prior messages on this. You are only seeing the tip of the iceberg, so it is understandable that you keep making wrong assumptions and beating the wrong-end of a dead horse. At this time I don't want to go into the full details of what we are doing since that would be premature. This is a project with over a half-dozen people working hard in the digital library arena. Will we succeed? I don't know, but there's a lot of activity going on behind-the-scenes, lots of people we are talking with, etc. If we succeed per our plan, PG and DP will benefit, but that's all I can say at present, not to get anyone's hopes up. If we don't succeed, then we will have learned a lot, and it will spawn a couple other more directed non-profit business ventures I'm now outlining, and which I've alluded to the last few months. I suppose you missed those hints since you seem not to have taken the time to read and comprehend what I have written. Fine, it's your right not to accurately read. In the meanwhile, finishing up "My Antonia" will benefit the public, even if it is only one text out of millions. Every text that PG/DP works on benefits the public -- each one is precious to someone in some way. That's why your playing games with "My Antonia" errata submission is probably seen by most here as self-centered and vindictive rather than helpful to the public and the goals of digitizing public domain texts.
to experiment with new ideas
i don't see any.
Laugh, pay me $100 and I'll show them to you. Of course, I'll donate the $100, half to PG and half to DP.
to get my hands "dirty" with the production process (although I've transcribed a dozen texts before),
but you've just told us that you used a one-time-only process here. (good thing, too, because, as we agree, your process was primitive.)
You seem to have trouble with simple reading comprehension. The focus of this project was not on the production process of the text (except maybe for producing the scans). I've noted many times that in "mass production" my process would NOT be used. Rather a DP or DP-like production/proofing process would be used. The process I chose was based on time and resource constraints plus focusing on the needs of the demo which are NOT production-related. Did I make the right decisions then? Maybe not. But I hit the deadline for the goals I had to hit. Now I want to finish the project for the good of the Public Domain, and we are where we are.
and basically, you borrowed an e-text that someone else produced! that's a good trick, when you can do it. but it won't scale very far. and it can hardly be considered to be "proof of concept".
I had a deadline. Yes, maybe submitting it to DP would have been the better approach. (On the other hand, I had to have full control over the process to make sure I met the deadline. DP does not work with such deadlines and it would have been *totally wrong* for me to even ask them to meet a time deadline of *mine*. DP produces digital texts as they produce them, on their timetable, as it should be. This is the core reason I was hesitant to even ask DP to help at the time. I respect what they do, and don't want to abuse their system.) But this is all water under the bridge. The focus now is to finish the proofing of "My Antonia" so it can be donated to PG (and elsewhere) as a finished product that we can back up. Your help, of course, is much appreciated. Since you already submitted a few errata, your name has already been added to the list of proofers at the "My Antonia" web site at http://www.openreader.org/myantonia/ .
This will be a tough test for your "we-don't-need-DP" approach:
i never said anything close to that. i think d.p. is swell. many people digitize books individually and independently, so it certainly can be done. given the right tools, it can even be easy. nonetheless, i think cooperation is dandy...
You've stated many times that you believe your "system" will *soon* (if not now) make D.P. unnecessary. And several have taken you to task on this assertion. Human proofers/final-assemblers will always be needed to do final proofing, cleanup and markup (or structuring if not working in XML) of digital texts, at least for the next couple decades, to make them high-quality, fully usable, repurposeable texts. Yes, you can create digital texts using an automated system, and they could even look pretty clean (however that is defined), but they will still be insufficient, buggy, and not fully usable. I've given several examples already which prove this simple point, which interestingly you have not commented upon. Why? Because you can't. Until we get mature A.I. with systems that fully comprehend content in multiple languages and dialects (essentially sentient systems with the comprehension, intelligence, and *life experience* of today's proofers/copyeditors), there will still be a need for human proofing to finalize digital texts (of which D.P. is a great system to mobilize people to help with proofing/copyediting/final-assembly.) (As an aside, this brings up my thoughts to contact that guy in Utah who is the real "Rain Man". He has unbelievable talents that could be used for digitizing texts.)
but as for your test, bring it on. if you have an old book, with hard-to-o.c.r. text, i highly recommend that you buy the abbyy version specialized for old, hard-to-o.c.r. books. without that, you won't know how successful o.c.r. can be.
As I noted before, I have access to the latest version of the top-end abbyy 7 via an acquaintance. But since there's a per-page fee, I try to avoid asking too much help from him since he is paying for it (of course, I could pay him, I suppose -- he did volunteer OCRing but I don't want to abuse his generosity, but to save it up for when I really need it.) I suppose I can ask him to run OCR on "My Antonia" and then I'll use that as a third source to compare with the other two, as you have done (hmmm, how do *you* have access to abbyy 7? Does it run on your legacy Mac? How do you afford it?) But why not make public your abbyy 7 raw text? In addition, however, I still want human proofers to go through the "My Antonia" text in order to validate the work in a public setting. This is part (but not all) of the trustworthiness requirement -- to have several independent eyeballs go through the text to verify authenticity and completeness.
and i love to expose the hype people try to spin....
Like your hype that DP-like systems are unnecessary?
but the way i'll work it is this: you put it through d.p., and then i'll see if i can find any errors they didn't find. if i do, i'll figure out what i'll charge you to reveal them. :+)
This "challenge" is really of interest to many. It's not just some personal interest of mine. In fact, Brewster Kahle at the Internet Archive mentioned to me a few months ago (and I assume he's already talked with Juliet and/or Charles -- didn't he also bring this up at the Dec. 2003 PG meeting which you and I attended?) that he wants a careful selection (cross-section) of already-scanned books to be highly-proofed by DP near to the *zero-error* level. These texts will then form a "test suite" to be used for future testing of various OCR and post-processing packages for text error rate estimation, and to improve open-source-based OCR applications (especially for better font handling to deal with the vagaries of older books.) Having essentially perfect digital texts to compare with creates a useful baseline for testing OCR. Now you may say we've arrived with current OCR, so his test suite is useless. He is very aware of abbyy 7 yet he wants to assemble the "test suite". What does he know about OCR that you don't?
Of course, you will have to promise not to do late hours hand proofing of the text -- that'd be cheating -- you have to do it all auto-magically as you've been advocating.
right. like i'm gonna spend a lot of my time volunteering for _you_. i take on your challenges because i can smash them so effortlessly...
As noted above, this is not for *me* -- it's for starting to build a test suite to benefit everyone involved in digitizing Public Domain texts. With regards to "My Antonia", that will be donated to PG once I am satisfied the error rate is very low and can demonstrate trustworthiness. Since you've apparently already done OCR on it using abbyy 7 to create a raw text version that can be used to help locate a few remaining errors (by the procedure mentioned in my prior reply, which most already know about so it's nothing earth-shattering), then that is something I don't need to duplicate. Unnecessary duplication is a waste -- it is inefficient. Why you do not freely share the errata on "My Antonia" seems more to do with vindictiveness and selfishness rather than trying to benefit the public. I guess I'll have to ask my friend to donate some abbyy OCR scans for the project if we don't get enough proofers to help out the next month or so (I haven't really begun advertising for them yet, pending a time when I have time to help those who do volunteer.) Of course, anyone reading this who is willing to donate some abbyy 7 OCR raw text, let me know! Jon Noring
participants (2)
-
Bowerbird@aol.com
-
Jon Noring