
jon said:
I didn't know you issued a challenge.
i sure did. :+) it came through at 4:53 pacific, on 2005/2/28. i have appended a copy for your convenience... basically, you said "i doubt it" in direct response to my claim that correct processing of the whole process -- from scanning through a few hours of post-o.c.r. work -- could result in an accuracy-rate of 1 error per 10 pages, so i challenged you to a test with your "my antonia" scans. since you don't seem to want to have the o.c.r. done, for understandable reasons, i will do it myself. by the way, i have done some extensive comparisons of the project gutenberg version of "my antonia" and yours. the more deeply i go into it, the more i become convinced most differences are due to intentional edits, and _not_ due to sloppiness in the original preparation of the work. so this appears to be exactly like the "frankenstein" case -- a simple use of a different edition as the source-text. in view of the insinuations you cast against the "accuracy" of the project gutenberg e-text, perhaps you should apologize? -bowerbird Subj: Re: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.) Date: 2/28/2005 7:53:21 PM Eastern Standard Time From: Bowerbird To: gutvol-d@lists.pglaf.org, Bowerbird jon said:
But the bigger issue is not constrained to errors (differences) with respect to the source text used, as you continue to focus on.
i think it was you who made "errors" the issue, revolving around the concept of "trustworthiness". if, once that house of cards falls down, you want to turn the issue to one of "which source-text to use", well then i think that michael's "i'm open to all of 'em" stance covers _that_ quite nicely, thank you very much. if you don't like the version of my antonia that's in the library now, add your own! the same goes for all the versions of "frankenstein". casting aspersions on the edition that _is_ there isn't constructive. provide all the meta-data you want on the version that you furnish; heck, you can even put a pointer in to your project at librarycity.org; these days i see a lot of e-texts referencing an .rtf version in france.
the PG version of Frankenstein, which now exposes PG to legal liability.
i don't agree. but if the lawyers to whom "bantam classics" is paying good money decide to send a cease-and-desist, let 'em. going by results obtained by the "gone with the wind" lawyers, the project gutenberg people will probably fold very quickly; without any money, you can't play poker against deep pockets. but hey, i would like to hear the laughter that would resound when bantam's lawyers argued that the way they can _prove_ that this e-text copied their book is because of the _errors_ (map-makers can pull that trick. but book-publishers? ha!) who knows, jon, maybe the project gutenberg lawyers will call _you_ to the stand, to throw your arms in the air and rant about how those terrible mistakes are ruining the fragile public domain, and therefore bantam doesn't _deserve_ the protection of the law. wouldn't that be ironic? :+)
The lack of proper processes, procedures and guidelines
well, i don't agree with that either, jon. you might not agree with the procedures, but that doesn't mean there is a "lack" of them. maybe you don't agree with their choice of source-text for frankenstein. but it _was_ good enough for bantam.
is leading to serious questions about the integrity and trustworthiness of the whole PG library
not in my mind. and not in the minds of most people, i don't think. not any more so than with any paper-book i might find in a store. like the "frankenstein" version that was being _sold_ by bantam.
1) redoing most of the non-DP works using DP,
let's find out how many d.p. people want me to go over _their_ work with a fine-tooth comb. go ahead, speak up, i'd _love_ the challenge.
Well, at least you seem to indicate from your interest in very low error rate OCR that every etext PG includes in its archive should be a textually faithful reproduction of some known source.
not necessarily. if someone wants to play editor and combine editions, i don't have any problem with that. in some sense, that's what the public domain is about. i don't see it in black/white terms as something frozen. if you _are_ going to represent something as faithful, i think it should _be_ faithful. but even then, that is _to_the_best_of_your_ability_. as long as you do that, and give your end-users a means of "checking your work", including a solid mechanism for improving it to perfection, then i think you've done your job. so yes, i agree with you, that scans should absolutely be furnished to the end-users, for works that purport to replicate that edition, certainly... however, i understand why they haven't been, up to this point, and so do you -- disk-space just hasn't been affordable enough, even now, if it were not for the largess of ibiblio and brewster, we couldn't even be entertaining the thought of posting the scans.
I doubt this error rate (let's say for even half of the public domain printings out there) is accomplishable without sentient-level AI.
i'm trying to get back off this listserve. i don't like contributing to the discourse in a place where my voice has been muffled before. so let me set up a place where you and i can fight... i mean, discuss... but this doubt of yours is rather easy to dispel, and quickly. you did a pretty good job of scanning that copy of "my antonia". and it looks like you processed (e.g., straightened) the scans well. so now we need to put them through o.c.r., using abbyy finereader; please have that done as follows: save results out to an .rtf file, one for each page; retaining line-breaks and paragraph indentation. do this for 20-50 pages, and zip the output up and e-mail it to me. i will reply to you with feedback on if the o.c.r. was done correctly. then i'll run it through programs that will soon be made available, at no cost, and we'll see what kind of an error-rate we end up with. or, if you prefer, follow this same procedure with some other book. then, if you still want to discuss this matter, we'll do it elsewhere.
But if proofreading is to be done anyway by the public, as is *now done* by DP, what difference is there between an OCR error of one every 10 pages, and one every page?
when i talk about "the public", i mean _end-users_ who are reading the book for the purpose of reading the book, and _not_ specifically to be "proofreading" it per se. for that type of reader, one error on every page is too many, but one error on every tenth page is not. especially since -- if we give them an easy means of checking for errors and reporting them, and then reward readers for finding them -- errors won't persist for very long, and the e-text will instead progress very quickly on its merry way to a state of perfection. in a practical sense, this means that before you turn an e-text loose for download in an all-in-one file, you make it available _page-by-page_ on the web. anyone who might want to read it has to do so in that form. right alongside the text for each page is the image, so the person can easily check any possible errors. you let 'em know you are asking for their help to find mistakes. if they find one, they fill out a form right on the page, and their input is recorded -- wiki-style -- immediately. later readers can either confirm the error, or question it, or make comments. first person to find each error gets a credit in the final e-text. you also give people a viewer-program that allows them to download the appropriate page-image if they suspect an error -- displaying it right there in the viewer-app next to the text -- and which simplifies the process of reporting it if they find one. (by, for instance, filling out an e-mail they can send with a click.)
The key is that for the aspect of building *trust* in the final product, it is a very good idea to involve the volunteer proofreaders to go over the texts, even if *you don't have to*.
what i just described does a good job of doing that. this is the system of "continuous proofreading" i outlined on this listserve a very long time ago. you recently mistakenly credited it to james linden. my offer to develop this system was largely snubbed. for _that_, the project gutenberg "people in charge" rightly deserve to be criticized. for the tiny stuff that you have been complaining about, they do not...
Having (and proving to anyone who asks) at least two independent people who proofed every page, adds to its trustworthiness.
not nearly as well as putting text and image side-by-side, and allowing any number of "volunteer proofreaders" to examine 'em. you might be surprised by the number of errors that "slip by" the proofreaders through two rounds of eyeballing over at d.p. (indeed, many even slip by the "third round" of post-processing and whitewashing, and sit there big and ugly in the final e-text.) even if a dozen people look at a page, an error might _still_ be there. but with eternal transparency, there is always hope it will be fixed. anyway, jon, i hope you take up the friendly challenge i issued here. and if any d.p. people want to call me on the challenge i made to them, you just let me know. in the meantime, i'll let you get in the last word on this thread, jon, because i _really_ need to be going. use it wisely... ;+) -bowerbird