
jon said:
I didn't know you issued a challenge.
i sure did. :+) it came through at 4:53 pacific, on 2005/2/28. i have appended a copy for your convenience... basically, you said "i doubt it" in direct response to my claim that correct processing of the whole process -- from scanning through a few hours of post-o.c.r. work -- could result in an accuracy-rate of 1 error per 10 pages, so i challenged you to a test with your "my antonia" scans. since you don't seem to want to have the o.c.r. done, for understandable reasons, i will do it myself. by the way, i have done some extensive comparisons of the project gutenberg version of "my antonia" and yours. the more deeply i go into it, the more i become convinced most differences are due to intentional edits, and _not_ due to sloppiness in the original preparation of the work. so this appears to be exactly like the "frankenstein" case -- a simple use of a different edition as the source-text. in view of the insinuations you cast against the "accuracy" of the project gutenberg e-text, perhaps you should apologize? -bowerbird Subj: Re: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.) Date: 2/28/2005 7:53:21 PM Eastern Standard Time From: Bowerbird To: gutvol-d@lists.pglaf.org, Bowerbird jon said:
But the bigger issue is not constrained to errors (differences) with respect to the source text used, as you continue to focus on.
i think it was you who made "errors" the issue, revolving around the concept of "trustworthiness". if, once that house of cards falls down, you want to turn the issue to one of "which source-text to use", well then i think that michael's "i'm open to all of 'em" stance covers _that_ quite nicely, thank you very much. if you don't like the version of my antonia that's in the library now, add your own! the same goes for all the versions of "frankenstein". casting aspersions on the edition that _is_ there isn't constructive. provide all the meta-data you want on the version that you furnish; heck, you can even put a pointer in to your project at librarycity.org; these days i see a lot of e-texts referencing an .rtf version in france.
the PG version of Frankenstein, which now exposes PG to legal liability.
i don't agree. but if the lawyers to whom "bantam classics" is paying good money decide to send a cease-and-desist, let 'em. going by results obtained by the "gone with the wind" lawyers, the project gutenberg people will probably fold very quickly; without any money, you can't play poker against deep pockets. but hey, i would like to hear the laughter that would resound when bantam's lawyers argued that the way they can _prove_ that this e-text copied their book is because of the _errors_ (map-makers can pull that trick. but book-publishers? ha!) who knows, jon, maybe the project gutenberg lawyers will call _you_ to the stand, to throw your arms in the air and rant about how those terrible mistakes are ruining the fragile public domain, and therefore bantam doesn't _deserve_ the protection of the law. wouldn't that be ironic? :+)
The lack of proper processes, procedures and guidelines
well, i don't agree with that either, jon. you might not agree with the procedures, but that doesn't mean there is a "lack" of them. maybe you don't agree with their choice of source-text for frankenstein. but it _was_ good enough for bantam.
is leading to serious questions about the integrity and trustworthiness of the whole PG library
not in my mind. and not in the minds of most people, i don't think. not any more so than with any paper-book i might find in a store. like the "frankenstein" version that was being _sold_ by bantam.
1) redoing most of the non-DP works using DP,
let's find out how many d.p. people want me to go over _their_ work with a fine-tooth comb. go ahead, speak up, i'd _love_ the challenge.
Well, at least you seem to indicate from your interest in very low error rate OCR that every etext PG includes in its archive should be a textually faithful reproduction of some known source.
not necessarily. if someone wants to play editor and combine editions, i don't have any problem with that. in some sense, that's what the public domain is about. i don't see it in black/white terms as something frozen. if you _are_ going to represent something as faithful, i think it should _be_ faithful. but even then, that is _to_the_best_of_your_ability_. as long as you do that, and give your end-users a means of "checking your work", including a solid mechanism for improving it to perfection, then i think you've done your job. so yes, i agree with you, that scans should absolutely be furnished to the end-users, for works that purport to replicate that edition, certainly... however, i understand why they haven't been, up to this point, and so do you -- disk-space just hasn't been affordable enough, even now, if it were not for the largess of ibiblio and brewster, we couldn't even be entertaining the thought of posting the scans.
I doubt this error rate (let's say for even half of the public domain printings out there) is accomplishable without sentient-level AI.
i'm trying to get back off this listserve. i don't like contributing to the discourse in a place where my voice has been muffled before. so let me set up a place where you and i can fight... i mean, discuss... but this doubt of yours is rather easy to dispel, and quickly. you did a pretty good job of scanning that copy of "my antonia". and it looks like you processed (e.g., straightened) the scans well. so now we need to put them through o.c.r., using abbyy finereader; please have that done as follows: save results out to an .rtf file, one for each page; retaining line-breaks and paragraph indentation. do this for 20-50 pages, and zip the output up and e-mail it to me. i will reply to you with feedback on if the o.c.r. was done correctly. then i'll run it through programs that will soon be made available, at no cost, and we'll see what kind of an error-rate we end up with. or, if you prefer, follow this same procedure with some other book. then, if you still want to discuss this matter, we'll do it elsewhere.
But if proofreading is to be done anyway by the public, as is *now done* by DP, what difference is there between an OCR error of one every 10 pages, and one every page?
when i talk about "the public", i mean _end-users_ who are reading the book for the purpose of reading the book, and _not_ specifically to be "proofreading" it per se. for that type of reader, one error on every page is too many, but one error on every tenth page is not. especially since -- if we give them an easy means of checking for errors and reporting them, and then reward readers for finding them -- errors won't persist for very long, and the e-text will instead progress very quickly on its merry way to a state of perfection. in a practical sense, this means that before you turn an e-text loose for download in an all-in-one file, you make it available _page-by-page_ on the web. anyone who might want to read it has to do so in that form. right alongside the text for each page is the image, so the person can easily check any possible errors. you let 'em know you are asking for their help to find mistakes. if they find one, they fill out a form right on the page, and their input is recorded -- wiki-style -- immediately. later readers can either confirm the error, or question it, or make comments. first person to find each error gets a credit in the final e-text. you also give people a viewer-program that allows them to download the appropriate page-image if they suspect an error -- displaying it right there in the viewer-app next to the text -- and which simplifies the process of reporting it if they find one. (by, for instance, filling out an e-mail they can send with a click.)
The key is that for the aspect of building *trust* in the final product, it is a very good idea to involve the volunteer proofreaders to go over the texts, even if *you don't have to*.
what i just described does a good job of doing that. this is the system of "continuous proofreading" i outlined on this listserve a very long time ago. you recently mistakenly credited it to james linden. my offer to develop this system was largely snubbed. for _that_, the project gutenberg "people in charge" rightly deserve to be criticized. for the tiny stuff that you have been complaining about, they do not...
Having (and proving to anyone who asks) at least two independent people who proofed every page, adds to its trustworthiness.
not nearly as well as putting text and image side-by-side, and allowing any number of "volunteer proofreaders" to examine 'em. you might be surprised by the number of errors that "slip by" the proofreaders through two rounds of eyeballing over at d.p. (indeed, many even slip by the "third round" of post-processing and whitewashing, and sit there big and ugly in the final e-text.) even if a dozen people look at a page, an error might _still_ be there. but with eternal transparency, there is always hope it will be fixed. anyway, jon, i hope you take up the friendly challenge i issued here. and if any d.p. people want to call me on the challenge i made to them, you just let me know. in the meantime, i'll let you get in the last word on this thread, jon, because i _really_ need to be going. use it wisely... ;+) -bowerbird

Bowerbird wrote:
since you don't seem to want to have the o.c.r. done, for understandable reasons, i will do it myself.
Great! Hopefully others here will run it through their favorite OCR program and share the results with you and with gutvol-d. Please!, others, OCR the scans, which are available at: http://www.openreader.org/myantonia/
by the way, i have done some extensive comparisons of the project gutenberg version of "my antonia" and yours. the more deeply i go into it, the more i become convinced most differences are due to intentional edits, and _not_ due to sloppiness in the original preparation of the work.
How do we know? We don't know what source edition was used for PG's version of "My Antonia", but I now believe (but cannot prove until someone does the actual comparison) that the source was the "mangled" British edition, as noted below. So, the way to know for sure is to secure a copy of that "mangled" British edition and do the comparison. (Which I won't do because it is futile because the British edition is itself unacceptable.)
so this appears to be exactly like the "frankenstein" case -- a simple use of a different edition as the source-text.
Yes, and this is why I called the PG version of "My Antonia" "mangled", because it is -- it is based on a mangled British edition which Willa Cather herself was very unhappy about regarding the sloppy editing and printing. She was very "painstaking" with regards to her books -- more than the average author (and she had the status to dictate the editing and typography of her books to her publisher -- most lesser authors didn't have this luxury.) Again, my focus on the problems with the PG collection go beyond the error rates from some source -- it goes to the general aspects of trust and using the proper (acceptable) editions as source, to properly identify the source, and to provide means for easier verification the etext faithfully conforms to the source (primarily making the scans available, which is now possible -- I agree with you things were tougher a few years ago vis-a-vis providing page scans online.) For example, if NetWorker's analysis is correct (posted to The eBook Community), it now appears that the edition used for PG's version of "Frankenstein" is based on a 1981 Bantam Classics Edition, which did significant editing of the text (in essence, creating a convenient "fingerprint"), and which NetWorker (who was an attorney at one time, I believe) surmises may border on a copyright infringement (and not just a "sweat of the brow" sort of thing.) Hopefully Bantam will not catch wind of this -- but if they do, they probably won't do anything anyway. Nevertheless, one wonders how many other earlier PG texts, where there's no source information given, were derived from post-1923 emended editions? Could those ebook publishers who today use PG texts be potentially liable because of the lack of source information and a means to verify provenance? Even if the title page of a Work was photocopied and sent to PG for copyright clearance, how do we know that the person did not then use an easy-to-obtain and available modern edition for the actual scanning -- and simply photocopied the title page from a non-circulating, non-scannable copy of the rarer original edition? I believe most of those individuals who submitted etexts to PG's collection did it faithfully and followed common sense rules and expectations with regards to sources ---> But *how do we know*, and *how can we know*? We can't -- there's no mechanism to verify these things. This is where having the full source information, and having all the page scans of the source and making them available, builds trust in (and protects from copyright infringement claims) the particular etext and the associated collection it belongs to. It is also the morally right thing to do.
in view of the insinuations you cast against the "accuracy" of the project gutenberg e-text, perhaps you should apologize?
Why? The differences in the PG edition of "My Antonia" likely came from a mangled British edition which Willa Cather apparently was upset about. These changes are, in essence, errors. In addition, we have no idea as to what emendments may have been made to the first and subsequent PG etext editions since (until possibly now) we didn't know what edition was used as the original source! You certainly don't have access to the edition used to generate the PG edition of "My Antonia", do you? If not, then *how do you know* it is accurate to some original source edition? We can't talk about what is an error and what is not an error when we don't have the source information, and better yet page scans to immediately verify. That's why Michael Hart's interest in "correcting" the errors in the non-DP portion of the PG corpus is beyond futile and will not build trust in the collection -- how can one reliably correct an etext when the original source is not known/available to consult with? It's ludicrous, and a complete waste of time. It's better to redo the etexts via DP where the source info is recorded and page scans are (hopefully) available, as well as having the proofing done by a number of independent proofers, rather than just one person. Multiple, independent proofers adds trust to the process, in addition to having the source info and scans available. After all, intentional misspellings are common in many books (e.g., "My Antonia", Mark Twain's books, etc. -- and many pre-19th century books use variant spellings since rigorous spelling was not then an established norm) so how does one know if an "error" is really an error? And there are errors which cannot be caught by simple reading or even programs, such as missing (or added) accented characters, wrong punctuation (such as replacing an em-dash with a colon), and wrong paragraph breaks. (Most of which we see in "My Antonia".) Many of these "not discernable" errors can sometimes tweak the meaning of the etexts. We owe readers, even the casual readers, an excellent product with full disclosure. For example, the poll I'm conducting on this topic at The eBook Community indicates (but not proves -- consider this a preliminary assessment) that a significant percentage of those who read public domain digital texts *prefer* (note carefully this word) the texts they use to come from acceptable, known editions, and be faithful renditions of those editions. This only makes common sense. To dismiss this is essentially saying that the vast majority people don't give a damn about whether the public domain texts they spend hours and hours of their valuable time reading are reasonably faithful to the original. Does anyone want to make that claim that the vast majority of people (99% as it seems like PG's online info says) don't care one whit? And trying to prove that claim by pointing to the large number of people using PG texts, is not proof since I believe most people have innocent blind faith that PG did things correctly. Furthermore, anyone doing a major effort in delivering the public domain to the public has a moral responsibility to do it correctly and to state in sufficient detail the provenance and any edits of the texts. If it is a heavily emended text, then it should be specified to the public with sufficient detail *in that etext, not elsewhere* so the reader *knows* a text they are reading has been emended (one doesn't have to list the edits item by item, but it should be made clear the text has been substantially edited and to give a general overview of the types of edits done.) I've explained this on TeBC in more detail. This is a *responsibility*, which places restrictions on how PG and similar groups should conduct themselves. This is a serious endeavor: digitally transfering and preserving the public domain. This is not child's play. It is true that the Public Domain exists for anyone to do anything with it as they see fit, but like any freedom, there are associated responsibilities. Full disclosure is one of them, and is a common sense responsibility. Trying to be faithful in transcribing texts is another one when no disclaimers are given in the texts themselves since people assume the texts they are reading a reasonably faithful to the original. Jon
participants (2)
-
Bowerbird@aol.com
-
Jon Noring