Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.)

[I'm forwarding the following message by "NetWorker" posted to The eBook Community. Any followup to the specific points NetWorker raises are maybe better posted over there, especially if you'd like NetWorker to see your comments ("ebook-community" at YahooGroups). NetWorker is very thorough... Jon] NetWorker wrote [a few days prior]:
Project Gutenberg e-texts may yet have a role to play in the production of high-quality e-books. Rising to the bait, I have placed a hold on my local library's one(!) copy of Frankenstein, which I will scan when it arrives. I will try to create a highly structured e-text (certainly not as fine as what Jon did with "My Antonia"). I will then try to find a way to preprocess both the OCR'ed text and the Project Gutenberg e-text in such a way that the two files can be meaningfully "diffed." Hopefully, I can come up with a method that will allow existing PG e-texts to be an automated "proofread" of next generation Public Domain e-books.
Boy, _that_ was an interesting experience! The project goals: As an e-book consumer, I want an e-book that contains _lots_ of metadata; the more the better. I want the metadata to be patterned, so that I can use automated tools to manage a collection; sorting by author, genre, publication date, publisher, contributors such as editors and illustrators, etc. I also want the actual text to be marked-up in such a way that 1) I can view the text with all the presentational richness traditionally associated with a paper book, if I choose to do so, 2) I can convert unambiguously from one markup language to another, and 3) so that I can do a structural analysis of the book using automated tools. I also want a mechanism to know if apparent errors in the text are due to transcription errors or the author's intent -- this can be accomplished by including source information in the metadata, or providing access to page scans; both would be preferable. Project Gutenberg e-texts satisfy none of these wishes; to create an e-book which _does_ satisfy them pretty much requires starting from scratch. Scanning technologies are quite advanced these days, but OCR is still not 100% accurate, and automated spell checking can only go so far. Clearly the most time-consuming -- and most error-prone -- part of producing a reasonably accurate e-book is proof-reading by a human being. My goal was to discover if Project Gutenberg e-texts, which are presumably fairly accurate as to the _words_, if nothing else, could be used as yet another automated preprocessing step to reduce typographical errors to a minimum before the actual proof-reading begins. The process: To test my theory, I decided to use the novel _Frankenstein_, by Mary Wollstonecraft Shelly. _Frankenstein_ is clearly in the public domain, is known to have at least two versions, and has been the subject of a fair amount of discussion on this list in the recent past. I obtained a copy of Frankenstein from the public library; it was published in the "Barnes & Noble Classics" series in 2000. I was fairly pleased with the edition, as it was printed in a rather old-seeming type-face which gave the appearance that it was in fact a photo reproduction of a much older text; it seemed likely that it had not gone through much in the way of re-editing to modern conventions. I scanned and OCR'ed the book using ABBYY FineReader. I then did a spell-check of the book from within FineReader so I could compared "misspelled" words to the actual scanned image. I then saved the text as an HTML file. In the past I have written a couple of programs to help in the creation of e-books. TidyeBook is based on the HTML Tidy code base. It fixes some of the inaccurate HTML produced by ABBYY, strips headers and footers but leaves page numbers intact, if invisible, and merges broken paragraphs when it can do so without question. html2txt, based on an earlier C++ version of HTML Tidy, takes an HTML document and reduces it to simple text similar to that used by Project Gutenberg. Next I ran "frankenstein.html" through TidyeBook to clean up the HTML. I then hand-edited the HTML to fix paragraph breaks not fixed in the automated process, or which should not have been broken. I also fixed those instances where hyphenated words spanned a page break (very easy to do given the output of TidyeBook). I then generated an Impoverished Text Format version of the HTML text using html2txt. My strategy was to use the Gnu "diff" program to detect differences between the simplified version of my work product, and the Project Gutenberg version. Because "diff" is line-oriented I needed to normalize the two texts so there was a greater likelihood that lines would be correctly matched. I did this by writing yet another program (this could probably have been done more efficiently by a Perl or AWK script, but I am not very familiar with scripting languages, but am a highly proficient C/C++ programmer; it was easiest for me to use the tools at my disposal). The new program would reduce each file to lines of no more than 60 characters (the shorter the line the easier for a human to find the difference detected). Additionally, the program would start a new line whenever it encountered what is conventionally accepted as sentence-ending punctuation (!.?) or two newline characters in a row, which would signal the beginning of a new paragraph. All whitespace was reduced to a space character, including multiple whitespace characters. I used the new program to normalize the text produced by html2txt and that of frank14.txt from Project Gutenberg. I then compared the two resultant files using gnu diff and Microsoft's WinDiff. The results: I was quite surprised to find literally thousands of differences between the two texts. Most of the differences were changes in punctuation and capitalization. Many em-dashes were converted to semicolons or omitted altogether, and many semicolons were converted to commas. Some words capitalized in my scan (eg. Paradise) were converted to lower case (paradise). Some phrases were "fixed" ("our uncle Thomas's book" became "our Uncle Thomas' book"; "an European" became "a European"). Some words were Americanized ("tranquillise" became "tranquillize") yet other words are not ("favourite" remained "favourite"). In an attempt to discover the source of these differences, I visited a number of not-so-local libraries, and checked out a number of different printings of _Frankenstein_. Two of the most interesting are Leonard Wolf's _The Annotated Frankenstein_, Clarkson N. Potter, 1977, which claims that "In order to ensure the authenticity of the text, we arranged with the Library of Congress in Washington, D.C., to microfilm a copy of the first edition. That text has been reproduced in this volume by the photo-offset process," and the Penguin Classics edition which includes an appendix identifying the differences between the 1818 and 1831 editions (while significant, they are neither as pervasive nor as substantive as has been earlier suggested). Neither of these editions contained the punctuation or spelling changes of the Project Gutenberg edition. One of the books I checked out, rather serendipitously as it turns out, was the Bantam Classic edition, which was first published in 1981. Of all the editions I consulted, only the Bantam edition contains virtually all of the changes I noted in the Project Gutenberg e-text. The PG edition is apparently based on Mary Shelly's revised 1831 edition (although it has lost both the "Author's Introduction to the Standard Novels Edition (1831)" by Mary Shell, and the "Preface" to the 1818 edition by Percy Shelly). I thus believe that the PG edition is based on the Bantam Classic edition of 1981. <sidebar> Interestingly, copyright law provides protection to changed versions of public domain texts if those changes are of a nature that they are more than mechanical and provide some modicum of creativity. Clearly, the punctuation changes are not merely mechanical, and in some cases actually change subtly the nuances of the text. Ironically, of all the textual bases that Project Gutenberg could have used for its e-text of _Frankenstein_, it choose the one which is apparently still protected by copyright! </sidebar> I modified my text normalization program to discard all punctuation except hyphens and underscores (and of course excluding the sentence-ending punctuation mentioned earlier). This reduced the noise to signal ratio enough that the differences started to become meaningful, although it still resulted in at least 500 differences. It allowed me to discover a handful of OCR errors that had been missed by the earlier automated methods, and I have so far also found a handful of errors in the PG text ("But must finish." should be "But I must finish.", "every sight ... seem still to" should be "every sight ... seems still to", "destroy radiant innocence" should be "destroy such radiant innocence", etc.) Conclusions: Of course, the goal of this exercise was not to establish the provenance of the Project Gutenberg e-text of _Frankenstein_, nor to discover if there are any errors in the PG e-text, but to determine if there was an automated method of reducing errors in newly scanned e-books for which a Project Gutenberg e-text already exists. I'm afraid the jury is still out on this question. If the texts are a different as the PG edition of _Frankenstein_ and virtually all other editions, the process of sorting through the chaff to find the grain of wheat may not be worthwhile; I believe that the OCR errors discovered so far were blatant enough that they would have been easily discovered in the first proof-reading, and I believe that human proof-reading will always be required no matter how good our automated tools become. Some time ago I produced an HTML e-book version of Mark Twain's _Pudd'n'head Wilson_. I believe I will put that e-book through that same process. I will then attempt a new scan of some other, perhaps more obscure, PD work that already has a PG version. Having at least three data points, I will report again later. p.s. -- If someone from Project Gutenberg wants my diff file to update the PG e-text, I will be happy to e-mail it to you; it is approximately 95k in size.
participants (1)
-
Jon Noring