Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.)

28 Feb 2005

      [I'm forwarding the following message by "NetWorker" posted to The
eBook Community. Any followup to the specific points NetWorker
raises are maybe better posted over there, especially if you'd like
NetWorker to see your comments ("ebook-community" at YahooGroups).
NetWorker is very thorough...   Jon]

NetWorker wrote [a few days prior]:
...
Project Gutenberg e-texts may yet have a role to play in the
 production of high-quality e-books. Rising to the bait, I have placed
 a hold on my local library's one(!) copy of Frankenstein, which I
 will scan when it arrives. I will try to create a highly structured
 e-text (certainly not as fine as what Jon did with "My Antonia"). I
 will then try to find a way to preprocess both the OCR'ed text and
 the Project Gutenberg e-text in such a way that the two files can be
 meaningfully "diffed." Hopefully, I can come up with a method that
 will allow existing PG e-texts to be an automated "proofread" of next
 generation Public Domain e-books.
Boy, _that_ was an interesting experience!

The project goals:

As an e-book consumer, I want an e-book that contains _lots_ of 
metadata; the more the better. I want the metadata to be patterned, so 
that I can use automated tools to manage a collection; sorting by 
author, genre, publication date, publisher, contributors such as editors 
and illustrators, etc. I also want the actual text to be marked-up in 
such a way that 1) I can view the text with all the presentational 
richness traditionally associated with a paper book, if I choose to do 
so,  2) I can convert unambiguously from one markup language to another, 
and 3) so that I can do a structural analysis of the book using 
automated tools. I also want a mechanism to know if apparent errors in 
the text are due to transcription errors or the author's intent -- this 
can be accomplished by including source information in the metadata, or 
providing access to page scans; both would be preferable.

Project Gutenberg e-texts satisfy none of these wishes; to create an 
e-book which _does_ satisfy them pretty much requires starting from 
scratch. Scanning technologies are quite advanced these days, but OCR is 
still not 100% accurate, and automated spell checking can only go so 
far. Clearly the most time-consuming -- and most error-prone -- part of 
producing a reasonably accurate e-book is proof-reading by a human 
being. My goal was to discover if Project Gutenberg e-texts, which are 
presumably fairly accurate as to the _words_, if nothing else, could be 
used as yet another automated preprocessing step to reduce typographical 
errors to a minimum before the actual proof-reading begins.

The process:

To test my theory, I decided to use the novel _Frankenstein_, by Mary 
Wollstonecraft Shelly. _Frankenstein_  is clearly in the public domain, 
is known to have at least two versions, and has been the subject of a 
fair amount of discussion on this list in the recent past.

I obtained a copy of Frankenstein from the public library; it was 
published in the "Barnes & Noble Classics" series in 2000. I was fairly 
pleased with the edition, as it was printed in a rather old-seeming 
type-face which gave the appearance that it was in fact a photo 
reproduction of a much older text; it seemed likely that it had not gone 
through much in the way of re-editing to modern conventions.

I scanned and OCR'ed the book using ABBYY FineReader. I then did a 
spell-check of the book from within FineReader so I could compared 
"misspelled" words to the actual scanned image. I then saved the text as 
an HTML file.

In the past I have written a couple of programs to help in the creation 
of e-books. TidyeBook is based on the HTML Tidy code base. It fixes some 
of the inaccurate HTML produced by ABBYY, strips headers and footers but 
leaves page numbers intact, if invisible, and merges broken paragraphs 
when it can do so without question. html2txt, based on an earlier C++ 
version of HTML Tidy, takes an HTML document and reduces it to simple 
text similar to that used by Project Gutenberg.

Next I ran "frankenstein.html" through TidyeBook to clean up the HTML. I 
then hand-edited the HTML to fix paragraph breaks not fixed in the 
automated process, or which should not have been broken. I also fixed 
those instances where hyphenated words spanned a page break (very easy 
to do given the output of TidyeBook). I then generated an Impoverished 
Text Format version of the HTML text using html2txt.

My strategy was to use the Gnu "diff" program to detect differences 
between the simplified version of my work product, and the Project 
Gutenberg version. Because "diff" is line-oriented I needed to normalize 
the two texts so there was a greater likelihood that lines would be 
correctly matched. I did this by writing yet another program (this could 
probably have been done more efficiently by a Perl or AWK script, but I 
am not very familiar with scripting languages, but am a highly 
proficient C/C++ programmer; it was easiest for me to use the tools at 
my disposal). The new program would reduce each file to lines of no more 
than 60 characters (the shorter the line the easier for a human to find 
the difference detected). Additionally, the program would start a new 
line whenever it encountered what is conventionally accepted as 
sentence-ending punctuation (!.?) or two newline characters in a row, 
which would signal the beginning of a new paragraph. All whitespace was 
reduced to a space character, including multiple whitespace characters.

I used the new program to normalize the text produced by html2txt and 
that of frank14.txt from Project Gutenberg. I then compared the two 
resultant files using gnu diff and Microsoft's WinDiff.

The results:

I was quite surprised to find literally thousands of differences between 
the two texts. Most of the differences were changes in punctuation and 
capitalization. Many em-dashes were converted to semicolons or omitted 
altogether, and many semicolons were converted to commas. Some words 
capitalized in my scan (eg. Paradise) were converted to lower case 
(paradise). Some phrases were "fixed" ("our uncle Thomas's book" became 
"our Uncle Thomas' book"; "an European" became "a European"). Some words 
were Americanized ("tranquillise" became "tranquillize") yet other words 
are not ("favourite" remained "favourite").

In an attempt to discover the source of these differences, I visited a 
number of not-so-local libraries, and checked out a number of different 
printings of _Frankenstein_. Two of the most interesting are Leonard 
Wolf's _The Annotated Frankenstein_, Clarkson N. Potter, 1977, which 
claims that "In order to ensure the authenticity of the text, we 
arranged with the Library of Congress in Washington, D.C., to microfilm 
a copy of the first edition. That text has been reproduced in this 
volume by the photo-offset process," and the Penguin Classics edition 
which includes an appendix identifying the differences between the 1818 
and 1831 editions (while significant, they are neither as pervasive nor 
as substantive as has been earlier suggested).

Neither of these editions contained the punctuation or spelling changes 
of the Project Gutenberg edition. One of the books I checked out, rather 
serendipitously as it turns out, was the Bantam Classic edition, which 
was first published in 1981. Of all the editions I consulted, only the 
Bantam edition contains virtually all of the changes I noted in the 
Project Gutenberg e-text. The PG edition is apparently based on Mary 
Shelly's revised 1831 edition (although it has lost both the "Author's 
Introduction to the Standard Novels Edition (1831)" by Mary Shell, and 
the "Preface" to the 1818 edition by Percy Shelly). I thus believe that 
the PG edition is based on the Bantam Classic edition of 1981.

<sidebar>
Interestingly, copyright law provides protection to changed versions of 
public domain texts if those changes are of a nature that they are more 
than mechanical and provide some modicum of creativity. Clearly, the 
punctuation changes are not merely mechanical, and in some cases 
actually change subtly the nuances of the text. Ironically, of all the 
textual bases that Project Gutenberg could have used for its e-text of 
_Frankenstein_, it choose the one which is apparently still protected by 
copyright!
</sidebar>

I modified my text normalization program to discard all punctuation 
except hyphens and underscores (and of course excluding the 
sentence-ending punctuation mentioned earlier). This reduced the noise 
to signal ratio enough that the differences started to become 
meaningful, although it still resulted in at least 500 differences. It 
allowed me to discover a handful of OCR errors that had been missed by 
the earlier automated methods, and I have so far also found a handful of 
errors in the PG text ("But must finish." should be "But I must 
finish.",  "every sight ... seem still to" should be "every sight ... 
seems still to", "destroy radiant innocence" should be "destroy such 
radiant innocence", etc.)

Conclusions:

Of course, the goal of this exercise was not to establish the provenance 
of the Project Gutenberg e-text of _Frankenstein_, nor to discover if 
there are any errors in the PG e-text, but to determine if there was an 
automated method of reducing errors in newly scanned e-books for which a 
Project Gutenberg e-text already exists. I'm afraid the jury is still 
out on this question. If the texts are a different as the PG edition of 
_Frankenstein_ and virtually all other editions, the process of sorting 
through the chaff to find the grain of wheat may not be worthwhile; I 
believe that the OCR errors discovered so far were blatant enough that 
they would have been easily discovered in the first proof-reading, and I 
believe that human proof-reading will always be required no matter how 
good our automated tools become.

Some time ago I produced an HTML e-book version of Mark Twain's 
_Pudd'n'head Wilson_. I believe I will put that e-book through that same 
process. I will then attempt a new scan of some other, perhaps more 
obscure, PD work that already has a PG version. Having at least three 
data points, I will report again later.

p.s. -- If someone from Project Gutenberg wants my diff file to update 
the PG e-text, I will be happy to e-mail it to you; it is approximately 
95k in size.

Jon Noring

tags

participants (1)