[I'm forwarding the following message by "NetWorker" posted to The
eBook Community. Any followup to the specific points NetWorker
raises are maybe better posted over there, especially if you'd like
NetWorker to see your comments ("ebook-community" at YahooGroups).
NetWorker is very thorough... Jon]
NetWorker wrote [a few days prior]:
> Project Gutenberg e-texts may yet have a role to play in the
> production of high-quality e-books. Rising to the bait, I have placed
> a hold on my local library's one(!) copy of Frankenstein, which I
> will scan when it arrives. I will try to create a highly structured
> e-text (certainly not as fine as what Jon did with "My Antonia"). I
> will then try to find a way to preprocess both the OCR'ed text and
> the Project Gutenberg e-text in such a way that the two files can be
> meaningfully "diffed." Hopefully, I can come up with a method that
> will allow existing PG e-texts to be an automated "proofread" of next
> generation Public Domain e-books.
Boy, _that_ was an interesting experience!
The project goals:
As an e-book consumer, I want an e-book that contains _lots_ of
metadata; the more the better. I want the metadata to be patterned, so
that I can use automated tools to manage a collection; sorting by
author, genre, publication date, publisher, contributors such as editors
and illustrators, etc. I also want the actual text to be marked-up in
such a way that 1) I can view the text with all the presentational
richness traditionally associated with a paper book, if I choose to do
so, 2) I can convert unambiguously from one markup language to another,
and 3) so that I can do a structural analysis of the book using
automated tools. I also want a mechanism to know if apparent errors in
the text are due to transcription errors or the author's intent -- this
can be accomplished by including source information in the metadata, or
providing access to page scans; both would be preferable.
Project Gutenberg e-texts satisfy none of these wishes; to create an
e-book which _does_ satisfy them pretty much requires starting from
scratch. Scanning technologies are quite advanced these days, but OCR is
still not 100% accurate, and automated spell checking can only go so
far. Clearly the most time-consuming -- and most error-prone -- part of
producing a reasonably accurate e-book is proof-reading by a human
being. My goal was to discover if Project Gutenberg e-texts, which are
presumably fairly accurate as to the _words_, if nothing else, could be
used as yet another automated preprocessing step to reduce typographical
errors to a minimum before the actual proof-reading begins.
The process:
To test my theory, I decided to use the novel _Frankenstein_, by Mary
Wollstonecraft Shelly. _Frankenstein_ is clearly in the public domain,
is known to have at least two versions, and has been the subject of a
fair amount of discussion on this list in the recent past.
I obtained a copy of Frankenstein from the public library; it was
published in the "Barnes & Noble Classics" series in 2000. I was fairly
pleased with the edition, as it was printed in a rather old-seeming
type-face which gave the appearance that it was in fact a photo
reproduction of a much older text; it seemed likely that it had not gone
through much in the way of re-editing to modern conventions.
I scanned and OCR'ed the book using ABBYY FineReader. I then did a
spell-check of the book from within FineReader so I could compared
"misspelled" words to the actual scanned image. I then saved the text as
an HTML file.
In the past I have written a couple of programs to help in the creation
of e-books. TidyeBook is based on the HTML Tidy code base. It fixes some
of the inaccurate HTML produced by ABBYY, strips headers and footers but
leaves page numbers intact, if invisible, and merges broken paragraphs
when it can do so without question. html2txt, based on an earlier C++
version of HTML Tidy, takes an HTML document and reduces it to simple
text similar to that used by Project Gutenberg.
Next I ran "frankenstein.html" through TidyeBook to clean up the HTML. I
then hand-edited the HTML to fix paragraph breaks not fixed in the
automated process, or which should not have been broken. I also fixed
those instances where hyphenated words spanned a page break (very easy
to do given the output of TidyeBook). I then generated an Impoverished
Text Format version of the HTML text using html2txt.
My strategy was to use the Gnu "diff" program to detect differences
between the simplified version of my work product, and the Project
Gutenberg version. Because "diff" is line-oriented I needed to normalize
the two texts so there was a greater likelihood that lines would be
correctly matched. I did this by writing yet another program (this could
probably have been done more efficiently by a Perl or AWK script, but I
am not very familiar with scripting languages, but am a highly
proficient C/C++ programmer; it was easiest for me to use the tools at
my disposal). The new program would reduce each file to lines of no more
than 60 characters (the shorter the line the easier for a human to find
the difference detected). Additionally, the program would start a new
line whenever it encountered what is conventionally accepted as
sentence-ending punctuation (!.?) or two newline characters in a row,
which would signal the beginning of a new paragraph. All whitespace was
reduced to a space character, including multiple whitespace characters.
I used the new program to normalize the text produced by html2txt and
that of frank14.txt from Project Gutenberg. I then compared the two
resultant files using gnu diff and Microsoft's WinDiff.
The results:
I was quite surprised to find literally thousands of differences between
the two texts. Most of the differences were changes in punctuation and
capitalization. Many em-dashes were converted to semicolons or omitted
altogether, and many semicolons were converted to commas. Some words
capitalized in my scan (eg. Paradise) were converted to lower case
(paradise). Some phrases were "fixed" ("our uncle Thomas's book" became
"our Uncle Thomas' book"; "an European" became "a European"). Some words
were Americanized ("tranquillise" became "tranquillize") yet other words
are not ("favourite" remained "favourite").
In an attempt to discover the source of these differences, I visited a
number of not-so-local libraries, and checked out a number of different
printings of _Frankenstein_. Two of the most interesting are Leonard
Wolf's _The Annotated Frankenstein_, Clarkson N. Potter, 1977, which
claims that "In order to ensure the authenticity of the text, we
arranged with the Library of Congress in Washington, D.C., to microfilm
a copy of the first edition. That text has been reproduced in this
volume by the photo-offset process," and the Penguin Classics edition
which includes an appendix identifying the differences between the 1818
and 1831 editions (while significant, they are neither as pervasive nor
as substantive as has been earlier suggested).
Neither of these editions contained the punctuation or spelling changes
of the Project Gutenberg edition. One of the books I checked out, rather
serendipitously as it turns out, was the Bantam Classic edition, which
was first published in 1981. Of all the editions I consulted, only the
Bantam edition contains virtually all of the changes I noted in the
Project Gutenberg e-text. The PG edition is apparently based on Mary
Shelly's revised 1831 edition (although it has lost both the "Author's
Introduction to the Standard Novels Edition (1831)" by Mary Shell, and
the "Preface" to the 1818 edition by Percy Shelly). I thus believe that
the PG edition is based on the Bantam Classic edition of 1981.
<sidebar>
Interestingly, copyright law provides protection to changed versions of
public domain texts if those changes are of a nature that they are more
than mechanical and provide some modicum of creativity. Clearly, the
punctuation changes are not merely mechanical, and in some cases
actually change subtly the nuances of the text. Ironically, of all the
textual bases that Project Gutenberg could have used for its e-text of
_Frankenstein_, it choose the one which is apparently still protected by
copyright!
</sidebar>
I modified my text normalization program to discard all punctuation
except hyphens and underscores (and of course excluding the
sentence-ending punctuation mentioned earlier). This reduced the noise
to signal ratio enough that the differences started to become
meaningful, although it still resulted in at least 500 differences. It
allowed me to discover a handful of OCR errors that had been missed by
the earlier automated methods, and I have so far also found a handful of
errors in the PG text ("But must finish." should be "But I must
finish.", "every sight ... seem still to" should be "every sight ...
seems still to", "destroy radiant innocence" should be "destroy such
radiant innocence", etc.)
Conclusions:
Of course, the goal of this exercise was not to establish the provenance
of the Project Gutenberg e-text of _Frankenstein_, nor to discover if
there are any errors in the PG e-text, but to determine if there was an
automated method of reducing errors in newly scanned e-books for which a
Project Gutenberg e-text already exists. I'm afraid the jury is still
out on this question. If the texts are a different as the PG edition of
_Frankenstein_ and virtually all other editions, the process of sorting
through the chaff to find the grain of wheat may not be worthwhile; I
believe that the OCR errors discovered so far were blatant enough that
they would have been easily discovered in the first proof-reading, and I
believe that human proof-reading will always be required no matter how
good our automated tools become.
Some time ago I produced an HTML e-book version of Mark Twain's
_Pudd'n'head Wilson_. I believe I will put that e-book through that same
process. I will then attempt a new scan of some other, perhaps more
obscure, PD work that already has a PG version. Having at least three
data points, I will report again later.
p.s. -- If someone from Project Gutenberg wants my diff file to update
the PG e-text, I will be happy to e-mail it to you; it is approximately
95k in size.