Re: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.)

28 Feb 2005

      Bowerbird wrote:
...
jon said networker said:
...
maybe not.  but having done so, it is _refreshing_ to know that
-- when that's factored in -- only "a handful" of errors surface.
But the bigger issue is not constrained to errors (differences) with
respect to the source text used, as you continue to focus on. The
issues deal with the larger areas of trust, verifiability, proper
digital preservation of the Public Domain, and using acceptable
sources (with proper documentation), not just blindly grabbing
anything off the shelf as what appears happened to the PG version of
Frankenstein, which now exposes PG to legal liability.

The lack of proper processes, procedures and guidelines to build the
non-DP portion of the PG library (which comprises about half the
collection and is heavily skewed towards the more classic works), is
leading to serious questions about the integrity and trustworthiness
of the whole PG library (I've discussed this at length the last
couple weeks on The eBook Community.) It can certainly be fixed, but
the fix will require:

1) redoing most of the non-DP works using DP,

2) Proper selection of sources so they are acceptable, both legally and
   from those knowledgeable as to the better sources to use, and

3) Proper documentation as to source, including making available all
   the original page scans (and not just the title page, which proves
   nothing.)

(Btw, NetWorker presented evidence in his message to indicate PG's
version of Frankenstein was taken from a copyrighted edition, that
itself had a significant number of emendments from the original,
which in essence act like a "fingerprint" as to the pedigree. This is
NOT good. It casts PG's archive in a negative light, and may even lead
to a legal demand by Bantam for PG to remove the current version of
Frankenstein. It also calls into question the provenance of a large
number of other pre-DP texts where there's no source metadata given
and no page scans to prove proper provenance. NetWorker himself is a
former attorney, and he has thoroughly researched copyright law the
last couple years as it relates to ebooks, so Michael and Greg should
seriously sit up and take notice of the problem with PG's version of
Frankenstein, and many of its other texts where an acceptable source
cannot be demonstrated. Even if PG is "right" in a legal sense, that
it could use the 1981 Bantam Classics edition as it *might* have done,
does it want to even fight this in court, or to try to explain it
away to the trusting public?)
...
so once again, in spite of some very big noises, it ends up that
this fails to stand as a good example of an error-ridden e-text.
Well, at least you seem to indicate from your interest in very low
error rate OCR that every etext PG includes in its archive should be a
textually faithful reproduction of some known source. That is, if any
post-emendments are done, that they should be properly documented.
Otherwise, leave the text as it is in the print source.

Is this your thinking, or do you believe that textual faithfulness
and proper source identification and verification are not necessary
at all? That is, just let people take any text in the PG library and
then "edit it" as they see fit?
...
if you do the scanning properly, manipulate those scans correctly,
use abbyy in the best way, and subject its results to the right tools,
you will reduce the errors in your text to a relatively small number.
I don't believe anyone disagrees with you here in general. But
NetWorker was not only interested in OCR errors, but the bigger
issues as mentioned above -- they are all interlinked.
...
(the number we've been kickin' around is 1 error for every 10 pages,
iand at that point, proofreading by the public becomes very viable.)
I doubt this error rate (let's say for even half of the public domain
printings out there) is accomplishable without sentient-level AI. But
if proofreading is to be done anyway by the public, as is *now done*
by DP, what difference is there between an OCR error of one every 10
pages, and one every page?

The key is that for the aspect of building *trust* in the final
product, it is a very good idea to involve the volunteer proofreaders
to go over the texts, even if *you don't have to*. Having (and proving
to anyone who asks) at least two independent people who proofed every
page, adds to its trustworthiness. Include source metadata, and access
to the original page scans used as the source, and the highest level
of trust is built (as well as greater immunity to legal challenge.)
That's what makes DP's system so powerful.

But look at PG's edition of Frankenstein:

1) Which original edition it represents is not documented (Mary Shelley
   issued two substantially different editions). I think the reader
   should know which one it is in the PG cataloging information. This
   lack of care about different editions is troubling.

2) The source document is not given at all. I'm not sure if the person
   who did the first etext version is even recorded anywhere (or even
   known.)

   (Btw, this person, should Bantam press the issue, which I hope they
   don't, would probably become a co-defendent. This shows that the
   lack of proper guidelines, processes and verification methods in
   the building of the non-DP portion of PG's collection exposes the
   volunteer donors of texts to potential legal liability! This is
   another demonstration that if a project is to do something, it
   needs to *do it right* from the start, and not just do the
   "ready-fire-aim" approach to everything.)

3) It is unknown what subsequent "edits" were done along the way --
   they are not documented, as far as I know. (How do we know that
   whole paragraphs were removed or inserted?)

4) It now appears, but is not proven, that the source document was the
   1981 Bantam Classics edition.

This certainly does not give one warm fuzzies as to the trustworthiness
of the non-DP portion of the PG collection.

As a user of PG texts, it is important, for both moral, legal and
aesthetic reasons, that the texts are:

1) textually faithful reproductions of *known* sources,

2) provable as such (include access to the full page scans, and not just
   the title page), and

3) the sources of which are themselves acceptable to use, both legally
   and from those knowledgeable (both professional and amateur) with
   the Work in question. (For Works which were only published once and
   never republished by anyone, this last point does not apply provided
   the source is itself Public Domain.)
...
if you then have the rare luxury of evaluating your output against
an existing version of the book -- like a project gutenberg e-text --
with the right tool (which networker obviously does not yet have),
the comparison between the two, alongside the page-images, should
make the process of coming to an error-free version simply a breeze.
There will always be hand work necessary to compare two different
etexts of the same Work (note that oftentimes there are multiple
editions of multiple versions: The Work/Expression/Manifestation (WEM)
principle.)

Even the issue of hyphenation of compound words requires a human
being to ascertain what the author intended. Of course, if this is
not important to you, then what can I say?
...
since this is _exactly_ what will need to be done _increasingly_,
as the page-images from the internet archive and (we hope) google
-- plus the work done by individual people scanning everywhere --
emerge into cyberspace, that's where my tool-development efforts
are now being focused.  i suggest networker start reading my blog;
it should start being updated on a daily basis starting next week...
Tools such as yours will likely work for some types of texts, and
not work for others, where there'll be a need for human beings to
not only proof for errors, but to properly structure the document.

I'm now assessing the digitizing of records of historical and
genealogical significance, and these documents usually have quite
complex table layouts, very poor quality printing (and oftentimes
handwriting). Scans of these records are insufficient for use, so
having human beings read them and transcribe the information into
properly structured etext form is necessary.

I'll post an announcement to TeBC of your blog if you'd like me to
(although I don't know the address of your blog -- had it and then
lost it.)

Jon

Re: [gutvol-d] Interesting message on TeBC from NetWorker (about fixing errors, Frankenstein, etc.)

Jon Noring