More on "My Antonia" (was Reply comments on US issues raised by 'orphan works')

13 May 2005

      [Pauline, I explain later on why I have not submitted "My Antonia" to
DP. But if someone from DP here says to submit the scans and the
current XHTML text for final-stage proofing, then I'll do so. I'll
take care of the final XHTML markup. Also, anyone here interesting in
converting the markup to TEI? (It should be relatively straightforward
since I use a structural/semantic markup which is relatively
compatible with TEI markup principles.)]

Bowerbird wrote:
...
Jon wrote:
...
...
I am looking for several proofreaders to finish the proofing of "My
Antonia" by comparison of the present XHTML to the original page
scans. It's been partially proofed already (errata has been
submitted by several people already, including Bowerbird.)
...
i'm sitting on 2 more errors in "my antonia", jon -- one outright
error, and one consistency error -- waiting until you offer a $50
bounty for the final error.    :+)
Don't hold your breath waiting for your moolah. <smile/>

Anyway, I think you mentioned that "consistency" error before (it is
an interpretation of what the typesetter did regarding spacing with
respect to two classes of contractions -- and how they are done in the
current XHTML is *correct* based on analyzing the 1926 second edition,
which Willa Cather herself was active in its production.)

It is interesting in what you've said before about the need for better
errata reporting and correction mechanisms, meaning you want errata
to be immediately fixed for Public Domain texts as they are found by
users. Nice to see you are being consistent to your principles. I
guess what you really meant to say is that "I will gladly submit my
errata immediately after I am paid." <laugh/>

Anyway, the proofers did find a couple errors (and as I noted the text
has not been sufficiently proofed yet), which have been corrected in
the current online text. I don't know if they corrected the errors you
found, but I don't really care. And of course, I won't tell you here
exactly what errors the proofers found, but for $100 I'll be happy to
tell you (but I'll be nice and give you a clue: "didn't" and "liked".)
Otherwise you can download the XHTML document again and run a 'diff'
on it -- happy hunting!
...
i figure at some point, it will become worth that to you to be able
to say with confidence your text is error-free.
What you are implying by this statement is that *you know* what the
error-free text is supposed to be. Interesting that you apparently
missed at least a couple errors. <laugh/>

You did bring up a good point about the line break issue. This leads
to my answer to Pauline's question on why I didn't use DP, and why I'm
hesitant now to submit it to DP:

1) I had a project deadline where I had to get the text reasonably
   cleaned up and online *in a hurry* (with page scans) for demo
   purposes. I made that deadline, but it led me to not do it via DP.
   It was my intention to continue the clean-up process after the
   deadline -- it's only the right thing to do.

   However, for actual mass production of texts, DP or similar process
   would be used -- which I have noted a couple times before.

2) Additionally, because of the deadline, I decided to start with an
   HTML document which was already pretty well-proofed (it was NOT PG's
   version), and did various types of analysis and hand-proofing on it
   (including 'diff' to PG's version, which, though mangled, provided
   a reference point using the principle that most of the transcribed
   text in it is accurate -- all discrepencies between PG's and my
   version were then compared to the original page scans (most of the
   discrepencies were with PG's text, I might add.) If there were the
   same errors in both, then those errors would not be caught except
   by proofing, by running some auto-checks, and maybe by creating a
   third text using OCR and 'diff'ing that with the others, which is
   what I assume Bowerbird did.)

3) So, the source I used was already missing the original line breaks,
   another reason that it was probably too late to submit it DP, since
   I don't believe the text I've done up to now could be efficiently
   injected "midway" into DP's workflow; it would be a waste to have
   DP do it all over again when it is now in very good shape.

   If anyone reading this wants to help and proof a few pages using
   the "primitive" system I now have (e.g., just print out the pages
   from the page scans you would like to proof, and compare them with
   the online XHTML version which shows the page numbering and breaks.
   There are other ways to do this proofing by comparison, such as
   opening up two windows in your browser, one showing the page scan
   and the other the online XHTML text.)

4) As noted above, I would use a DP process for "mass production", but
   for this particular project did not for the reasons cited above.

"My Antonia" is, and has always been, a demonstration project to
experiment with new ideas, to get my hands "dirty" with the production
process (although I've transcribed a dozen texts before), and to use
it for showing to some people interested in this. Yes, it is just
"one" text out of the 5000+ already done by DP, but "My Antonia" was
not intended to be competitive to anyone. It is a first small and
methodical step towards meeting several goals and proving some
principles. It is still "beta", too, and hopefully all the ideas I
want to experiment with will be implemented in a couple months (I've
just started to implement what I'd like to implement.) What's been
done so far has nicely assisted with the LibraryCity business plan,
though, and may lead to a parallel business plan to be written by
several of us in a couple months that's a little closer to the public
domain text area.

Btw, Bowerbird, I'm hoping to soon scan my copy of the *original*
Burton edition of the Kama Sutra of Vatsyayana, and the scans will be
submitted to DP as someone there requested. Like "My Antonia", I'm
going to do high quality scans (at 600 dpi) and carefully clean them.
This will be a tough test for your "we-don't-need-DP" approach: very
small font, poor late nineteenth century typesetting, and some poorly
printed pages (and this is one of the better copies of this book!)

Maybe we should do a competition (you seem to love competition!):
After I complete the scans, I'll submit them to both you and to DP.
After the magic is done by you and DP, we'll compare the results
(such as by 'diff'ing them) to look for inconsistencies which will
then be compared to the original page scans. Of course, you will
have to promise not to do late hours hand proofing of the text --
that'd be cheating -- you have to do it all auto-magically as you've
been advocating.

As some person whose name eludes may says, "The proof is in the
pudding." Are you up to the challenge? <smile/> Anyway, I think such
an experiment is good to gauge the current state of OCR and
auto-processing to clean up texts, so it is a test which will benefit
everyone. In addition, the careful screening of this text during the
"competition" (such as by comparing the result of the two approaches)
should lead to a *very low error*, possibly *zero error* digital text
version, which can then be used for quick assessment of the accuracy
of new OCR and auto-post-processing algorithms. I believe the Kama
Sutra scans plus zero error digital text will make a good addition for
a "test suite".

Jon Noring

Jon Noring

tags

participants (1)