More on "My Antonia" (was Reply comments on US issues raised by 'orphan works')

[Pauline, I explain later on why I have not submitted "My Antonia" to DP. But if someone from DP here says to submit the scans and the current XHTML text for final-stage proofing, then I'll do so. I'll take care of the final XHTML markup. Also, anyone here interesting in converting the markup to TEI? (It should be relatively straightforward since I use a structural/semantic markup which is relatively compatible with TEI markup principles.)] Bowerbird wrote:
Jon wrote:
I am looking for several proofreaders to finish the proofing of "My Antonia" by comparison of the present XHTML to the original page scans. It's been partially proofed already (errata has been submitted by several people already, including Bowerbird.)
i'm sitting on 2 more errors in "my antonia", jon -- one outright error, and one consistency error -- waiting until you offer a $50 bounty for the final error. :+)
Don't hold your breath waiting for your moolah. <smile/> Anyway, I think you mentioned that "consistency" error before (it is an interpretation of what the typesetter did regarding spacing with respect to two classes of contractions -- and how they are done in the current XHTML is *correct* based on analyzing the 1926 second edition, which Willa Cather herself was active in its production.) It is interesting in what you've said before about the need for better errata reporting and correction mechanisms, meaning you want errata to be immediately fixed for Public Domain texts as they are found by users. Nice to see you are being consistent to your principles. I guess what you really meant to say is that "I will gladly submit my errata immediately after I am paid." <laugh/> Anyway, the proofers did find a couple errors (and as I noted the text has not been sufficiently proofed yet), which have been corrected in the current online text. I don't know if they corrected the errors you found, but I don't really care. And of course, I won't tell you here exactly what errors the proofers found, but for $100 I'll be happy to tell you (but I'll be nice and give you a clue: "didn't" and "liked".) Otherwise you can download the XHTML document again and run a 'diff' on it -- happy hunting!
i figure at some point, it will become worth that to you to be able to say with confidence your text is error-free.
What you are implying by this statement is that *you know* what the error-free text is supposed to be. Interesting that you apparently missed at least a couple errors. <laugh/> You did bring up a good point about the line break issue. This leads to my answer to Pauline's question on why I didn't use DP, and why I'm hesitant now to submit it to DP: 1) I had a project deadline where I had to get the text reasonably cleaned up and online *in a hurry* (with page scans) for demo purposes. I made that deadline, but it led me to not do it via DP. It was my intention to continue the clean-up process after the deadline -- it's only the right thing to do. However, for actual mass production of texts, DP or similar process would be used -- which I have noted a couple times before. 2) Additionally, because of the deadline, I decided to start with an HTML document which was already pretty well-proofed (it was NOT PG's version), and did various types of analysis and hand-proofing on it (including 'diff' to PG's version, which, though mangled, provided a reference point using the principle that most of the transcribed text in it is accurate -- all discrepencies between PG's and my version were then compared to the original page scans (most of the discrepencies were with PG's text, I might add.) If there were the same errors in both, then those errors would not be caught except by proofing, by running some auto-checks, and maybe by creating a third text using OCR and 'diff'ing that with the others, which is what I assume Bowerbird did.) 3) So, the source I used was already missing the original line breaks, another reason that it was probably too late to submit it DP, since I don't believe the text I've done up to now could be efficiently injected "midway" into DP's workflow; it would be a waste to have DP do it all over again when it is now in very good shape. If anyone reading this wants to help and proof a few pages using the "primitive" system I now have (e.g., just print out the pages from the page scans you would like to proof, and compare them with the online XHTML version which shows the page numbering and breaks. There are other ways to do this proofing by comparison, such as opening up two windows in your browser, one showing the page scan and the other the online XHTML text.) 4) As noted above, I would use a DP process for "mass production", but for this particular project did not for the reasons cited above. "My Antonia" is, and has always been, a demonstration project to experiment with new ideas, to get my hands "dirty" with the production process (although I've transcribed a dozen texts before), and to use it for showing to some people interested in this. Yes, it is just "one" text out of the 5000+ already done by DP, but "My Antonia" was not intended to be competitive to anyone. It is a first small and methodical step towards meeting several goals and proving some principles. It is still "beta", too, and hopefully all the ideas I want to experiment with will be implemented in a couple months (I've just started to implement what I'd like to implement.) What's been done so far has nicely assisted with the LibraryCity business plan, though, and may lead to a parallel business plan to be written by several of us in a couple months that's a little closer to the public domain text area. Btw, Bowerbird, I'm hoping to soon scan my copy of the *original* Burton edition of the Kama Sutra of Vatsyayana, and the scans will be submitted to DP as someone there requested. Like "My Antonia", I'm going to do high quality scans (at 600 dpi) and carefully clean them. This will be a tough test for your "we-don't-need-DP" approach: very small font, poor late nineteenth century typesetting, and some poorly printed pages (and this is one of the better copies of this book!) Maybe we should do a competition (you seem to love competition!): After I complete the scans, I'll submit them to both you and to DP. After the magic is done by you and DP, we'll compare the results (such as by 'diff'ing them) to look for inconsistencies which will then be compared to the original page scans. Of course, you will have to promise not to do late hours hand proofing of the text -- that'd be cheating -- you have to do it all auto-magically as you've been advocating. As some person whose name eludes may says, "The proof is in the pudding." Are you up to the challenge? <smile/> Anyway, I think such an experiment is good to gauge the current state of OCR and auto-processing to clean up texts, so it is a test which will benefit everyone. In addition, the careful screening of this text during the "competition" (such as by comparing the result of the two approaches) should lead to a *very low error*, possibly *zero error* digital text version, which can then be used for quick assessment of the accuracy of new OCR and auto-post-processing algorithms. I believe the Kama Sutra scans plus zero error digital text will make a good addition for a "test suite". Jon Noring
participants (1)
-
Jon Noring