
Bowerbird wrote:
jon noring said:
did you do o.c.r. on it? if you can retrieve the output, that would be good. it would allow people to do research on assessing/improving o.c.r. quality, and assist programmers in developing post-o.c.r. text-cleanup programs.
No, I did not OCR the scans for producing My Antonia (I did experiment with scanning though). But since the scans exist, any OCR package will import them and scan them. So nothing is "lost". There's no law that says one must OCR them at the same time they are scanned -- they are separate processes and can be decoupled with no loss of anything to anyone at any time. If you need OCRing done, you can probably post a "plea for help" and find someone who has the OCR software packages you'd like to try (I don't have the robust, up-to-date ones -- like Abbyy which I asked a friend to help out with for my experiments -- I just have the cheapo freebies.) The scans are available online for download, as you know.
(but, from later posts, it looks like you grabbed the text from elsewhere. so what you've done is "blessed" somebody else's work as "trustworthy", presumably after checking it, and maybe correcting it. you could also have done that same thing using project gutenberg's version of the text, since my comparison of the two files shows them to be very similar, so much so that i expect they were indeed based on the same version.)
I won't go into the gory details, but yes, I took two versions and then combined/diffed them. I then did a very thorough comparison page-by-page to the original source page scans, a la DP. We are now in the process of having several people do the same (a page-by-page comparison of the XHTML version to the page scans -- I want at least two people to go over each page) -- anyone reading this, you are welcome to volunteer and help us -- do a few pages just like for DP! The error rate in the XHTML version (still beta) is now very low, and can be considered for all practical purposes a very accurate and textually faithful reproduction of the 1918 1st edition. (But then, maybe I'll be surprised and find a serious error in the text.) In retrospect, this process should have been done via DP instead. But there was a deadline to finish the first beta of the cleaned-up text, so there was no time to have this done at DP. However, I do plan to post a request to the relevant DP forums for final proofing help, as well as to seek help from the DP folk on other matters (such as TEI markup). If DP wishes to go over it in some fashion and incorporate it into their "archive" as well as submit it to PG, that's fine by me (I will not directly contribute the text to PG as I've noted on TeBC.) Regarding the PG version of My Antonia compared to the 1918 1st edition, there are a *lot* of differences. I regularized both texts and ran 'diff' between them, and found over 200 differences, mostly spelling (the PG version uses mostly British spelling but even here it is strangely inconsistent!), but also oddities in punctuation, wrong paragraph breaks, some missing accented characters, a couple places with changed wording, a few misspellings, etc. Of course, whenever I encountered a difference, I went to the original page scans of the 1918 1st edition to verify what was done there. All 200+ differences with respect to the original text were with the PG version, which I surmise was derived from the British edition of My Antonia, which is noted to have been mangled in editing (Willa Cather was supposedly furious over the quality of editing in the British edition which went beyond just using British spelling for words, such as 'colour' instead of 'color'.) Anyway, when the final proofing is done, I believe the textual error rate will be very low, near zero (but one cannot say it is perfect.) So I think it will be useful for OCR accuracy experiments (which I assume is what you want to perform?) Of course, there's always the issue of hyphenated compound words, figuring out if a hyphenated compound word will have a dash in it or not, but that's another matter. I believe we did pretty good on this, with help from the UNL information as well as textual analysis.
I'll zip up the 600 dpi 2-color (B&W) scans which have already gone through a clean-up stage (they will be PNG files, and occupy if memory serves me right, about 50 megs of space
those are too big for my purposes, and for me to download.
Oops, sorry. They are pretty large for downloading by modem (but with DSL/Cable they can be downloaded pretty quick.)
but if i could reimburse you for sending them to me on a cd?
It's on me. In private email send me your address and I'll burn and mail you a disk of the 600 dpi and 120 dpi scans. I do have the original 600 dpi 24-bit color scans (which is overkill -- next time I'll do the raw scans for B&W pages at greyscale), but in PNG they occupy over 5 gigs of disk space! (Don't have a DVD burner yet otherwise I'd send those, too.)
or the 120-dpi versions would work just fine for my project, the same ones that are on the website, just zipped together.
Unfortunately, since the 120-dpi scans are antialiased greyscale (while the 600 dpi are bitonal), the size difference is surprisingly not that different. I updated the My Antonio index page to include downloading all the 120-dpi scans in a ZIP file, which is still over 30 megs in size: http://www.openreader.org/myantonia/ Bowerbird, I'll be happy to put up a ZML regularized text version of My Antonia. If I put up plain text, I want the plain text to follow some regularization rules, and ZML is the only game in town actively working with etexts (as far as I know at least -- I do recall two other text regularization schemas, but don't know if the authors are doing anything with them.) Jon