jon said:
> What type of global corrections were these?
the type that is made easy by my tool. that's all i'll say for now.
> One area is how to handle hyphenation,
> and whether there was a short dash in the compound word
> in the first place before the typesetter hyphenated the word.
as i said, i ignored the issue of hyphenation for the time being.
my tool will give a number of ways to deal with hyphenation,
but the routines haven't been brought into the current version.
but i can give a general overview. end-line hyphenation is removed.
the hyphen in compound words is retained. to tell the difference,
when there is ambiguity, you look at the rest of the text, to see if
the word was handled consistently there. if it was, you match that.
if not, you have more work to do. that's where it gets interesting.
to go any further is to give too much information for here and now.
> Hopefully you used the 600 dpi bitonal which should OCR the best.
i did.
> Antialiasing actually causes problems
> (notwithstanding the much lower resolution.)
right. i first thought the periods misrecognized as commas
were the effect of anti-aliasing, but i used the 600-dpi scans.
so it must be something else causing that problem.
> One thing you could do is to look at the 600 dpi pages at 100% size
> for which the punctuation was not correctly discerned. You probably
> will see some errant pixels that fooled the OCR into thinking
> it was some other punctuation mark than it is.
i didn't care that much, really.
the post-o.c.r. software can
solve the problem well enough.
i mentioned it for the record,
for the sake of full disclosure,
and to see if anybody knew why.
> punctuation is a toughie for OCR to exactly get right,
even if the recognition is admittedly somewhat difficult,
i expect abbyy to correct "mr," and "mrs,", for instance.
but even if abbyy doesn't, that's easy for me to program.
> Resolving this usually requires a human being to go over,
> especially for Works from the 18th and 19th century
> where compound words with dashes were much more common
if you want to retain those arcane spellings, it's difficult.
if you wanna update them, the computer does it very easily.
"to-day" and "to-morrow" become "today" and "tomorrow". instantly.
> Sometimes one has to see what the author did
> elsewhere in the text.
is there some reason you think the computer can't do that?
> In a few cases a guess is necessary based on understanding
> what the author did in similar cases in the text.
oh, i see. it takes "understanding".
one of those rare precious human-being things.
well then, i guess there's no way to program it.
> Some of this can be automated. In other cases
> it requires a human being to make a final decision.
> I followed the UNL Cather Edition here.
it's always easier to let other people make the decision, isn't it? ;+)
> Whether or not it is an "unnecessary" distraction,
> it is better to preserve the original text in the master etext version.
well see, jon, that's where i differ with you. and other people do too.
but like i said, as long as it's just one global change away, no big deal.
i see lots of other cases, as well, where you diverge from the paper.
a good many of the quotation-marks are set apart from their words.
you're making editorial decisions whether you acknowledge it or not.
> My thinking is that if someone wants to produce
> a derivative "modern reader" edition of "My Antonia",
> they are welcome to do so and add it to the collection
> because the original faithful rendition is *already* there.
whose "collection" are we talking about here jon?
yours? do you have any intention of adding more "my antonia" editions?
specifically a "derivative modern reader"? if so, i will submit mine.
but surely you don't mean michael hart's project gutenberg collection?
because, according to you anyway, he doesn't have a "faithful" rendition
in his library, not even one, not *already* anyway. just a mangled one.
another difference between your collection and michael's is
you have 1 book in your collection and he has 10-15 thousand
in his collection, depending on who is in charge of defining
how the official counting is tabulated these days, it appears.
whether you like it or not, that's a comment on the philosophies.
> indicating this was more of a typesetter's convention
> rather than something Cather specified.
well that's a convenient dodge, isn't it?
and of course you have no real _evidence_ that this is the case,
do you? so you _really_ should enter each case as it _appears_,
shouldn't you? at least if you want to stick to your philosophy?
> In addition, the UNL Cather Edition closed off all the apostrophe s
> (no spaces), but kept the space for many of " n't" words.
> So here again I followed the UNL Cather Edition.
and that's the difficulty with following an authority, ain't it?
there are often so many, it's hard to know which one to follow!
i know i can't keep up even with the editions of this one book!
so how would a person possibly keep up on tens of thousands!
and before you know it, you're having arguments about _that_!
and not reading the book, or digitizing it, or playing at the park.
and i don't know about you, jon, i don't think you're being consistent.
you said you were reproducing what is right there in black-and-white
on the page itself, even made high-resolution scans to prove it to us,
and now you're making judges that are easy to spot. and to justify it,
you're quoting some other figure of "authority". that's inconsistent.
but heck, i have to be honest here. even if you _were_ consistent,
and kept all of those quirks from the paper-book that _i_ consider
to be distracting, the first thing i'm gonna do is global-change 'em.
so all that hard work you did was for no good purpose to me.
> Cather wanted the line length to be fairly short,
> so this puts extra pressure on typesetters
> who will either have to extend character spacing
> for a particular line or scrunch it up more than usual,
> depending upon the situation with the rest of the typesetting
> on the page, and whether certain words can be hyphenated or not.
oh!, hold it!, wait!, did i just hear you say what you just said?
i think i did! yes, i'm quite sure i did!
"cather wanted the line length to be fairly short".
wow. you mean author-intent can go to _the_length_of_lines_?
do you realize how significant that is to your philosophy, jon?
it means you will need to respect willa's wishes on the matter.
none of the long lines you might get in a web-browser! no sir!
willa wanted short lines! (is that why the book looks so narrow?)
> You mean accented characters?
if they aren't in the lower-128 of the ascii range ("true ascii"), yes.
> Accented characters are *always* important
> to preserve under all situations.
according to you, maybe. according to me, it depends.
in this case, i say no. that's my prerogative as an editor.
(and i _do_ consider myself an editor, not just a copyist.)
> There's no need anymore, in these days of
> Unicode and the like to stick with 7-bit ASCII.
until unicode works flawlessly on every machine used
by all the people i know, for texts like this that have
only the occasional character outside the lower-127,
where the meaning isn't changed, i'll stick to plain ascii.
> I sense that you don't want to
> properly deal with accented characters
first of all, jon, i define what "properly" means for me, you don't.
you can define it for yourself. but i won't let you define it for me.
> I sense that you don't want to
> properly deal with accented characters
> since this poses extra problems with OCRing and proofing,
nope. it's just that i see them as _unnecessary_ to this book.
if a reader thinks it _is_ necessary, make the global-change.
> something you are trying to avoid in your zeal to get everything
> to automagically work. To me, that's going too far in simplifying.
i'm not "simplifying". i'm consciously making a choice to use
something that will work on the broad range of machines out there,
as opposed to something that -- in far too many cases -- fails badly.
it's a pragmatic decision based on real-life knowledge of the actual
infrastructure of machines that exist out here in our real world.
it's the same pragmatic decision that michael made when he crafted
the philosophy guiding the building of this library of 10,000+ e-texts,
in sharp contrast to your philosophy, which has built a 1-book library.
> Preserving accented characters are important.
in some cases, i'd agree with you. in others, not. in this case, not.
> punctuation changes can sometimes subtly affect the meaning.
you know, as a writer, i'd really like to think that's possible.
as a person who uses a lot of commas, i _want_ to believe it.
but i'll be darned if i can think of that many good examples.
if you can, i would _love_ to hear them. and if you can show me
_any_ in "my antonia", any at all, i'd give you extra bonus points.
as it is, though, i just have to resign myself to the position that
o.c.r punctuation errors are a distraction, but make no difference.
i'll still root them out, due to my sense of professionalism, but
i sure wish it felt _fun_, instead of feeling like _doing_chores_.
and to the extent that i can automate the chores, i'll be _happy_.
> They are hopefully caught by human proofers/readers
> when grammar checkers don't (I do use Word to
> help find both spelling and punctuation errors --
> when they find something, I then manually
> check it in the page scans and the master XML.)
oh, so you _do_ use an assist from your tools at times. that's good.
> They are "sometimes" easy to spot.
> Other times the automatic routines will not catch errors
maybe the automatic routines you are using are just inferior.
use my tool. if it doesn't spot something it should, let me know.
> Usually true, but there are some rare exceptions where
> an abbreviation can be mistaken for an end of a sentence.
not if your routines are as smart as mine are.
> Then there's the ellipsis issue
i'm three-dozen layers deep on some of these issues,
and you want to talk about level 2. i'm not interested.
use my tool. if it doesn't give you the results you want,
let me know.
> This is also true, but as found in "My Antonia",
> there are exceptions to pure nesting, such as
> when a quotation spills over into several paragraphs
> where the intermediate paragraphs are not terminated
> by an end quotation mark (whether single or double.)
is it really your considered opinion that i don't know this?
that i haven't factored it into my thinking _and_ my tools?
maybe you're grandstanding to the lurkers, but my goodness,
jon, do you really think that _they_ are that stupid too?
> Also, apostrophes are sometimes confused with single right quote marks.
ditto.
> With a smart enough grammar and parser,
> the above might be properly parsed and the
blah blah blah. use my tool. if it doesn't figure out your stuff, let me
know.
> But still, real-world texts tend to throw a lot of curve balls
> that are sometimes hard to correctly machine process.
i know how to hit 87 different pitches, from both sides of the plate,
and you're telling me to "watch out for the curve balls". i laugh at you.
> OCR is quite fast. It's making and cleaning up the scans
> which is the human and CPU intensive part.
wait! i thought you said _proofreading_ and _mark-up_
were the steps that take up the most time. didn't you?
or do i have you confused with someone else?
> Well, not all of the pages have been doubly proofed.
> The team is not finished, and I plan to post a plea
> somewhere for more eyeballs to go over it.
have you heard about distributed proofreaders?
might be able to find some people there...
(ok, now you see what it feels like.)
> I would like to receive error reports as well for this text,
i'll tell you the same thing i told michael about project gutenberg:
set up a system for the checking, reporting, correction, and logging
of errors, a system that is transparent to the general public, and
i will be more than happy to report errors to you, and help you out.
otherwise, you waste my time, as i figure someone else can do it.
which, by the way, is what everyone else is thinking.
which is why errors in the texts are not being reported
at nearly the frequency that they should be being reported.
but i've got another message sitting here waiting to be sent
where i discuss that topic in more detail, so i'll stop here now.
> since Brewster wants highly proofed texts
> for some experiments he plans to run similar to yours.
i'll have to ask him about his tests.
> But if I have to use the version you donate to PG, so be it. :^)
probably, yep.
if michael wants it. they say he'll take just about anything...
> I did find one error in my text based on the list you gave. Thanks.
you're welcome. but that's not the one i was talking about. :+)
> I assume you discovered
> the several different paragraph breaks in the PG edition?
nope. i didn't even evoke the routines to examine paragraph-breaks.
i considered doing so, once you said that there were differences,
but decided it was just too inconsequential to even bother with it.
it's another one of those things i would very much like to see a case
where it made a difference, because i'd love to believe it _could_,
but in the absence of a case (or even an _imaginary_ possibility,
which i confess i can't come up with, not off the top of my head),
i am forced to relegate it to the "too trivial to think about" pile.
as above, i'll make the corrections, but i ain't gonna sweat 'em...
-bowerbird