roger said:
> my sampling of the kind of errors
> left behind by BB's process as far as it went.
thanks for the fuller exposition, roger...
just as a note, before we begin, i'd like to
issue one minor note of protest that these
errors are labeled as "bb:", when, actually,
they were errors that were in _the_o.c.r._...
it's not as if i _introduced_ these errors...
it's just that my preprocessing failed to
find and fix them. there's a difference...
i don't have a suggestion for an alternative,
and i see the reason for that nomenclature,
just want people to grok the bias in the label.
***
on a continued note...
it's also the case that my preprocessing _did_
find-and-fix some 250 errors in the o.c.r., so
i think the fact that it missed 21 is acceptable,
as a first-pass. it can probably be improved,
too, but i believe a 92% reduction in errors is
nothing to sneeze at. even if i missed _100_,
that would be a 70% reduction in o.c.r. errors.
***
now that we're done with the minor protest...
for the folks who aren't paying full attention,
the whole point behind my argument is that
a minimum of effort can fix a _lot_ of errors.
i have never claimed it can find _all_ of them
-- indeed, that would be a form of magic --
or that it's a _sufficient_ mechanism, by itself.
this "minimum of effort" absolutely _needs_
to be followed by a beta-reading component.
although i have never failed to stress that need,
sometimes people seem to forget it completely.
please remember to consider the full picture...
once you do, you'll understand my rejoinder:
my beta-readers woulda caught these errors.
(just like roger's smoothreader caught them.)
having said that, it is of some concern to me
that this seems to change the error-rate so
it becomes worse than 1-error-per-10-pages.
gonna have to see what i can do about that...
> With the addition of a good smoothreader,
> many of these diffs would have disappeared
right. (well, i'd say _all_ of them, but i'll settle
with "many of them" if you insist that it's true,
as we're talking about imaginary beta-readers,
in my case, so i can't be certain of their skills,
but if you say they're not perfect, i won't argue.)
these results are from a midpoint in the workflow.
while all the errors you reported are very real
-- or almost all of them, anyway -- they are
from classes that i have oft-acknowledged as
ones that are missed in my _preprocessing_...
there's a good reason i use that particular word
-- _preprocessing_ -- because it _emphasizes_
the point that this doesn't create a final product.
let's look at those acknowledged weakspots:
> stealth scannos
> missed italics
> splotches
> spelling discrepancies
"stealth scannos" are something my preprocessing
will never catch. that's why it needs beta-readers.
"missed italics" will depend on abbyy's .rtf quality.
if you don't use abbyy, you'll do all of it manually.
and if you suck at it, like me, the results will suck.
it's not that i cannot imagine coding a routine
that could check for stealth scannos, or italics.
i could develop some ideas and take a stab at it.
i just wouldn't expect that i'd be very successful,
in cost-benefit terms. you can do lotsa checks for
stealth scannos, but you get too many false alarms.
and my whole point is that you utilize cost-benefit.
it's easy -- and efficient -- for beta-readers to
catch stealth scannos, and even missed italics...
so let them do that job! that's the wisest course.
"splotches" are something else that's hard to catch,
sometimes even impossible. i'll look at these errors,
and see if they are the type of thing that _could_ be
detected with some programmed routines. but still,
it's obvious that some can't, which is -- yet again --
why you need to have beta-readers in the workflow.
as for "spelling discrepancies", i do a spellcheck,
but if both forms pass, then i let both of them go.
and i didn't do consistency checking on this book.
i _will_ make it part of my process, when i finally
formalize it, but i didn't do it on this book, nope.
***
so this wasn't a fair fight. roger used a smoothreader,
and i didn't. so _of_course_ his results will be better...
roger might've even done a word-by-word proofing,
similar to the kind done at distributed proofreading.
that can give even _better_ results. sometimes not,
but sometimes it _will_, there's no question about it.
where the question comes in is whether it's worth it,
in terms of the time and energy that it takes people.
go take a look at a project page over at d.p., and see
how much time it takes the proofers to step through
the pages of a round. you can see a page being saved
every minute, or two, or three, so a 256-page book
-- like "betty lee" -- can take about 600 minutes to
go through a single round. that's 10 hours of work!
for one round! is this a wise use of time and energy?
i sure don't think so. when a beta-reader does a book,
at least they get the pleasure of actually _reading_ it...
those proofers over at d.p. don't even get that benefit.
i don't think the time and energy that they _volunteer_
is being used in the best possible cost-benefit manner.
and _that_ is what this discussion should be about...
***
all in all, these results don't surprise me one bit.
this o.c.r. had more scannos and splotches than
i usually find, but the scans weren't all that hot,
as you'll readily see just by paging through 'em.
maybe i'll run 'em through scan-tailor, just to
see if i can then get better o.c.r. out of them...
but if roger will promise to make available his
data -- i.e., scans, r.t.f. -- for his future books,
i will formalize my preprocessing into a system,
so he could run regular tests of its efficiency...
-bowerbird
p.s. now, notes on each of the individual errors...
stealth scannos -- i leave these for the beta-readers...
> RF: some of them, and give Ramon's message,
> BB: some of them, and give Earn on's message,
shoulda caught this, as an unexpected mid-sentence cap.
> RF: Betty and next to Peggy Pollard, who, it
> BB: Betty and nest to Peggy Pollard, who, it
crap. i did catch one "nest/next" scanno, but
forgot to check the rest of the file for another.
> RF: a thing to work for that being president
> BB: a tiling to work for that being president
i'd guess a check for "tiling" would be a worthy one.
> RF: the back. Mary Emma could not go with
> BB: the back. Mary Emma could hot go with
"could hot" and "can hot" checks will surely be worthy.
> RF: problems. From Lucia's manner, she
> BB: problems. From Lucia's manlier, she
a "manlier" check might well be worthy.
> RF: of the page and below was a brief resume
> BB: of the page and below war; a brief resume
a check for "war" might be good, or might give false alarms.
> RF: I'm the crossest girl you ever saw, so far as mere looks
> BB: I'm the Grossest girl you ever saw, so far as mere looks
shoulda caught this, as an unexpected mid-sentence cap.
***
spelling discrepancies -- these are just because i made mistakes
> RF: of those still, quiet stiletto exchanges
> BB: of those still, quiet stilletto
i probably gave this the o.k. because i don't know how to spell it.
> RF: tonsillitis. Betty saw her and overheard
> BB: tonsilitis. Betty saw her and overheard
i probably gave this the o.k. because i don't know how to spell it.
***
splotches -- not much i can do about these
> RF: packed a thin chiffon dress, while
> BB: packed, a thin chiffon dress, while
can't see how i could ever devise a test to catch that.
> RF: this, Miss Betty Lee!"
> BB: this,' Miss Betty Lee!"
didn't do a check of the balancing of single-quotemarks.
***
missed italics -- not much i can do about these
> RF: wouldn't do _one thing_. She is sweet
> BB: wouldn't do one thing. She is sweet
i don't know how to code a test for italics.
> RF: other times too, but _always_ then,
> BB: other times too, but always then, before
i suck at finding italics. i missed 40 cases, not?
***
missing quote marks -- my tests should've caught these
> RF: won't you?"
> BB: won't you?
shoulda found this, unless both quotemarks were missing.
> RF: who sat down. "How is your mother
> BB: who sat down. How is your mother
shoulda found this, unless both quotemarks were missing.
***
extra quote marks -- an ironic coincidence of splotches
> RF: little habit of dropping in when
> BB: little habit of 'dropping in' when
two splotches happened to do something that made sense.
***
levenshtein check -- i should include this in my process
> RF: are the Sevillas and where do they live?
> BB: are the Savillas and where do they live?
i got so confused on this name, but thought i checked 'em all.
***
guiguts-catchable errors -- i just plain forgot these tests
> RF: sometimes! I can't study! Come over here
> BB: sometimes! I can't study I Come over
shoulda caught this, as an unexpected mid-sentence cap.
> RF: who reads the sport page."
> BB: who reads the! sport page."
shoulda caught this, as an unexpected sentence-starting lower.
> RF: know."
> BB: know," (at end of paragraph)
shoulda caught this, as an improper paragraph-termination.
***
a bug in BB's generator -- not "a bug"; indicates a continued quote
> RF: like my residence here.
> BB: like my residence here." "
not a bug. will be deleted before the product goes "final".