
roger said:
my sampling of the kind of errors left behind by BB's process as far as it went.
thanks for the fuller exposition, roger... just as a note, before we begin, i'd like to issue one minor note of protest that these errors are labeled as "bb:", when, actually, they were errors that were in _the_o.c.r._... it's not as if i _introduced_ these errors... it's just that my preprocessing failed to find and fix them. there's a difference... i don't have a suggestion for an alternative, and i see the reason for that nomenclature, just want people to grok the bias in the label. *** on a continued note... it's also the case that my preprocessing _did_ find-and-fix some 250 errors in the o.c.r., so i think the fact that it missed 21 is acceptable, as a first-pass. it can probably be improved, too, but i believe a 92% reduction in errors is nothing to sneeze at. even if i missed _100_, that would be a 70% reduction in o.c.r. errors. *** now that we're done with the minor protest... for the folks who aren't paying full attention, the whole point behind my argument is that a minimum of effort can fix a _lot_ of errors. i have never claimed it can find _all_ of them -- indeed, that would be a form of magic -- or that it's a _sufficient_ mechanism, by itself. this "minimum of effort" absolutely _needs_ to be followed by a beta-reading component. although i have never failed to stress that need, sometimes people seem to forget it completely. please remember to consider the full picture... once you do, you'll understand my rejoinder: my beta-readers woulda caught these errors. (just like roger's smoothreader caught them.) having said that, it is of some concern to me that this seems to change the error-rate so it becomes worse than 1-error-per-10-pages. gonna have to see what i can do about that...
With the addition of a good smoothreader, many of these diffs would have disappeared
right. (well, i'd say _all_ of them, but i'll settle with "many of them" if you insist that it's true, as we're talking about imaginary beta-readers, in my case, so i can't be certain of their skills, but if you say they're not perfect, i won't argue.) these results are from a midpoint in the workflow. while all the errors you reported are very real -- or almost all of them, anyway -- they are from classes that i have oft-acknowledged as ones that are missed in my _preprocessing_... there's a good reason i use that particular word -- _preprocessing_ -- because it _emphasizes_ the point that this doesn't create a final product. let's look at those acknowledged weakspots:
stealth scannos missed italics splotches spelling discrepancies
"stealth scannos" are something my preprocessing will never catch. that's why it needs beta-readers. "missed italics" will depend on abbyy's .rtf quality. if you don't use abbyy, you'll do all of it manually. and if you suck at it, like me, the results will suck. it's not that i cannot imagine coding a routine that could check for stealth scannos, or italics. i could develop some ideas and take a stab at it. i just wouldn't expect that i'd be very successful, in cost-benefit terms. you can do lotsa checks for stealth scannos, but you get too many false alarms. and my whole point is that you utilize cost-benefit. it's easy -- and efficient -- for beta-readers to catch stealth scannos, and even missed italics... so let them do that job! that's the wisest course. "splotches" are something else that's hard to catch, sometimes even impossible. i'll look at these errors, and see if they are the type of thing that _could_ be detected with some programmed routines. but still, it's obvious that some can't, which is -- yet again -- why you need to have beta-readers in the workflow. as for "spelling discrepancies", i do a spellcheck, but if both forms pass, then i let both of them go. and i didn't do consistency checking on this book. i _will_ make it part of my process, when i finally formalize it, but i didn't do it on this book, nope. *** so this wasn't a fair fight. roger used a smoothreader, and i didn't. so _of_course_ his results will be better... roger might've even done a word-by-word proofing, similar to the kind done at distributed proofreading. that can give even _better_ results. sometimes not, but sometimes it _will_, there's no question about it. where the question comes in is whether it's worth it, in terms of the time and energy that it takes people. go take a look at a project page over at d.p., and see how much time it takes the proofers to step through the pages of a round. you can see a page being saved every minute, or two, or three, so a 256-page book -- like "betty lee" -- can take about 600 minutes to go through a single round. that's 10 hours of work! for one round! is this a wise use of time and energy? i sure don't think so. when a beta-reader does a book, at least they get the pleasure of actually _reading_ it... those proofers over at d.p. don't even get that benefit. i don't think the time and energy that they _volunteer_ is being used in the best possible cost-benefit manner. and _that_ is what this discussion should be about... *** all in all, these results don't surprise me one bit. this o.c.r. had more scannos and splotches than i usually find, but the scans weren't all that hot, as you'll readily see just by paging through 'em. maybe i'll run 'em through scan-tailor, just to see if i can then get better o.c.r. out of them... but if roger will promise to make available his data -- i.e., scans, r.t.f. -- for his future books, i will formalize my preprocessing into a system, so he could run regular tests of its efficiency... -bowerbird p.s. now, notes on each of the individual errors... stealth scannos -- i leave these for the beta-readers...
RF: some of them, and give Ramon's message, BB: some of them, and give Earn on's message,
shoulda caught this, as an unexpected mid-sentence cap.
RF: Betty and next to Peggy Pollard, who, it BB: Betty and nest to Peggy Pollard, who, it
crap. i did catch one "nest/next" scanno, but forgot to check the rest of the file for another.
RF: a thing to work for that being president BB: a tiling to work for that being president
i'd guess a check for "tiling" would be a worthy one.
RF: the back. Mary Emma could not go with BB: the back. Mary Emma could hot go with
"could hot" and "can hot" checks will surely be worthy.
RF: problems. From Lucia's manner, she BB: problems. From Lucia's manlier, she
a "manlier" check might well be worthy.
RF: of the page and below was a brief resume BB: of the page and below war; a brief resume
a check for "war" might be good, or might give false alarms.
RF: I'm the crossest girl you ever saw, so far as mere looks BB: I'm the Grossest girl you ever saw, so far as mere looks
shoulda caught this, as an unexpected mid-sentence cap. *** spelling discrepancies -- these are just because i made mistakes
RF: of those still, quiet stiletto exchanges BB: of those still, quiet stilletto
i probably gave this the o.k. because i don't know how to spell it.
RF: tonsillitis. Betty saw her and overheard BB: tonsilitis. Betty saw her and overheard
i probably gave this the o.k. because i don't know how to spell it. *** splotches -- not much i can do about these
RF: packed a thin chiffon dress, while BB: packed, a thin chiffon dress, while
can't see how i could ever devise a test to catch that.
RF: this, Miss Betty Lee!" BB: this,' Miss Betty Lee!"
didn't do a check of the balancing of single-quotemarks. *** missed italics -- not much i can do about these
RF: wouldn't do _one thing_. She is sweet BB: wouldn't do one thing. She is sweet
i don't know how to code a test for italics.
RF: other times too, but _always_ then, BB: other times too, but always then, before
i suck at finding italics. i missed 40 cases, not? *** missing quote marks -- my tests should've caught these
RF: won't you?" BB: won't you?
shoulda found this, unless both quotemarks were missing.
RF: who sat down. "How is your mother BB: who sat down. How is your mother
shoulda found this, unless both quotemarks were missing. *** extra quote marks -- an ironic coincidence of splotches
RF: little habit of dropping in when BB: little habit of 'dropping in' when
two splotches happened to do something that made sense. *** levenshtein check -- i should include this in my process
RF: are the Sevillas and where do they live? BB: are the Savillas and where do they live?
i got so confused on this name, but thought i checked 'em all. *** guiguts-catchable errors -- i just plain forgot these tests
RF: sometimes! I can't study! Come over here BB: sometimes! I can't study I Come over
shoulda caught this, as an unexpected mid-sentence cap.
RF: who reads the sport page." BB: who reads the! sport page."
shoulda caught this, as an unexpected sentence-starting lower.
RF: know." BB: know," (at end of paragraph)
shoulda caught this, as an improper paragraph-termination. *** a bug in BB's generator -- not "a bug"; indicates a continued quote
RF: like my residence here. BB: like my residence here." "
not a bug. will be deleted before the product goes "final".