ok, we've got a couple of different topics running around,
so let us take a minute to make sure we are not confused...

first of all, let's talk about my campaign for preprocessing...

i have demonstrated, over and over and over again, that d.p.
(and rfrank) should be doing _much_ better preprocessing...

i've shown how they can use _very_simple_means_ to do that,
and how -- if they did -- they could reduce the error-counts
in their books to a ridiculously small amount, even _before_
their text went in front of proofers. i have talked about how
it is a huge _waste_ of the generous donations of volunteers
(in both time and energy) not to do aggressive preprocessing,
which automatically locates errors to make them easy to fix...

again, the crux of my argument -- and i have proven it to be
absolutely true, again and again -- is that it's _easy_ to do this.

indeed, when i have shown the steps taken to locate the errors,
it becomes painfully obvious how ridiculously simple they are...

they include obvious checks, like a number embedded in a word,
or a lowercase letter followed by a capital letter, or two commas
in a row, or a period at the beginning of a line. _obvious_ stuff!
this isn't rocket science. it's not even _hard_... it's dirt-simple!
and yet neither d.p. nor rfrank has instituted such preprocessing.

***

let's contrast this with gardner's request, which was to compile
a list of reg-ex tests that will locate all possible errors in any
random book. this request -- as worthy as it might seem -- is
_much_ more difficult to realize. in fact, it's almost impossible.

a friend of mine over in england, nick hodson, is a very prolific
digitizer. all by himself, he has done some 500 books or more.
nick collected an extensive set of checks over the years. i can't
remember exactly how many there were, but roughly about 200.

however, once nick upgraded his o.c.r. program, he found that
about half of his checks were no longer required. they had been
necessary essentially as an artifact of an outdated o.c.r. program.

the type of books nick was digitizing hadn't changed, and neither
had the quality of the scans, or the resolution of the scans, or the
digital retouching that he performed on the scans -- none of that.
he was the same person, using the same computer and scanner,
and he was doing the same things exactly as he had done before.

the only thing that changed was the version of his o.c.r. program.

yet he found many checks he formerly needed became unnecessary.

so, for an operation like d.p., who intakes all kinds of scans and
uses a wide variety of o.c.r. programs, operated by users with a
huge range of expertise, their results will be all over the board.
they're _never_ gonna get a definitive list of checks to be made.

it would be _immensely_ difficult, to the point of being impossible.

but that's totally beside our other point, about preprocessing...

because the fact of the matter is that a few dozen _simple_ tests
are all that d.p. needs in order to reduce the number of errors to
a level where they can be handled easily by their human proofers.

they're never gonna get 100%. but they could find 90% so easily
that it's criminal negligence that they aren't doing that already...

heck, spell-check by itself will locate 50% of the errors for you...

-bowerbird