in discussing how to bring about a spell-check functionality,
i talked about the "bad-words" list as a part of that workflow.

i posted the "bad-words" list from the o.c.r. for the sitka book.
>   http://z-m-l.com/go/sitka/sitka-reversedictionary.txt

i talked about how the proofing process can be envisioned as
movement from the "bad-words" list to the "good-words" list.

now that rfrank has posted the "final" version at p.g., we can
do the same "dictionary-check" procedure on his finished book:
>   http://z-m-l.com/go/sitka/sitka-reverseposted.txt

you'll see i have introduced blank lines into these files, so as
to coordinate their lines so they can be merged into one file.
this will allow us to see how each word traversed the process.

the merged file is here:
>   http://z-m-l.com/go/sitka/sitka-reverse-review.html

***

go ahead and take a quick look at the words...

one thing you might notice is that a number of the words are
marked with asterisks.  those are the ones that are suspicious,
in that they appear to be _variants_ of each other, which might
be a good indication of either (1) an o.c.r. misrecognition, or
(2) an inconsistency in the p-book which should be corrected...

(some people use levenshtein edit-distance to find these words.
that's nice, but a plain old review of a sorted list works well too.)

i checked those variants against the p-book, and many of them
were indeed errors.  (the ones marked "ok" at the end were right.)

that's how i found many of these consistency errors rfrank missed.

oh, by the way, there is still at least one more of those errors that
i didn't mention earlier...  so if anyone wants to go looking for it...

ok, so let's go on to look at the list in other ways...

***

for each individual word, we're gonna see how it was handled...

the first group are some words that had garbage characters.
those words didn't have any direct equivalent in the final file,
at least none that were close enough to be sorted similarly...

once we get into the lowercase words, we get some matching.

and that continues as we get into the words with an initial cap,
and on into the compound-words, and then into the numbers...

focus on the initial-cap words -- which are primarily names --
and you'll see that the vast majority of these were recognized
correctly, in that they persisted through to the final version...
about 85% of the initial-cap words were recognized correctly.

the same is true of the compound-words, with few exceptions.
and most numbers seem to have been recognized correctly too.

but lowercase words are more of a mixed bag.  some were right,
it's true, but a relatively high percentage of them were incorrect,
as evidenced by the fact they were changed, one way or another.
only about half the lowercase words were recognized correctly.

you might remember that i had predicted precisely this pattern.

most lowercase o.c.r. words that are in the "bad-words" list are
generally misrecognitions, while most initial-cap words are not.

indeed, the percentage of correct lowercase words in this o.c.r.
was much higher than is normal, because this was not typical
o.c.r., in the sense that the scans were clean, but also because
rfrank probably did some preprocessing on the raw o.c.r. text,
which we can tell because it had very few garbage characters...

so what we see is that roughly 75% of these words were _correct_,
despite the fact they were "bad-words" (i.e., not in the dictionary).
they weren't in the dictionary, but they were "good" in this book...
that so many "bad-words" can be correct and unique to the book
is why it's important to use a "custom" book-specific dictionary.

only about 25% of the "bad-words" were actually misrecognitions.

to the extent that you can narrow down your _flagging_ of the
"bad-words" to the ones that are _really_ bad, you can relieve
your proofers of a _lot_ of unnecessary flags, which is _good_,
because false flags sap the attention of proofers unnecessarily.

once you've done this analysis a number of times, like i have,
you'll come to recognize that it is a very important analysis...

-bowerbird

p.s.  hey, dkretz, thanks for the shoutout!  but one correction!
i wasn't "baited" into "crossing a line" that got be banned at d.p.
i never get "baited" into _anything_.  i always know what i'm doing.
and i've been banned from enough places that i know how it works,
so it wasn't that i "crossed a line".  again, i know what i'm doing...
no, if i get banned from somewhere, it's something i _anticipated_,
and after a consideration of that outcome, decided it didn't matter.
which is _not_ to say that i _like_ to get banned, or that i _try_ to,
but is _rather_ to say that i won't allow myself be banned _unless_
i have decided that i don't really care whether i'm banned or not...
as for "crossing a line", there's no need for it.  even though people
will generally say that i broke some technical rule and that is why
i was banned, the truth of the matter is that one only gets banned
if one pisses off the person with the power to push the ban button.
it has nothing at all to do with "the rules"...  it's just raw emotion...
oh, and let me say one more thing; it's _nice_ that d.p. still lets me
come to their forums and read them.  if they tried to prevent that,
i _could_ get around it, but it's a hassle.  so i thank them for that...