hey, today is my "d.p. birthday"!               :+)

it's been 3 years now.  and 32 pages, meaning i am
averaging just under 11 pages per year...  woohoo!           ;+)

***

roger frank, over on the d.p. forums, said:
>   My example is this: in a recent book with a character "Ann"
>   there was one instance of "Ana" that I missed. The first thing
>   I did was to write code that goes through every word in the book,
>   checks them against a dictionary, finds the singletons or doubletons,
>   and then does an edit-distance check on those words against
>   every word used in the book. In this case, that code finds "Ana" once
>   and "Ann" many times. It's almost certainly an error that
>   needs to be flagged.  But how?

you might be overthinking (and overprogramming) this...

a simple ascii-based sort of the unique words in the book
(with occurrence counts) can quickly and easily be generated,
and a perusal skim of that will reveal instances like this one...

use an ascii sort, so that capitalized forms will sort separately:
>   Ana -- 1 ********** (asterisk entities with few occurrences)
>   And -- 6
>   Ann -- 22
>   Apple -- 2 **********
>   Ben -- 86

you'll also find that pulling out all _names_ is a good idea;
"anything capitalized mid-sentence" is a working definition.
not only will that help you sort out the names very quickly,
but it will identify those pesky mid-sentence capitalizations
that can happen when an "i" was incorrectly upper-cased...


>   Since I go through the whole book anyway
>   a paragraph at a time before sending it to PPV,
>   I'm thinking why can't I have that error and indeed
>   any suspicious situation flagged in the file I review?
>   And what better way is there than to use color?
>   So as I go through the book a page at a time
>   I might use one color for questionable one-off errors
>   (like "Ana" for "Ann"), another color for apparent quote mark
>   mismatches, another for spelling suspects, and so forth.
>   So this idea is half-baked at best, but I'm going to explore it.

it's not a half-baked idea.  it works quite wonderfully for me...

the one thing i'd suggest is that you build this into a program
that lets you make the corrections right away, inside of itself,
rather than doing the display via .html and then having to go
to a separate program to do the required editing...

extra points if your program can analyze the list created above
and wisk you to the very place where the single "ann" is located,
complete with the scan placed next to it for a quick evaluation...


>   If I'm crazy, tell me now.
>   If it might be worthwhile, I'd like to know. Thanks.

you're not crazy.  in fact, you're smarter than the average bear.

-bowerbird