james said:
>   One thing I spotted was the use of

i don't need details.  just show me the utf8.          :+)


>   However, I thought you should be aware that
>   your word extraction program is doing this
>   and it is wrong.

i know it's wrong.  that's the whole point.
don't explain.  show me the right version.


>   I have a thought on looking up the words.
>   PDFs and DjVus from archive.org
>   have text contained in them. 
>   I should be able to put in
>   a questionable word from the right column
>   and see what it should be on the left,
>   then fix it.

you're a programmer, right?

start thinking like one.

i don't know exactly what you mean by
"put in a questionable word",
but it sounds uncomfortably _manual_.

ditto with doing "find" in a .pdf or .djvu.

you have a list of the bad words in a file.
and you have the actual e-book in a file.
with pagenumbers pointing to the scans.

so...

think like a programmer, and write code
that _automates_ the process for you, so
you just have to click a button or two and
maybe -- in the extreme case -- edit text
in a text-field by using your (ick) keyboard.

think like a programmer.

i _will_ repeat this.  if i _have_to_ repeat it.

but james, i don't want to have to repeat it...

-bowerbird