Re: [gutvol-d] these grapes are sweet -- lesson #12

23 Sep 2011

      ...
i've got another dictionary -- i think it's the one rfrank uses
in his routines -- that's 1.5 megs.  bigger isn't always better.
first, it takes more machine-time to use a bigger dictionary.
second, and more importantly, you're more likely to miss a
real o.c.r. error that just happens to spell some other word.
BB is correct that bigger isn't always better. In some of my software
tools, I use the SCOWL (Spell Checker Oriented Word Lists) lists.
There, you can select increasingly more complex levels. For example,
"american-words.10" has much more common words than "american-words.95"
There are also Canadian word lists, British, contractions, abbreviations.
Learn more about SCOWL and other word lists at http://wordlist.sourceforge.net/

For code that doesn't use Scowl, I use a dictionary that is almost exactly
the same as the one BB put in his post. I compared the words starting with
"e" and found our lists are identical except his has two words mine doesn't:
"escaloped" and "escaloping." After doing a lot of research and testing, I ended
up using the 2of12 list from http://wordlist.sourceforge.net/12dicts-readme.html
for most of my projects.

Like BB, I also use supplemental word lists. I do several other things in my
code that I didn't see in BB's process. First, if a word is capitalized and
occurs often enough, then it's deemed a proper name and is accepted.
Second, if each part of a hyphenated word is a recognized word, then
the hyphenated version is accepted. In another part, the code looks
for the hyphenated and unhyphenated word and reports if both appear in
the text. This report from a recent run provides an example:

  snowshoes(1) and snow-shoes(19) both appear in text.
  today(1) and to-day(28) both appear in text.
  tomorrow(1) and to-morrow(31) both appear in text.
  tonight(1) and to-night(25) both appear in text.

I also flag as a suspect word any word that has mixed case.

BB mentioned it takes more machine time to use a bigger dictionary, which
is true but not a problem for me in practice. What I do that really does
add computer time is I do a Levenshtein distance check between all the
words in the book, flagging any that are close. Here are examples or
short edit distance suspects in recent books:

  McDonald:Mcdonald (12:4)
  ker-choooo:ker-chooooo (1:1)
  Luke:Lukey (65:2)

I believe that the process BB presented in lesson 12 would find all
of these same problems. It would just take more effort to go through
the suspect lists. If I can chase down suspect words using more than
"not in the list" evaluations, then I'll let the computer do the
additional checks.

I'm really enjoying BB's lessons on how to digitize a book. There's
a lot of good information in there.

--Roger (rfrank)

Re: [gutvol-d] these grapes are sweet -- lesson #12

Roger Frank