i've got another dictionary -- i think it's the one rfrank uses in his routines -- that's 1.5 megs. bigger isn't always better. first, it takes more machine-time to use a bigger dictionary. second, and more importantly, you're more likely to miss a real o.c.r. error that just happens to spell some other word.
BB is correct that bigger isn't always better. In some of my software tools, I use the SCOWL (Spell Checker Oriented Word Lists) lists. There, you can select increasingly more complex levels. For example, "american-words.10" has much more common words than "american-words.95" There are also Canadian word lists, British, contractions, abbreviations. Learn more about SCOWL and other word lists at http://wordlist.sourceforge.net/ For code that doesn't use Scowl, I use a dictionary that is almost exactly the same as the one BB put in his post. I compared the words starting with "e" and found our lists are identical except his has two words mine doesn't: "escaloped" and "escaloping." After doing a lot of research and testing, I ended up using the 2of12 list from http://wordlist.sourceforge.net/12dicts-readme.html for most of my projects. Like BB, I also use supplemental word lists. I do several other things in my code that I didn't see in BB's process. First, if a word is capitalized and occurs often enough, then it's deemed a proper name and is accepted. Second, if each part of a hyphenated word is a recognized word, then the hyphenated version is accepted. In another part, the code looks for the hyphenated and unhyphenated word and reports if both appear in the text. This report from a recent run provides an example: snowshoes(1) and snow-shoes(19) both appear in text. today(1) and to-day(28) both appear in text. tomorrow(1) and to-morrow(31) both appear in text. tonight(1) and to-night(25) both appear in text. I also flag as a suspect word any word that has mixed case. BB mentioned it takes more machine time to use a bigger dictionary, which is true but not a problem for me in practice. What I do that really does add computer time is I do a Levenshtein distance check between all the words in the book, flagging any that are close. Here are examples or short edit distance suspects in recent books: McDonald:Mcdonald (12:4) ker-choooo:ker-chooooo (1:1) Luke:Lukey (65:2) I believe that the process BB presented in lesson 12 would find all of these same problems. It would just take more effort to go through the suspect lists. If I can chase down suspect words using more than "not in the list" evaluations, then I'll let the computer do the additional checks. I'm really enjoying BB's lessons on how to digitize a book. There's a lot of good information in there. --Roger (rfrank)