these grapes are sweet -- lesson #12
this thread explains how to digitize a book quickly and easily. search archive.org for "booksculture00mabiuoft" for the book. i hope you enjoyed the brief interlude between our lessons... *** ok, we've done all of our "first-pass" cleaning, so now it's time for us to move on to spellcheck. so let's code a spellchecker... the first thing you need to know is that even though i call it "a spellchecker", it's very different from the spellchecker that you are used to, the one you have inside your wordprocessor. within that environment, a spellchecker has two components: 1) a routine to flag words that are not in the dictionary, and 2) a routine to provide a "suggestion" on the correct spelling. we need the first component. but we don't need the second, since the question about the "correctness" of the spelling is answered by viewing the pagescan, to see what was printed. luckily for us, the first routine is extremely easy to program, while the second is usually much more difficult and complex. so we get to do the simple part and not the hard part. great! *** the first thing we need is a "dictionary" to use as our standard. again, this is not "a dictionary" per se, because there are no _definitions_ included in it, but merely a long list of words... you can find many dictionaries online. here's the one i use:
if you take a look at that, you'll see it's a simple list of words, each one on a line by itself, arranged in alphabetical order... it's reasonably sized, at just under 800k, with 81,500+ words. i've got another dictionary -- i think it's the one rfrank uses in his routines -- that's 1.5 megs. bigger isn't always better. first, it takes more machine-time to use a bigger dictionary. second, and more importantly, you're more likely to miss a real o.c.r. error that just happens to spell some other word. i even think my dictionary is too big... there are "words" in it that only a scrabble player could love. but it's too much work to refine it, and i don't think it's causing errors, so i keep it... as an interesting sidenote, many of the dictionaries online were formed originally by the "moby" research, which used e-texts from project gutenberg as their underlying corpus. *** on the face of it, it's brain-dead simple to check a book. first, you need to split the book up into "words", which you can do my changing all the spaces into linebreaks, and then splitting the file by linebreaks. easy enough... then, a looped routine like this seems like it should work:
if eachword not in thedictionary then print eachword
that code has big bugs in it, though. can you spot them? first off, if you looked at the dictionary, it's all-lower-case, so we need to convert eachword into its lower-case version. second, the dictionary contains no punctuation characters, so we need to strip all the leading and trailing punctuation. (anything with internal punctuation is automatically "bad".) but the biggest bug is misrecognitions like "ng" _will_ be found in the dictionary, e.g., in every word ending in "ing". to prevent such "subwords" from being found, we need to slap a linebreak at the start and at the end of eachword, so it will only register a hit if it matches a complete "line". after those little changes, this routine will "work" just fine. it's not a _good_ routine, though, because it has to search too many words -- from the top of the dictionary down -- for every bookword before finding a hit. and on bad words, it has to search the entire dictionary to know it's "not there". *** now, because the dictionary is sorted alphabetically, we can improve our routine so that it is significantly more efficient. for instance, we could split the dictionary into an array, with all the words that start with each letter as a separate item... i.e., all "a" words go in one item, all "b" words, all "c" words... then, we'd just have to search that one specific array-item... (for a word starting with "c", we'd search the "c-word" item.) that would cut down on the unnecessary searching a bunch. *** but we can do even better by sorting the words in our book, so that we will then be dealing with _two_ alphabetical lists. in addition, sorting the words in our book allows us to easily discard the duplicates, so we only search for each word once. here's how we do the check, and why it's so bloody efficient... we have 2 arrays -- our dictionarywords and our bookwords. we set our pointers to start at the top of each list, and we then compare the dictionaryword with the bookword. if the dictionaryword and the bookword match each other, we put the bookword on the list we keep of "found words", and we then move on to the next word in both of our lists. if the dictionaryword is "less than" the bookword -- that is, if it comes _before_ the bookword in an alphabetized list -- then we go on to the next word in our dictionaryword list... finally, if the dictionaryword is "greater than" the bookword, we know we've gone past where the bookword would've been -- if it _were_ in the dictionary -- so it must not be in there, thus we will place that bookword in our "not-found" list, and then move on to the next bookword in our list of bookwords. this is simple logic to code. (easier to program than explain.) *** so now let's go on to the next step in our thinking about this. the "not-found" list is very important, of course. indeed, you could be forgiven for thinking it is the only one that matters. but you would be wrong; the "found" list is extremely valuable. once we have the initial list of "found" words, we're _cooking_. that's because we now have a "dictionary" for this unique book. this file is the one we will use in our spellchecker from now on. because this file is significantly smaller than the full dictionary, our spellcheck routine will work very speedily, which means we can incorporate it as a regular part of all our cleaning routines. so if/when we inadvertently introduce errors, we will catch them. you may find yourself resisting this idea. keep thinking about it, until you come to the realization it _is_ the smartest thing to do. *** there are two other wrinkles i threw into the spellcheck routine. the first is that i included a check for _british_ spellings, which was necessary for this book. (if the file wouldn't have had any, it wouldn't have mattered. since it did, if i had _not_ included the british dictionary, those words wouldn't been "not-found".) the british words get listed separately, so if there are only a few, you can check to see whether they were simple misrecognitions. when there are many, as you can see here, you are assured that the author utilised such spellings with the fullest of intentions... the british dictionary is here:
the other wrinkle is a file i call "specials.txt". this file contains "special" stuff that isn't in the regulardictionary.txt, but which is not unusual to encounter when digitizing a book, things like roman numerals, arabic numbers, famous names, and so forth, much of which requires that its check be done _case-sensitive_. the "specials" dictionary is here:
take a good look at the items there, so you get a feel for them... *** ok, so let's run the code:
http://zenmarkuplanguage.com/grapes110.py http://zenmarkuplanguage.com/grapes110.txt
we get a list of "notfound", "british", "specials" and "found" words. i have found it useful to split the "notfound" list into 2 segments, depending upon whether the word starts with an initial capital... the ones which do are usually _names_, and (like most of the text) they generally tend to have been recognized by the o.c.r. correctly. and yes indeed, that appears to be exactly the case we have here:
Arcady Dante's De Emerson's Farge's Florio's Gary's Goethe's Hamilton Haymarket Hegel's Hfe Hke Hmits Holinshed's Hterature Hve Hving Lowell's MDCCCCII Mabie Melos Norse Plato's Quincey's Selfexpression Shakespeare's Spenser's Symonds's Tennyson's Text-books They'll Tolstoi's Wordsworth
we do find a few words which do not look like names, but look instead like o.c.r. errors, including a bunch of those starting with a capital-h... further, "selfexpression" clearly just needs to have its hyphen restored. *** next, words which were notfound that start with a lowercase letter are most typically misrecognitions, and that was true in the current case...
acquirv ageo appletree bhnd certainlytrue clearlyenough comparacively conceit conceived concentration conception conceptions concern concerned concerning concert conclusion conclusions conditlons de dehvered eightysix enhghtenment en« escape.[1 farreaching generahsations h'abitually halfbelieved halfunderstood happyformulation hfe highminded hves illdirected largelydeveloped manysided mediaevalism nomena oftrepeated oi oijly onlylive onlysearching overmatched overweighted personahty phe pious-minded potentiahties powerfullyorganised rationahsing richment selfsacrifice selfsatisfaction throug uhimate unbroken- vitahty wellwrought whlch witliout wo;-ks
most of the words are o.c.r. errors. some of 'em are hyphenates that appear to have lost their hyphen. others are terms that were at the end of a line where a speck was misrecognized as a hyphen, so they were improperly joined (e.g., certainlytrue, clearlyenough). and still others are terms that were end-of-line hyphenates which must _retain_ their hyphen if/when they become rejoined together. but, as with the capitalized words, these words are rather typical. so, in many ways, this looks like the type of list you would expect... but there were some things about this list which jumped out at me. the first is "they'll". you might have seen that i have _no_contractions_ in my dictionary. that's because i've stored them in a separate file, one which i didn't have the current version of the program read in or utilize. just so you know, here's that list of contractions:
so i expected to see a lot more than one contraction appear on this list. i checked why, and found, to my surprise, that this book contains _no_ contractions -- save for that "they'll", which is embedded in a quote -- and that's why nothing else showed up. you learn every book is unique. another thing which jumped out at me is that i expected more names... there are a few, but usually you see more. ditto with compound-words; i have _none_ in my dictionary, so every one should've been listed here. the reason for these last two strange phenomena is quite humorous. remember that "specials" dictionary, which i suggested you review? it was formed, in large part, on the basis of my experience when i did this mabie book in an earlier experiment. indeed, look at the words that turned up in the "specials" list, and you'll see a number of names, as well as _all_ of the compound-words which are present in the book, from "book-lover" to "world-spirit". i briefly considered taking those compound-words out of the "specials" file, just so it wouldn't look like i was "cheating". but i decided against it, because it _does_ illustrate that -- as you do more and more books -- your dictionaries become more and more complete, so you end up with more accurate results... besides, all of those compound-words can easily be checked _inside_ of the program, and verified, if you like, simply by checking each side of the dash to see if they are valid words, and they are, in this book... (the one "exception" might be "self-centred", a problem that we could solve quite easily simply by adding the word to our british dictionary.) so much for the spellcheck lesson. do you have any questions for me? -bowerbird
i've got another dictionary -- i think it's the one rfrank uses in his routines -- that's 1.5 megs. bigger isn't always better. first, it takes more machine-time to use a bigger dictionary. second, and more importantly, you're more likely to miss a real o.c.r. error that just happens to spell some other word.
BB is correct that bigger isn't always better. In some of my software tools, I use the SCOWL (Spell Checker Oriented Word Lists) lists. There, you can select increasingly more complex levels. For example, "american-words.10" has much more common words than "american-words.95" There are also Canadian word lists, British, contractions, abbreviations. Learn more about SCOWL and other word lists at http://wordlist.sourceforge.net/ For code that doesn't use Scowl, I use a dictionary that is almost exactly the same as the one BB put in his post. I compared the words starting with "e" and found our lists are identical except his has two words mine doesn't: "escaloped" and "escaloping." After doing a lot of research and testing, I ended up using the 2of12 list from http://wordlist.sourceforge.net/12dicts-readme.html for most of my projects. Like BB, I also use supplemental word lists. I do several other things in my code that I didn't see in BB's process. First, if a word is capitalized and occurs often enough, then it's deemed a proper name and is accepted. Second, if each part of a hyphenated word is a recognized word, then the hyphenated version is accepted. In another part, the code looks for the hyphenated and unhyphenated word and reports if both appear in the text. This report from a recent run provides an example: snowshoes(1) and snow-shoes(19) both appear in text. today(1) and to-day(28) both appear in text. tomorrow(1) and to-morrow(31) both appear in text. tonight(1) and to-night(25) both appear in text. I also flag as a suspect word any word that has mixed case. BB mentioned it takes more machine time to use a bigger dictionary, which is true but not a problem for me in practice. What I do that really does add computer time is I do a Levenshtein distance check between all the words in the book, flagging any that are close. Here are examples or short edit distance suspects in recent books: McDonald:Mcdonald (12:4) ker-choooo:ker-chooooo (1:1) Luke:Lukey (65:2) I believe that the process BB presented in lesson 12 would find all of these same problems. It would just take more effort to go through the suspect lists. If I can chase down suspect words using more than "not in the list" evaluations, then I'll let the computer do the additional checks. I'm really enjoying BB's lessons on how to digitize a book. There's a lot of good information in there. --Roger (rfrank)
Hi All, I hate to tell you this. But both of you are wrong, A big dictionary does not increase the look up at all that much! Just the time it takes to load. I will not argue this point as it is standard Computer Science 101! The problem is how you do the look up. A properly designed dictionary will only need linear time in relation to the length of the word being looked up! Of course if you are using simple lists and and using simple pattern matching, You are naturally correct. But, then again you are not worried about efficiency in the first place. regards Keith. Am 23.09.2011 um 13:32 schrieb Roger Frank:
i've got another dictionary -- i think it's the one rfrank uses in his routines -- that's 1.5 megs. bigger isn't always better. first, it takes more machine-time to use a bigger dictionary. second, and more importantly, you're more likely to miss a real o.c.r. error that just happens to spell some other word.
BB is correct that bigger isn't always better. In some of my software tools, I use the SCOWL (Spell Checker Oriented Word Lists) lists. There, you can select increasingly more complex levels. For example, "american-words.10" has much more common words than "american-words.95" There are also Canadian word lists, British, contractions, abbreviations. Learn more about SCOWL and other word lists at http://wordlist.sourceforge.net/
For code that doesn't use Scowl, I use a dictionary that is almost exactly the same as the one BB put in his post. I compared the words starting with "e" and found our lists are identical except his has two words mine doesn't: "escaloped" and "escaloping." After doing a lot of research and testing, I ended up using the 2of12 list from http://wordlist.sourceforge.net/12dicts-readme.html for most of my projects.
Like BB, I also use supplemental word lists. I do several other things in my code that I didn't see in BB's process. First, if a word is capitalized and occurs often enough, then it's deemed a proper name and is accepted. Second, if each part of a hyphenated word is a recognized word, then the hyphenated version is accepted. In another part, the code looks for the hyphenated and unhyphenated word and reports if both appear in the text. This report from a recent run provides an example:
snowshoes(1) and snow-shoes(19) both appear in text. today(1) and to-day(28) both appear in text. tomorrow(1) and to-morrow(31) both appear in text. tonight(1) and to-night(25) both appear in text.
I also flag as a suspect word any word that has mixed case.
BB mentioned it takes more machine time to use a bigger dictionary, which is true but not a problem for me in practice. What I do that really does add computer time is I do a Levenshtein distance check between all the words in the book, flagging any that are close. Here are examples or short edit distance suspects in recent books:
McDonald:Mcdonald (12:4) ker-choooo:ker-chooooo (1:1) Luke:Lukey (65:2)
I believe that the process BB presented in lesson 12 would find all of these same problems. It would just take more effort to go through the suspect lists. If I can chase down suspect words using more than "not in the list" evaluations, then I'll let the computer do the additional checks.
I'm really enjoying BB's lessons on how to digitize a book. There's a lot of good information in there.
--Roger (rfrank)
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (3)
-
Bowerbird@aol.com -
Keith J. Schultz -
Roger Frank