how to create a spellcheck workflow

http://z-m-l.com/go/regulardictionary.txt just so you have a feel for the output from this program, i have posted it for the "sitka" book we've been discussing: http://z-m-l.com/go/sitka/sitka-reversedictionary.txt
ok, well gee... i had gotten the firm impression that rfrank had read the d.p. forum thread where the development of their "wordcheck" was discussed (thoroughly -- 30 pages). d.p. calls it "wordcheck", but it's basically spellcheck. so why did i think he had read it? because he incorporated several of the things which i had suggested (to no good effect) in that thread... i won't bother reciting all of those particulars, since he might well have come up with the ideas himself, and nobody really cares anyway, not even me... but i mention it because now i have been convinced that roger must _not_ have read that thread, based on some current discussions on his fadedpage site... or, if he did read it, he didn't "get it", but given those current discussions, he might now be more receptive. so i will run through a quick little "how-to" refresher on how to design and build a spellcheck functionality, and incorporate it into the overall workflow of a book, so roger has the benefit of my wisdom here... ;+) as you'll see, this will hit a very wide variety of topics. note that what we're talking about here is _primarily_ executed during _preprocessing_, but there are some follow-on thoughts that apply to the proofing as well. *** 0. set o.c.r. parameters correctly. don't dehyphenate! name your files wisely. look both ways before crossing. 1. the first thing you should do with your o.c.r. results is to make the few global changes you can do _blindly_. the very first one is to strip trailing spaces from all lines. another will be to replace two spaces with a single space. yet another is to change a linebreak-doublequote-space combination to eliminate the space, since it's superfluous. likewise, change all cases of space-doublequote-linebreak. you get the drift. oh, by the way, don't do what d.p. does! _retain_ the runheads, and pagenumbers; you need 'em... 2. the next thing you should do is clean up the runheads. this has little bearing on our general "spellcheck" topic, but i include it here because it's always your second step. 3. the third thing on your list is to fix all paragraphing... again, not much to do with spellcheck, but it _is_ step #3. 4. now we can focus quite specifically on spellcheck stuff. we will take the o.c.r. and run it through a program i wrote which pulls out all the words _not_ present in its dictionary. i think rfrank has his own program that does the same thing, or something similar enough. i'll make mine available too. this is the first draft of your "bad-words" list for this book. (note this is not the same as how d.p. defines "bad words".) in this regard, use a good dictionary in this first check here. (this is something that rfrank hasn't done correctly thus far.) the dictionary i use is quite good, and it can be found here: that output was generated in 5 seconds, so it's pretty fast... 5. you'll see, on viewing your list of supposed "bad-words", that a bunch are not "bad-words" at all. some of 'em will be character-names, or jargon specific to your particular book. some will be hyphenated fragments, some compound-words. you can delete these words from the list now, if you want, but you shouldn't necessarily feel a great need to do that. if you're looking at the list of words from the "sitka" book, you will also notice that i separated the initial-caps words from all-lowercase ones. there is a good reason for that. due to proper names, _most_ initial-cap words are correct, whereas most of the all-lowercase words are _incorrect_... separating the lists makes it easier to focus your attention. 6. your dictionary-check program should also spit out the _frequency_ of each bad-word. you'll use that information to cull some words from this list of "bad-words". this you _will_ want to do, most definitely, so _sort_ on frequency. i fought tooth-and-nail on this with d.p. (i lost, of course), but you can take this to the bank that as long as there are a mere 4-plus occurrences of a specific string in the o.c.r., you can (i.e., should) delete it from this list of "bad-words". yes, some of the words _might_indeed_ be bad, but unless you're positive of that, you should delete 'em from the list. and yes, this means that those words will _not_ be flagged. but trust me, if there's 4 or more occurrences of a scanno, your proofers will find at least _one_ . and whenever they find _one_ "bad-word" that wasn't on that "bad-word" list, you will automatically search the rest of the book, and thus find those _other_ occurrences as well, so you can fix them. i did _not_ include frequency information in my "sitka" list, because i didn't want to make everything so bloody obvious, and because i want you to discover the importance of that frequency data for yourself, so it burns itself in your brain. 7. when you narrow your focus to the words that are _not_ in the dictionary, and which occur only two or three times in the book, you'll find you can be very productive fixing errors. for many of the words, it'll be obvious what they should be... building a tool that will take you _immediately_ to each word, plus show the scan alongside, will turn you into a _machine,_ an awesome and devastatingly efficient error-fixing machine. for this very first pass, i recommend you look only at words you're confident are scanning errors. (they're easy to spot.) on your next pass, you can look at more questionable words. also pay attention to words with several variants that'll thus sort next to each other. (see the asterisks in the "sitka" list.) it'll almost certainly be the case that one variant is a scanno. (97 times out of 100, it is the one with fewer occurrences.) also, in a system where you're gonna have proofers doing a word-by-word proofing, don't even bother to look at words which look kinda reasonable. and don't ever bother to view any words with only _one_ occurrence. those will be flagged, and it's no less efficient to have the proofer see if it's correct than to have _you_ see if it's correct. you wanna be efficient in preprocessing. efficiency is _the_point_ of preprocessing! 8. the next thing you want to look at are compound-words. my tool also separates compound-words to their own listing. as per usual, perusal of the compound-words will show that some are obviously correct, others are obviously incorrect, and a bunch where a judgment can only happen with a scan. the other thing about compound-words, which you will want your tools to handle, is a check against the rest of the book, to find any other instances of the compound where the parts are separated by a space (two words) or joined (one word)... that information will help you decide how to treat the word. 9. the other thing you're gonna check is end-line hyphenates. remember that i told you _not_ to have the o.c.r. rejoin them. my philosophy is to _retain_ the end-line hyphenates through my final product, but i'm not arguing for that position _here_. you can rejoin the end-line hyphenates if you want to do that. just don't do it until _after_ all of the proofing is done, because having original linebreaks makes a page much easier to proof. and besides, in determining whether or not the rejoined word contains a dash or not, we need to have uncompromised data. if your o.c.r. program destroys that data (by joining the word in the way that _its_ dictionary dictates), then you might just be doing a disservice to the way that the _book_ did things... for your spellcheck, however, you can ignore all of that stuff. internal to your spellcheck tool, rejoin the end-line hyphenate by eliminating the linebreak, and test the resultant compound. if it passes, fine. if not, try again with the dash removed too. if it _still_ doesn't pass, flag _both_ portions of the compound. (note that if you are using _my_ dictionary, mentioned above, it contains no compounds, so you would skip that first check.) 10. so at this point in time, you have a great "bad-words" list. that is half your battle. (that's right, just _half_.) so now you will make your "good-words" list. to do this, just run the text through your dictionary-checker using your "bad-words" list as the dictionary. thus, the output will be all of the words in your text which are _not_ included on your "bad-words" list. (or you can just use my tool, which can also create this list.) from now on, this "good-words" list will be your dictionary... got that? you're not using the huge dictionary file any more. you'll use the much-more-compact "good-words" list instead. so now you have a "bad-words" list and a "good-words" list, which -- taken together -- comprise all words in your book. there are other jobs that you'll do during preprocessing, but this is all the spellcheck work that's needed during that stage. 11. so now we will move on from preprocessing to proofing, which takes us to "flagging" -- highlighting possible errors... a word should be flagged if it appears in the "bad-words" list. a word _might_ be flagged if it's not on the "good-words" list, perhaps in a different way; for instance, yellow instead of red. notice that this is a slightly more nuanced way to do flagging. rather than flag _everything_ that _might_ be wrong, we are gonna flag _only_ the things we really _suspect_ are wrong... underflagging is better than overflagging, because too many flags makes us complacent; we start to check only the flags. it's impossible for your mind to ignore the fact that most of the flags are not actually errors, so it comes to expect that if something is _not_ flagged, it certainly won't be an error. but if you underflag, and the proofer spots an error that was _not_ flagged, it primes them to be attentive to everything... 12. the other thing that's extremely important here is that, when the book is finished, we should have resolved all flags. every word in the book will be there on the "good-word" list, and the "bad-word" list will have shrunk until it disappeared. that is, every bad-word will have been checked, and if it was "ok", it will have been moved to the "good-word" list, and if it was not "ok", it would have been _changed_, which will also remove it from the bad-words list. the words do _not_ have to be physically removed from the "bad-words" list, but if we check every word in the book on the "good-words" list, and find it fine, then the "bad-words" list will be eliminated. this complete "good-words" list is _useful,_ because we can run the full book against spellcheck at any time, and it will come out totally clean. so we do that check periodically, so we know that we haven't compromised the book's accuracy. oh, and just so you'll know, it's quite easy to write the code that does this check. you simply sort the words in the book, eliminating duplicates; then you sort the "good-words" list (if it's not already sorted), and eliminate its duplicates too (shouldn't be any); then the 2 outputs should be _identical._ this lets us envision the proofing process as movement of all words on the "bad-words" list to the "good-words" list. (put that image in your head; the visualization has utility.) to help facilitate that movement, you need to make it easy for proofers to put words on the "good-words" list, which is why -- on my proofing site -- i let them add all the words for a single page to the "good-words" list with one button-click. (another option is a button-click for each individual word.) the flip-side is that, in order to have a page be considered as "finished", all flagged words on that page _must_ be cleared. remember, words move from "bad-words" to "good-words". that's good enough for now. any questions on this? ;+) -bowerbird

On 08-Apr-2010 04:42, Bowerbird@aol.com wrote:
_retain_ the runheads, and pagenumbers; you need 'em...
I imagine you'll eventually explain, but what use is preserving or fixing the page headings? If they can be mechanically fixed, they were not worth much. Do you include or exclude running headings in your word-count dictionary analysis? Does it matter? ============================================================ Gardner Buchanan <gbuchana@teksavvy.com> Ottawa, ON FreeBSD: Where you want to go. Today.
participants (2)
-
Bowerbird@aol.com
-
Gardner Buchanan