these grapes are sweet -- lesson #12

23 Sep 2011

      this thread explains how to digitize a book quickly and easily.
search archive.org for "booksculture00mabiuoft" for the book.

i hope you enjoyed the brief interlude between our lessons...

***

ok, we've done all of our "first-pass" cleaning, so now it's time
for us to move on to spellcheck.   so let's code a spellchecker...

the first thing you need to know is that even though i call it
"a spellchecker", it's very different from the spellchecker that
you are used to, the one you have inside your wordprocessor.

within that environment, a spellchecker has two components:
1) a routine to flag words that are not in the dictionary, and
2) a routine to provide a "suggestion" on the correct spelling.

we need the first component.   but we don't need the second,
since the question about the "correctness" of the spelling is
answered by viewing the pagescan, to see what was printed.

luckily for us, the first routine is extremely easy to program,
while the second is usually much more difficult and complex.

so we get to do the simple part and not the hard part.   great!

***

the first thing we need is a "dictionary" to use as our standard.

again, this is not "a dictionary" per se, because there are no
_definitions_ included in it, but merely a long list of words...

you can find many dictionaries online.   here's the one i use:
...
http://z-m-l.com/go/regulardictionary.txt
if you take a look at that, you'll see it's a simple list of words,
each one on a line by itself, arranged in alphabetical order...

it's reasonably sized, at just under 800k, with 81,500+ words.

i've got another dictionary -- i think it's the one rfrank uses
in his routines -- that's 1.5 megs.   bigger isn't always better.
first, it takes more machine-time to use a bigger dictionary.
second, and more importantly, you're more likely to miss a
real o.c.r. error that just happens to spell some other word.

i even think my dictionary is too big...   there are "words" in it
that only a scrabble player could love.   but it's too much work
to refine it, and i don't think it's causing errors, so i keep it...

as an interesting sidenote, many of the dictionaries online
were formed originally by the "moby" research, which used
e-texts from project gutenberg as their underlying corpus.

***

on the face of it, it's brain-dead simple to check a book.

first, you need to split the book up into "words", which
you can do my changing all the spaces into linebreaks,
and then splitting the file by linebreaks.   easy enough...

then, a looped routine like this seems like it should work:
...
if eachword not in thedictionary then print eachword
that code has big bugs in it, though.   can you spot them?

first off, if you looked at the dictionary, it's all-lower-case,
so we need to convert eachword into its lower-case version.

second, the dictionary contains no punctuation characters,
so we need to strip all the leading and trailing punctuation.
(anything with internal punctuation is automatically "bad".)

but the biggest bug is misrecognitions like "ng" _will_ be
found in the dictionary, e.g., in every word ending in "ing".

to prevent such "subwords" from being found, we need to
slap a linebreak at the start and at the end of eachword,
so it will only register a hit if it matches a complete "line".

after those little changes, this routine will "work" just fine.

it's not a _good_ routine, though, because it has to search
too many words -- from the top of the dictionary down --
for every bookword before finding a hit.   and on bad words,
it has to search the entire dictionary to know it's "not there".

***

now, because the dictionary is sorted alphabetically, we can
improve our routine so that it is significantly more efficient.

for instance, we could split the dictionary into an array, with
all the words that start with each letter as a separate item...
i.e., all "a" words go in one item, all "b" words, all "c" words...

then, we'd just have to search that one specific array-item...

(for a word starting with "c", we'd search the "c-word" item.)
that would cut down on the unnecessary searching a bunch.

***

but we can do even better by sorting the words in our book,
so that we will then be dealing with _two_ alphabetical lists.

in addition, sorting the words in our book allows us to easily
discard the duplicates, so we only search for each word once.

here's how we do the check, and why it's so bloody efficient...

we have 2 arrays -- our dictionarywords and our bookwords.

we set our pointers to start at the top of each list, and
we then compare the dictionaryword with the bookword.   

if the dictionaryword and the bookword match each other,
we put the bookword on the list we keep of "found words",
and we then move on to the next word in both of our lists.

if the dictionaryword is "less than" the bookword -- that is,
if it comes _before_ the bookword in an alphabetized list --
then we go on to the next word in our dictionaryword list...

finally, if the dictionaryword is "greater than" the bookword,
we know we've gone past where the bookword would've been
-- if it _were_ in the dictionary -- so it must not be in there,
thus we will place that bookword in our "not-found" list, and
then move on to the next bookword in our list of bookwords.

this is simple logic to code.   (easier to program than explain.)

***

so now let's go on to the next step in our thinking about this.

the "not-found" list is very important, of course.   indeed, you
could be forgiven for thinking it is the only one that matters.

but you would be wrong; the "found" list is extremely valuable.

once we have the initial list of "found" words, we're _cooking_.

that's because we now have a "dictionary" for this unique book.
this file is the one we will use in our spellchecker from now on.

because this file is significantly smaller than the full dictionary,
our spellcheck routine will work very speedily, which means we
can incorporate it as a regular part of all our cleaning routines.
so if/when we inadvertently introduce errors, we will catch them.

you may find yourself resisting this idea.   keep thinking about it,
until you come to the realization it _is_ the smartest thing to do.

***

there are two other wrinkles i threw into the spellcheck routine.

the first is that i included a check for _british_ spellings, which
was necessary for this book.   (if the file wouldn't have had any,
it wouldn't have mattered.   since it did, if i had _not_ included
the british dictionary, those words wouldn't been "not-found".)

the british words get listed separately, so if there are only a few,
you can check to see whether they were simple misrecognitions.
when there are many, as you can see here, you are assured that
the author utilised such spellings with the fullest of intentions...

the british dictionary is here:
...
http://z-m-l.com/go/britishdictionary.txt
the other wrinkle is a file i call "specials.txt".   this file contains
"special" stuff that isn't in the regulardictionary.txt, but which
is not unusual to encounter when digitizing a book, things like
roman numerals, arabic numbers, famous names, and so forth,
much of which requires that its check be done _case-sensitive_.

the "specials" dictionary is here:
...
http://z-m-l.com/go/specials.txt
take a good look at the items there, so you get a feel for them...

***

ok, so let's run the code:
...
http://zenmarkuplanguage.com/grapes110.py
   http://zenmarkuplanguage.com/grapes110.txt
we get a list of "notfound", "british", "specials" and "found" words.

i have found it useful to split the "notfound" list into 2 segments,
depending upon whether the word starts with an initial capital...

the ones which do are usually _names_, and (like most of the text)
they generally tend to have been recognized by the o.c.r. correctly.

and yes indeed, that appears to be exactly the case we have here:
...
Arcady
   Dante's
   De
   Emerson's
   Farge's
   Florio's
   Gary's
   Goethe's
   Hamilton
   Haymarket
   Hegel's
   Hfe
   Hke
   Hmits
   Holinshed's
   Hterature
   Hve
   Hving
   Lowell's
   MDCCCCII
   Mabie
   Melos
   Norse
   Plato's
   Quincey's
   Selfexpression
   Shakespeare's
   Spenser's
   Symonds's
   Tennyson's
   Text-books
   They'll
   Tolstoi's
   Wordsworth
we do find a few words which do not look like names, but look instead
like o.c.r. errors, including a bunch of those starting with a capital-h...

further, "selfexpression" clearly just needs to have its hyphen restored.

***

next, words which were notfound that start with a lowercase letter are
most typically misrecognitions, and that was true in the current case...
...
acquirv
   ageo
   appletree
   bhnd
   certainlytrue
   clearlyenough
   comparacively
   conceit
   conceived
   concentration
   conception
   conceptions
   concern
   concerned
   concerning
   concert
   conclusion
   conclusions
   conditlons
   de
   dehvered
   eightysix
   enhghtenment
   en«
   escape.[1
   farreaching
   generahsations
   h'abitually
   halfbelieved
   halfunderstood
   happyformulation
   hfe
   highminded
   hves
   illdirected
   largelydeveloped
   manysided
   mediaevalism
   nomena
   oftrepeated
   oi
   oijly
   onlylive
   onlysearching
   overmatched
   overweighted
   personahty
   phe
   pious-minded
   potentiahties
   powerfullyorganised
   rationahsing
   richment
   selfsacrifice
   selfsatisfaction
   throug
   uhimate
   unbroken-
   vitahty
   wellwrought
   whlch
   witliout
   wo;-ks
most of the words are o.c.r. errors.   some of 'em are hyphenates
that appear to have lost their hyphen.   others are terms that were
at the end of a line where a speck was misrecognized as a hyphen,
so they were improperly joined (e.g., certainlytrue, clearlyenough).
and still others are terms that were end-of-line hyphenates which
must _retain_ their hyphen if/when they become rejoined together.

but, as with the capitalized words, these words are rather typical.

so, in many ways, this looks like the type of list you would expect...

but there were some things about this list which jumped out at me.

the first is "they'll".   you might have seen that i have _no_contractions_
in my dictionary.   that's because i've stored them in a separate file, one
which i didn't have the current version of the program read in or utilize.

just so you know, here's that list of contractions:
...
http://z-m-l.com/go/contractions.txt
so i expected to see a lot more than one contraction appear on this list.

i checked why, and found, to my surprise, that this book contains _no_
contractions -- save for that "they'll", which is embedded in a quote --
and that's why nothing else showed up.   you learn every book is unique.

another thing which jumped out at me is that i expected more names...
there are a few, but usually you see more.   ditto with compound-words;
i have _none_ in my dictionary, so every one should've been listed here.

the reason for these last two strange phenomena is quite humorous.

remember that "specials" dictionary, which i suggested you review?

it was formed, in large part, on the basis of my experience when i did
this mabie book in an earlier experiment.   indeed, look at the words
that turned up in the "specials" list, and you'll see a number of names,
as well as _all_ of the compound-words which are present in the book,
from "book-lover" to "world-spirit".   i briefly considered taking those
compound-words out of the "specials" file, just so it wouldn't look like
i was "cheating".   but i decided against it, because it _does_ illustrate
that -- as you do more and more books -- your dictionaries become
more and more complete, so you end up with more accurate results...
besides, all of those compound-words can easily be checked _inside_
of the program, and verified, if you like, simply by checking each side
of the dash to see if they are valid words, and they are, in this book...
(the one "exception" might be "self-centred", a problem that we could
solve quite easily simply by adding the word to our british dictionary.)

so much for the spellcheck lesson.   do you have any questions for me?

-bowerbird

Bowerbird＠aol.com

Roger Frank

Keith J. Schultz

tags

participants (3)