these grapes are sweet -- lesson #15

24 Sep 2011

      this thread explains how to digitize a book quickly and easily.
search archive.org for "booksculture00mabiuoft" for the book.

***

ok, time to start in with our "final checks" on this book...

you might recall that we noticed -- right off the bat -- that
the o.c.r. was having a problem misrecognizing the letter "i",
especially in conjunction with a lowercase "l" (where the pair
were mistaken as a capital-h), and at the start of many words,
where the lowercase "i" was often misrecognized as a capital...

we took care of all the "li"-misrecognized-as-capital-h errors
when we reviewed the spellcheck, as those words got flagged.

but the improperly-cased i-words passed spellcheck just fine,
so we need another check, to catch them and similar errors...

this check is a relatively easy one to do -- you just look for
capitalized words which are not at the start of a sentence...

the problem is that this check gives too many false-alarms,
because it finds _names_, not just improperly-cased words.

on the other hand, you can turn that _flaw_ into an _asset_.

you might remember that we _want_ to find names, because
we want to add 'em to our customized spellcheck dictionary.

it ends up that this routine is a _great_ way to locate names.

what you do is you search for mid-sentence capitalized words.
when you find one, you check it against the regular dictionary.
if it's not there, you put the word on the "possible names" list.
if the word is in the dictionary, you flag it as a "possible error".

sometimes the error is due to an inappropriate upper-casing.
other times it can also be due to some missing punctuation,
or a period that was misrecognized as a comma, and so on...

but most of the time in this book, it was inappropriate casing,
so the refinement to our routine is an extremely welcome one.

this refinement doesn't remove all of the false-alarms (or all
of the "misses", for that matter, if you know signal-detection).

there are a good number of personal names that _do_ appear
in a dictionary, such as "smith" and "brown" and "tanner", etc.

(there's a humorous example in our book here, which dotes
on the many contributions from the classic epic poet "homer",
a name that passes spellcheck _nicely_, praise be to babe ruth,
albeit probably not in a way the greek storyteller anticipated.)

likewise, there are sometimes mid-sentence initial-caps that
are _not_ o.c.r. errors.   (you might recall that the current book
gave honorific capitalization to "idealism", "realism", and such.)

but all in all, this is a very worthy routine, which finds errors
in a good number, without too many misses or false-alarms...

so that'll be the first part of our program today.

***

the second part will check the balancing of our doublequotes.

you might remember that i fixed the doublequotes manually,
in the initial clean-up, since there weren't very many of them.
but it's always good to have the computer go back and check
your work, make sure you did it correctly.   ("to err is human.")

besides, there might be some that the o.c.r. just plain missed.
or there might be some that the _printer_ accidentally forgot.

the best check, for doublequotes, is to split on paragraphs,
and check that each one starts with an open-doublequote
(if it has any doublequotes), and then alternates between
close-doublequotes and open-doublequotes, ending with
-- preferably -- a doublequote.   it is grammatically correct
to not have the close-doublequote when a quoted passage
continues with the next paragraph.   but, for the most part,
throughout the file, we're gonna expect alternation between
an open-doublequote and a close-doublequote, so we can
write a routine that actively tests for that specific pattern...

to be exact, we will set up a print "toggle", and _output_ any
text between an open-doublequote and a close-doublequote.

and we'll wave a flag if we encounter two open-doublequotes
in a row, or (especially) two consecutive close-doublequotes.

***

the third part of this miscellaneous clean-up program today
will check any ambiguous paragraphing at the top of a page.

the o.c.r. at archive.org has no information about a paragraph
that starts at the top of a page, but we need to know that info,
so we'll check anything that looks like it involves such a thing.

there are three tip-offs.   one is sentence-ending punctuation
at the end of a page.   another is a capital letter at the top of
a page.   last of all, a third is a short line at the end of a page.
any of those tip-offs will result in a check by us of the scan...

as you can tell from those tip-offs, this check involves both
text at the bottom of one page _and_ at the top of the next.

for the goal of getting proper paragraphing, it's enough to
view the text at the top of "the next page" -- if it's indented,
you have a new paragraph, and if not, you don't -- but you
also want to check the text at the bottom of "the one page",
because you sometimes find the o.c.r. made an error there.
(indeed, it happened once, maybe twice, in the current book.)

this check will tell you the pages that you'll need to review,
and you will check the pagescans to make your judgment,
and then edit the text-file appropriately, if there is a need.

***

the final check performed by today's program will be on the
runheads.   i include these in the z.m.l. version of an e-book,
so i want them to be correct.   we expect they'll be repetitive,
so we ignore 'em if they are, print them if/when they change.

***

here's the program:
...
http://zenmarkuplanguage.com/grapes112.py
   http://zenmarkuplanguage.com/grapes112.txt
the output is kinda sloppy, so i won't repeat it here.
just run the program yourself if you want to see it...

let's review our findings on the 4 things we checked...

suffice it to say that there were a lot of places where
the o.c.r. misrecognized a lowercase-i as uppercase,
and a few other places where we had an inappropriate
mid-sentence capped word, for one reason or another.

what the program also revealed is that i'd made some
errors when i manually "fixed" spacey-doublequotes...
that's the glitch with "manual" fixes -- human error...
just a couple of errors, yes.   but errors nonetheless...

third, it listed out _16_ possible page-top paragraphs,
about half of which will require a reference to the scan.
those on 4-5, 7-8, 45-46, 111-112, 112-113 do not,
because we can tell the trigger was fired unnecessarily.
a few of the rest obviously constitute a new paragraph,
specifically those on 168-169, 171-172, and 251-252.
and we're tipped to an o.c.r. error at the bottom of 208.

finally, we see that we have a couple runheads that have
misrecognitions in them, so we'll have to fix those too...

***

based on this output, i did corrections to the text-file:
...
http://zenmarkuplanguage.com/grapes005.txt
***

so here's the same program, run on the new "005" file:
...
http://zenmarkuplanguage.com/grapes113.py
   http://zenmarkuplanguage.com/grapes113.txt
as you can see from the output, i didn't catch all of the
bugs when i did my edits.   again with the human error.

but this teaches us something that's extremely important.
after you do edits -- automatic or manual, but especially
the manual kind -- you need to re-do the check, to ensure
that (1) you made all of the changes you intended to make,
and that (2) you didn't make any other unintended changes.

***

anyway, so i did the corrections _again_ to the text-file:
...
http://zenmarkuplanguage.com/grapes006.txt
and here's the check:
...
http://zenmarkuplanguage.com/grapes114.py
   http://zenmarkuplanguage.com/grapes114.txt
ok, it finally looks clean.   whew.   thus ends today's lesson.

-bowerbird

Bowerbird＠aol.com

tags

participants (1)