these grapes are sweet -- lesson #15

this thread explains how to digitize a book quickly and easily. search archive.org for "booksculture00mabiuoft" for the book. *** ok, time to start in with our "final checks" on this book... you might recall that we noticed -- right off the bat -- that the o.c.r. was having a problem misrecognizing the letter "i", especially in conjunction with a lowercase "l" (where the pair were mistaken as a capital-h), and at the start of many words, where the lowercase "i" was often misrecognized as a capital... we took care of all the "li"-misrecognized-as-capital-h errors when we reviewed the spellcheck, as those words got flagged. but the improperly-cased i-words passed spellcheck just fine, so we need another check, to catch them and similar errors... this check is a relatively easy one to do -- you just look for capitalized words which are not at the start of a sentence... the problem is that this check gives too many false-alarms, because it finds _names_, not just improperly-cased words. on the other hand, you can turn that _flaw_ into an _asset_. you might remember that we _want_ to find names, because we want to add 'em to our customized spellcheck dictionary. it ends up that this routine is a _great_ way to locate names. what you do is you search for mid-sentence capitalized words. when you find one, you check it against the regular dictionary. if it's not there, you put the word on the "possible names" list. if the word is in the dictionary, you flag it as a "possible error". sometimes the error is due to an inappropriate upper-casing. other times it can also be due to some missing punctuation, or a period that was misrecognized as a comma, and so on... but most of the time in this book, it was inappropriate casing, so the refinement to our routine is an extremely welcome one. this refinement doesn't remove all of the false-alarms (or all of the "misses", for that matter, if you know signal-detection). there are a good number of personal names that _do_ appear in a dictionary, such as "smith" and "brown" and "tanner", etc. (there's a humorous example in our book here, which dotes on the many contributions from the classic epic poet "homer", a name that passes spellcheck _nicely_, praise be to babe ruth, albeit probably not in a way the greek storyteller anticipated.) likewise, there are sometimes mid-sentence initial-caps that are _not_ o.c.r. errors. (you might recall that the current book gave honorific capitalization to "idealism", "realism", and such.) but all in all, this is a very worthy routine, which finds errors in a good number, without too many misses or false-alarms... so that'll be the first part of our program today. *** the second part will check the balancing of our doublequotes. you might remember that i fixed the doublequotes manually, in the initial clean-up, since there weren't very many of them. but it's always good to have the computer go back and check your work, make sure you did it correctly. ("to err is human.") besides, there might be some that the o.c.r. just plain missed. or there might be some that the _printer_ accidentally forgot. the best check, for doublequotes, is to split on paragraphs, and check that each one starts with an open-doublequote (if it has any doublequotes), and then alternates between close-doublequotes and open-doublequotes, ending with -- preferably -- a doublequote. it is grammatically correct to not have the close-doublequote when a quoted passage continues with the next paragraph. but, for the most part, throughout the file, we're gonna expect alternation between an open-doublequote and a close-doublequote, so we can write a routine that actively tests for that specific pattern... to be exact, we will set up a print "toggle", and _output_ any text between an open-doublequote and a close-doublequote. and we'll wave a flag if we encounter two open-doublequotes in a row, or (especially) two consecutive close-doublequotes. *** the third part of this miscellaneous clean-up program today will check any ambiguous paragraphing at the top of a page. the o.c.r. at archive.org has no information about a paragraph that starts at the top of a page, but we need to know that info, so we'll check anything that looks like it involves such a thing. there are three tip-offs. one is sentence-ending punctuation at the end of a page. another is a capital letter at the top of a page. last of all, a third is a short line at the end of a page. any of those tip-offs will result in a check by us of the scan... as you can tell from those tip-offs, this check involves both text at the bottom of one page _and_ at the top of the next. for the goal of getting proper paragraphing, it's enough to view the text at the top of "the next page" -- if it's indented, you have a new paragraph, and if not, you don't -- but you also want to check the text at the bottom of "the one page", because you sometimes find the o.c.r. made an error there. (indeed, it happened once, maybe twice, in the current book.) this check will tell you the pages that you'll need to review, and you will check the pagescans to make your judgment, and then edit the text-file appropriately, if there is a need. *** the final check performed by today's program will be on the runheads. i include these in the z.m.l. version of an e-book, so i want them to be correct. we expect they'll be repetitive, so we ignore 'em if they are, print them if/when they change. *** here's the program:
http://zenmarkuplanguage.com/grapes112.py http://zenmarkuplanguage.com/grapes112.txt
the output is kinda sloppy, so i won't repeat it here. just run the program yourself if you want to see it... let's review our findings on the 4 things we checked... suffice it to say that there were a lot of places where the o.c.r. misrecognized a lowercase-i as uppercase, and a few other places where we had an inappropriate mid-sentence capped word, for one reason or another. what the program also revealed is that i'd made some errors when i manually "fixed" spacey-doublequotes... that's the glitch with "manual" fixes -- human error... just a couple of errors, yes. but errors nonetheless... third, it listed out _16_ possible page-top paragraphs, about half of which will require a reference to the scan. those on 4-5, 7-8, 45-46, 111-112, 112-113 do not, because we can tell the trigger was fired unnecessarily. a few of the rest obviously constitute a new paragraph, specifically those on 168-169, 171-172, and 251-252. and we're tipped to an o.c.r. error at the bottom of 208. finally, we see that we have a couple runheads that have misrecognitions in them, so we'll have to fix those too... *** based on this output, i did corrections to the text-file:
*** so here's the same program, run on the new "005" file:
http://zenmarkuplanguage.com/grapes113.py http://zenmarkuplanguage.com/grapes113.txt
as you can see from the output, i didn't catch all of the bugs when i did my edits. again with the human error. but this teaches us something that's extremely important. after you do edits -- automatic or manual, but especially the manual kind -- you need to re-do the check, to ensure that (1) you made all of the changes you intended to make, and that (2) you didn't make any other unintended changes. *** anyway, so i did the corrections _again_ to the text-file:
and here's the check:
http://zenmarkuplanguage.com/grapes114.py http://zenmarkuplanguage.com/grapes114.txt
ok, it finally looks clean. whew. thus ends today's lesson. -bowerbird
participants (1)
-
Bowerbird@aol.com