these grapes are sweet -- lesson #09

this thread explains how to digitize a book quickly and easily. search archive.org for "booksculture00mabiuoft" for the book. *** i showed you the list of edits which i made in my "first pass", all of which i found using various searches in my text-editor. the auto-change that stripped the trailing space off every line touched... well... every line, quite obviously, so in that sense, every single line of the original o.c.r. text-file was changed, albeit in a terribly uninteresting way. if we look at non-trivial auto-changes, there were far fewer... we auto-changed floating punctuation, but that was merely a couple lines for most of the marks, which the exception of (1) linebreak doublequotes, where 19 lines were fixed, and (2) semicolons, where some 444+ instances were corrected, which is a big number, but it remains an uninteresting edit. mid-line floating doublequotes got changed manually, with roughly 62 of those corrected. in most books, i would _not_ correct those glitches by hand, but since there were so few... but again, all of these are boring in the extreme... probably the most interesting set of edits were those which involved "weird" characters. using reg-ex, it's easy enough to locate such characters in a single search, if you choose... which means that it's easy to write a program to list the lines that contain any such characters. that's what we'll do today, so you can actually see the lines located with that routine... this is a useful program that you'll want to run on any file when you begin to work on it, to see what problems it has. it helps you zero in on the most common o.c.r. problems... and again at several points in your workflow, you'll want to run it again, just to make sure that you haven't introduced any of these weird characters into your text by accident... this program is a modification of the last one, adding in an array of characters that we want to flag, then looping through the lines of the file and listing any of those lines which contain a flag characters. you can run this program, or grab the code, here:
http://zenmarkuplanguage.com/grapes103.py http://zenmarkuplanguage.com/grapes103.txt
*** again, that code lists the lines by their order in the file... (it displays the linenumber for your convenience as well, since many text-editors will let you jump to a certain line. if you don't know of such a feature, you should look for it.) now, we might also want to group the lines according to the flag character, so we see all the lines that have a dollarsign (for instance) listed together, so we'll rework the code a bit. you can run this new program, or view the code, here:
http://zenmarkuplanguage.com/grapes104.py http://zenmarkuplanguage.com/grapes104.txt
this produces a nice grouped list, plus it also shows us the flagged terms which produced _no_ instances from the file. as you can see, most of the listed lines _are_ o.c.r. errors. so, for the most part, once we've cleaned up these glitches, whenever we run this program again later in our workflow, it should produce very few lines. it _will_, however, produce _some_ lines. at least it should. for instance, we're flagging "11", because in a lot of books, o.c.r. misrecognizes the end of words like _they'll_ or _we'll_ as an _11_... (you'd think it'd be smarter than that, but no.) indeed, as you can see from the output of our little program, there was one such misrecognition in this file, on line #4938. it's also the case, however, that there will be a lot of lines where "11" is the correct recognition, like the pagenumber for page 11, for instance. and for 111 and 211 too. d'uh. don't forget 112, 113, 114, 115, 116, 117, 118, and 119... but we see no "111" in this file, as the pagenumber for 111 was misrecognized as "iii". we also have no 115, and we see (because we searched for "ii" as well) that 116 and 118 were misrecognized, with the "11" seen as a pair of lowercase i's. also, pagenumber 113 has a signature number we'll erase... (this informs us we need a search to find that phenomenon, which will be a reg-ex looking for number-space-number.) likewise with that "ii". few english words have an "ii" pair, but imagine if you were digitizing a book about "skiing"... the goal is to catch o.c.r. errors but minimize false-alarms. our grouped listing also shows, in this text, that asterisks were usually a misrecognized doublequote (or singlequote). and we see we had a few glitches where "w" was seen as "v/". also note that we had problems involving the insertioncaret. of specific import is that some of these caret problems were on pagenumbers, which will be particularly important to us... *** the linenumbers of the file, which we've been listing, are good. but there's better positional information inside the text itself, i.e., pagenumbers (plus, by extension, per-page linenumbers). the pagenumbers give us the vital ability to _display_the_scan_ for any particular line of text, so we wanna correct 'em _now_, so we start benefiting immediately from that great capability. so let us write some code to help us correct the pagenumbers... that's rather simple to do, because our expectation is that they will march along for us in a sequence that's entirely predictable. so we'll just look closely for any departures from that sequence. *** the routine that helps clean up pagenumbers is found here:
http://zenmarkuplanguage.com/grapes105.py http://zenmarkuplanguage.com/grapes105.txt
you can puzzle through the code later if you like, but for now just look at the output. what you'll see is a line of "xxxxxxxx" wherever we get a departure from the next-expected-number. the program also shows us any short lines that it encounters, which helps us immensely to see where the problems might lie. it also lists the actual contents of the line, not just the number which it expected to find. sometimes a line "reduces down to" the expected number, but it has some garbage characters in it, and we're gonna wanna eliminate those garbage characters... and as before, this is a program that we're gonna wanna re-run every so often, to make sure we didn't mess up any pagenumbers. indeed, we might want to leave this routine in all our programs, and have it be silent as long as everything checks out fine, but have it bring notice to our attention if it suddenly finds an error. so, let's run this code against the file that i edited, to check it.
http://zenmarkuplanguage.com/grapes106.py http://zenmarkuplanguage.com/grapes106.txt
oops! as we can see, i fixed most of the pagenumber problems, but not _all_ of them. there's still a bug on pagenumber 61, and pagenumber 183 is still missing-in-action, and pagenumber 228 was misrecognized by the o.c.r. as "328", so we have to fix that... we also have a listing for linenumber 5957 and linenumber 5959, where a footnote number is confusing our routine. we can fix the code so that it ignores such footnote numbers in the future. *** so far, we've taken the general approach of _searching_for_ the particular character(s) which might be an o.c.r. misrecognition. that's an ok approach if we know what we should look for, but sometimes we don't. so we can take the _opposite_ approach, and say "these are _good_ characters, so ignore all of them, but show me any characters which are _not_ these good characters". that's the code we've got here:
http://zenmarkuplanguage.com/grapes107.py http://zenmarkuplanguage.com/grapes107.txt
what we find when we run that is some high-bit characters, which we were not expecting to find in 7-bit o.c.r. output... remember how i told you that archive.org was "losing" their em-dashes, and i had to badger them for years to fix that? well, when they "fixed" it, they did so by introducing a utf8 character into their otherwise-all-7-bit-files. what idiots! these are technocrats of the very worst kind, ones who will introduce unnecessary complexity into the simplest of files. oh well, it's just another global change, so it's fairly easy to ignore this particular obstacle the technocrats threw at us... besides em-dashes, there are 2 other lines that show up... those were also high-bit characters that got put in the file. but we've isolated them, so it's easy enough to delete 'em... *** now that we've got search-related routines in the program, might as well code first pass at a general search capability. that's here:
http://zenmarkuplanguage.com/grapes108.py http://zenmarkuplanguage.com/grapes108.txt
now we can search the file for any word we want... this is pretty simplistic, but it's much farther along than we were before we coded it, so let's enjoy our progress... -bowerbird
participants (1)
-
Bowerbird@aol.com