these grapes are sweet -- lesson #04

the thread explains how to quickly and easily digitize a book... i did the book "books and culture", by mabie, which you can find by searching archive.org for "booksculture00mabiuoft". *** first off, a little note for the future lessons in this thread... in order to do your digitization, you're going to need to use a text-editor or word-processor that can do reg-ex searches. on the mac, i use a program called "tex-edit" (not "text-edit", but "tex-edit", where "tex" stands for texas) by tom bender... if you've been confused by regular expressions in the past, and are still gun-shy, don't worry about it. i promise that we won't become obsessed in the overly-complex details... but you need to be able to search for specific simple patterns, such as "lowercase followed by space followed by uppercase". i'll specify that combo as "lowercase-space-uppercase", and you'll need to know how to translate it into a reg-ex search. for text-edit, just to give you an example, that would be this:
[a-z] [A-Z] and you would specify that the search is to be case-sensitive. (tex-edit has a button for that, but your system might differ.)
you also need to learn how to search for "whitespace", which might be a regular space, but might also be a linebreak or tab. (it's very important that you be able to search for a linebreak; some word-processors, even advanced ones, will not do that.) finally, another handy thing about a reg-ex search is that you can specify a search for multiple items, which is a time-saver. so you could do one search that would find a forward-slash and a backward-slash and a left-bracket and a right-bracket and a cent-sign and a tilda and a downward-sloping accent and an @-sign and a #-sign and a $-sign and a %-sign and an ^-sign and a & and an asterisk and "=" and an underbar. all in one search. which is faster than doing 15 searches... my pledge, in this series, is to do all of the digitization with _only_ my word-processor. if i do need a specialized tool, i will program it, and make it available online for you to see, and use, if you would like to replicate the work i am doing. (which would be a great way to learn how to do it yourself!) *** in the last lesson, we discussed "auto-cleaning" -- changes that are made to the o.c.r. file globally, without monitoring. auto-cleaning can fix a huge number of errors -- at times literally _thousands_ of them -- so it's incredibly efficient, and its great usefulness should never be underappreciated. for instance, in our mabie book, auto-cleaning made fixes to 543 lines, out of the 5700 in the body-text, right at 10%. those are fixes that cost you zero time or energy. right on! but that's just the first type of cleaning you will want to do... *** so now let's discuss the second type of cleaning you can do. after having done auto-cleaning, i turn to "high-probability" searches that quickly locate o.c.r. errors which are common. over the years, i've posted messages to educate this list of an amazing number of routines that help find o.c.r. errors. once again, here i am, doing the same thing another time... right about now, you're probably starting to get pissed off with my pedagogical tone... you might be saying "boy, that bowerbird certainly thinks he smart, doesn't he?" or maybe you'd like to take me down a few notches, knock me off my high horse. that's why the people on this list _rarely_learn._ they're too busy getting all riled up because of my "tone"... if you woulda been paying attention, instead, you would've learned quite a lot over the years, and you'd be thanking me. lucky for you, though, someone once created a collection of my tips, translating them into regular-expression searches, and posting them to the web somewhere... i'm not going to tell you where -- mostly because i never wrote that down -- but maybe someone did, and is willing to share it with you. for now, as usual, i'll tell you the routines that would make the "minimal set" for finding the unique errors in this book. *** there is a great assortment of cleaning routines that i do which are ones where i manually approve whether a change will be made or not. a very good example of this is the search for "11" (eleven). the replacement-term will be "ll" (lowercase letter "l", twice). here are the lines this search locates in the mabie book:
Chapter 11. 112 8 113 114 117 119 "They *11 take me there some day," 211
as i expect whenever i run this search, there are a number of "false-alarms" that turn up, where the number 11 is correct... nonetheless, this is still a very productive search, because it helps to locate some problems. first of all, that "chapter 11" isn't actually chapter 11, it's chapter 2, in roman numerals, so that's an o.c.r. error that needs to be two uppercase "i" letters. next, the "8 113" line should really just be "113", because it's the pagenumber for page 113. but the "8" _was_ on the page, yes indeed, as the signature-number, so this wasn't "an error". but since we don't need the signature-number, we'll delete it. (books are usually printed as a collection of 16-page sections, which are called "signatures"... if you look closely at the spine of a book, along the top edge, you'll see these signatures and how they were sewn or glued together. back in the olden days, printers numbered the signatures, so as to keep 'em straight.) next, in the "take me there" line, there _is_ a misrecognition of "11" (eleven) for "ll" (two lowercase "l"), which is _exactly_ what we were seeking with this search. but there is _also_ an error with the apostrophe plus a floating-space before it, so this line needs hand-editing, not just a search-and-replace... finally, there are some "dogs that did not bark" in this search. where is the pagenumber for page 11? it shoulda turned up. likewise with the pagenumbers for 111, 115, 116, and 118... the fact that they are missing means we should go look there. so this is an instance of a search yielding a rather high payoff, both in the lines that it found (3 of the 8 of them had errors) and also (once you grasp it) in the lines that it did _not_ find, since it indirectly pointed to 5 other errors in the book's text. *** another neat example is the _mid-line_ floating-doublequote. take a look at one of the sets of lines from our last lesson -- where we fixed floating-doublequotes at the start of the line. we see that most also have mid-line floating-doublequotes:
"thrift of time," which brings ripe- "Chronicles " and North's trans- "Still studying Dante ? " said the "Divine Comedy " or in Goethe's "Faust " for the first time discovers "Master and Man," is one of those "Odyssey " are of more importance "In any museum," says Mr. La "Anna Karenina " leaves no reader
if we do a search for this space-doublequote-space combo, there will be two very-common replacement-terms, namely _either_ space-doublequote _or_ doublequote-space. your first instinct might be just to find the next occurrence, and then edit it by hand. but that's not the right approach... if at all possible, you want to resist the urge to hand-edit. it's much smarter, faster, and easier to search-and-replace. so, in this case, you do the search-and-replace _two_times_. first, you specify doublequote-space as the replacement-term, and you approve all cases where that is the correct replacement (this is the right fix for the lines given above), and skip the rest. then the second time, you specify space-doublequote instead, and approve all cases where _that_ is the correct replacement. (since you already fixed the other cases, it should be all cases, but there might be some exceptions, so you still need to look.) some people here will undoubtedly recall that i have a program that can correctly fix such floating-doublequotes automatically. but, as i said above, i'm trying to work only in a word-processor. since there were only 59 cases of these floating-doublequotes, the task was manageable enough that i could handle it manually. *** a lot of times, people ask me to provide a "full set of searches"; they want to have a list that can find every error in every book... you _can_ get something that comes close. but you pay a price... almost every search-routine will have some "false-alarms"... (remember that we had a couple when we searched for "11"?) if a routine finds enough errors, the false-alarms are "worth it". but still, it takes time and energy to process those false-alarms. so you only want to run the routines that will locate errors... i know what you're thinking: "i'm running the routines so i can _find_ the errors, because i don't know what errors are there, so how do i know which routines will be the ones to find errors?" it _does_ seem to be a bit of a knot, but the escape is simple: if you listen to the text, you will hear it telling you the errors. let me explain... when you're correcting one error, you'll notice other errors... when you are doing your non-automatic cleaning, you need to have a certain mindset. on the one hand, you must be focused. when you start a specific change-routine, you have to finish it; don't let yourself be distracted by other errors that will pop up. on the other hand, don't put blinders on yourself. you must be open and attentive to the text. when you're correcting an error, be mindful of other text around the error, and any errors in it... whether you correct the other errors -- or leave 'em for later -- is immaterial. what is important is that (1) you retain your focus on the process of continuing what you had been doing _and_that_ (2) you make a note of the nature of the other error you spotted, so that you can later do a specific search for that _type_ of error. because if an error exists, it's likely that other such errors exist. and if you "listen" to the text, it'll tell you about all of its errors... another thing that people have asked is about any "order" they should perform the search-routines in. it doesn't matter much. because each search-routine will lead to other search-routines, if you're listening to the text. so eventually you'll find all errors. in this regard, however, i will say that a simple _spellcheck_ is quite likely to find the greatest number of errors most quickly. one thing about spellcheck is that it isn't much of a challenge... i like to clean up text because it's something akin to a _puzzle_, and i enjoy the game. a brain-dead spellcheck ruins all my fun... but if you just want to speed up the job, do a spellcheck. be aware, though, that spellcheck will _not_ find everything, so just because you spellchecked doesn't mean you're "done". there are lots of digitization errors that can survive spellcheck, and i'll discuss one of them right after this. before concluding with spellcheck, though, i should tell you that the secret of spellcheck is to add all the "correct" words to your dictionary, so subsequent occurrences are not flagged. this means proper names, of course, but it also means things like archaic spellings (if you're retaining them in your output), and slang, and dialect contractions, and all of that other crap. this means that it's best if your spellcheck program can use a _customized_dictionary_. (because you don't wanna be adding archaic spellings to the dictionary you use for other documents.) but once you've added all the "unique" words to your dictionary, spellcheck is a breeze, because -- unless you screw stuff up -- spellcheck will run on the whole document without a single stop. which means you can do it often, to ensure you didn't screw up. but remember -- a clean spellcheck doesn't mean a clean text. *** there was an interesting situation in this book. if you want to follow along, you can pull up the o.c.r. file:
http://ia700300.us.archive.org/1/items/booksculture00mabiuoft/booksculture00... (that u.r.l. will probably be split, so you'll need to rejoin it.) i'll draw your attention to page 9, where we find these lines:
thoughts, acts, dispositions, and pas- sions of humanity. There is no getting to the bottom of Shake- speare, for Instance, or to the end of his possibilities of enriching and In- teresting us, because he deals habit- ually with that primary substance of human life which remains sub- stantially unchanged through all the mutations of racial, national, and personal condition, and which Is al- ways, and for all men, the object of supreme interest. Time, which is the relentless enemy of all that Is partial and provisional. Is the friend of Shakespeare, because It continu- ally brings to the student of his
more specifically, look at these lines:
his possibilities of enriching and In- ... the relentless enemy of all that Is partial and provisional. Is the friend of Shakespeare, because It continu-
you can see that we have a problem with the o.c.r. program turning small "i" words into capitalized ones. this is a glitch i've seen before, and i've even discussed it on this list before. it's easy to understand how the dot on the lowercase-i could mislead the o.c.r. that it is the solid line of an uppercase-i... (and, of course, just to follow up on the point made above, spellcheck doesn't flag this as an error -- its spelled write!) so we'll need to fix these glitches. specifically, we need to change lowercase-space-capital-i to lowercase-space-lowercase-i, replacing the mistaken capital. in most books, this replacement won't introduce many errors, if any at all, so i have auto-changed this before (albeit rarely). so, since it was a common mistake in the o.c.r. for this book, with a count of 69 such instances, should we auto-change it? i've appended the list of instances. take a look at it, and see whether this change would be good to do as an auto-change, or whether we'd need to approve each as a monitored change. go ahead and make your assessment now, i'll wait... so, did you think it would work as an auto-change? starting at the top, we see auto-correction would apparently work just fine, including for all those in the upper two-thirds. so you might've got tired of looking, and said "auto-change". however, toward the bottom, we see we may have a problem. and yes, a glance at one of the scans tells us we do indeed... specifically, this author has given "honorific capitalization" to the terms "idealism", "idealist", "idealists", and "ideal"... (elsewhere in the book, he contrasts idealism with realism, so "realism", "realist", and so on, also receive capitalization. and, for the record, so does "nature"; mabie's a tree-hugger.) so, in this book, this replacement as an auto-change would have meant more than a typical number of introduced errors. and not just auto-change. sometimes even when you are "monitoring" a search-and-replace, you go on auto-pilot, and just start pressing "replace" without actually _looking_ to consciously confirm that the change is really necessary. so be aware of that, and guard against it. keep your focus, and remember to _also_ be aware of the surrounding text... *** so this capital-i thing was an interesting twist in this book. but it gets even better! take a close look at the middle line:
the relentless enemy of all that Is partial and provisional. Is the friend of Shakespeare, because It continu-
this error would _not_ have been isolated by our find routine, because the capitalized-word "is" was preceded by a _period_. it just so happens that -- in the reality of the print-book -- that "is" word was preceded by a _comma_ -- not a period -- but the o.c.r. _also_ misrecognized that comma as a period... so my error-finding routine wouldn't have isolated this line, because it's looking for cases of lowercase-space-uppercase. we just happened to see the error in this line because we're looking at the identical error in the lines which bordered it. such are the intricacies of correcting an o.c.r. file... :+) *** in our next lesson, we'll start getting down to the nitty-gritty of finding and fixing the errors in the text, so stay tuned... -bowerbird 69 instances of lowercase-space-capital-i
speare, for Instance, or to the end of his possibilities of enriching and In- personal condition, and which Is al- the relentless enemy of all that Is of Shakespeare, because It continu- is to Invite an intimacy with It which is to Invite an intimacy with It which The man who Is bent on getting is by no means Irresponsible ; It may is by no means Irresponsible ; It may of giving Itself up to idle reverie ; "Spectator" and turned them Into The man who is thrown Into constant association with Inferior work either of the genius of a race or an ageo It to enter Into the thought of the un- Idea and feeling so deftly woven to- gether, and follow each back to Its man of culture shares In no small Demeter this Is precisely what Mr. " TT Is undeniable," says Matthew teachers Impart the breath of life by giving us inspiration and Impulse. paintings and of a gallery of Italian he receives it into himself, and In the and he is sufficient unto himself In 'X'HERE IS a general agreement tion, " Home for Incurable Children." other words, culture had wrought Its tinct from himself. There Is no easy because It Is a matter of growth. because It Is a matter of growth. and possessing them in the Imagina- education ; but genius Is not only in- the Immense majority of men any ef- must not be Inferred that a man Is must not be Inferred that a man Is laws, habits, or Institutions. It is the itself out In material and social rela- a main or central movement In prominence, Imbedded themselves These events are described In narra- tive form, with episodes, Incidents, be found In the pages of Homer, Idealism, that, like other spiritual The true Idealist has his feet firmly as much as Idealism from weak practi- The essence of Idealism is the ap- he defined the Ideal as the comple- rational Idealism sees all these things, that the Idealist asks is that life shall This is true Idealism ; but it is acterises true Idealism, and which makes all the greater writers Idealists of the representative Idealists of his bits of true Idealism which has been Idealists in the breadth of their vision Idealism which not only discovers the reality of the Ideal as Plato saw it is A rational Idealism is, therefore, of the Ideal in personal and social life stroying a false Idealism, convention- true Idealism which the world has yet Arden and on Prosperous Island there Idealism which grows out of the vision such an Idealism, and no spiritual great Idealists is one of the greatest ter *'As You Like It" must give of "As You Like It" is not only
participants (1)
-
Bowerbird@aol.com