huck finn doan' be kickin mah ass no mo'

so huck finn, mah frien'... yer _dun_ kickin mah ass! *** i've finally overcome the huckleberry monster. that's the good news. the bad news is that the book isn't finished yet. the ugly? roughly 3,000 diffs in the line-pairs between my version and a "clean" comparison. the reason is one that i've known all along, and which has multiplied my difficulty-factor greatly, but i chose to bear it, for precisely that difficulty. the reason? my "clean" copy is a different edition. you can tell from the first sentence in the book: my version:
You don’t know about me, without you have read a book by the name of “The Adventures of Tom Sawyer,” but that ain’t no matter.
the "clean" version:
You don’t know about me without you have read a book by the name of _The Adventures of Tom Sawyer,_ but that ain’t no matter.
notice the comma after "about me", and the quotemarks around the book-title... (the other one did it in _italics_ instead.) that was a big tip-off to massive differences in punctuation throughout the two editions... so i knew about the issue, but i persisted... because, again, i wanted to see how big of a problem this would pose for the comparison. first, even _preparations_ for the comparison (mainly restoring linebreaks and pagebreaks) were hindered quite a bit by these differences. indeed, the prep may be the biggest problem. maybe i'll be able to ease the prep tangles, now that i have a better idea what they are. we'll see... but even then, 3000 diffs is a lot to resolve... of course, it's a lot easier to resolve the diffs than to check all the punctuation "manually", which is what the proofers over at d.p. do... (just so's we keep it in perspective, mind ya.) but still, that's a heaping bunch o' diffs. now, the next thing to note is that _a_tool_ that's custom-built to do such comparisons, and resolve diffs, might well ease this task... i'll have some comments on that in a minute. but even with a good tool to help the process, 3000 diffs is a lotta diffs to check, no kidding. so it just might be the case that _eventually_, i will change over to another "clean" copy that _is_ from the same edition, such as this one:
http://www.marktwainproject.org/xtf/view?docId=works/MTDP10000.xml;chunk.id=... as you see, it's from "the mark twain project", so we can expect that they took some care... just as a sidenote, it's interesting that _most_ of the copies of this text floating in cyberspace are _not_ from the edition that i am using and which the "mark twain project" also is using... i'd guess that this is the case because _most_ of those copies floating around were _based_ on the project gutenberg e-text to begin with. i haven't done the forensics research required to bolster that claim, but that is my suspicion. *** so let's talk about a tool for doing comparisons. the input for the tool will be the two text files: my in-progress o.c.r. file, and the "clean" text. the prep got these files synchronized into lines which match those that we'll find on the scans. (between the slanguage and the punctuation, this synchronization was tremendously thorny.) so our tool now walks through the line-pairs, stopping when it finds a diff to show the lines _plus_ the scan for the relevant page, so that we can see which of the lines (if either) is right. so you know, you'll find a copy of this post here:
and here's a graphic of the basic framework:
you'll also notice, in that screenshot, that i create a "diff line" underneath the pair, to flag the diff(s). this "diff line" is a hugely important functionality, because it spots the diff(s) _accurately_and_fast_, so i don't spend my valuable attention doing that. (that task can be time-consuming, and draining.) so i see what the diff is, then i examine the scan to resolve the diff by seeing which line is correct. it's a straightforward task. but let's speed it up. first, it is troublesome to locate the appropriate _place_ where the line resides on the page-scan. but it would help some if we were to know that "the line is at the top of the page" (or the middle, or the bottom). even better, say the line-number! so the first innovation is to give the line-number:
you'll see the big number "8" next to the text-pair. it indicates the diff is in the 8th line on the page... this is a chapter-opener, so there are "blank" lines; and line "8" means "toward the top, but down a bit." we can do better -- let's signal where the lines are:
oh, this helps a lot. in the margins, we show circles where each line will fall. (i did put in line-numbers, but that started to make the display a little "busy"; so i compromised by coloring the actual "hit" line.) scans aren't always done uniformly, so the circles might be off, and aren't to be taken too seriously -- the other good reason i deleted the numbers -- but they're excellent at helping to focus the search. but we can even do better. since we have an idea about _where_ the appropriate line resides, we can screenshot a portion of the scan around that point, and display it right at the top, under our line-pair. voila!
wow. that makes a really nice and useful display... my eyes don't have to travel anywhere at all. great! so, the next part of this job is indicating to the tool which of the lines is correct -- or if both are wrong. so here's the functionality for that:
you'll see we now have 3 buttons at upper-right... the "1" button means the top line is the right one. the "2" button indicates the bottom line is correct. the "3" button says that we're _skipping_ this pair. each of the buttons, once selected, stores the info, and then instructs the tool to locate the _next_ diff. this minimizes the interaction for each comparison to a single button-click, which makes the task easy. and yes, you can keypress "1" or "2" or "3" instead. in addition, the cursor-up key says "top line is right", while cursor-down says "bottom line is right", and cursor-right key says "let's skip this pair for now..." (cursor-left will take you back to the previous pair, if you realize that you indicated the wrong choice.) also note that the text-fields are _live_, so you can _edit_ them, in case neither one of them is correct. pick the one that's easiest to fix, repair it, and then click its button, and your edited line will get saved. (there are times, however, where you simply cannot make a decision; that's why there's a "skip" option.) you can also do a "find" operation on a search-term, or jump to a specific line in the file, if you need to... other abilities which can be periodically useful are the ability to skim along chapter-heads, or pages. all of this is very straightforward vis-a-vis coding; nothing here challenges a competent programmer. there are some additional tricks that i will describe, once i've started in on those 3000 diffs, but _this_ is already more than enough for you to think about. *** i haven't said as much, but i'm not trying to "hide" the fact that my clean copy is the one jim created, which is up at project gutenberg as e-text #32325. i initially picked it because i thought a comparison between his copy and mine would be beneficial to both of us. but since i've had to bend his so much, to facilitate the comparison, that's probably not true at this time, at least not as it applies from his view... so now, if jim wants to do it, he will have to "bend" my version so as to facilitate a comparison with his. -bowerbird p.s. here's some more info to detect the editions.
http://zenmagiclove.com/huckf/wisdom.py?whatpage=26 "Oh, she'll do, she'll do. That's all right. Huck can come in."
the "clean" copy, #32325 from project gutenberg. "Oh, she'll do. That's all right. Huck can come in."
***
http://zenmagiclove.com/huckf/wisdom.py?whatpage=189 They swarmed up the street towards Sher- burn's house, a-whooping and yelling and raging like Injuns, and every- thing had to clear the way or get run
the "clean" copy, #32325 from project gutenberg. They swarmed up towards Sherburn's house, a-whooping and raging like Injuns, and everything had to clear the way or get run
and as the final note here, pg#76 is a little embarrassing, because it includes graphics from the edition _i'm_ using, like the one on page 189, from chapter 22, meaning that words in the graphic _contradict_ the digital text below it.
participants (1)
-
Bowerbird@aol.com