huck finn doan' be kickin mah ass no mo'

19 Oct 2012

      so huck finn, mah frien'...
yer _dun_ kickin mah ass!

***

i've finally overcome the huckleberry monster.

that's the good news.

the bad news is that the book isn't finished yet.

the ugly?   roughly 3,000 diffs in the line-pairs
between my version and a "clean" comparison.

the reason is one that i've known all along, and
which has multiplied my difficulty-factor greatly,
but i chose to bear it, for precisely that difficulty.
the reason?   my "clean" copy is a different edition.

you can tell from the first sentence in the book:

my version:
...
You don’t know about me, 
   without you have read a book
   by the name of 
   “The Adventures of Tom Sawyer,”
   but that ain’t no matter.
the "clean" version:
...
You don’t know about me 
   without you have read a book 
   by the name of 
   _The Adventures of Tom Sawyer,_
   but that ain’t no matter.
notice the comma after "about me", and
the quotemarks around the book-title...
(the other one did it in _italics_ instead.)

that was a big tip-off to massive differences
in punctuation throughout the two editions...

so i knew about the issue, but i persisted...

because, again, i wanted to see how big of a
problem this would pose for the comparison.

first, even _preparations_ for the comparison
(mainly restoring linebreaks and pagebreaks)
were hindered quite a bit by these differences.

indeed, the prep may be the biggest problem.

maybe i'll be able to ease the prep tangles,
now that i have a better idea what they are.

we'll see...

but even then, 3000 diffs is a lot to resolve...

of course, it's a lot easier to resolve the diffs
than to check all the punctuation "manually",
which is what the proofers over at d.p. do...
(just so's we keep it in perspective, mind ya.)

but still, that's a heaping bunch o' diffs.

now, the next thing to note is that _a_tool_
that's custom-built to do such comparisons,
and resolve diffs, might well ease this task...

i'll have some comments on that in a minute.

but even with a good tool to help the process,
3000 diffs is a lotta diffs to check, no kidding.

so it just might be the case that _eventually_,
i will change over to another "clean" copy that
_is_ from the same edition, such as this one:
...
http://www.marktwainproject.org/xtf/view?docId=works/MTDP10000.xml;chunk.id=...

as you see, it's from "the mark twain project",
so we can expect that they took some care...

just as a sidenote, it's interesting that _most_
of the copies of this text floating in cyberspace
are _not_ from the edition that i am using and
which the "mark twain project" also is using...

i'd guess that this is the case because _most_
of those copies floating around were _based_
on the project gutenberg e-text to begin with.
i haven't done the forensics research required
to bolster that claim, but that is my suspicion.

***

so let's talk about a tool for doing comparisons.

the input for the tool will be the two text files:
my in-progress o.c.r. file, and the "clean" text.

the prep got these files synchronized into lines
which match those that we'll find on the scans.

(between the slanguage and the punctuation,
this synchronization was tremendously thorny.)

so our tool now walks through the line-pairs,
stopping when it finds a diff to show the lines
_plus_ the scan for the relevant page, so that
we can see which of the lines (if either) is right.

so you know, you'll find a copy of this post here:
...
http://zenmagiclove.com/misc/comptool0.txt
and here's a graphic of the basic framework:
...
http://zenmagiclove.com/misc/comptool1.png
you'll also notice, in that screenshot, that i create
a "diff line" underneath the pair, to flag the diff(s).

this "diff line" is a hugely important functionality,
because it spots the diff(s) _accurately_and_fast_,
so i don't spend my valuable attention doing that.
(that task can be time-consuming, and draining.)

so i see what the diff is, then i examine the scan
to resolve the diff by seeing which line is correct.

it's a straightforward task.   but let's speed it up.

first, it is troublesome to locate the appropriate
_place_ where the line resides on the page-scan.

but it would help some if we were to know that
"the line is at the top of the page" (or the middle,
or the bottom).   even better, say the line-number!

so the first innovation is to give the line-number:
...
http://zenmagiclove.com/misc/comptool2.png
you'll see the big number "8" next to the text-pair.
it indicates the diff is in the 8th line on the page...
this is a chapter-opener, so there are "blank" lines;
and line "8" means "toward the top, but down a bit."

we can do better -- let's signal where the lines are:
...
http://zenmagiclove.com/misc/comptool3.png
oh, this helps a lot.   in the margins, we show circles
where each line will fall.   (i did put in line-numbers,
but that started to make the display a little "busy";
so i compromised by coloring the actual "hit" line.)

scans aren't always done uniformly, so the circles
might be off, and aren't to be taken too seriously
-- the other good reason i deleted the numbers --
but they're excellent at helping to focus the search.

but we can even do better.   since we have an idea
about _where_ the appropriate line resides, we can
screenshot a portion of the scan around that point,
and display it right at the top, under our line-pair.

voila!
...
http://zenmagiclove.com/misc/comptool4.png
wow. that makes a really nice and useful display...
my eyes don't have to travel anywhere at all.   great!

so, the next part of this job is indicating to the tool
which of the lines is correct -- or if both are wrong.

so here's the functionality for that:
...
http://zenmagiclove.com/misc/comptool4.png
you'll see we now have 3 buttons at upper-right...

the "1" button means the top line is the right one.
the "2" button indicates the bottom line is correct.
the "3" button says that we're _skipping_ this pair.

each of the buttons, once selected, stores the info,
and then instructs the tool to locate the _next_ diff.

this minimizes the interaction for each comparison
to a single button-click, which makes the task easy.

and yes, you can keypress "1" or "2" or "3" instead.

in addition, the cursor-up key says "top line is right",
while cursor-down says "bottom line is right", and
cursor-right key says "let's skip this pair for now..."

(cursor-left will take you back to the previous pair,
if you realize that you indicated the wrong choice.)

also note that the text-fields are _live_, so you can
_edit_ them, in case neither one of them is correct.
pick the one that's easiest to fix, repair it, and then
click its button, and your edited line will get saved.

(there are times, however, where you simply cannot
make a decision; that's why there's a "skip" option.)

you can also do a "find" operation on a search-term,
or jump to a specific line in the file, if you need to...

other abilities which can be periodically useful are
the ability to skim along chapter-heads, or pages.

all of this is very straightforward vis-a-vis coding;
nothing here challenges a competent programmer.

there are some additional tricks that i will describe,
once i've started in on those 3000 diffs, but _this_
is already more than enough for you to think about.

***

i haven't said as much, but i'm not trying to "hide"
the fact that my clean copy is the one jim created,
which is up at project gutenberg as e-text #32325.

i initially picked it because i thought a comparison
between his copy and mine would be beneficial to
both of us.   but since i've had to bend his so much,
to facilitate the comparison, that's probably not true
at this time, at least not as it applies from his view...

so now, if jim wants to do it, he will have to "bend"
my version so as to facilitate a comparison with his.

-bowerbird

p.s.   here's some more info to detect the editions.
...
http://zenmagiclove.com/huckf/wisdom.py?whatpage=26
   "Oh, she'll do, she'll do. That's all right. Huck can come in."
...
the "clean" copy, #32325 from project gutenberg.
   "Oh, she'll do. That's all right. Huck can come in."
***
...
http://zenmagiclove.com/huckf/wisdom.py?whatpage=189
   They swarmed up the street towards Sher-
   burn's house, a-whooping and yelling
   and raging like Injuns, and every-
   thing had to clear the way or get run
...
the "clean" copy, #32325 from project gutenberg.
   They swarmed up towards Sherburn's house,
   a-whooping and raging like Injuns, and 
   everything had to clear the way or get run
and as the final note here, pg#76 is a little embarrassing,
because it includes graphics from the edition _i'm_ using,
like the one on page 189, from chapter 22, meaning that
words in the graphic _contradict_ the digital text below it.

Bowerbird＠aol.com

tags

participants (1)