well...  "the adventures of huck finn" has
certainly given me some "adventures" in
creating my new, accurate, digitization...

and i'll document them all in a summary.

***

but for today, it's refreshing to know that
i am finally _powering_through_ the book.

you might remember that, as of yesterday,
i'd reduced the comparison diffs to ~2000.

and i was contemplating a shift away from
my current "clean" text, since it's based on
_a_different_edition_ than what i am using,
one that has heavily edited the punctuation.

that's why there are so many paired diffs...

usually, with a book this clean, there might
be 1% to 3% of the lines that would be diffs.
in this book, completely due to punctuation,
it's almost _20%_ -- ~2,000 out of ~11,000.

and, in about ~92% of those diffs, the o.c.r.
is correct, so the diff is being caused by the
_edit_, rather than an o.c.r. misrecognition.

the thing is, the comparison helps to catch
the _other_ ~8%, which are real o.c.r. errors.

so even though it's tedious to step through
~2,000 diffs, to find some ~160 o.c.r. errors,
it is _still_ a rather efficient way of doing it...

so i'm gonna stick with this different edition.

but i adopted some "tricks" to power through.

and those are going very well.  i'm now down
to ~1,000 diffs, so i'm sure i will finish soon.

so i thought i'd tell you about these "tricks"...

***

i didn't mention one of these "tricks" earlier,
but it was the one where i focused in on the
_words_and_letters_ instead of punctuation.

i did that using the following trick...

after finding a diff -- a pair of relative lines
from the two files that were not the same --
i culled out the non-letters from each line...

if, after culling, the 2 lines were the same,
then that meant the letters were the same,
so the diff was entirely due to punctuation,
and -- since in that pass -- i was _focused_
on resolving differences in the words/letters,
i could skip over such punctuation-only diffs.

but if the lines were _different_ after culling,
then i _would_ examine and resolve the diff.

in the first round of this pass, a refinement is
that i'd ignore the diff if the culled punctution
was different as well, so i was only looking at
diffs where _the-letters-and-only-the-letters_
were different, allowing a very-specific focus.

after all those diffs were resolved, i looked at
lines where letters _and_ punctuation differed.

thus, when all of _those_ were resolved too,
i knew that i'd resolved all letter-based diffs.

(a further refinement, which retains spaces
when the culled-lines are created, helped to
find words where the letters were the same,
and in the same order, but _words_ differed,
such as when words were improperly joined,
which can happen on rare occasions in o.c.r.)

***

this methodology of narrowing the focus of
the _type_ of diff being displayed, to assist
your attentional focus in the resolving of it
-- by letting you know what to "expect" --
can lend itself to a variety of these "tricks".

for instance, earlier work on the files had
shown me that one of them _hyphenated_
the expression "by-and-by", but the other
did not. so i did a search just for that diff.
it was quite easy to see that every instance
(but one, which i considered "printer error")
was hyphenated in my edition, meaning that
i did a global-change to the "clean" text so
those instances would not be flagged again.
with my mind on the one track, it went fast.

***

another trick was to present only the diffs
where the diff lines were the same length.

this usually meant that a punctuation mark
had been misrecognized as another mark...
e.g., a semi-colon misrecognized as a colon.

this trick was very good at providing focus,
so i expanded it, and displayed only those
diffs where the line-length was off by one.
then by two, and so on, until all were done.

for instance, if the line-lengths were off by
one character, that often meant that one of
the two lines had a comma the other lacked.

once you know that this is the nature of the
diffs you are looking at, the resulting focus
in your attention can speed up the process...

some of you might see that this variable of
line-length serves as a _proxy_, of sorts, for
what is termed "edit distance" -- defined as
the number of edits that needs to be made
to one string to change it to another string.

so using "edit distance" might be a further
refinement that i could use the next time...

and after that, i might examine the diffs,
to categorize them, and then present only
one specific type of diff, like when one line
has a comma while the second line doesn't.

another difficult situation is when the lines
differ at one point, and then at a later point,
and one line is correct at the first difference,
while the second line is right on the second.
in these cases, it would be good to _split_
the diff lines into the two respective parts.
it will take a bit of coding, but it's doable...

anyway...

all of this new experience indicates to me
that there are many things to do that will
allow me to make the comparison process
easier, speedier, and much more efficient.

at any rate, we're close to finishing _huck_.

-bowerbird