here is a jumble of responses to various things that
were said recently.  i think it's got a pretty good flow,
so i haven't bothered to put it in chronological order.

but if you think that misrepresents anything, say so...

***

don said:
>   I've scanned and reviewed the proofing on a lot of books,
>   all ABBYY-OCR'ed, and one error per 10 pages exhibits
>   good book selection Abbyy management skills I wish I had,
>   or maybe books with 12 lines per page.

i was talking about _after_ the preprocessing, not before...

raw o.c.r. has a typical rate of about 1 error per page on
the relatively clean and simple and straightforward books.

on a simple book, like the "betty lee" that roger was using,
you can fix about 90% of those errors in roughly one hour.
(assuming you have a good tool, which ain't hard to build.)

another hour (or two) will fix 80% of the remaining errors,
meaning you are left with ~2% of the original o.c.r. errors.

at that point, the wise course of action is to send the book
to the smoothreaders, and count on them to find the rest...

for the encyclopedia stuff you do, don, it takes more work.
more time, and more passes, and less accuracy at the end.
how much, i can't say, because i don't have the experience.


>   I'm pretty sure we could do better work, quicker, if we
>   proofed in parallel, then compared (and merged) results

i don't think the "serial/parallel" distinction matters much.

as long as the work of both individuals gets incorporated
-- which it does, in both -- i can see no quality difference.

as for parallel being "quicker", it will be, _while_proofing,_
since person "a" can start at the same time as person "1".

but once you add in the time for "comparison and merge",
you've likely lost any advantage you had, and maybe more.

i don't think anyone here has summoned up more evidence
for the value of comparison/merge than i have, but if you
are arguing that it's quicker and better, i probably disagree.

(most of my research involved a pre-existing digitization,
meaning that there was _no_ cost to generate that corpus.
and sure, in _that_ type of situation, it is clearly "quicker".)

***

roger said:
>   there are several more regex checks that
>   would have helped and I'll bake those in.

a reg-ex path is almost certainly the wrong way to go.

yeah, a small number are necessary and efficient, but
as you get more and more, their signal-to-noise ratio
starts to rot, and you waste your time on false alarms.

the most effective means is also the easiest one of all,
namely a straightforward spellcheck.  i looked closely,
and documented it carefully and extensively on this list,
so i can tell you spellcheck does most of what you need.

where reg-ex shines is with punctuation abnormalities.
but if you're having those kinds of problems, then you
will need to have help from smoothreaders _anyway_...

***

don said:
>   Are you trying to proof and format
>   at the same time?  I did that experiment
>   several years ago with the same effect.

sadly, it will take a long time to erase the d.p. myths...

the reason proofing and formatting do not get along
at d.p. is because their system mixes pseudo-markup
with plain-text which is shown in an .html text-area...

the pseudo-markup _interferes_ with the plain-text
-- since, face it, markup is inherently obstructive --
so yes, that makes "proofing-while-formatting" hard.

but also causing the difficulty is the simple fact that
plain-text does not resemble the p-book page-scan.

the font is monospaced...  also, the paragraphs are
(a) not indented, and (b) separated by a blank line...
italics are not displayed.  blockquotes aren't set off.
headings aren't centered.  poetry is formatted right.

in short, there is _nothing_ about their interface that
aids in the task of _either_ proofing _or_ formatting!

so if you try to do _both_, plus deal with the obstacles
of in-line markup, you're really fighting a losing battle.

and the answer is as obvious as the nose on your face:

>   http://z-m-l.com/go/triple2010.png

look at the middle pane.  it looks similar to the scan.
it's not an "exact" match, but it's getting pretty close.

the paragraphs _look_like_ the ones on the page-scan.
the blocks of text between the two match quite nicely.

the fonts are fairly equivalent, the leading is identical,
there are curly-quotes, matching the ones on the scan.

and the italics, if there were any, would _be_ italicized.

so the only real discrepancy is the lack of justification.

with this interface, you can indeed proof-and-format
at the same time, and actually succeed.  at least i can.
i think results would be better than the d.p. interface.

because there's no drag on proofing-and-formatting.

that doesn't mean i think it's a good idea to do both...
as i said earlier, it's important to keep a laser-focus...
but a person _could_ do both, if they really wanted to.

so yes, don, i agree with you that one shouldn't try to
proof-and-format at the same time.  but _not_ due to
the "it can't be done" reasoning that you have implied,
a message that's been promulgated by d.p. ineptitude.

d.p. just uses a bad interface that makes it difficult...

***

roger said:
>   Yes, because removing formatting
>   only to have to put it back in
>   seems counterproductive,
>   especially since Abbyy's output
>   has a higher catch percentage
>   than I believe a foofer does.

well, ok, first of all, part of your reasoning
is based on _your_ failure on this task, and
i don't think your performance is too useful.

as far as i know, you do too little formatting
on d.p. to be _experienced_ at it...  and yes,
spotting italics is a task that takes practice...

i'd also expect that, having been humbled,
you'd be careful next time, and thus better.

so before too long at all, you'd be great at it.
(even so, even the great ones do miss some.)

to be sure, i am _not_ saying that it helps to
"try harder" or "be careful" or "pay attention".

to me, it seems like  explicit consciousness
might well hinder the task, not facilitate it...

spotting italics might be a right-brain thing;
i've noticed that sometimes i will view a page,
and "see nothing", so i click to the next page,
but then i "have a feeling" that i missed one,
so i'll turn back, and sure enough, there it is.

so, like i said, it feels like a right-brain thing.
and thus it might be better to be "unfocused".
you can just look at a page and get a gestalt
that tells you if the slant of an italic is there.

whatever the case, it's definitely the situation
that you get better the more often you do it.
further, when i was doing it frequently, i got
_faster_ at it, even as my accuracy improved.

(so i tried to force myself to go even faster,
but it didn't work.  had to happen naturally.)

but anyway, now i am out of practice again.

having said all of that, however, i recommend
that people use the .rtf version, since it does
do a fairly good job these days on the italics...

you still need to check, though, so on a book
like this one, with easy and infrequent italics,
it won't matter too much if you do .rtf or not.
it will be the same amount of work either way.

but on a book that has tons of difficult italics,
it's well worth running a test to see how well
abbyy recognizes them, because it could save
a lot of time and energy, and boost accuracy.

again, i am out of the habit these days, but
i went through the whole book, to see how
accurate i would be, and how long it takes.

i spent an hour doing it, and found only 110
or 120  of the 160 italics you said are there.
(that includes the ones i had marked earlier,
whenever i encountered 'em "along the way".)

so that's an accuracy-rate of a mere 70%-75%.

even with another go, i might not find 'em all.

so, from my perspective, it's a no-brainer to
at least let abbyy _try_ to find them for me...

moreover, i'd defer this task to collaborators
if i was working in a distributed environment.


>   And having two foofers go through
>   each page of a book like this
>   seems to be a waste of resources.

well, that's a separate question.  it depends on
how much energy you are willing to expend to
_guarantee_ that you've got an error-free book.

personally, i don't think italics are _that_ vital;
nonetheless, i wouldn't want to miss more than
a half-dozen in a 240-page book like this one.

and i cannot say, without doing some research,
how good smoothreaders are at catching these.
(although i hasten to point out that, unlike d.p.,
my smoothreaders view text and scan together,
so it'd be easier for them to detect-and-check.)

heck, to be honest, i can't even say that i know
what kind of accuracy the d.p. formatters have.
i just assume their 2 rounds must be adequate.
but i don't really have _research_ to back it up.
nobody does.  it's too hard to create a criterion.

***

don said:
>   It will be interesting to see
>   what your testing reveals.
>   I remember that, at the time,
>   abbyy's accuracy with formatting
>   location and boundaries was not good,
>   and that one could spend more time
>   fixing what's there than to
>   put it there in the first place.

gosh, more b.s. posing as "accumulated wisdom"
from that flaccid-fountain-of-fail known as d.p.

here's the truth, rather than a recycled d.p. myth.

abbyy is now _pretty_good_ at recognizing italics.

even back in the day, when this myth took shape,
abbyy did a _fair_ job, if you gave it decent scans.

the problem was that some people at d.p. threw
crap scans at abbyy, then said "it doesn't work"...

of course o.c.r. isn't gonna give you good results
from a scan that looks like your baby kutzed on it!

it's optical character recognition, not mind-reading!

on some of those scans, _i_ couldn't read the type.
so how could anyone expect o.c.r. to do the trick?

i am continually amazed at how _well_ abbyy does.

even on the current "betty lee" book, if you look at
the places where there were glitches in the o.c.r.,
then look at the corresponding place on the scan,
you usually see the splotch that caused the glitch.

this scan-set was not up to the usual quality level
for scans done by roger.  must've been a bad copy.

and here's another tip which many of you know:
use scan-tailor to clean up your scans for o.c.r.
(don knows -- as he always advises it to others.)

i don't know all the ins-and-outs of scan-tailor,
so i can't give any advice on optimizing results,
but it proved itself during my initial experiments.

so do a test-run of a few scans to see if your o.c.r.
gets good results.  if not, do mods with scan-tailor.


>   Also consider how the abbyy formatting data
>   could be used by software after your proofing,
>   to consider each location where abbyy found
>   formatting and whether you had put formatting
>   there; and resolve the differences efficiently.

now _this_ is a good suggestion.

myself, i would use the abby italic information
right off the bat, without any hesitation at all...

but if someone is reluctant to do that, they can
certainly use that information _after_the_fact_.

for instance, as i reported above, i located
113-119 places where "betty lee" has italics.
(119 places, to be exact, found in 113 lines.)

so i've appended a list of those lines with italics.

a similar list can be generated from abbyy info,
and the two lists then compared to test for diffs.

the diff results could also be used for _training_,
i.e., i could view lines where i missed the italics,
and perhaps pick up a clue on how to improve...

***

don said:
>   The best way to get past the tedium is to come up
>   with an objective means for comparing results
>   derived from different, and differences in, the process.

doesn't seem like that should be too difficult...


>   Until then we're stuck with our prejudices.

doesn't seem like a very appealing alternative...

-bowerbird

p.s.  cases of italics that i found in "betty lee"...

>   to have _everything_ talked over in the family! I
>   "True enough. But you don't have to tell _all_
>   Lucia (pronounced 'Lu _chee_ a,' in Italian
>   and as for her she was going to _choose_ her
>   I _am_ expecting it to put them to flight! I thank
>   most of them fine, and both as _worthy_ and as
>   _interested_ in their own team's winning. Do not
>   haven't lost it. Oh, wouldn't that be _awful?"_
>   Mother's that has _grand_ notes in it and I use it
>   "Oh, _do_ you?" laughed Betty. "This is so
>   "He gives the _full_ address inside. It's
>   haps just to joke me -- and while I _thought_ that
>   other things, I never found out till the _middle_of_
>   _my_sophomore_year_ that junior orchestra only
>   meant _second_ to the senior orchestra, sort of a
>   Probably it _was_ "nerve," but she had not meant
>   Clara probably thought that Betty was _gloating_
>   enough tact to be president of _anything"_
>   lunch room side by side. "I _thought_ it was
>   Woods, rather proudly. "As soon as _he_ gets
>   stuff _ing._ But I'd never get tired of it and I
>   "She'll copy Lucia, and it will do _her_ good to
>   dent of Lyon 'Y', _you_ would be going home
>   _junior_ for president!"
>   Come over here on the _chaise_longue_ and let me
>   broader than the ordinary _chaise_longue._ Golden
>   "If you only understood Italian, Betty! _Che_
>   _peccato!_ That means 'What a pity' -- for I'll
>   he is on an African _safari;_ he wrote me just
>   "I _casually_mention_ hearing from my father
>   mother. She doesn't care a _centime_ for him;
>   stuff that I'm most scared for fear she _will_ start
>   him and tell him just to come over and _get_ her!"
>   Betty. I believe I _will_ write and tell my father
>   _were_ connected with Ramon -- and she had
>   everybody could select her own doll. And _these_
>   _"I_ wouldn't care for such a friend," said Dotty.
>   I _knew,_ I should scarcely give any information
>   that died. Isn't it _pitiful?_ So I just sent Bessie
>   the kind that _she_ has to _help."_
>        the kind that _she_ has to _help."_
>   _rather_ you wouldn't tell Mother. I don't know
>   changed, provided she _could_ change it now. She
>   at first and they _are_ good folks, Betty -- at least
>   They were so _good_ to me! I'm going to give
>   in the _little_ basket till Christmas morning, Mrs.
>   chance that _he_ may get work the first of the
>   sang Latin hymns and songs, the _Adeste_Fidelis,_
>   "It's a shame you aren't going to _your_ grand-
>   to _make_anything,_ I don't see, but I'll appre-
>   _never_ forget you. Couldn't if I tried!"
>   called it in the old days when _she_ was a little
>   _"We_ read things," importantly said Amy
>   "Funny she wants _you_ Mumsy, when she has
>   in trouble. Well, Lucia wanted _her;_ perhaps
>   "Did you see that _tr-ragic_ face, Betty?" asked
>   _your_ side," she told Carolyn Gwynne.
>   "It wasn't _anything!_ I got knocked down and
>   "We certainly could, and _crede_mihi,_ Betty,
>   to do, and you can't do everything, _crede_mihi!"_
>   _"'Crede_mihi'_ -- I can't," laughed Betty. "Do
>   you suppose _'mihi'_ ought to come before
>   _'crede?'_ Oh, yes, imperative first!"
>   _"'O_tempora,_o_mores!'"_ replied Carolyn,
>   my word for it' isn't enough like 'believe _me'_
>   like _ubi_est?_ or _Quid_loquor?_
>        like _ubi_est?_ or _Quid_loquor?_
>   _Quid_agis?_--_O_miserabile_me!_--_horribile_
>        _Quid_agis?_--_O_miserabile_me!_--_horribile_
>   _dictu_--_age_vero_--_da_operam,_ and other expres-
>   centrated on basketball "to beat _us."_ And beat
>   think that he owned her! She _liked_ Jack.
>   couldn't help it. Jack _must_ be all right! Why,
>   Betty had wondered if her own parents _could_
>   where she was. It _was_ nonsense. She would go
>   home when she got ready. But she _would_
>   near dinner time it is, and Mother _will_ be
>   "Oh, you _did!"_ exclaimed Doris. "Tell us
>   facetious about it. Isn't that so, _mon_cher_
>   _papa?"_"  ""
>   _wild_set_ of the _society_bunch!_ What do you
>        _wild_set_ of the _society_bunch!_ What do you
>   "Don't _ask_ me, Doris. I don't like _any_ of
>        "Don't _ask_ me, Doris. I don't like _any_ of
>   cally. "Would she _let_ me, do you suppose?"
>   her. Perhaps it was Chet's remark about _not_
>   _forgetting_ that misled her! She was dressed,
>   if there _was_ a pretty good current in the river,
>   frowned again. _Could_ her father tell him? Then
>   "Oh, Father, you _told_ him! And I know he's
>   "Ramon Balinsky Sevilla is not _in_ Detroit!"
>   friend met him. _Now_ are you satisfied?"
>   new dress, _"so_ becoming," she said. "Betty, I
>   for it in others, and _have_a_care,_ Betty. That
>   _a_la_ victrola, though Jack apologized for not
>   de cook. Dat young lady in dere's done _passed_
>   _out!_ An' de butleh -- he gone, too."
>   "No, honey, not daid. No, you jus' _keep_out._
>   me, she would know the way around _this_ house
>   _Mother,_ was still up!
>   Then there _was_ something in the remarks
>   "I thought he _might_ call up to see if I had gotten
>   Betty was not quite sure just _how_ she could
>   think it was _terrible!"_"  "  "
>   cause _you_ didn't know what _was_ happening."
>        cause _you_ didn't know what _was_ happening."
>   just _saw_ going to pieces little by little and hav-
>   reach Italy and how he was having the _palazzo_
>   opened in Milan. Now _that_ may mean some-
>   so _lonely,_ Mrs. Lee!" And your mother asked
>   _"Would_ you, Lucia? I wish you would stay.
>   to have a real old Italian _palazzo_ to come to
>   them the Sevillas. At least _he_ had not found
>   daughter -- _perhaps,_ Lucia thought. But the
>   perhaps, some day. I thank you for _trusting_
>   annual Lyon High _Star._ It was the custom to
>   with her _Star_ under her arm, Lucia in the
>   thing. And she's _got_ to!"
>   not. Of course, we could not mention _that_ before