a little while back, i pointed out that a person can
compare two independent digitizations to find errors
in both of them, and that this method works very well.

carel said:
>   That depends on a lot of factors including the assumption
>   that two OCR programs would not make the same mistake

that's a good point.

if the two digitizations have errors in common, then
the comparison method won't be able to find them,
and thus its effectiveness will be lessened somewhat.

there's no argument with that.

what's surprising to me, however, is how many people
are completely defeated by this _possible_ shortcoming.
upon learning that there _might_ be a problem with the
comparison method, they dismiss it with no other thought.

not me. i set out to actually _test_ the assumption.

i documented the results in a thread in the d.p. forums.
you can search for "revolutionary o.c.r. proofing". it's at:
>   http://www.pgdp.net/phpBB2/viewtopic.php?t=24008

as i note there, i presented the data earlier elsewhere, at:
>   http://onlinebooks.library.upenn.edu/webbin/bparchive?year=2005&post=2005-10-03,3

so yeah, that's right, my findings are over 4 years old now.

***

what my research found is that there were virtually _no_
errors-found-in-common between the two digitizations.

and this finding was replicated, and replicated once again.

in other words, the effectiveness of the comparison method
is _not_ lessened by this possible shortcoming. no indeed,
the evidence says it is not even affected in the slightest way.

the clarity of the results was striking; they are unforgettable.

if you doubt the data, i encourage you to repeat the research.

because repeating the possible problem, without any data,
won't get anyone very far in the future, not if i'm listening...

***

here's a quick-and-dirty experiment, for anyone willing...

i just used the comparison method on gardner's e-text,
and found 159 differences between his work and mine...
i then resolved the differences by consulting the scans...

79 differences were due to errors in his work.

77 were due to errors in mine.

3 were due to errors in _both_ his and mine.

now, of course, any errors-in-common will still reside in
both his and mine. why don't you see if you can find any?

>   http://z-m-l.com/go/gardn/gardn.zml
>   http://z-m-l.com/go/gardn/gardnp123.html

i'll be waiting. but i won't be holding my breath...

***

carel said:
>   I feel that a human looking at
>   a smaller subset of a large document
>   is a good thing in the error finding process.
>   You apparently do not think it is.

if the comparison method has already found all the errors,
why waste the time and energy of a human rechecking that?

>   Neither of us is right or wrong:
>   It is a matter of perspective and opinion.

unless i get a good answer to the question that i just asked,
my opinion will continue to be that i am _absolutely_right_,
and you're wrong because you're wasting human resources.

that's my perspective, and i'm not changing it...          :+)

-bowerbird