[gutvol-d] Re: Processing eTexts

26 Feb 2010

      carel said:
...
Direct comparison to the scans is part of 
   the formal proofing process and, 
   although available during the 'quest for perfection,' 
   should not be a forced issue.
i disagree.

it's largely a matter of "framing" the issue and its perception.

i don't think we should ask the user to help with "proofing" per se.

i don't want them to think of this as "the formal proofing process".

i believe we need to put the books in front of them _to_read_...

_that_ should be our understanding with them, how we "frame" it.

they're there because _they_ want to _read_ this particular book.

now, as part of the interface by which we present the book to them,
we give them both the digital text _and_ the page-scan, together...

most interfaces make the user _choose_ one of the other, but
i see no reason (especially no _benefit_) in doing it that way...

i think it's wise to give them _both_, and have them understand that
there may have been some transcription errors during digitization,
so we're giving them the page-scan too, for any cases they _suspect_
might be errors.   in most cases, they'll see that we matched the scan.
or they'll understand why we made the change that we did, if we did.

_or_ -- in some rare cases, in which we request their assistance --
they'll find an actual transcription error that we made, and _tell_ us.

this is the thing that project gutenberg doesn't do well.   it doesn't
_invite_ the reader to become a part of the transcription process,
to become one of the instruments that marches texts to perfection.

indeed, you can _feel_ deep in your gut the _contempt_and_scorn_
that jim tinsley and al haines have for the stupid people who _do_
report errors, because _half_ of those "errors" are not, in fact, errors.

well, gee, big surprise.   you haven't given them the page-scans, so
how in the world are they gonna know if an apparent error _is_ real?
...
I think calling it a wiki makes it easier for quick comprehension 
   rather than explaining the whole process of storage, editing, etc.
we should never _have_to_ "explain the whole process" to the public.
for them, it should be as easy as clicking a button to say "check this!"
they shouldn't even need to explain the error.   (we can find it, right?)
...
Rather than working an editing interface 
   into an out-of-the-box wiki system
we don't need to make "an editing interface"...
...
Has anyone ever just asked google if they could 
   use the scans and OCR text for PG?
michael had a meeting with google when they were just starting out...

he reported that they treated him rather rudely, hustling them out of
their famous cafeteria before he had even finished his lunch.   it's sad,
but i don't think they really appreciate project gutenberg very much...
...
XML makes a nice standard for storage and conversion
it's obtuse, and unnecessary obstructive to the text itself...
...
My only protest is that I believe 
   a human proofing stage is still a good choice.
if you mean a word-by-word reading of the page by a person
who has no real interest in reading the entire book, i disagree.

a comparison of different digitizations is a much better process,
in terms of finding errors, and it takes less work, and is more fun.

i can't think of one single reason for a word-by-word proofing,
let alone two or three rounds of it.   it's simply not efficient, and
anybody who explores the options can learn that for themselves.

between aggressive clean-up of o.c.r., and the comparison method,
and "smoothreading" by people from the general public, we're good.
...
Otherwise, I would say to have two people 
   run the same correction process, diff the results and
   then present the mis-matches to a third set of eyes.
close...   but it's unnecessary for "people" to do the digitizations.
two sets of cleaned-up o.c.r., diffed against each other, is enough...

(but yes, it's necessary to have a sentient human doing the clean-up.)
...
the existance of a proofing round
   creates a sense of belonging to the project and 
   could assist in recruiting people to try the 'harder' process of 
   running the 'automation' scripts and submitting books
word-by-word proofing is a slow, boring, humdrum task.

o.c.r. clean-up is a fast-paced, active, and energetic task.
which does _not_ mean that it's "harder".   just more fun...

doing the former task won't help you at all with the latter.
(except perhaps make you _extremely_ grateful that you
don't have to do such a boring, miserable job ever again.)
...
People need a sense that they are not just 
   assisting tools to create book editions, 
   but are a fundamental part of the process.
is the hammer more important than the carpenter?

i doubt it, whether you ask the hammer _or_ the carpenter...

i get the impression you think these clean-up tools are just
a bunch of scripts that people turn on, and the results fall out.
but that's not anything like what actually goes on.
...
I'm not sure I'm ready to program an AI engine 
   capable of "listening to the book."
the software can't listen to the book.   _you_ have to do that.
...
Which is why I do not believe in leaving people out of the process
again, you have a profound misunderstanding of the process.

the hammer can't do a thing without the active involvement of
the carpenter.   even then, the carpenter has to know what s/he
is doing, or the hammer won't be able to do one bit of good...

the best thing you can do, carel, is to play around with twister...

-bowerbird

[gutvol-d] Re: Processing eTexts

Bowerbird＠aol.com