[gutvol-d] Re: Processing eTexts

27 Feb 2010

      Bowerbird said:
...
i don't want them to think of this as "the formal proofing process".
...
i believe we need to put the books in front of them _to_read_...
Which is exactly why a book that is ready for release to PG should not
force upon an individual the original scan. When the reader spots
something they think is an error, they should be able to call up the
scan to see if it is an actual error before reporting it as such, etc.
Otherwise their reading is distracted by the scan. A proper interface
would allow users to select what they want to see. An interface should
be about options.

In any case, I have no desire to host an archive that duplicates the
content of PG and others. The purpose in coming to my project site would
be to prepare books for release to PG and to polish released books so
that they can be read for leisure and pleasure off of _my_ bandwidth.
...
...
I think calling it a wiki makes it easier for quick comprehension 
  rather than explaining the whole process of storage, editing, etc.
we should never _have_to_ "explain the whole process" to the public.
for them, it should be as easy as clicking a button to say "check this!"
they shouldn't even need to explain the error.  (we can find it, right?)
Of course, but it is hard to develop it and request assistance with the
development of it without explaining the basic concepts, etc. A project
must be explained and defined or it cannot be realized.
...
...
Rather than working an editing interface 
  into an out-of-the-box wiki system
...
we don't need to make "an editing interface"...
It's rather hard for people to make/submit corrections without having an
interface for doing so.
...
...
Has anyone ever just asked google if they could 
  use the scans and OCR text for PG?
michael had a meeting with google when they were just starting out...
he reported that they treated him rather rudely, hustling them out of
their famous cafeteria before he had even finished his lunch.  it's sad,
but i don't think they really appreciate project gutenberg very much...
That is too bad.
...
...
XML makes a nice standard for storage and conversion
it's obtuse, and unnecessary obstructive to the text itself...
It's also one of many methods for the storage of 'intelligent' data and
is easy to do. Once put in place, it would never be seen again by anyone
except someone correcting the markup (which one hopes would be rare).
InDesign uses markup, MS Word uses markup, and on and on. I don't think
the users of this software find the markup obstructive to the text
itself.
...
...
My only protest is that I believe 
  a human proofing stage is still a good choice.
[snip]
i can't think of one single reason for a word-by-word proofing,
let alone two or three rounds of it.  it's simply not efficient, and
anybody who explores the options can learn that for themselves.
I don't believe I said anything about multiple rounds of proofing, etc.
I believe I discussed the main processing of a document being done
through human interaction with scripts and then people actually looking
at the results of the process before the output is released (proofing).

I feel that a human looking at a smaller subset of a large document is a
good thing in the error finding process. You apparently do not think it
is. Neither of us is right or wrong: It is a matter of perspective and
opinion.
...
between aggressive clean-up of o.c.r., and the comparison method,
and "smoothreading" by people from the general public, we're good.
Yes and no. It depends on your definition of 'good.'
...
...
Otherwise, I would say to have two people 
  run the same correction process, diff the results and
  then present the mis-matches to a third set of eyes.
...
close...  but it's unnecessary for "people" to do the digitizations.
two sets of cleaned-up o.c.r., diffed against each other, is enough...
That depends on a lot of factors including the assumption that two OCR
programs would not make the same mistake and that, if they did, it would
be a mistake that could be caught by another process. Etc.
...
(but yes, it's necessary to have a sentient human doing the clean-up.)
Exactly. And, if that human makes a major error, the entire text in now
a mess. If two people do the same process it reduces, but does not
eliminate, the risk for 'the human error factor.'
...
o.c.r. clean-up is a fast-paced, active, and energetic task.
which does _not_ mean that it's "harder".  just more fun...
People actually do enjoy proofing. You do not. Again, neither is wrong
in their viewpoint. Volunteers should be involved in a manner that gives
them pleasure and satisfaction, etc. I say to give people tools and
options and let them be the judge of what is best for them.
...
i get the impression you think these clean-up tools are just
a bunch of scripts that people turn on, and the results fall out.
but that's not anything like what actually goes on.
Since I am planning to write scripts for such an environment, one would
hope that I do not "think the results will just fall out." I think much
of this will be very difficult to program and that the software must be
capable of 'learning' and that it will require a human to interact with
it so that the job can be done well and the software upgraded to do the
job even better in future.

I actually wrote several scripts like this many years ago. They are
archived on my CDs, but my manifest was damaged in a fire, so I am not
sure where to even begin to retrieve them. It would be faster and easier
to just start from scratch than to dig up outdated software anyway, so I
do not regard it as a great loss.
...
...
I'm not sure I'm ready to program an AI engine 
  capable of "listening to the book."
...
the software can't listen to the book.  _you_ have to do that.
I pretty much said that in what you snipped. :)
...
...
Which is why I do not believe in leaving people out of the process
...
again, you have a profound misunderstanding of the process.
I'll just laugh and pretend you didn't say that....

Perhaps I have a profound misunderstanding of _your_ proposed process,
but I have a very firm understanding of my own. There is no _the_
process for me to understand. In any case, if you would like to create
your own project, feel free. "The road is free to all." And Michael and
PG would be happy to recieve the texts submitted through any process. 

And I would be happy to receive constructive feedback on my efforts as
well from any source that cares to give it. I am just interested in
creating a method for people to complete initial editions of books
rapidly, to avoid putting features in place that 'waste' the time of
volunteers in doing tasks that could be done more rapidly and easily
through the use of software, and to have an environment that offers the
opportunity to correct any errors that were missed in the initial
release. Also, it should offer a sense of community for volunteers to
interact. The _process_ is not a complex one by any means.

As a human, I expect that my project process will not be perfect nor
will it agree with the opinions of all others on how it _should_ be
done, but it will be a process that produces books. So, it will be part
of the solution no matter how many problems it has. :)

Nor do I feel that the process at DP is flawed. They are doing the
process the way they have chosen to do it and it is a project that
acutally exists and is productive. They simply created a bottleneck in
their workflow that they need to deal with. I actually hope that some of
my tools are of assistance to those who use DP to process books. I am
about getting books into PG and care very little how one goes about
doing it so long as one _is_ going about doing it. :)

Carel

[gutvol-d] Re: Processing eTexts

cmiske＠ashzfall.com