Processing eTexts (was RE: Re: d.p.'s undeserved superior attitude about "quality")

26 Feb 2010

      Bowerbird said:
...
carel said:
...
I've read your posts for many years
so _you're_ the one who's been reading them!           ;+)
I'm sure many people read them. :)
...
...
I'm not sure the most efficient system on earth 
  could keep up with the scan rate.
sure we can.  just have 10 million people do 1 book each.
piece'a'cake.
It sounds easy when you put it that way.... ;)
...
...
And, if you speed things up by skipping the manual proofing
  and use automation completely for the first release then you 
  will have text little to no better than what google already has.
oh, don't kid yourself.  google's text is _much_ better than
what they're showing to us members of the general public.
Their OCR software requires training, so I agree with you that they
always have an at least one step better version on the backburner. They
probably have a stepped up version of everything we see....
...
i haven't been impressed by the wiki approach, honestly...
The standard wiki layout is not a proper environment for editing. Better
would be something that allows the vast majority of readers to just read
the page and click a button to be presented with an interface for
making/submitting corrections if such is required. Direct comparison to
the scans is part of the formal proofing process and, although available
during the 'quest for perfection,' should not be a forced issue.

I think calling it a wiki makes it easier for quick comprehension rather
than explaining the whole process of storage, editing, etc. Rather than
working an editing interface into an out-of-the-box wiki system, it
would be better to write a proprietary system with some of the social
networking features of a wiki. That part would take a very firm second
place to just getting some tools out there for people to produce decent
quality texts in a minimal amount of time. Because of that fact, I
haven't really given much thought to the 'wiki' interface as it will
evolve by itself from wants/needs.

[snip]

in re: google:
...
people on this listserve have been cut off for scraping...
you can still do it, of course, but you have to be careful.
Has anyone ever just asked google if they could use the scans and OCR
text for PG? In regards to the scans, I believe they just require you to
leave the branding.
...
i would bother to nitpick certain parts of it, like the x.m.l.,
but really, my objection is at the more fundamental level.
XML makes a nice standard for storage and conversion and can be human
readable. I'm not one of those people who believes that XML is the
answer to everything, but I do think it works well for this use-case
scenerio. Opinions on this will vary, etc.
...
yes, i believe every page of every book should be online,
on its own webpage with unchanging u.r.l., text and scan,
with an error-report form that the general public can use
to detail any problems with the page, or ask questions, or
even make annotations and have dialog about that page...
Agreed.
...
but i also believe that the text for the vast majority of these
pages should be _perfect_ at the time that text is mounted.
We are in agreement on this point as well. The majority of the work
towards corrections should be done via automation and human interaction
with a software process so as to reduce the error level to a minimm
before it is put in the hands of a proofreader (or the general public).
...
and i've demonstrated, repeatedly, that this is fully possible.
It is very possible.
[snip]
...
so here's how i see things going...
...
when you find a scanset online of a book you want to digitize,
you'll download it, along with the o.c.r.  (if it comes from a site
that won't give you the o.c.r., you can upload it to archive.org,
and they'll o.c.r. it for you.  providing you have no o.c.r. app.)
...
you'll use a handy-dandy tool to digitize it in an hour or two,
and then you'll make the results available to everyone online.
My only protest is that I believe a human proofing stage is still a good
choice. For one thing, it would be a way of catching errors that could
be induced by the human assisted automation process. Otherwise, I would
say to have two people run the same correction process, diff the results
and then present the mis-matches to a third set of eyes. If that doesn't
result in a book that is ready for release, I'm not sure what would.

Ready for release != perfect. :)

On a point in favor of the proofing round that really has nothing to do
with the potential for automation, the existance of a proofing round
creates a sense of belonging to the project and could assist in
recruiting people to try the 'harder' process of running the
'automation' scripts and submitting books, etc. People need a sense that
they are not just assisting tools to create book editions, but are a
fundamental part of the process. If you try to proof after a release has
been done, there will be little interests in it because you lose the
motivation to 'get the book out there."
...
and carel, since you're interested in programming such tools,
we'll have plenty to talk about...
Yes. :)
It's very much a 'will do' at this point. I just have to survive my
mid-terms. I look forward to your feedback and your input on the
project.
...
well, like i told gardner just the other day, i really don't
_have_ any collection of scripts.  i use a text-editor and
start looking at the book.  i see an error, and devise a
global search-and-replace to deal with that type of error.
when they're all fixed, i look again to find the next error.
That brings up the point that pretty much all of what my scripts would
be doing could be done in word processing software....
...
i think of this process as "listening to the book", in that
the book itself will tell you what kinds of errors it has...
I like that. I'm not sure I'm ready to program an AI engine capable of
"listening to the book." Which is why I do not believe in leaving people
out of the process, but rather giving them tools to assist them with the
process. And, the tools can 'learn' from the people who use them.

I'll be busy finishing up a group project for my classes, so it'll be
late March before I get any real time to work on scripts for this
project. I'll work on some of the research, logic, and flow in the
meantime so that I am ready to roll when I do have the time.

Back to my programming....

Carel

cmiske＠ashzfall.com

tags

participants (1)