New subject: lest the message be missed

8 Mar 2005

      carlo said:
...
How do you manage words that are written in the same way, 
  except the accent? This is quite common in french and italian.
i wouldn't strip away high-bit characters on a french or ltalian book;
they are an essential part of those languages.  i've said that repeatedly.
jon noring seems to turn every listserve into an endless merry-go-round.

***

robert said:
...
With some of the OCR I've seen latesly, 
  this is probably about 90% right, for 90% of books
if you want me to estimate some hard numbers,
i'd say 75% of the e-texts in the library now
could be done to our standard in one evening.

the "standard" we've been talking about thus far
is 1 error every 10 pages.  if an individual can
take an e-text to that level of accuracy, then
it can be turned over to a quasi-public process
i call "continuous proofreading", discussed below.

another 20% of the e-texts in the library now
would take two or three evenings, and might
require the help of a "specialist" of some kind,
someone knowledgeable about a certain arena,
like greek, or tables, or indexes, or graphics, etc.
here's where the value of distributed proofreaders
will most come into play in the future, in my opinion,
being able to "fix up" the work of independent proofers.

the remaining 5% of the  e-texts in the library now
(remember, i'm just guessing at the numbers here)
might be too difficult for any individual to take on,
for reasons of size, complexity, or what have you.
again, distributed proofreaders will shine here...
...
provided you can get good scans,
jon noring promises us good scans, at a high resolution.
i guess he's got a lotta money, to be able to afford the
maintenance costs of storage and bandwidth for them.
if he delivers on that promise, we don't need to worry...

if he can't -- and he has no track-record in this arena,
so maybe we shouldn't count on him -- then we have
the scans that internet archive is making for toronto.
they've promised us "thousands" of scanned books by
this spring, but the last thing i heard was that they
were "pausing to do an evaluation of their quality",
which probably means they've come to realize that
they must do a much better job than they have been.
i've heard that some books are done very well and
that others leave a lot to be desired.  consistency
in this regard is often a difficult goal to achieve.
but i believe brewster will eventually get it right.

and of course, there's always google.  we still don't know
what kind of job they'll do, or if we can use their scans.
but if they do a good job, and release their scans freely,
that will provide us with a ton of scanned-image books.

some people reading this undoubtedly work in an office
that has one of those "multi-functionality" machines.
some of those babies can scan over 60 pages a minute,
straighten the scans, and upload them to the internet,
all while you sit there and whistle and pick your nose.
at that rate, the 450 pages of "my antonia" would take
7.5 minutes.  i don't know how crafty _you_ might be,
but my fellow poets are _highly_ skilled at hijacking
office machinery for our own nefarious purposes...     ;+)
(heck, that's the only reason some of us get a job at all.)

finally, there are millions of home computers out there
that were sold with an all-in-one printer/scanner/fax.
and, as before, the quickest and easiest way to get to a
million scan-books is for a million people to scan _one_.

so, it's fairly easy to predict that, from one or more of the
above factors, there will soon be an _avalanche_ of books
that have been scanned and need to be converted into text.

and every scanned-book will have at least _one_ person
who will want its text badly enough to do a little work.
what we need to do is _give_that_person_a_good_tool_
that enables their little bit of work to get good results.
that's what i intend to give them.  all i ask of you is that
you stop telling people that this job is difficult.  it's not.
...
and provided you are willing to let a few 
  hard-to-detect classes of error
you'll have to explain what you mean by "hard-to-detect".

in my experience, if an error is serious (in any meaningful way),
then it'll be detected by a person who's actually reading the book.

some errors are unforgivable, such as an incorrect word that
won't even pass spellcheck.  those should _always_ be caught.

trivial punctuation errors, like a missing comma, are... well, trivial.

(although i haven't mentioned anything about it until now,
a great way to catch some errors is to have the computer
speak the text aloud to you as you follow along reading it;
stealth scannos, for instance, are handily exposed by this.)
...
and provided you are willing to let a few 
  hard-to-detect classes of error 
  go until post-production.
"post-production" has no meaning in my scenario.

i repeat myself, again and again, by saying that once a person
gets the error-level on an e-text down to 1 error in 10 pages,
we can make it available via "continuous proofreading" and 
let readers-from-the-general-public zoom it to perfection.

1.  scan.
2.  "fix" the scans.
3.  do the o.c.r. on them.
4.  run the post-o.c.r. tool.
5.  do quasi-public "continuous proofreading".
6.  consolidate the corrections into a public release.
7.  release the e-text out to the public as a single file.
8.  continue doing a full-public "continuous proofreading".
9.  take error-reports from the people reading the file offline.

i will also state, for the record, that i think step #4 can do
_far_ better than 1 error in 10 pages if we sharpen our tools.

look at the "my antonia" example.  jon had a team of _seven_
proofreading it.  i don't know how many looked at each page,
but, according to my analysis, they took the error-rate down
to about 1 every 70 pages.  when i used my bag of tricks on it,
i removed 3 errors from the 210 pages i subjected to scrutiny.
(which leads us to predict 3 more errors in the second half.)

to the best of my knowledge, there are no errors in my file,
i.e., the first half of the book.  (full book by friday, hopefully.)

i'm not saying it _is_ free of errors even now, since someone
with a different set of tricks in _their_ bag might be able to
locate 2 more remaining errors i couldn't find, but i will say
that it is more than clean enough to turn loose on the public...
(that is, i think we could skip steps #5 and #6 on this e-text.)

-bowerbird

re: [gutvol-d] lest the message be missed

Bowerbird＠aol.com

Jon Noring

David Starner

tags

participants (3)