[gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting

9 Mar 2005

      here's one from last week that never got mailed out...

i'll be leaving here again very shortly, since i have been
reminded just why i had stayed away, because this place
can be so negative and destructive and poisonous...  ick!

***

jon, you said the scanning took "much more than four hours".
so how long _did_ it take?  and if you were to do it again,
with your present scanner, how long would it take you?

also, how long did it take you to manipulate the images?
and how did you do that?  what specific steps did you take,
in what order, and what program did you use to do all that?
is there anything of all that which you'd do differently now?

***

jon said:
...
OCR is quite fast. It's making and cleaning up the scans 
  which is the human and CPU intensive part.
well, it all depends, jon, it all depends...

with the right hardware -- like office-level machinery --
60 pages a minute can get swallowed by the gaping maw.
that's right.  one page per second.  that seems fast to me.

that means your 450-page scan-job would take 7.5 minutes.
probably took you more time than that to cut the cover off.

and the machine will automatically straighten those pages,
o.c.r., and upload to the net, while you stare dumbfounded...

likewise with the kirtas 1200, geared to scanning books.
     http://www.kirtas-tech.com/
it does "only" 20 pages a minute, but hey, 1000 pages/hour
ain't nothing to sneeze at.  they estimate that in a full-scale
production environment, the price-per-scan is 3 cents a page.
sounds like brewster should buy a half-dozen of these babies.

so it all depends.

the bottom line, though, is that if a person has experience,
good equipment, solid software, and a concentrated focus,
they can open a paper-book to start scanning it and move it
all the way through to finished, high-power, full-on e-book
in one evening, maybe two.

***

i said:
...
third, you used a reasonable naming-scheme for your image-files!
  the scan for page 3, for instance, is named 003.png!  fantastic!
  and when you had a blank page, your image-file says "blank page"!
  please pardon me for making a big deal out of something so trivial
  -- and i'm sure some lurkers wrongly think i'm being sarcastic --
  but most people have no idea how uncommon this common sense is!
  when you're working with hundreds of files, it _really_ helps you
  if you _know_ that 183.png is the image of page 183.  immensely.
  even the people over at distributed proofreaders, in spite of their
  immense experience, haven't learned this first-grade lesson yet.
i forgot to mention earlier that my processing tool can automatically
rename your image and text-files, based on the page-numbers that it
finds right in the text-files (which it extends in sequence for those
files without a page-number -- usually the section-heading pages).

so even if you're dealing with someone else's scans, and _they_ didn't
name their files wisely, you don't have to deal with the consequences.

***

jon said:
...
I believe as you do that an error reporting system is a good idea
  so readers may submit errors they find in the texts they use -- 
  sort of an ongoing post-DP proofing process.
i didn't elaborate earlier that it goes much deeper than that.

a very important point here is that an error-reporting system
-- over and above the obvious effect of getting errors fixed --
will actively incorporate readers into the entire infrastructure,
making them active participants cumulating a world of e-books.

if you have ever edited a page on a wiki, you're likely aware that
the experience gives a very strong feeling of _empowerment_ --
because you can "leave your mark" right on a page, quite literally.

if we set up a wiki-page to collect the error-reports for an e-text,
in a system allowing people to check the text against a page-image,
they'll be much more motivated to report errors than they are now,
with the "send an e-mail" system.  the feedback is more immediate,
and compelling, with a wiki.  furthermore, by collecting the reports,
in the change-log right on the wiki, you can avoid duplicate reports.
you can also give rational for rejecting any submitted error-reports,
and/or engage people in a discussion about whether to act on a report.

all of this makes your readers feel _responsible_ for the e-texts.

a lifetime of experience with printed matter has made people very
_passive_ about typographic errors.  there's no reason to "report"
an error they find in a newspaper, for instance, because hey, it's
already been printed.  the same with a magazine or a printed book.
water under the bridge.  and they translate that same attitude over
to e-books, even though it _does_ do good to report errors there.
so we need to do something to shake them out of their passivity,
something to make them feel _responsible_ for helping fix errors.

(just for the record, although i use the term "wiki", i don't mean it
literally.  what i have in mind is more of a "guestbook" type method,
where people can _add_ their text to the page, but not necessarily
_delete_ what other people have added.  it's thus more like a blog,
where everyone can add their comments to the bottom of the page,
but the top part stays constant, to list the "official" information.
but i'll still use the term "wiki" to connote a free-flowing attitude.)

in addition to the wiki, you can build an error-reporting capability
into the viewer-program that you give people to display the e-texts.
if they doubt something in the e-text, they click a button and boom!,
that page-image is downloaded into the program so they can see it.
if they have indeed found an error, they copy the line in its bad form,
correct it to its good form, and then click another button and boom!,
the error-report is e-mailed right off to the proper e-mail address.

this symbolic (and real!) incorporation of readers into our processes
is a rad thing to do.  but it's not the _only_ benefit of such a system;
it also facilitates the automation of the error-correction procedures.

the error-report can be formatted such that your software can
automatically summon the e-text _and_ the relevant page-scan.
so you see a screen with the page-scan _and_ the error-report.
you check its merit, and if it's good, click the "approve" button
and the e-text is automatically edited.  further, the change-log
is updated right on the wiki-page for that e-text, and anyone who
requested error-notification gets an e-mail describing the change.
auxiliary versions of the e-text -- like the .html and .pdf files --
are automatically updated.  and all you did was click one button...
face it, if you're dealing with 15,000+ e-texts, doing it manually
is a sure-fire way to burn yourself out.  who needs that hassle?

i mocked up a demo up this, using a simple a.o.l. guestbook script.
i'm sure you versatile script-kiddies here could do something that
was much more sophisticated, but my version will give you the idea:
     http://users.aol.com/bowerbird/proof_wiki.html

-bowerbird

[gutvol-d] a wiki-like mechanism for continuous proofreading and error-reporting

Bowerbird＠aol.com