another jumble of responses...

***

roger said:
>   Uh, no. I didn't put the project up for you
>   to scrape and try to draw conclusions.

except that "drawing conclusions" by analyzing data
after experimentation is simply "what i do", roger...


>   I don't have those numbers and I'm not going to
>   try to derive them from the data.
...
>   Right now, I don't care which way is better, technically.
>   I believe the choice will be made on which approach
>   is most comfortable to the user. I've heard from
>   some people that solo process that actually
>   like to go through the book page at a time because
>   they enjoy following the story as they go, which
>   doesn't happen when someone is in production mode
>   at the book level.
...
>   You chose to scrape it and open up the discussion.

yes i did choose to do that.  because that's what i do --
analyze data, draw conclusions, and open up discussion.

but i realize that you didn't "ask" for this discussion, and
you do not care to discuss some things i talk about, and
if you don't want to continue with it, that's fine with me...

i appreciate what i got from you this round, like always,
but if you'd rather not be engaged, then i can accept that.

i think you could write better software if you did do tests,
with well-designed research, intended to get clear data,
from which you could "draw conclusions" that will make
you smarter about the underlying dynamics at work, but
i'm past the point of caring very much about your stuff...
if you wanted to talk to me about these things, you had
chances in 2007 or so.  you didn't want to _then_, either.

as you can tell from my "pontifications", my willingness
to persist in this quest for meaningful dialog is waning.

so i'm just getting some things out of my system, so that
when 2012 rolls in, i can start anew, with a clean slate...
i feel a strong desire to stop treading all this old ground,
and start moving the ball forward; the world is now ready.

if i have a lingering need to say stuff in the coming year
on any of the particular topics instantiated here recently,
i'll address messages to a fictionalized amalgamation of
don, roger, alex, etc. -- a character dubbed "dodger-x".

and yes, i also realize my "tone" is often "disrespectful".

it didn't start out that way, but the treatment i received
from some people here often turned it in that direction.

i don't feel that people automatically "deserve" respect.
respect is something you have to _earn_, with _quality_.
my commitment toward honesty, integrity, and truth is
far too strong to dole out _respect_ to shoddy bullshit...

you _can_ automatically expect _courtesy_ from me, yes.

although some people forfeited even that, when they
seemed to think it was ok to treat _me_ discourteously,
just because i didn't pander to their need for "respect".

and you certainly won't get "respect" by bullying me...

i learned a long time ago not to let myself be bullied --
that you had to give bullies back their own "medicine",
to expose their cowardice, or they would chew you up...

but it's also the case that i simply _have_ no "respect"
for the kind of willful ignorance you expressed above.
i mean, it's _smart_ to "not care" about things that are
irrelevant to what you do.  but relevant considerations?
well, it's _stupid_ to "not care" about _those_ things...

and no, roger, don't get all upset, because i am _not_
"calling you stupid".  precisely because two days later
you were coming back to the list saying, "hey, i noticed
that my accuracy at finding the italics really sucked..."

so here you are, dragging in data that made you wiser
about a flaw in your workflow, so you really _did_ "care".
and that's the kind of attention to detail that i _respect_.
that's the kind of _wisdom_ which i _value_.  it's too bad
that it's far too little, and far too late, to matter much...

on the bright side, even though you're years behind me,
you're _light-years_ ahead of almost everyone else here.

so i give you the strongest encouragement possible.

but, like i said, you shoulda talked to me back in 2007.
because starting in 2012, i'll move on to better things.

***

but since we're still here in 2011, let me clear my plate...

***

everyone agrees that some parts of the digitization workflow
need to be done on a book-wide basis, by a single individual.

they just _fail_to_recognize_ those parts for the discussion...

this typically includes steps at the beginning of the workflow
-- e.g., obtaining the scans, photoediting them if necessary,
getting the o.c.r. in order and the other preprocessing steps,
including the naming of the files, and discarding the garbage.
nobody would give much consideration to the argument that
these steps should be done "page by page", or collaboratively.
sometimes you need just one person in charge of operations.

and typically, it also includes steps at the end of the workflow
--  final checks to ensure the products are reasonably correct,
mounting the final products, accepting any error-reports, etc.
again, it would be silly not to have one person handle a book.

it is the middle where differences of opinion emerge on:
1.  whether it's done by one person, or multiple people.
2.  whether it's on a book-wide basis, or page-by-page.

those are the main divisions, but occasionally along the way
some people have introduced even more confusions, such as
the people who insisted -- when i started bringing this up --
that any book-level changes were those done "automatically",
and thus "blindly", so they would then argue against _that_...
fortunately, we haven't heard that kind of blather here lately.
good thing, too.  talk about a straw-man...  geez!

as another example, this one which _is_ extremely recent,
roger said some things that gave me the strong impression
that he defined the "book-level" versus the "page-level" by
_how_the_text_was_stored_.  using a concatenated-text-file
meant that he was doing stuff at the level of the "book", but
once he split the text for each page into its own file, then
he was operating at the "page" level.  now, i refuse to believe
that he thought something as arbitrary as his _file-storage_
determined the level.  but that's how he seemed to be talking!

for the record, just to try to make lemonade out of this lemon,
it's trivial to split the text of the book, slicing it various ways.
you can split it into pages, and many times that slice is useful.
you can split it into sections (e.g., chapter), also useful often...
sometimes you'll need to split it into paragraphs, like we just
discussed in the topic of checking for balanced doublequotes.
and of course, for spellcheck, you need to split into _words_.
for curly-quotes and em-dashes, you go down to characters.
i rarely find a need for it, but you can slice by _sentence_ too.
and then, last but not least, i often slice the file into _lines_...
there's no question that lines are the best way to show diffs
of the type that's most likely to occur when correcting o.c.r.
so, when you're writing code to handle the digitization tasks,
you're constantly splitting, joining, resplitting, and rejoining,
depending on what you are trying to accomplish at the time...
so, to a programmer, talking about the way the data is stored
as your definition of the level of analysis is... bass-ackwards.
oh, and just to close this sub-topic, i store text as _books_.
it's too inconvenient to store the text in so many "page" files.
"betty lee" is under 300k, so that is a very manageable size...
(a couple of its _individual_ page-images are that same size!)

at any rate, my point is that, on this overall thread involving
how we think about an approach-level to digitization, there's
a lot of confusion out there about how to conceive the topic.

and, needless to say, it's impossible to argue against _every_
type of misconstrual out there, so let's ignore the wacky ones,
and focus our concentration on the top two contenders now.
(because, as we'll see, even the top two end up meaningless.)

as per convention over at d.p., we'll call these middle parts
"proofing" or "formatting".  the difference of opinion is over:

1.  whether it's done by one person, or multiple people.
2.  whether it's on a book-wide basis, or page-by-page.

now, i fully recognize that d.p. has warped a lot of brains
into thinking these middle parts are purely reductionistic.

which is bullshit.

we're not building automobiles.  we don't need a factory,
filled with employees, milling around the assembly line...

but let's go back in time, so we can give d.p. its due credit.

before charlz franks dreamed his brilliant little brainstorm,
book digitization was a solitary and lonely path to tread on.

michael hart had done the job of creating a force of volunteers,
but by and large those individuals did a book _by_themselves_.
they might proof each other's work, after a book was "finished",
but they didn't split a book to do it -- not even split it in half!

so what charlz concocted was truly astoundingly revolutionary!

it was one of the first projects which leveraged the connectivity
-- the _social_ connectivity -- of the internet, believe it or not.

and that alone marked it as a nifty piece of futurism.

but the idea that you could digitize a book _page-by-page_
was just as challenging to the "accepted wisdom" of the day.

indeed, going in, everybody wondered if it would even work!
not if it would work "well"...  whether it would work _at_all!_
most people actually _doubted_ it would get off the ground.

hey, charlz himself wasn't sure.  "it was just an experiment."

so he was overjoyed -- along with the other people around --
when the first book was done, then the second, and a third...

only after a few months (or maybe even more than "a few")
did he actually come to believe that it was going to succeed.
he told me that face-to-face in 2003, and even at that time,
he appeared to be shocked and amused that it really worked.

and maybe we shouldn't have been surprised.

after all, the main task is correcting the scannos, right?,
and how many ways _are_ there to correct a misspelling?

but there _were_ differences of opinion on if it'd work,
and perhaps it'll be instructive to consider the reasons.

many people believed that digitizing a book required a
consistency of purpose, a unity of vision, which would
be difficult (or impossible) to achieve in a collaboration.
there was a belief too many cooks would spoil the soup.

and back then, it took a month, or 2 or 3, to do a book,
and there was much doubt that a group would cohere
for such an extended period, on a fairly difficult task.
so there was also doubt cooks would even stick around
long enough to ruin the soup!  it'd be a revolving door,
nothing more than a never-ending nightmare of churn.

and, even with the hindsight of retrospect, the concerns
weren't that far-fetched.  if everyone went about the job
doing things "their own way", it woulda ruined the soup.

and there have been enough other "distributed" failures
-- including several book-digitization ones -- that the
doubt that critical mass would form has been validated;
d.p. ain't the rule, it's the exception that proves the rule.

thus it ended up that that very same social connectivity
"solved" those problems, by creating a _community_...

the community did form a critical mass, which ensured
that there would be plenty of cooks, for a long time,
even if some of them would burn out of the kitchen...

and the community formed a social core that enforced
some consistency on how its members performed, so
a "unity of vision" was achieved via that peer pressure.

not that it was easy.

nope, despite the obvious response to "how many ways
can there possibly be to correct a simple misspelling?",
there arose a plethora of other points of disagreement.

a _lot_ of them.

disagreements about ellipses, and italics markup, and
blank lines at the top of a page, and end-line hyphens,
and how to solve the problems of persistent log-jams,
and how many rounds to have, and how to skip rounds,
and how to deal with the asshole known as bowerbird,
and whether postprocessing can be distributed or not,
and how come some people keep removing my notes?,
and how such-and-such a rule should be "interpreted",
and how to install guiguts, and what spellchecker to use,
and eighty-seven possible options on that spellchecker,
and how to handle greek, and whether to use utf-8, and
how you hurt my feelings, now my feelings are hurt, and
what are you gonna do about my hurt feelings, and why
did you hurt my feelings, and now i feel so _hurt,_ and
why are you such a mean person anyway?, meanie!, and
how come my book is stuck in the p2 queue again, and
what _do_ you do about a problem like maria, anyway?
d.p. people are like a family, and that's how they fight,
just like a family, over everything, for years and years...
and then they try to kiss and make up, and tell everyone
how much they love them, and how much they love d.p.,
how important it is to them, and the world at large, and
how much they love each other, and... back to fighting...

so no, it wasn't easy, not really.  love and pain.  family.

and now a hardening of the arteries that makes it very
difficult to improve because of a resistance to change.

social connectivity ain't always all it's cracked up to be.

***

now...  where was i?

oh yeah, so now, the pendulum swung the other way,
because we found out that d.p. actually _would_ work,
and now people think it's the _only_ way that can work.

which is stupid, of course.

i mean, seriously, kiddies, we knew how to spellcheck
before distributed proofreaders came along...  really...

and it's _possible_ to spellcheck a whole book.  really.

it's not _required_ that you have a hundred people
doing two pages a piece.  seriously, kids, it's not...

but that's what they know, and some people believe that
what they know is the only thing that can possibly exist.

and they refuse to peek out of their shell, because they
don't wanna have that stupid belief become invalidated.

so now we have way too many people who "doubt" that
a book-level approach to digitization will actually work.

but, as i pointed out, up at the top, there are many parts
of the digitization process that are -- even now -- done
by one individual working alone, on a book-wide basis...

even roger, with his focus on the page-by-page method,
is -- quite ironically -- doing much of his books alone...

he chooses books, he scans them, he mounts the scans,
sometimes he has people help him with the _proofing_
and the _formatting_, and he has a couple dependable
smoothreaders too, but he does all the postprocessing
by himself, and he submits the products to d.p. himself.
heck, folks, he even writes his own software for the job.

and once he learns how easy it is to do the whole book,
with the right tool, watch how he will go off on his own.

he's done about 600 books with the help of his friends.
i won't be surprised if he does another 600 by himself.

by the way, i got christmas greetings from nick hodson,
a 77-year-old from england who's digitized 750 books.
all by himself...  using software tools he wrote himself...

and for many years now, nick has used _text-to-voice_
as one of his best proofing tools, and he swears by it...

i've played around with it myself, and yes, _it_works_.
it is also convenient as can be, with iphone and ipad...
i don't mention it, since you guys don't deserve it, but
mark my words, that's how it'll be done in the future.

with siri and mobile, you're gonna see that a move to
voice-oriented systems will become very accelerated.

in the meantime, you will still be hung up on things like
"book-level" versus "page-level", which mean _nothing_.

those two methodologies are _not_ mutually exclusive.
they work _with_ each other, like a hand in a glove, in
a synergistic manner, so you gotta be able to do both.
at the same time.  using the same tool.  figure it out...

***

ironically, the series i interrupted in order to pursue
this current thread was "a review of digitization tools",
where i was discussing points that are quite relevant.

specifically, the ideal tool will run online and offline,
and it will be designed so that the job can be done by
one person all alone, or multiple people collaborating.

so, for instance, a python script would fit the bill.

a web-app _might_, but only providing that it can
be run offline, without any access to the internet.
(or an offline alternative app would be acceptable,
if functionality is substantially identical, or better.)

the truth is simple:  certain tasks need to be done,
in order for a book-digitization to be "successful"...

and it doesn't matter if those tasks are done by
one person or by multiple people, online or off.
as long as they're done, the book gets digitized.

for instance, since roger brought it up recently,
one of the tasks is "checking the paragraphing".

sometimes o.c.r. gets the paragraphs wrong...
it either incorrectly splits a paragraph into two
(or more) paragraphs with improper blank lines,
or incorrectly joins two paragraphs into one by
improperly deleting the blank line between 'em.

so you need to look at each and every paragraph
in your text, and compare it to the relevant scan,
to make sure there were no paragraphing errors.

it doesn't matter if you do that offline or online.

and it doesn't matter if one person or many do it,
as long as every paragraph on every page is done,
and done correctly.  now, from a practical position,
if you have multiple people doing different pages,
then you must _coordinate_ their efforts, so as to
make sure that every page actually gets checked...
but as long as it _does_, that's what truly matters.

i don't know if you want to call that "page-level"
or "book-level", but the label don't mean much.
it is what it is.  you have to check the whole book,
and you do the check against each separate scan.
call it whatever you like, it doesn't mean a thing.

and that's the way almost all of these tasks are.
what matters is that they _get_ done, not _how_.

the other important consideration here, though,
is that they are done _correctly_.  when you have
one person looking at a page, and that is _all_,
the odds are that, over the course of a full book,
an average person is gonna make a few mistakes.
to err is human.  even on an easy task.  especially!

so -- if you want to ensure each page is right --
you're gonna need to have more than one check.

whether it's from a separate round of volunteers,
or the beta smoothreaders, or your end-users, or
even you yourself alone checking it over and over,
that's what it's gonna take to have 100% accuracy.
and even then ya might not reach that perfection.

there's no need to be pessimistic, however, since
there is another factor that has an impact here...
namely, we can often code programmatic checks
that offer varying degrees of certainty that a task
has been completed successfully.  indeed, we can
sometimes even have these tasks do a "first pass"
on the text, and then simply verify what it's done.

let's take this paragraphing task as an example...

first of all, there's a very simple find operation
to help find some improperly-split paragraphs:
two-linebreaks-followed-by-a-lowercase-letter.
boom!  odds are you just located a few problems;
not all the improper-splits, but _some_ of them.

now improve the routine, so you're searching for
a-lowercase-letter-followed-by-two-linebreaks.
boom!  you might've located a few more glitches.
(if you'd done this routine before the other one,
it would have found some of the same problems.
but both of them can detect some unique cases.)

you can improve the routine further, if you work.

or you can even try a whole different approach.

that's what i've done, to check the paragraphing.
to test my routine, i eliminate all the blank lines
from an o.c.r. file, and have it put them back in.
it does amazingly well.  i'll show you sometime.
so, to cut a long story short, when roger implied
that this job needed to be done by human eyes,
he was wrong...  the computer can do very well,
providing you have programmed it intelligently.

heck, truth be told, even dumb programming
does a fairly good job.  if you put a blank line
after every sentence-terminated line followed
by a capital letter, you get a lot of them right.
despite an excess of cases of improper splits,
a big majority of the paragraphs are correct.
and improving that routine is also fairly easy,
until you've got 98% of the paragraphs right.
ain't gonna tell you the secret of the last 2%,
but i'll show you an app proving it is doable.
if you ask nice.  otherwise just take my word.

it ends up that -- for most of the tasks that
need to be done -- you can program a test,
to see whether you think the file is adequate.

i've already discussed code for doublequotes,
but that's yet another perfect example of this.

again, just so everyone is perfectly clear here,
there's always a need for good smoothreaders.
every digitizer is gonna make some mistakes...
we march to perfection, we don't just "arrive"...
and we're never really sure we've gotten there.
we always expect "at least 1 more" error-report.

but...

there are book-digitization tasks
which need to be accomplished...

to say that book is "finished",
you must be able to certify --
with some degree of certainty
-- that those tasks were done.

how those tasks get accomplished
-- whether by one person or many,
offline or online -- doesn't matter.
what matters is that they got done,
and that we can verify they're done.

that's the tool that we need to have,
to do the job, a tool that allows us to
1.  do the tasks, and
2.  verify they're done.

you can make up terminology if you like,
and draw up some graphs in your mind
which are as meaningless as the pictures
your kid colored stuck to your fridge-door
(albeit less captivating than their pictures),
but when you boil digitization to its essence,
what you find is "there are tasks to be done."

kudos to roger for building software to help.
and to don too, for ongoing advocacy of the
importance of "checklists" to the workflow...

and let us not forget that don built an app.
once again, folks, you should get "twister".

>   http://code.google.com/p/dp50/downloads/list

yeah, it's almost 2 years old now, and it has
lots of features that are "not implemented",
but you still get a feel for "how it should be".
and you reg-ex fans will be happy to know
that you can install your own set of reg-ex,
and have the app use it.  (i couldn't get that
particular thing to work, but maybe you will.)

but as you use "twister", ask yourself whether
it's working at the "book" or the "page" level.
if you can come up with an answer, tell me...

i also have a brand new thing to show you,
but i'll save that for a later "pontification"...

have a nice day.

-bowerbird