aaron said:
>   LOL.  So much for
>   "_all_ these e-books are also available as text,
>   and have always been available in that format."

aaron, i don't have experience with the daisy format.
are you telling us that one can't get text out of that?
and are you claiming archive.org doesn't _start_ with
the o.c.r. output when it creates the daisy-format file?
if they give you daisy, and not text, then that must be
part of the d.r.m. they wrap around the file to prevent
the books from being "pirated" out to the seeing world.

what i said _is_ true about the public-domain material.
perhaps none of the public-domain material is counted
in their announcement of "more than 1 million books"...


>  
I suspect that the quality of the public domain books
>   will be roughly the same as the copyrighted texts. 

so we agree on that.


>   As to whether or not they will spend the time and
>   money to improve the quality of their books, I doubt it.

so we agree on that.


>   The cost would otherwise be prohibitive, unless
>   the improvement could be made via software.

well, we cannot know that unless we categorize the errors.
which is precisely why i am undergoing such categorization.

but yes, i've done more than enough research along the way
to say unequivocally that _many_ improvements can be made,
via software, with -- at most -- minimal human input needed.

which pinpoints some of the ridiculousness around this issue.

on the one hand, there are apologists who try to tell us that
"even bad o.c.r. is better than no o.c.r. at all."  well, that's true.

but if we're really willing to settle for that, then why do those
people over at distributed proofreaders even bother working?

obviously, some of us put a value on correcting bad o.c.r.

on the other side of the tightrope, we have those same people
at distributed proofreaders who try to tell us that their job of
correcting o.c.r. is horrendously difficult and time-consuming.

and this is equally ludicrous.  it takes them a long time to do it
because they're doing it in a way that takes a long time to do it.

i'm trying to walk the middle-ground, which seems to me like
it should be a very broad path that's obviously the best to take,
where we spend a sensible amount of time on a book (an hour),
and take it to a place where it meets almost all of our needs...

we can decide to ignore quality and focus on quantity alone,
and we can have millions of books; and all of 'em are flawed.

we can decide to ignore quantity and focus on quality alone,
and we can have 23,456 books, like project gutenberg does...

we can even do _both_, and have 23,456 high-quality books
and millions of flawed ones.  so, is that what we want to do?

or we can spend one hour per book, and have half-a-million
pretty-darn-good books, with all their obvious flaws corrected.

i don't know why it's so hard to see this middle path is the best.

-bowerbird