if you want to skip directly to the polls, just scroll down...
***
ok, i haven't yet done an exhaustive search, so i'm not sure,
but i have been unable to find an 1818 frankenstein scanset.
which is sad. millions of books scanned, and not this one?
and i guess it _kinda_ absolves greg weeks for using a reprint.
because this book really _needed_ to get re-done, since it is
almost surely in the list of "lowest-quality p.g. digitizations",
especially if we factor in the download-count, which is smart.
***
and yes, after some brief checks, #84 definitely seems _not_
to be based on the 1818 edition, meaning that it _probably_
used the _1831_ printing, as that is the other major one for
this particular book. and there is controversy about _which_
of the two should be considered the "authoritative" edition,
so p.g. should most definitely include _both_ in its library...
(let's hope #84 wasn't based on a mixture of both editions.
or maybe it doesn't matter, as it needs to be re-done too.)
***
at any rate, #84 isn't going to help clean #41445, because
we find the "comma-toggle edits" which seem to be typical
with a reprint. (that is, if the sentence had a comma in the
previous edition, then _remove_ it; otherwise, _insert_ one;
either way, the publisher will believe they got their money's
worth from your editing, because "look at all the changes!)
in addition, however, there are some actual author-did-it
changes from one edition to the next. as i said, there was
a chapter seemingly added (or deleted, i can't remember),
and some paragraphs were rewritten rather extensively...
(one argument in the editions-controversy is that shelley
was older, with more respect, at the second printing, thus
"toned down" the aspects that might hurt her reputation.)
all of this argues _against_ any comparison being useful...
i have maintained for many years now that _comparisons_
are the future of digitization clean-up, but i've been clear
that you gotta have a substantial similarity to start with...
and if you've got different editions, that is a big red flag!
they might be similar enough that comparison still helps.
but they might not, and -- at the start -- you don't know.
so it helps to be aware of it, and to make sure before you
get in too deep that the benefits will outweigh the costs...
one of those costs is the task of prepping the comparison.
as jim has made clear, by his refusal to answer questions,
the job of rewrapping text is _not_ as simple as he says...
and that's especially true if you do everything you should,
like restore the end-line hyphenates to their original state.
and the big job, of course, is actually _checking_ each diff.
i have tools to make the job just about as easy as possible,
but it _still_ takes time. especially with thousands of diffs.
so when most of the diffs are "real ones" -- i.e., ones that
do not help to locate errors which need to be corrected --
the cost/benefit ratio worsens considerably, even _badly_.
that is almost certainly the case here with "frankenstein"...
at least in _my_ opinion, based on my quick analysis.
but as i said, jon hurst can make up his own mind about
whether or not he is willing to waste his time doing that.
however, it does, i think, raise a good question for a poll.
*** *** ***
*** *** ***
*** *** ***
*** *** ***
*** *** ***
*** *** ***
*** *** ***
*** *** ***
*** *** ***
*** *** ***
*** *** ***
*** *** ***
the polls start here, folks!
*** *** ***
*** *** ***
*** *** ***
so here's the question:
> in general, is it worthwhile to do any further cleaning on
> a book which just came out of distributed proofreaders?
the last good estimate is that the average d.p. digitization
has approximately 50 errors in it, most of 'em punctuation.
that was a few years back, so maybe d.p. has gotten better.
or maybe, because they are losing a lot of their old-timers,
they've gotten worse. but we have no good data either way.
so it just boils down to what we "think" might be "true"...
so, what do you think?
again:
> in general, is it worthwhile to do any further cleaning on
> a book which just came out of distributed proofreaders?
my vote is _no_. most of those books are _clean_enough._
we should use our valuable resources to do _more_ books.
now, just because i want to be understood completely, i am
also 100% supportive that we need to build the capacity for
volunteers to _improve_ existing e-books, including those
from d.p. (and elsewhere) which were just posted _today_...
but, in the allocation of scarce resources, a book that has
just been put through 4-8 rounds is likely _clean_enough_.
so the vote so far is: no=1, yes=0.
(so if nobody answers the poll, i just decided the matter.)
***
and heck, if i'm gonna do one poll, might as well do two...
so here's the question for the second poll:
> for a list of "the bottom-20 p.g. digitizations",
> which books would you nominate for inclusion?
> (note: download-count acts as multiplier here;
> so we want bad-quality plus high-downloads.)
here are a couple "honorary nominations", which
have already been suggested _often_ in the past:
> peter pan
> tarzan
> frankenstein
> alice in wonderland
> pride and prejudice
not really eligible, due to negligible download-count:
> laieikawai
-bowerbird