notes, thoughts, and two polls, because we're all so chatty

if you want to skip directly to the polls, just scroll down... *** ok, i haven't yet done an exhaustive search, so i'm not sure, but i have been unable to find an 1818 frankenstein scanset. which is sad. millions of books scanned, and not this one? and i guess it _kinda_ absolves greg weeks for using a reprint. because this book really _needed_ to get re-done, since it is almost surely in the list of "lowest-quality p.g. digitizations", especially if we factor in the download-count, which is smart. *** and yes, after some brief checks, #84 definitely seems _not_ to be based on the 1818 edition, meaning that it _probably_ used the _1831_ printing, as that is the other major one for this particular book. and there is controversy about _which_ of the two should be considered the "authoritative" edition, so p.g. should most definitely include _both_ in its library... (let's hope #84 wasn't based on a mixture of both editions. or maybe it doesn't matter, as it needs to be re-done too.) *** at any rate, #84 isn't going to help clean #41445, because we find the "comma-toggle edits" which seem to be typical with a reprint. (that is, if the sentence had a comma in the previous edition, then _remove_ it; otherwise, _insert_ one; either way, the publisher will believe they got their money's worth from your editing, because "look at all the changes!) in addition, however, there are some actual author-did-it changes from one edition to the next. as i said, there was a chapter seemingly added (or deleted, i can't remember), and some paragraphs were rewritten rather extensively... (one argument in the editions-controversy is that shelley was older, with more respect, at the second printing, thus "toned down" the aspects that might hurt her reputation.) all of this argues _against_ any comparison being useful... i have maintained for many years now that _comparisons_ are the future of digitization clean-up, but i've been clear that you gotta have a substantial similarity to start with... and if you've got different editions, that is a big red flag! they might be similar enough that comparison still helps. but they might not, and -- at the start -- you don't know. so it helps to be aware of it, and to make sure before you get in too deep that the benefits will outweigh the costs... one of those costs is the task of prepping the comparison. as jim has made clear, by his refusal to answer questions, the job of rewrapping text is _not_ as simple as he says... and that's especially true if you do everything you should, like restore the end-line hyphenates to their original state. and the big job, of course, is actually _checking_ each diff. i have tools to make the job just about as easy as possible, but it _still_ takes time. especially with thousands of diffs. so when most of the diffs are "real ones" -- i.e., ones that do not help to locate errors which need to be corrected -- the cost/benefit ratio worsens considerably, even _badly_. that is almost certainly the case here with "frankenstein"... at least in _my_ opinion, based on my quick analysis. but as i said, jon hurst can make up his own mind about whether or not he is willing to waste his time doing that. however, it does, i think, raise a good question for a poll. *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** the polls start here, folks! *** *** *** *** *** *** *** *** *** so here's the question:
in general, is it worthwhile to do any further cleaning on a book which just came out of distributed proofreaders?
the last good estimate is that the average d.p. digitization has approximately 50 errors in it, most of 'em punctuation. that was a few years back, so maybe d.p. has gotten better. or maybe, because they are losing a lot of their old-timers, they've gotten worse. but we have no good data either way. so it just boils down to what we "think" might be "true"... so, what do you think? again:
in general, is it worthwhile to do any further cleaning on a book which just came out of distributed proofreaders?
my vote is _no_. most of those books are _clean_enough._ we should use our valuable resources to do _more_ books. now, just because i want to be understood completely, i am also 100% supportive that we need to build the capacity for volunteers to _improve_ existing e-books, including those from d.p. (and elsewhere) which were just posted _today_... but, in the allocation of scarce resources, a book that has just been put through 4-8 rounds is likely _clean_enough_. so the vote so far is: no=1, yes=0. (so if nobody answers the poll, i just decided the matter.) *** and heck, if i'm gonna do one poll, might as well do two... so here's the question for the second poll:
for a list of "the bottom-20 p.g. digitizations", which books would you nominate for inclusion? (note: download-count acts as multiplier here; so we want bad-quality plus high-downloads.)
here are a couple "honorary nominations", which have already been suggested _often_ in the past:
peter pan tarzan frankenstein alice in wonderland pride and prejudice
not really eligible, due to negligible download-count:
laieikawai
-bowerbird

On Thu, November 29, 2012 2:57 pm, Bowerbird@aol.com wrote: [snip]
and yes, after some brief checks, #84 definitely seems _not_ to be based on the 1818 edition, meaning that it _probably_ used the _1831_ printing, as that is the other major one for this particular book.
The first 7 chapters were based on the 1985 Penquin Classics edition which modernized some spelling and punctuation. It looked like the remainder was based on the 1831 edition (probably in a modern reprint). I may have scans around of a photo reprint of the 1818 edition, if anyone is interested.... [snip]
so here's the question:
in general, is it worthwhile to do any further cleaning on a book which just came out of distributed proofreaders?
No, not as to the text. The formatting/markup will obviously have to be improved in a different environment. [snip]
and heck, if i'm gonna do one poll, might as well do two...
so here's the question for the second poll:
for a list of "the bottom-20 p.g. digitizations", which books would you nominate for inclusion? (note: download-count acts as multiplier here; so we want bad-quality plus high-downloads.)
Dracula? A Tale of Two Cities?
here are a couple "honorary nominations", which have already been suggested _often_ in the past:
peter pan tarzan
Yes to Tarzan of the Apes, just because we know that #78 is the bowdlerized version, and no other versions have been produced.
frankenstein
Yes, although it appears that at least one replacement Frankenstein is in the works, so further attention may not be necessary.
alice in wonderland
Which Alice? The latest ones look pretty good, and came out of the DP process, so I would think that the basic text is pretty accurate.
pride and prejudice
Again, which one? And I haven't seen much comment about failures in P&P (other than markup). Lets get a complaint forum up and going, then see what people complain about most.

as jim has made clear, by his refusal to answer questions, the job of rewrapping text is _not_ as simple as he says...
This is more BB misrepresentation which he constantly does to try to make the point that his favorite hobby-horses are "the only way to fly." I take off the headers and footers so that the two texts have more or less matching starts and ends and then unwrapping takes a few seconds. It's not that I don't answer BB's questions, it's that he never listens. One can try my software, for better or for worse, I have posted it. Or one can send me a text you want me to unwrap, and I will do it for you.
like restore the end-line hyphenates to their original state.
Conversely use dictionary-based conservative end-line dehyphenation/pull-up at the start of your normalization process, and then you never have to worry about it again. Just because DP does it wrong doesn't mean that everyone has to emulate them.
in general, is it worthwhile to do any further cleaning on a book which just came out of distributed proofreaders?
Simply answer is select and read a few pages at random and see if you spot any errors. If yes, the book needs fixing. If no, the book is probably "OK." Let's try this approach on say, 76.html Bugs found in the first couple pages: Background: pink Missing textual version of the title page First line of Illustrations indented incorrectly Repeats in list of illustrations "BY ORDER OF THE AUTHOR" is textual information presented only as a graphic. "EXPLANATORY" indentations don't make sense Use of leading all-caps words don't match PG guidelines. Images don't have accessible-friendly alt-tags. Textual information in Bust presented only as a graphic. Use of indentation-style paragraphs without following the convention of *not* indenting the first paragraph of a chapter. [100s of places] Use of SHOUT emphasis carried over inappropriately from the archaic (and mistaken) practice of quasi-marking italic emphasis in historical PG "txt" files, whereas HTML actually HAS technology for correctly marking and rendering italics aka <i>italic emphasis</i> Etc. Again, 76.html has about 1500 errors, most of them of the SHOUT encoding variety.
participants (3)
-
Bowerbird@aol.com
-
James Adcock
-
Lee Passey