
don said:
If PG is to have a hope of citizen-provided proofing, crowd or otherwise, there are two non-negotiable design requirements.
The page images must be available as the basis for resolving all questions about content (understanding they aren't always unambiguous, but no images is hopeless.)
The text must have the page numbers embedded as data so the error submission process includes the ability to easily confirm the text with the image.
well, don, yes, your general point is certainly spot-on. if i wanted to quibble, i could mention that it would certainly be possible to build capabilities that could handle much of the task via a variety of other means, using other ways, maybe being even more effective. (such as auto-comparison of multiple digitizations.) nonetheless, i would still have to admit that, at many points in time, along the way, it is simply _necessary_ to be able to refer to the original, to quell some issue. it's an unavoidable need, for a digitizing system. *** and yes, i understand quite well that you are merely probing to see if the powers-that-be will "concede" that they will indeed have to install such capability. since if they don't agree to _that_, we just go home, knowing that the job before us would be impossible. *** still, don, perhaps you're getting ahead of yourself. because even if they _do_ concede, would it matter? is there anyone to build such a correction system? (and by "build", i mean "program". it's obvious we have no shortage of "architects" here who're willing to draft up their favorite fancy blueprints. but without anyone to actually build the thing...) so, would anyone code a p.g. correction capability? as someone who has volunteered -- many times -- to program a system for project gutenberg that did exactly that, i do believe i can answer that question. today, 2012, i would never build that system for p.g. and not just because p.g. lost all its credit with me... because i'd advise anyone else _not_ to do it either... the reason is simple: not enough bang for your buck. first of all, the requirements are hard for p.g. to meet. even just these two don listed are hard, because p.g. never paid any attention -- at all -- to pagenumbers. and even though they give a little lip-service to scans, it's not part of their d.n.a. either. so the infrastructure is wholly and completely unprepared for such demands. indeed, d.p. has the scans for most of the books they've digitized over the years -- online as we speak, i believe. (if a provider asks other sites not to repost their scans, d.p. will respect that request, which is a bad response, to my mind, because nobody should put that restriction on scans from a public-domain book, and we shouldn't honor such a restriction if anyone does try to impose it; but that's the d.p. policy.) at any rate, the fact remains that d.p. _has_ the scans. but p.g. doesn't mount them. and yes, for all the p.g. books, you can probably go find a scan-set somewhere, either google or internet archive, or somewhere else. but who is it that'll do that legwork? and how do we decide which version, if there are many? and then there are the practical issues, like file-naming. and the p.g. linebreaks which do not match the scan-set. and scan-sets not atypically will manifest some glitches, which -- at d.p. -- requires a "hospital". who mans that? none of these are trivial issues. they can be surmounted, at least individually, but it's work. en masse? more work. and if you think _programmers_ are in short supply here, take a look around and see how many gophers you count. but even if we ignore all of that, for now, we can't ignore that p.g. _could_ be mounting scans now. but it doesn't. which tells me the p.g. infrastructure can't handle scans, not with any volume, which means this thing won't scale. so that's a difficulty facing you. but it's not the hardest one. no sir. not by a long shot. the hardest part would be the obstacle of the politics. the navigation of marcello, all by itself, would impose such a huge cost on you that it wouldn't be worthwhile. no way to go anywhere without marcello blocking you. then add greg, and the whitewashers, plus all the flack you'd have to take from d.p. for invading "their" space, not to mention all the "cooks" in the gutvol-d "kitchen". all with the best of intentions, mind you. (well, maybe except for marcello.) but a huge obstacle nonetheless. heck, it'd be a _nightmare_ -- and that's putting it mildly. *** but even if, with some miracle, you were prepared to deal with such immense costs, the benefits are feeble. the most you're gonna get by fixing the p.g. library is 40,000 books. and that is a total fantasy number. at least 10,000 of the "books" are repeat fragments, audio files, and various things which are not "books". so the _real_ number is closer to 30,000, if that many. but even if it _was_ 40,000, or 80,000, that number is _dwarfed_ by the collections now available elsewhere. moreover, a very high percentage of the p.g. e-books are _already_ very clean, especially in a relative sense, which lessens further the bang you'd get for your buck. nobody cares much if you remove the last 100 errors from a book. and if they knew you spent _100 hours_ fixing those last 100 errors, they'd say you were crazy. they'd rather you spent that time taking 10 new books from "unacceptably dirty o.c.r." to "ok, this will work". and where can you do that? well, over at internet archive, of course. so, if any programmers want to build such a system, my advice to them is to "build it for internet archive". their infrastructure _does_ support page-scans, and it _does_ keep track of page-numbers, and their stuff is exposed on the web, in a somewhat-documented way, so you can grab it from them without talking to them. (which is a good thing, because they're just as deaf as the p.g. powers-that-be are, so count your blessings.) so the costs of building their system will be much lower, and with millions of books the benefits will keep comin'. plus correcting their text would be a real contribution, rather than simply gilding the lily on a few p.g. books. dollar-for-dollar, pound-for-pound, bang for the buck is gonna be _far_greater_ at internet archive than at p.g. much lower costs. much greater benefits. no-brainer.
People will self-verify if they are given the tools. If they can't, the whitewashers are screwed again.
the whitewashers screwed themselves by not installing such a verification-system in the first place, and that's something i have been saying since december of 2003, when p.g. had 10,000 e-texts, as it was apparent then that we needed a way to go back and repair the books. a significant mark of the uselessness of this endeavor is the fact that -- more than 8 years after that date -- we are _still_ having a "dialog" about "how" we "could" perhaps "build" a "system" that "might" (or might not) "accomplish" the "goal" of "correcting" those "e-books". it is testament to michael hart's vision and achievement in building from the grass-roots an actual cyberlibrary, that so many good people have shown that we're willing to waste valuable years of our lives in service to that goal. but maybe, just maybe, it's time to stop wasting our time. michael is gone, folks. there is no vision here any more... i, for one, am not gonna discuss this until the year 2020 before i see the p.g. powers-that-be are deaf _and_ blind. -bowerbird