
ok, i talked about the positive parts of rfrank's roundless experiment. now it's time to review the "bad news" -- the not-so-positive parts... many of the _negative_ parts are quibbles about the implementation of the positive aspects, so i will discuss them at the end of this post... there are, however, a few points that are almost fully negative, still. *** the very first thing i do with a set of files that i start working with is to make sure that they're named _correctly_and_intelligently_... that is, a filename must explain, all by itself, the file's _contents_, and the filename must contain the pagenumber from the p-book. moreover, every file must have a _unique_ filename, and every file associated with the same page should share a similar name (e.g., the name will be the same, but with a different extension.) i've done extensive work with datasets that follow these rules and with datasets that do not, and i can say without any hesitation that the datasets which do not follow these rules are much more clumsy, and waste bits of my time that are small but cumulate to significance. and that's why i now no longer will even work with badly-named files. it's just unnecessary frustration. people who use badly-named files will tell you they have "adapted" to the naming. that's pure and simple crap. they don't know better because they haven't worked long and hard with both kinds of data. they are handicapped, and they just don't know they are handicapped. rfrank names his files incorrectly. maybe someday he will learn... *** rfrank does a lot of things right. his scans are extremely well-done, which indicates that he is very careful and meticulous when scanning. it's quite likely he also does some refinements on the scans, such as straightening them, centering them, and perhaps despeckling them. they look quite nice, and they are generally a pleasure to work with... however, all this care seems to be dropped once he's done the o.c.r. his preprocessing routines used to be abysmal. they're better now, but they still have considerable room for improvement. i'm hopeful that he's learned the lesson. he has pushed many of his checks back, from postprocessing to the proofing stage. so now he just needs to push them back from the proofing stage to the preprocessing stage. it would perhaps be very helpful in this regard if _somebody_ who is working at the fadedpage.com site would _volunteer_ to do the step of nondistributed preprocessing, thus freeing rfrank from doing it... he's probably feeling very overwhelmed at the moment, so an offer like that would probably be something that he would accept readily, and it would make a remarkable difference in evolving his progress. *** we've already discussed recently that rfrank should submit his scans along with his postings to p.g. alas, he's picked up the bad habit of failing to do that from his distributed proofreader upbringing. pity. *** it would also be good if rfrank kept the linebreaks and pagebreaks of the original p-book when he submitted the book to project gutenberg. but hey, that's unlikely, isn't it? what _is_ more likely, however, is that he would keep the linebreaks _consistent_ between the various versions that he submits to p.g. but on one file i checked, the 7-bit version was wrapped differently than the 8-bit version, which was wrapped differently than the .html. this is madness, if/when it comes to doing long-term version-control. *** rfrank also picked up the bad habit from d.p. of "clothing" em-dashes. of all the stupid things d.p. does, this is among the most stupid of all. and yet rfrank, who showed the ability to rethink proofing/formatting and roundlessness per se, failed to grasp the basic stupidity of this... *** ditto with unhyphenating the end-of-line hyphenates. i take it all back about "clothing hyphens" being the most stupid thing... dehyphenating has to be the _most_ stupid, because when the proofers do this, they actually destroy the evidence that a computerized routine would use to do the job _properly_, which is _on_a_book-wide_basis_... again, the failure of rfrank to rethink such an obvious stupidity is sad... (kudos, however, to one of his members, for spelling it out in a post. let's just hope that that reasoning will soak in to rfrank's busy brain.) *** again, repeating a d.p. flaw, rfrank strips runheads and pagenumbers from his o.c.r. and perhaps fate is trying to teach him a lesson on this, because he has had several problems where text on a page was deleted, or replaced with text from some other page. these types of problems can be detected and prevented when each page contains its pagenumber. in general, you want to _retain_ this information because it "earmarks" each page of text, making it clear what book it comes from, and where. it also serves as the "suspenders" in a "belt and suspenders approach" along with the filename, which will contain the very same information, and thus the two make it very easy to crosscheck and confirm each other. the silliness of naming all your scansets "001.png" through "999.png" and expecting their subdirectory name to distinguish them is _stark_... (and it has caused all kinds of grief for people in the past, i assure you.) *** rfrank hasn't really installed any instructions of his own, just letting his members rely on their d.p. training, so he has no policy of his own on ellipses, at least that i've been able to detect. but it would be refreshing if he decided to avoid the merry-go-round of never-ending changes that sometimes happens at d.p., and went _exclusively_ with the 3-dot ellipse. (it's funny, because many of his books don't even seem to _have_ ellipses!) *** rfrank is putting a lot of stock in "c.i.p." except, to confuse _everyone_, to him, "c.i.p." means "confidence in proofer", not "confidence in page", which is how everyone else defined the term, up to this point in time... now, me?, i don't think you can put much stock in "confidence in proofer". even the best proofers miss errors, and they don't know when they miss, so i don't think that you can trust their judgment and get perfect pages. rfrank's big mistake here is that he's not necessarily looking for "perfect", since he sees himself, as the postprocessor, as the last line of correction, and he's willing to take a non-perfect page if he can get it a little faster... even if that's fine for him, i don't think it's a good way to build a system. but even then, i just don't think "confidence in proofer" will actually work. or, to be more accurate, i think it'll work just well enough that rfrank will put lots of energy into it before he finds out it doesn't work well enough. or, worst case scenario, he'll convince himself that it really _is_ working, and other people believe him, and we all end up with non-perfect pages. on the other hand, rfrank has shown in some cases that he _can_ learn from the data, and change his mind on something he held dearly, so... *** ok, now we're down to the implementation quibbles... *** first, i'll repeat that it's sad that rfrank is "archiving" his finished projects. it would help all of us learn more about roundlessness if he left them up. i offer webspace if rfrank needs it. and project gutenberg has offered too. *** it's still "in-process", so i expect that it might improve, but the spellcheck display that rfrank is offering would benefit by retaining the linebreaks, so the search for "unresolved words" on the pagescan was much easier to do. i'd also like to see each unresolved word in _clickable-button_ form, for both the good-word and bad-word lists, so a button-push would do that. (in the current form, a person has to copy-and-paste each of the words.) i must add, though, that the ability to include these words immediately is a _tremendous_improvement_ over the d.p. method, one that shows its value to the proofer right away, and is thus very robust and valuable. empowering the proofers to benefit themselves is a remarkable asset... *** in this regard, an ability for a _proofer_ to execute a global change would be a mind-blowing step, and thus a very brave thing to try... of course, bear in mind that i believe that all global changes should have every occurrence verified, so take the suggestion appropriately. and i believe that any global changes that might be required _should_ be sussed out during the preprocessing, before proofers even see text. but nonetheless, putting such a powerful tool in the hands of proofers would speak _volumes_ on the responsibilities you entrust them with, and thus make a tremendous statement that would _embolden_ them... even if they never ever used it... *** as it is now, though, rfrank does the global changes himself, and he has been a little reluctant to do the job in the way that he really "should"... at least he was in one case -- where he declined to fix a contraction -- but perhaps that was not representative of his feelings more generally, so i'll let it go for now... *** as i said before, i don't know whether the d.p. separation between proofing and formatting is a good thing or not. i see the arguments in favor of it, and they seem compelling to some degree, but i also know that the vast majority of pages have little or no formatting, so i'm reluctant to lay another step on the overall process for no benefit. so, in cases like this where the answer is unclear, i'd do an experiment. luckily, rfrank is doing an experiment. it's not a well-controlled one, and we're not really privy to all of the data, so it's far from being ideal, but at least we're engaged in the active questioning of an unknown... still, it would be nicer if we were doing the experiment _properly_... *** rfrank does a pretty good job of showing proofers their diffs, _except_ that you must visit each project-page to see your diffs for that book... it would be far better if you were presented with all of 'em on one page. (and i would emphasize that page by presenting it to the user _first_, when they return for more proofing, so they'll realize its importance.) there's also the slightly troubling aspect that if you mark a page "done", the odds are lowered that it will be proofed again, so you don't obtain the satisfaction of getting a "no-diff" result on that page. i do believe that's counterproductive, and i'd like to see every page reproofed once, even after the page was marked "done", even by a high-c.i.p. proofer... and, of course, having the page reproofed, and having the "done" status confirmed by a "no-diff" by the subsequent proofer, would also raise the "confidence-in-page" for that page, and thereby serve a double benefit... (conversely, if the next proofer finds an error, they rescued a false done.) *** the "page tweet" idea is a good one. (the astute observer might realize that this is the same idea i always use on the bottom of my web-pages, where a person can leave a comment about that specific p-book page.) however, a way to _consolidate_ the tweets for a book would be useful. (and easy to code.) that way, a proofer could look at all of the "tweets" and perhaps answer some of the questions being posed, or fix some of the problems being reported, or take some other kind of positive action (such as finding a person who _can_ fix the problem if you cannot do it). also, it would be good if there were some dedicated buttons on the page, so it would be easy to say things like "difficult formatting, please check" or "foreign language specialist needed on this page", or stuff like that... again, that way a person perusing all of the tweets for a book will know exactly what needs to be done among this list of possible specific tasks, *** and -- just to finish up this post by taking it back to the beginning -- i note with amusement that rfrank uses the term "page" throughout his system. he lists the "pages" that you've done, and calls the notes you attach "page tweets", and the diffs are listed by "page", and so on. so it is ironic that when he talks about "page 123", he's not _really_ talking about _page_123_ at all! he's really talking about _.png_ 123! and the file named "123.png" probably isn't about page 123 at all! indeed, let's review the 7 files named "123.png" rfrank has up now:
http://fadedpage.com/p/201002140505/d/123.png http://fadedpage.com/p/201002140533/d/123.png http://fadedpage.com/p/201002270757/d/123.png http://fadedpage.com/p/201002280257/d/123.png http://fadedpage.com/p/201003020840/d/123.png http://fadedpage.com/p/201003040537/d/123.png http://fadedpage.com/p/201003070309/d/123.png
what we actually find are pages 120, 112, 118, 106, 82, 124!, and 118, respectively. that's quite a range of pages, but alas, none are page 123. so any reference to a "page" number on the faded.com website is gonna frustrate anyone who wants to know what _page_ was being talked about, once rfrank has gone and deleted all of those files. which is a real pity... but alas, here i am talking about filenaming conventions again. help me! time to draw this to a close... *** while i'm letting myself discuss the negatives without feeling any guilt, i might add that it'd be nice if rfrank shared data from his experiments. of particular interest are all the intermediate files, such as the various pages as saved by individuals, and the concatenated text file at various "checkpoints" along the way, notably before and after postprocessing... without such data, we really have no way of evaluating the experiment! rfrank comes from a world of engineers working in private companies, where data is closely guarded, and he doesn't seem to have the attitude that is prevalent in the scientific world that data belongs to the public, and that sunshine is the best disinfectant, and open data is a positive... in this regard, i read and liked this article:
http://flowingdata.com/2010/03/04/think-like-a-statistician-without-the-math... i sure could learn a lot more if rfrank were open with sharing his data, and my guess is that lots of other people could learn lots more as well. -bowerbird
participants (1)
-
Bowerbird@aol.com