so let's talk about my collaborative proofreading site, part 3

here's more info on my collaborative proofreading site... *** to see what we're talking about, you can visit this u.r.l.:
*** we've talked about 4 main topics:
navigating the pages... certifying a page as clean... searching the book for a string... feel the power with the "command" field...
under the 4th topic -- the command field -- we've discussed these commands you can issue:
showmap... concat... showcustom... blubberbaby... pairsearch... end-page-hyphenates...
today we'll discuss a few more commands...
copyfootnotes... movefootnotes... show-end-line-hyphenates...
*** copyfootnotes... some e-book formats want the footnotes collected together into their own section (a la "endnotes")... to accomplish this, enter "copyfootnotes" into the search-field, and then click the "find" button... all of the footnotes will be presented on a screen, so you can copy them en masse. this command leaves the footnotes unmolested on their pages... *** movefootnotes... movefootnotes is another command that does the same as "copyfootnotes", except "movefootnotes" also deletes each footnote from its original page... i'll note that neither of these commands should be used until proofing has been completely finished. until that time, you want to leave the footnote in the one place where it can be most easily proofed, which is right there on that page, next to the scan. for the moment, i have disabled this command... once i've programmed the "mass revert" ability, to reverse any sabotage effort, i will reinstate it. *** show-end-line-hyphenates... you'll probably recall that i encourage people to _retain_ original linebreaks from the paper-book, expressly including all the end-line-hyphenates... this makes it much easier to do proofing, as even distributed proofreaders and project gutenberg acknowledge when it comes to _them_ proofing. (so why they rewrap their text before giving it to other people is a bit disingenuous; but i digress.) at any rate, one slight problem with this approach is that the hyphenated fragments often do _not_ pass spellcheck, and thus are unnecessarily flagged. for instance, you might have the first part of a frag- ment on the top line, and the second on the bottom, and neither "frag-" nor "ment" will pass spellcheck. this command helps you solve that little problem. "show-end-line-hyphenates" will list all of them, as you might expect, but it does a little bit more. first, it tests if the rejoined form passes spellcheck. if so, then it gives you both fragments, so that you can include them in the book's custom dictionary... this command also surveys the full book to see how many times the rejoined form appears in it -- with hyphen, without it, and as two words -- and informs you of the counts, which is good info. i restored all of the end-line-hyphenates on many pages within the "sitka" book, and you can observe the output from "show-end-line-hyphenates" here:
*** while we're on the subject of end-line-hyphenates, i should briefly address one of the thorny matters... i've always maintained that users should be able to unwrap the text themselves, any time they wanted. indeed, i've said we should give them tools to do it. even more than _that,_ i've _provided_ such a tool:
in most cases, an end-line-hyphenate is _easy_ to resolve. you eliminate the dash and then bring up the first string from the next line and concatenate it. simple enough. the glitch happens when it was a _compound_word_ -- i.e., a word that includes a dash in it _normally._ in word-processing parlance, this is known as the difference between a "hard" and a "soft" hyphen... so, in order to indicate to the unwrap routine that any particular dash at the end of a line is a "hard" hyphen, to be retained, we need to give it some kind of marker. i've decided -- tentative to testing for problems -- this marker will be the "~" character, after the dash. you can see cases in the sitka book where this happens:
http://z-m-l.com/go/sitka/editr.pl?bpn=sitkap007 http://z-m-l.com/go/sitka/editr.pl?bpn=sitkap019 http://z-m-l.com/go/sitka/editr.pl?bpn=sitkap093 http://z-m-l.com/go/sitka/editr.pl?bpn=sitkap094 http://z-m-l.com/go/sitka/editr.pl?bpn=sitkap094 http://z-m-l.com/go/sitka/editr.pl?bpn=sitkap107
the lines from those 6 cases are listed here, respectively:
sions in America. The sails of ships from far-~ off Kronstadt on the Baltic brought Russian
during the winter the hunters took 40 sea-~ lions, and in the spring many seals were
of ancient Venice. The picturesque, dark-~ skinned Thlingit women sit at the doors of
Russian fur warehouse. Next is the three-~ story building used for courthouse and jail,
and later of the U.S. Marines from the Man-~ of-War which was stationed here. East of
sea. Eastward crest after crest of glacier-~ capped peaks rise for a hundred miles,
so when these are unwrapped, the words "far-off" and "sea-lion" and "dark-skinned" and "three-story" and "man-of-war" and "glacier-capped" will now be rendered as they should be -- as compound words... *** based on my long observation, i'd say dehyphenation is one of the most _inelegant_ aspects of the d.p. system... first of all, it causes unnecessary work for the proofers, because it's more difficult to proof when the linebreaks have been disturbed in any way. even though the effect is relatively small when it's just on end-line-hyphenates, it still cumulates. (and the dictum against "unclothed" em-dashes at line-ends adds to this cumulative effect.) this shifting of original linebreaks causes line-lengths to become uneven, introducing a variety of problems in that some routines that _could_ be written to help process the text depend on line-lengths, and thus are sabotaged when we change the line-lengths arbitrarily. second, dehyphenation itself is work, because proofers (who do not have access to any book-wide information) have to make a judgment about whether the hyphen is to be retained or not, which is fraught with ambiguity... this leads to diffs, which chew even more proofer time. indeed, in the "perpetual" projects, we saw cases where one proofer would take out a hyphen, and another one would put it back with an asterisk (meaning "check it"). and then the third proofer would take out the asterisk! and of course, if a proofer makes a bad decision, that pollutes the text, which can lead to more bad decisions. decisions on all end-line-hyphenates should be made during preprocessing. then if the proofers challenge any of the decisions, the postprocessor can decide that. that's the only sensible workflow. and this "show-end-line-hyphenates" command shows that it is indeed possible to handle end-line-hyphenates in a manner that is simple, yet adequately sophisticated. *** so those are our 3 new commands for the weekend... more later... -bowerbird

Agreed that mindlessly changing hyphens to check-hyphens is one way some P3 automatons introduce more damage than they're worth. There is a general problem in DP that proofers introduce a check-hyphen when they mean "I really don't like the fact that the original book had a hyphen there." Well, too bad. If the original book had a hyphen there then the two options are: 1) Join with hyphen, or 2) Join without hyphen "Throw the hyphen away because I do not like it" is not an option. Typical example is something like: ..school- teacher. Where the two plausible answers could be: .school-teacher. or .schoolteacher. and of course the proofer automaton changes this to .school-*teacher. meaning "gee I wish the author had written this as:" .school teacher. which of course is not an option.

On Fri, Apr 16, 2010 at 2:36 PM, James Adcock <jimad@msn.com> wrote:
There is a general problem in DP that proofers introduce a check-hyphen when they mean “I really don’t like the fact that the original book had a hyphen there.”
And you know this HOW? What the asterisk means is, "I don't know how the author usually spells this, whether closed (schoolteacher) or hyphenated (school-teacher). This hyphen comes at the end of a line, so I don't know whether to drop it or keep it. I'll put an asterisk there, so the PPer can check the usage in the rest of the text. to see what spelling the author usually uses, hyphenated or closed." That's ALL it means. Uncertainty about the author's preferred spelling. I know, as a professional copyeditor, that the open/hyphenated/closed continuum is extremely mutable, that words have changed over time (to-day becomes today), and that at any one time, different authors and different publishing houses may make different choices. (Copyeditor or copy-editor is just one example; you'll find it both ways.) Sometimes newbie proofers over-asterisk. The same word may occur on the same page with the author's preferred spelling prominently on display. But the newbie is afraid of making a judgment call and asterisks anyway. No big deal. Better to be too careful than to drop the hyphen and rejoin words that should be hyphenated rather than closed up. At times this list seems to function just as Encyclopedia Dramatica does for Wikipedia; all the malcontents gather and mutter about THEM over THERE doing it WRONG and THEY didn't listen to ME. -- Karen Lofstrom

And you know this HOW?
Because 1) I have seen some P3s change EVERY hyphen to a check-hyphen. 2) As a PP I have attempted to "fix" check-hyphens and to do so one has to try to understand what it was that the P3 was complaining about. I've emailed some and said "what were you thinking?" and they say "oops, you're right, I was basically thinking that I wished the hyphen wasn't there." Lot's of people put check-hyphen in there when they are feeling "uncomfortable." Feeling "uncomfortable" isn't *sufficient* reason to put a check-hyphen in -- because if you do so then you make the PP uncomfortable too -- and who has no recourse except a) ignore the check-hyphen. b) waste copious amounts of time trying to double check the hyphen against the author's published corpus c) write the proofer an email and hope some day they will respond honestly and tell you what they were thinking if they were thinking when they entered the check-hyphen. Many authors put hyphens in places which today make to our modern tastes feel uncomfortable. That is not enough reason to insert a check-hyphen. A check-hyphen is basically a punt to the PP -- who is no better placed to resolve the issue.

On Fri, Apr 16, 2010 at 3:12 PM, James Adcock <jimad@msn.com> wrote:
Because 1) I have seen some P3s change EVERY hyphen to a check-hyphen. 2) As a PP I have attempted to "fix" check-hyphens and to do so one has to try to understand what it was that the P3 was complaining about. I've emailed some and said "what were you thinking?" and they say "oops, you're right, I was basically thinking that I wished the hyphen wasn't there."
Those are folks who don't understand the rules. You ran into some bad P3ers; that doesn't mean that all of us in P3 are like that. You could have *corrected* the misconceptions, rather than deciding that we're all idiots. -- Karen Lofstrom not an idiot (at least in THIS area)

It's also possible that the P3er was responding appropriately by responding mindlessly to a process that encourages and rewards mindlessness. (Saying that more diplomatically, a system that encourages rote memorization and application of universal rules rather than thoughtful consideration of the text in the light of the available context.) What's ironic is that the second easiest way to handle it is to let the postprocessor (or is that post-processor? let's say post-*processor. See what I mean?) use the available tools to simply list all the cases where hyphenated and dehyphenated versions of the same word appear in the text, check a page image, see which was actually used (I bet it's the most frequent), and fix 'em all at a stroke. The first easiest way is to do this before posting the project in the first place. Then let the instructions say those dreaded DP words: "It doesn't matter,", reducing the cognitive distinctions and requirements between new proofers and old proofers. Somehow this concept is always a non-starter unfortunately, especially among the old proofers who get to write the rules. On Fri, Apr 16, 2010 at 6:19 PM, Karen Lofstrom <klofstrom@gmail.com> wrote:
On Fri, Apr 16, 2010 at 3:12 PM, James Adcock <jimad@msn.com> wrote:
Because 1) I have seen some P3s change EVERY hyphen to a check-hyphen. 2) As a PP I have attempted to "fix" check-hyphens and to do so one has to try to understand what it was that the P3 was complaining about. I've emailed some and said "what were you thinking?" and they say "oops, you're right, I was basically thinking that I wished the hyphen wasn't there."
Those are folks who don't understand the rules. You ran into some bad P3ers; that doesn't mean that all of us in P3 are like that.
You could have *corrected* the misconceptions, rather than deciding that we're all idiots.
-- Karen Lofstrom not an idiot (at least in THIS area) _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Fri, Apr 16, 2010 at 5:26 PM, don kretz <dakretz@gmail.com> wrote:
The first easiest way is to do this before posting the project in the first place.
That sounds like a good idea. Why don't we add it as a step in the preparation process? -- Karen Lofstrom

Ummm ... I think it is. It's a standard feature of guiprep. It's *almost* a required activity for the Content Provider. But one that the guidelines suggest that the proofer (who doesn't know this has probably happened) will nullify. Here is the applicable Proofing Guideline: Words like to-day and to-morrow that we don't commonly hyphenate now were often hyphenated in the old books we are working on. Leave them hyphenated the way the author did. If you're not sure if the author hyphenated it or not, leave the hyphen, put an * after it, and join the word together like this: to-*day. The asterisk will bring it to the attention of the post-processor, who has access to all the pages and can determine how the author typically wrote this word. Now an only mildly conservative reading of that suggests that just about any word that could possibly be hyphenated should be "-*"ed unless there's another example showing the "right way" on the very same page. There's certainly no moderating language encouraging the proofer to do anything else. Especially considering the possible calumny if they should do the wrong thing. And it says right there to leave it for the PPer if it's not obvious to you in the context of the one page available to you at the time. In fact, if the CPer has done what most CPers do, and left provably hyphenated words hyphenated and closed up the rest, the Guideline actually would lead the proofer to undo it all. On Fri, Apr 16, 2010 at 8:49 PM, Karen Lofstrom <klofstrom@gmail.com> wrote:
On Fri, Apr 16, 2010 at 5:26 PM, don kretz <dakretz@gmail.com> wrote:
The first easiest way is to do this before posting the project in the first place.
That sounds like a good idea. Why don't we add it as a step in the preparation process?
-- Karen Lofstrom _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Fri, Apr 16, 2010 at 6:03 PM, don kretz <dakretz@gmail.com> wrote:
Here is the applicable Proofing Guideline:
Words like to-day and to-morrow that we don't commonly hyphenate now were often hyphenated in the old books we are working on. Leave them hyphenated the way the author did. If you're not sure if the author hyphenated it or not, leave the hyphen, put an * after it, and join the word together like this: to-*day.
Ah, badly-written guideline. What it doesn't spell out is that there's ambiguity ONLY when words are hyphenated at the end of a line. I can see how someone would misread that guideline and add asterisks before every dang hyphen. The more so if the proofer weren't familiar with 18th and 19th century spellings and lacked any sense of how spellings might have changed. I have been proofing for nearly seven years now, so I suppose some things seem clear to me that might be opaque to a less-experienced proofer. You're also assuming that the proofer is doing only one page at a time. Many of us P3ers tend to do many pages in the same book, so begin to have some sense of what spellings the author uses. It might make sense for the project comments to include a list of words that the au hyphenates that might be problematic. A note to the effect that au uses to-day and to-morrow might alleviate some anxiety and asterisks. -- Karen Lofstrom

Ah, badly-written guideline. What it doesn't spell out is that there's
ambiguity ONLY when words are hyphenated at the end of a line. I can see how someone would misread that guideline and add asterisks before every dang hyphen. The more so if the proofer weren't familiar with 18th and 19th century spellings and lacked any sense of how spellings might have changed.
I have been proofing for nearly seven years now, so I suppose some things seem clear to me that might be opaque to a less-experienced proofer.
You're also assuming that the proofer is doing only one page at a time. Many of us P3ers tend to do many pages in the same book, so begin to have some sense of what spellings the author uses.
It might make sense for the project comments to include a list of words that the au hyphenates that might be problematic. A note to the effect that au uses to-day and to-morrow might alleviate some anxiety and asterisks.
--
Yup. Or you could say that all the hyphenated words have already been checked once, they will all be checked again in post-processing, and it doesn't matter. :)

Why is it that people will have a long thread about hyphenation, yet not edit the darned screwed-up subject line to be clear and readable? I'm sorry that our pglaf spam filter tags some stuff as spam, but it doesn't mean we need to carry the tag forever! Lovingly, Greg On Fri, Apr 16, 2010 at 09:36:01PM -0700, don kretz wrote:
Ah, badly-written guideline. What it doesn't spell out is that there's
ambiguity ONLY when words are hyphenated at the end of a line. I can see how someone would misread that guideline and add asterisks before every dang hyphen. The more so if the proofer weren't familiar with 18th and 19th century spellings and lacked any sense of how spellings might have changed.
I have been proofing for nearly seven years now, so I suppose some things seem clear to me that might be opaque to a less-experienced proofer.
You're also assuming that the proofer is doing only one page at a time. Many of us P3ers tend to do many pages in the same book, so begin to have some sense of what spellings the author uses.
It might make sense for the project comments to include a list of words that the au hyphenates that might be problematic. A note to the effect that au uses to-day and to-morrow might alleviate some anxiety and asterisks.
--
Yup. Or you could say that all the hyphenated words have already been checked once, they will all be checked again in post-processing, and it doesn't matter. :)
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

It might make sense for the project comments to include a list of words that the au hyphenates that might be problematic. A note to the effect that au uses to-day and to-morrow might alleviate some anxiety and asterisks.
I have tried leaving project comments and the P1s and the P2s tend to read and follow the project comments whereas the P3s ignore them and undo the good work of the P1s and P2s.

On Sat, 17 Apr 2010, Jim Adcock wrote:
It might make sense for the project comments to include a list of words that the au hyphenates that might be problematic. A note to the effect that au uses to-day and to-morrow might alleviate some anxiety and asterisks.
I have tried leaving project comments and the P1s and the P2s tend to read and follow the project comments whereas the P3s ignore them and undo the good work of the P1s and P2s.
What I found was that there's a contingent in all rounds that seem to have never read the project comments. I don't know if they never read them, or just forgot them in the throws of proofing. -- Greg Weeks http://durendal.org:8080/greg/

On Sat, 17 Apr 2010, Jim Adcock wrote:
It might make sense for the project comments to include a list of words that the au hyphenates that might be problematic. A note to the effect that au uses to-day and to-morrow might alleviate some anxiety and asterisks.
I have tried leaving project comments and the P1s and the P2s tend to read and follow the project comments whereas the P3s ignore them and undo the good work of the P1s and P2s.
Somehow in the context of the handful of mesages Jim Adcock sent, and in even in the context of this message, this does not seem to be a dim view. . .with two plusses and one minus. Of course, it would all work out better if the minus came first--

You could have *corrected* the misconceptions, rather than deciding that we're all idiots.
I did correct the misconceptions and I did not decide that "we" are all idiots. I have stated repeatedly that I found found extremely competent and dedicated volunteers at all levels of DP -- and the converse.

On Sat, 17 Apr 2010, Jim Adcock wrote:
You could have *corrected* the misconceptions, rather than deciding that we're all idiots. [Talk about a "Dim View". . . .]
I did correct the misconceptions and I did not decide that "we" are all idiots. I have stated repeatedly that I found found extremely competent and dedicated volunteers at all levels of DP -- and the converse.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (8)
-
Bowerbird@aol.com
-
don kretz
-
Greg Newby
-
Greg Weeks
-
James Adcock
-
Jim Adcock
-
Karen Lofstrom
-
Michael S. Hart