
On Sat, Dec 24, 2011 at 7:43 PM, Roger Frank <rfrank@rfrank.net> wrote:
By the way, if anyone ever wants to understand the analysis code, it starts in HTML, which onclicks to JavaScript, which ajaxs to PHP, which popens to Python, which is my language of choice where RE's are involved. That made it easy to reverse-engineer and bake-in the RE's that you probably used. (Don K, stop frowning!)
It sounds legit to me. But as in the past the bird has expressed
disdain for regexes for his own use, so it's probably more string- search-and-replace. Whatever he's comfortable with. In the Twister program I use for preparing new projects, I have a display with the text and page synchronized, and a set of external files with lists of regexes - columns are search pattern, replacement pattern, flag-set, and descriptive text. I run one pass to get counts for all of them, and then work through what is found. Some are safe for global replacement ('ist" is almost always 1st unless it's quoting German, "nth" is "11th" unless there is math, etc. I figure if it improves the text 90% of the time to execute the default replacement it's probably worth doing. I think I can correct over half the initial errors before the first proofer sees it, but not much better than that. Double-quotes are an interesting case. The editor WordPress interestingly automatically converts them to curly-quotes without even asking, and with great accuracy. It even does a great job on single-quotes. I need to look at the algorithms when I get some time. Here's one I'll solicit community improvement suggestions for. It's the regex I use to find problematic quotes. It does a pretty reliable job finding errors, and the logic to repair them is also reliable. But its ability to detect paragraph and other boundaries, and leading continuing quotes, could be better. /([^\s\(-]?)"(\s*)([^\\]*?(\\.[^\\]*)*)(\s*)("|^\r)([^\s\)\.\,\?;:-]?)/gim This is in (my one experiment with) Actionscript, which is a pretty strict superset of javascript, including regex functionality.