On Sat, Dec 24, 2011 at 7:43 PM, Roger Frank <rfrank@rfrank.net> wrote:

By the way, if anyone ever wants to understand the
analysis code, it starts in HTML, which onclicks to
JavaScript, which ajaxs to PHP, which popens to Python,
which is my language of choice where RE's are involved.
That made it easy to reverse-engineer and bake-in the RE's
that you probably used. (Don K, stop frowning!)

It sounds legit to me. But as in the past the bird has expressed
disdain for regexes for his own use, so it's probably more string-
search-and-replace. Whatever he's comfortable with.

In the Twister program I use for preparing new projects, I have a
display with the text and page synchronized, and a set of
external files with lists of regexes - columns are search pattern,
replacement pattern, flag-set, and descriptive text. I run one
pass to get counts for all of them, and then work through what
is found. Some are safe for global replacement ('ist" is almost
always 1st unless it's quoting German, "nth" is "11th" unless there
is math, etc. I figure if it improves the text 90% of the time to
execute the default replacement it's probably worth doing. I think
I can correct over half the initial errors before the first proofer
sees it, but not much better than that.

Double-quotes are an interesting case. The editor WordPress
interestingly automatically converts them to curly-quotes without
even asking, and with great accuracy. It even does a great job
on single-quotes. I need to look at the algorithms when I get
some time.

Here's one I'll solicit community improvement suggestions for.
It's the regex I use to find problematic quotes. It does a pretty
reliable job finding errors, and the logic to repair them is
also reliable. But its ability to detect paragraph and other
boundaries, and leading continuing quotes, could be
better.

/([^\s\(-]?)"(\s*)([^\\]*?(\\.[^\\]*)*)(\s*)("|^\r)([^\s\)\.\,\?;:-]?)/gim

This is in (my one experiment with) Actionscript, which is
a pretty strict superset of javascript, including regex
functionality.