gardner said:
>   That's kind of my experience, I guess.
>   Several fixes will suggest themselves,
>   in the context of a given specific text.
>   The next one might need different fixes.

right. a rule i always follow is that when i find an error,
i always search the rest of the text for other occurrences.

>   But that doesn't mean a long list of fixups
>   might be tried when there's no cost
>   to just adding tests/fixes to the list.

well, except for what i mentioned about false alarms.

it's a rare test that doesn't turn up any false alarms,
but when one turns up too many of them, it becomes
a liability instead of an asset. the question is always,
"how many false alarms is too many?", and the next
question is always, "how can i weed out false alarms?"

>   Better stick to Windoze, if it's a GUI.

actually, they are both generated from the same code,
so they should act identically. whether they really do...

>   Text in one file -- check.
>    I favour marking page boundaries with "===00123"
>   these days, but a global search/replace can fix that.

my app is looking for separator lines that look like this:

{{myantp123.png}} || the_runhead ||

(note that there is a space in the first column.)

that .png filename there is the name of the page-scan,
and the program assumes you name your files wisely...

so, for instance, if you want to jump to a certain page,
you simply type the page number and press enter, and
the program automatically jumps to that page. nifty...

>   Yes. Looking at that. I am not 100% sure
>   I want to mess with Twister exactly, but
>   the list of regular expressions looks interesting.
>   I'm picturing building a perl script that
>   applies all of these fixes, then creates a patch set
>   based on the the differences it has introduced.
>   I could then edit the patch set as a file,
>   nuking changes that are wrong, and
>   finally apply the patches for the changes I like.

i would be very surprised if you can make that workflow
more efficient than simply editing text in the interface...

the beauty of my app, and twister too, is that you can
view the page-scan to help you make the edit decision.

i'm well aware that you don't _need_ to view the scan
in order to resolve the vast majority of questions, but
the inefficiency in handling that thin minority is huge
if the bureaucracy of viewing the scan is too convoluted.

>   Jeebies and gutcheck reference specific line numbers.

try twister. seriously. the ability to jump right to the page
where the question occurs, and view the scan in context,
is a major boost to efficiency. i bet you will be surprised...

>   I find it takes a good couple of passes before
>   I am satisfied I have all the genuine hits covered.
>   Invariably the WW finds things I've missed anyhow.

that's a sign of an inefficient workflow.

you want to accomplish things in one pass,
and you want to make sure you got all of it.

>   Got those. Like I say -- I will turn it into
>   a perl script and see where that takes me.

a perl script is operating blind. get a seeing-eye dog.

>   Lots of choices there.
>   http://www.canadiana.org/ECO/ItemRecord/48293?id=16c79d4f15394e51
>   http://www.archive.org/details/advocateanovel00heavgoog
>   http://books.google.com/books?id=ot4OAAAAYAAJ&oe=UTF-8

none of those options are all that useful to me, however...

what i need is to have the individual scans available online,
each of them individually addressed with their own address.

for instance, that "myantp123.png" file i referenced above,
the one that reflects the scan of page 123 in "my antonia"?
you can find that right here, in sequence with all the rest:
>   http://z-m-l.com/go/myant/myantp123.png

this is the way the library of the future will be organized...
if you want your work in it, mount your files appropriately.

yes, i can download the .zip file of the scans from archive.org,
or pull 'em from the google .pdf, and then mount them myself,
but that's too much work for me to do, when you could have
mounted them correctly in the first place.

>   There are no page numbers in the Gutenberg text though.

then you threw away some very crucial information, didn't you?
probably rewrapped the text too, am i right? and dehyphenated?
all these actions make any kind of reproofing an impossible task.

which is not to say that your proofing work was a waste of time.

no, in such situations, i'll download the o.c.r. from archive.org,
which _does_ still contain pagebreak info, and unwrapped text,
and end-line hyphenates. and then i will use your proofed text
to make the corrections to the archive.org o.c.r. and then i will
throw your text away, and keep the corrected, unwrapped and
page-marked text with the original end-line hyphenates in it...

and when i throw away your text, i throw away your credit-line.

had you kept all that valuable information which i need to have,
instead of tossing it out, i probably would keep your credit-line.

you know, just so you know...

-bowerbird