jim said:
>   http://freekindlebooks.org/Dev/PNP.txt

ok, that looks pretty good, at first glance.          :+)

i'm curious if you can also restore the end-line hyphenates.
it's alright if you can't, but if you can, i'd like to see that too.

of course, depending on what you'll wanna use this for, you
might not even want the end-line hyphenates to be restored.

for instance, to re-do this book at distributed proofreaders
-- which, face it, greg, will be your "crowdsourcing" solution,
because why should anyone re-invent that particular wheel? --
you don't need the end-line hyphenates; d.p doesn't use 'em.

but for some purposes, they will be important.

for instance, my intention would be to mount the text with
the original scan-set, for easy comparison by the public, so
for that kind of project, the end-line hyphenates are needed.

***

still waiting for carlo to demonstrate his output...

or document his procedure. or _anything_, really.

***

i've written programs to do this job a half-dozen times,
each one with a slightly different approach, so i believe
i have enough strategies now to do the task thoroughly,
i just need to assemble the pieces in the correct order...

it's also the case that i'll compare the archive.org text
with the p.g. text, to create a superior end-result, so
i'm willing to take some time to pre-clean them both
-- most especially the archive.org version-- meaning
that it will be simpler to write the synthesizing code...
still, i'd like as much of it as possible to be automatic.
so i'll post my output as soon as carlo posts his, but
i'm also going to continue to work on this objective...

>   Try:
>   http://www.gutenberg.org/cache/epub/28948/pg28948.txt
>   vs.
>   http://www.archive.org/details/therainbowlawren00lawrrich

great. i'll tackle that next. unless you know that it's
a significantly different version. i'm only interested in
scan-sets that are a match for the particular text-file.
if they're too dissimilar, a comparison isn't worthwhile.

-bowerbird