focus, kids, focus.
think about the objective here.
so... why would we want to rewrap
a p.g. text to an existing scan-set?
well, for two main reasons:
1. to re-proof and correct the text with the scan-set.
2. to use the scan-set as the provenance for the text.
the good news is that we only have to do this rewrap
_once_ for a book, and we can assume that we'll have
volunteers with a reasonable level of skill for the task,
so it doesn't have to be idiot-proof or fully automatic.
the bad news is that we need to do it for 20,000 books,
so the job _can't_ require _too_ much time or energy...
moreover, in order to be used in a proofing interface,
the output of the rewrap needs to be in a certain form.
but the parameters are wide open; as long as you can
make your output work in your proofing system, fine.
the perspectives offered so far, by jim and carlo, are
quite instructive, in terms of striking the right balance.
jim commented that the two versions are very similar,
while carlo noted that there are many small differences.
to my mind, this is exactly how to walk this tightrope...
while i would prefer to have the exact same edition of
the scan-set/paper-book used to create the text-file,
we can never be sure with the project gutenberg files.
so if the editions are "reasonably close", that'll be fine.
indeed, if they're close enough, but clearly different,
then we've gained additional information concerning
edits that were made from one edition to the other...
(scholars, in particular, go nuts for that type of stuff.)
and even as we "correct" the text version into a file
which reflects the scan-set -- i.e., _change_it_ from
one edition into the other edition -- we'll still keep
the original text around, to be used as soon as the
scan-set from its actual parent-edition shows up...
(this assumes that it was not some mutant hybrid.)
the main takeaway here is that the output from the
rewrap must be amenable to a proofing interface...
and the process can require _some_ expertise, but
it can't be one that needs too much time or energy.
so, i'll see if i can work through carlo's instructions,
and i'll prepare my output for posting, and then we
can discuss this some more, probably next week...
-bowerbird