
"Bowerbird" == Bowerbird <Bowerbird@aol.com> writes:
Bowerbird> it's a discussion over at distributed proofreaders Bowerbird> about repurposing digitizations found elsewhere on the Bowerbird> web into the d.p. workflow, jumpstarting the proofing Bowerbird> process with a text that has already received a good Bowerbird> amount of proofing. the catch? the other Bowerbird> digitizations have linebreaks removed, making proofing Bowerbird> more difficult for d.p. people... Too easy to solve: OCR the images, preserving line breaks, add to every end-of-line a character not otherwise appearing much, e.g. @, run wdiff between the two versions, replace [-@-] with a linebreak, remove the other differences with a regexp. You might miss some linebreaks, if the OCR is very bad. But a better regexp might help in this case. Carlo