
but in that case, your routines should have done what d.p. requires of its proofers, which is to move the second part of the end-of-line hyphenate to the previous line. i _believe_ that in some cases, you moved the first part down... (but i could be wrong on that, so do please let me know if i am.)
Yes, my routines would have been smart enough to move the second part up if I had been smart enough to remember to run my smart dehyphenation routines prior to running the linebreak recovery routine, but since you were only doing a visual “proof of concept” I didn’t think it would matter to you whether the half-word was moved one line up or one line down since the line length imbalance would on average remain the same in any case. "in theory"? you just take them from the o.c.r. at archive.org, which is what you used to restore the linebreaks anyway, right? I don’t “just” do anything. The linebreak recovery algorithm is more-or-less a whole document best match using Levenshtein distance tokenized on words+whitespace, and then cloning the whitespace information from the OCR text to the PG text. Since it is the PG text that contributes the “word” part of the information there are no hyphens to recover – because PG requires that we remove those hyphens. If I not only took the “whitespace” part of the information from the OCR but also the “word” part from the OCR, then you would be back to the raw OCR – but hey, if that’s what you want… …I will see what it would take to also recover the hyphens, which also sounds like a reasonable thing to be able to do.