Re: Typesetting ("gods and fighting men")

jim said:
my understanding was that you were trying to work from the PG txt files – which do not retain the original hyphenations.
right. so those end-of-line hyphenates need to be restored.
Recovering original hyphenations should be in theory possible too, but not work that I have looked at yet.
"in theory"? you just take them from the o.c.r. at archive.org, which is what you used to restore the linebreaks anyway, right?
The linebreak recovery algorithm I worked on was intended to allow people at DP, for example, if they want to, to resubmit some of the early PG works and run them through DP again.
ok, fair enough. but in that case, your routines should have done what d.p. requires of its proofers, which is to move the second part of the end-of-line hyphenate to the previous line. i _believe_ that in some cases, you moved the first part down... (but i could be wrong on that, so do please let me know if i am.)
Without automatic recovery of linebreaks one has several days of extremely tedious work reintroducing the original linebreaks.
as i said above, i use the o.c.r. text from archive.org to restore them. it's pretty straightforward, and automatic. it didn't take long to code, and it runs very fast. but you're right; doing it manually is painful...
The other alternative for you is to leave healthy right margins and leave your PDF’s “ragged right” [*very* ragged right!]
well, one object is to clone the pages of the printed book itself, so ragged-right isn't really an option. -bowerbird

but in that case, your routines should have done what d.p. requires of its proofers, which is to move the second part of the end-of-line hyphenate to the previous line. i _believe_ that in some cases, you moved the first part down... (but i could be wrong on that, so do please let me know if i am.)
Yes, my routines would have been smart enough to move the second part up if I had been smart enough to remember to run my smart dehyphenation routines prior to running the linebreak recovery routine, but since you were only doing a visual “proof of concept” I didn’t think it would matter to you whether the half-word was moved one line up or one line down since the line length imbalance would on average remain the same in any case. "in theory"? you just take them from the o.c.r. at archive.org, which is what you used to restore the linebreaks anyway, right? I don’t “just” do anything. The linebreak recovery algorithm is more-or-less a whole document best match using Levenshtein distance tokenized on words+whitespace, and then cloning the whitespace information from the OCR text to the PG text. Since it is the PG text that contributes the “word” part of the information there are no hyphens to recover – because PG requires that we remove those hyphens. If I not only took the “whitespace” part of the information from the OCR but also the “word” part from the OCR, then you would be back to the raw OCR – but hey, if that’s what you want… …I will see what it would take to also recover the hyphens, which also sounds like a reasonable thing to be able to do.
participants (2)
-
Bowerbird@aol.com
-
James Adcock