>as jim has made clear, by his refusal to answer questions,
the job of rewrapping text is _not_ as simple as he says...

 

This is more BB misrepresentation which he constantly does to try to make the point that his favorite hobby-horses are “the only way to fly.”

 

I take off the headers and footers so that the two texts have more or less matching starts and ends and then unwrapping takes a few seconds.  It’s not that I don’t answer BB’s questions, it’s that he never listens.  One can try my software, for better or for worse, I have posted it.  Or one can send me a text you want me to unwrap, and I will do it for you.


>like restore the end-line hyphenates to their original state.

 

Conversely use dictionary-based conservative end-line dehyphenation/pull-up at the start of your normalization process, and then you never have to worry about it again.  Just because DP does it wrong doesn’t mean that everyone has to emulate them.


>   in general, is it worthwhile to do any further cleaning on
>   a book which just came out of distributed proofreaders?

 

Simply answer is select and read a few pages at random and see if you spot any errors.  If yes, the book needs fixing.  If no, the book is probably “OK.”

 

Let’s try this approach on say, 76.html

 

Bugs found in the first couple pages:

 

Background: pink

 

Missing textual version of the title page

 

First line of Illustrations indented incorrectly

 

Repeats in list of illustrations

 

“BY ORDER OF THE AUTHOR” is textual information presented only as a graphic.

 

“EXPLANATORY” indentations don’t make sense

 

Use of leading all-caps words don’t match PG guidelines.

 

Images don’t have accessible-friendly alt-tags.

 

Textual information in Bust presented only as a graphic.

 

Use of indentation-style paragraphs without following the convention of *not* indenting the first paragraph of a chapter.

 

[100s of places] Use of SHOUT emphasis carried over inappropriately from the archaic (and mistaken) practice of quasi-marking italic emphasis in historical PG “txt” files, whereas HTML actually HAS technology for correctly marking and rendering italics aka <i>italic emphasis</i>

 

Etc.

 

Again, 76.html has about 1500 errors, most of them of the SHOUT encoding variety.