rewrapping p.g. to an existing scan-set

carlo said:
Much simpler, one can use wdiff (or dwdiff) and preserve whitespace from one and non-whitespace from the other through regexp. I am interested in comparing your tool with my wdiff approach.
i'd love to see the results from both of you. and compare it to what my program can do, because, at least thus far in its development, my tool still needs considerable baby-sitting for the level of quality one needs to produce. i suggest everyone use their approach on the p.g. e-text for "pride and prejudice"...
let's mold it to this version at archive.org:
(or, if some other version was the basis for #1342, or you prefer it for any other reason, say so, and we can use that.) -bowerbird

BB> <http://www.gutenberg.org/cache/epub/1342/pg1342.txt> http://www.gutenberg.org/cache/epub/1342/pg1342.txt let's mold it to this version at archive.org:
<http://www.archive.org/details/prideprejudiceno00austuoft> http://www.archive.org/details/prideprejudiceno00austuoft
See: http://freekindlebooks.org/Dev/PNP.txt where I have put almost no effort into this particular application of linebreak recovery (I use the software mainly for cross-versioning) Note that "all" the whitespace is being cloned from IA to the PG file, which you can see needlessly introduces whitespace errors in non-linebreak situations. Presumably this issue needs to be thought out some more. And "PAGEBREAK" represents page break recovery from the AI txt file. But, these two files are almost identical to begin with so they really don't represent much of a test. My software is designed to work on more challenging situations. Try: http://www.gutenberg.org/cache/epub/28948/pg28948.txt vs. http://www.archive.org/details/therainbowlawren00lawrrich
participants (2)
-
Bowerbird@aol.com
-
James Adcock