BB> http://www.gutenberg.org/cache/epub/1342/pg1342.txt
let's mold it to this version at archive.org:
> http://www.archive.org/details/prideprejudiceno00austuoft
See:
http://freekindlebooks.org/Dev/PNP.txt
where I have put almost no effort into this particular application of linebreak recovery (I use the software mainly for cross-versioning)
Note that “all” the whitespace is being cloned from IA to the PG file, which you can see needlessly introduces whitespace errors in non-linebreak situations. Presumably this issue needs to be thought out some more. And “PAGEBREAK” represents page break recovery from the AI txt file.
But, these two files are almost identical to begin with so they really don’t represent much of a test. My software is designed to work on more challenging situations. Try:
http://www.gutenberg.org/cache/epub/28948/pg28948.txt
vs.
http://www.archive.org/details/therainbowlawren00lawrrich