
I am interested in comparing your tool with my wdiff approach.
See: http://freekindlebooks.org/Dev/pgdiff.cpp where this might be called using something like: pgdiff -linebreaks -w 1000 pg1342.txt ia.txt > rewrapped.txt which clones the linebreaks from ia.txt onto the pg1432.txt creating rewrapped.txt The -w 1000 parameter specifies that the longest continuous stretch of mismatched words to search for a match should be about 1000 words -- since this is an n^2 algorithm. This is useful in versioning, for example, where one version of the text may contain a paragraph not found in the other text. I haven't worked much on the issue pagebreaks, right now I just have a hard-wired assumption that pagebreaks are marked with "PAGEBREAK" in ia.txt, in which case those pagebreaks are also passed through to rewrapped.txt Long uncommon prefixes and suffixes such as PG legalize and/or generated TOCs should be removed. First word of each text, and last word of each text should be identical to "get the algorithm off on the right foot." Often I simply put in a dummy word at the start and end of each text such as "START" and "END" to force this match. This requirement is a bug that I need to sort out. More commonly I use this program in versioning to compare to different versions of "the same" text -- not necessarily identical editions, by doing a: pgdiff -w 1000 pg1342.txt ia.txt > rewrapped.txt which marks areas of disagreement between pg1342.txt and ia.txt