Re: [gutvol-d] RST/PGTEI/etc

15 Feb 2012

      ...
I am interested in comparing your tool with my wdiff approach.
See:

http://freekindlebooks.org/Dev/pgdiff.cpp

where this might be called using something like:

pgdiff -linebreaks -w 1000  pg1342.txt ia.txt > rewrapped.txt

which clones the linebreaks from ia.txt onto the pg1432.txt creating
rewrapped.txt

The -w 1000 parameter specifies that the longest continuous stretch of
mismatched words to search for a match should be about 1000 words -- since
this is an n^2 algorithm.  This is useful in versioning, for example, where
one version of the text may contain a paragraph not found in the other text.

I haven't worked much on the issue pagebreaks, right now I just have a
hard-wired assumption that pagebreaks are marked with "PAGEBREAK" in ia.txt,
in which case those pagebreaks are also passed through to rewrapped.txt

Long uncommon prefixes and suffixes such as PG legalize and/or generated
TOCs should be removed.  First word of each text, and last word of each text
should be identical to "get the algorithm off on the right foot."  Often I
simply put in a dummy word at the start and end of each text such as "START"
and "END" to force this match.  This requirement is a bug that I need to
sort out.

More commonly I use this program in versioning to compare to different
versions of "the same" text -- not necessarily identical editions, by doing
a:

 pgdiff -w 1000  pg1342.txt ia.txt > rewrapped.txt

which marks areas of disagreement between pg1342.txt and ia.txt

Re: [gutvol-d] RST/PGTEI/etc

James Adcock