
I am working on and will continue to work on the file here:
http://dl.dropbox.com/u/8919415/AStudyOfTheBhagavataPurana.txt
I've put in much too much work on it to just abandon it, and since it has been de-hyphened, rewrapped and de-paged there is no good way to get my work into your system.
If one has two reasonably clean versions of the txt files of the same images, file1 with page separators and file2 without, it is reasonably simple to merge the two taking the separators from the first one and the text from the second one. The trick is to use wdiff (GNU wdiff, possibly dwdiff, not wdiff.com wdiff); assume for example that the separators are a line that cannot appear in the text, e.g. <--------Page 123------> then do wdiff file1 file2 > file3 file3 consists of file2 in which differing parts (sequence of different words, thus disregarding whitespace) are represented like This is the [-first-] {+second+} version One has just to preserve the common parts, the text of the second version and the separators. The line breaks and other whitespace are preserved from the second version. Of course this is not so simple as I said, since one has to take care of extra spaces and newlines, and be sure that the difference separator strings [- , -] , {+ , +} , do not appear in either text; but these strings can be changed with wdiff options -w, -x, -y, -z. Handling whitespace might be a bit trickier, (maybe use wdiff -n), but you can at the end run diff file2 file3 to see what went wrong (if everything is OK you see just the separators and the lines that are split between the pages). The position of the separators in the result might be a bit off if there are differences in the words immediately before or after the separators. For example, a word split between pages goes to the second page. I don't have much experience, or ready-to-use scripts, since I always preserve dearly my separators, (I have simple scripts to join files preserving the filenames in the separators and split recovering the filename), but I have sometimes helped others to recover their lost page separators. Carlo