I appreciate the suggestion, but the file I'm working on has pages combined, hyphenated words re-joined, and paragraphs re-wrapped. The only thing left to do is to correct spellings and add accents (mostly circumflexes), add [Illustration] tags, format ASCII art family trees, etc. If Bowerbird can give me a conversion utility I'll put in ZML markup, otherwise I'll do what I've done in the past. It's a 400+ page book and I'm nearing page 80 at the moment. The OCR at archive.org came out pretty well so the corrections are not too numerous. I would have done the corrections one page at a time with page images alongside if I had known how to get separate text files for each page.

On Fri, Dec 23, 2011 at 12:13 AM, Carlo Traverso <traverso@posso.dm.unipi.it> wrote:

> I am working on and will continue to work on the file here:
>
> http://dl.dropbox.com/u/8919415/AStudyOfTheBhagavataPurana.txt
>
> I've put in much too much work on it to just abandon it, and since it
> has been de-hyphened, rewrapped and de-paged there is no good way to get
> my work into your system.

If one has two reasonably clean versions of the txt files of the same
images, file1 with page separators and file2 without, it is reasonably
simple to merge the two taking the separators from the first one and
the text from the second one.

The trick is to use wdiff (GNU wdiff, possibly dwdiff, not wdiff.com
wdiff); assume for example that the separators are a line that cannot
appear in the text, e.g.

<--------Page 123------>

then do

wdiff file1 file2 > file3

file3 consists of file2 in which differing parts (sequence of
different words, thus disregarding whitespace) are represented like

This is the [-first-] {+second+} version

One has just to preserve the common parts, the text of the second
version and the separators. The line breaks and other whitespace are
preserved from the second version.

Of course this is not so simple as I said, since one has to take care
of extra spaces and newlines, and be sure that the difference
separator strings [- , -] , {+ , +} , do not appear in either text;
but these strings can be changed with wdiff options -w, -x, -y, -z.

Handling whitespace might be a bit trickier, (maybe use wdiff -n), but
you can at the end run

diff file2 file3

to see what went wrong (if everything is OK you see just the
separators and the lines that are split between the pages).

The position of the separators in the result might be a bit off if
there are differences in the words immediately before or after the
separators. For example, a word split between pages goes to the second
page.

I don't have much experience, or ready-to-use scripts, since I always
preserve dearly my separators, (I have simple scripts to join files
preserving the filenames in the separators and split recovering the
filename), but I have sometimes helped others to recover their lost
page separators.

Carlo

_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d