I have created a new command line tool “pgdiff” along the lines of what BB has been talking about, which compares two independently OCR’ed texts on a word-by-word basis, so as to find and flag errors. In this regards it is similar to “worddiff”, as opposed to “diff” which is the approach BB has been talking about, which compares on a per-line basis. But my new tool has several tricks that haven’t been seen before:

It can be used with two different versions or editions of the text as long as there are not really long differences in the texts. IE the two texts do not have to have their linebreaks at the same locations. It tries to retain the linebreak locations of the first input text in preference to the second input text. IE the first input text should represent the target text you are trying to create.

This means it can also be used for “versioning” – for example using a copy of a PG text from one version or edition of a text to help fix and create a text from a different version or edition of the text.

It can also be used to recover linebreak information, where linebreak information has been lost, for example to take an older PG text and recover linebreak information in order to allow, for example, the resubmission of that PG text back to DP for a clean-up pass.

In normal mode when if finds an mismatch it outputs the mismatch like this { it’ll | it’11 } within the body of the text so that given a regex compatible editor it is very quick to search for and fix the errors found.

As BB says, having tried this approach, the manual approach of trying to visually spot errors seems pretty painful and silly.

I find that finding differences on a word basis rather than a line basis makes it quicker and easier to fix the errors in general.

You do want to do some regex punc normalization on the two OCRs to try to remove the trivial differences prior to running the tool, in order to cut down the number of trivial errors it finds that you have to fix.

Source and a compiled windows version at http://www.freekindlebooks.org/Dev/StringMatch

It is based on traditional Levenshtein Distances where the token is taken to be the non-white part of a “word” as opposed to measuring distances between lines of text or on individual characters.