I have created a new command line tool “pgdiff”
along the lines of what BB has been talking about, which compares two independently
OCR’ed texts on a word-by-word basis, so as to find and flag
errors. In this regards it is similar to “worddiff”, as
opposed to “diff” which is the approach BB has been talking about,
which compares on a per-line basis. But my new tool has several tricks
that haven’t been seen before:
It can be used with two different versions or editions of the text
as long as there are not really long differences in the texts. IE the two texts
do not have to have their linebreaks at the same locations. It tries to retain
the linebreak locations of the first input text in preference to the second input
text. IE the first input text should represent the target text you are trying
to create.
This means it can also be used for “versioning” –
for example using a copy of a PG text from one version or edition of a text to
help fix and create a text from a different version or edition of the text.
It can also be used to recover linebreak information, where
linebreak information has been lost, for example to take an older PG text and
recover linebreak information in order to allow, for example, the resubmission
of that PG text back to DP for a clean-up pass.
In normal mode when if finds an mismatch it outputs the mismatch
like this { it’ll | it’11 } within the body of the text so that given
a regex compatible editor it is very quick to search for and fix the errors
found.
As BB says, having tried this approach, the manual approach of
trying to visually spot errors seems pretty painful and silly.
I find that finding differences on a word basis rather than a
line basis makes it quicker and easier to fix the errors in general.
You do want to do some regex punc normalization on the two OCRs
to try to remove the trivial differences prior to running the tool, in order to
cut down the number of trivial errors it finds that you have to fix.
Source and a compiled windows version at http://www.freekindlebooks.org/Dev/StringMatch
It is based on traditional Levenshtein Distances where the token
is taken to be the non-white part of a “word” as opposed to
measuring distances between lines of text or on individual characters.