Re: [gutvol-d] Summary: Improving PG library

25 Sep 2012

      ...
According to Bowerbird, Step 3 is possible by simply diffing a
   fresh OCR of an MS against the extant version: most of the errors
   highlighted would be in the OCR, but it should also highlight the
   majority of the errors in the extant text. If this is correct, DP
   does not need to be involved.
See http://www.freekindlebooks.org/Dev/HuckDiff.txt for an example of these
kind of diffs performed on the end results after removing scannos.

Note that this is an example of "cross-diffing" in that 76 and 32325 both
have provenance and are acknowledged as coming from different editions.

When one does such a diff from a new scan one finds not only these "real"
differences but also a much larger set of scannos which need to be fixed.
This proceeds relatively quickly and easily however [compared to DP], if one
has created a PDF-a or DJVU file as part of the OCR process, which allows
one to search on the text in question, thereby bringing you directly to that
section of the scan image one needs to compare to.

I think the most important take-away from this diff is that one should NOT
assume that the golden-moldy texts are in good shape, nor should one assume
that the passage of time and having hundreds of thousands of people reading
a PG offering results in actually correcting the mistakes that are in the
golden-moldy texts.  It doesn't.  On the contrary, the golden-moldies are
more likely to be type-ins -- subject to the vagaries of the human mind --
and are more likely to come from texts of lower quality provenance.  And
these golden-moldies are what customers download and read the most.

PS: Take a good hard look at http://www.gutenberg.org/files/76/76-h/76-h.htm
if you don't believe these diffs represent "real errors!"

Re: [gutvol-d] Summary: Improving PG library

James Adcock