Summary: Improving PG library

25 Sep 2012

      Back to RL work tomorrow, so I thought I'd summarise where this
discussion stands, at least from my point of view:

(1) It is recognised by all concerned that a problem exists with the
   accuracy and/or typography in PG's top downloaded books. There is a
   desire to do something about this from Greg and at least a number of
   members of this list.

(2) The first vital step towards doing something is to link each title
   to a scan set. We do not know the editions of the extant text, so
   choosing a scan set means choosing an edition. For new entries to the
   library this is the prerogative of the creator. How to choose one for
   the updates has not yet been answered. I would prefer it done by
   someone or a group of someones nominated by Greg with reference to
   the PG community at large. As I do not have the required level of
   knowledge to make executive decisions about editions, I can be of no
   help here.

(3) Once master scans are uploaded, it will be possible to make some
   progress. There will be no point producing final derivative versions
   that will be buried by the current version. What _can_ be done is to
   create a clean text by whatever means and diff it against the extant
   text, thereby producing a comprehensive errata list for the WWs. This
   would at least take care of the accuracy side of things.

(4) According to Bowerbird, Step 3 is possible by simply diffing a
   fresh OCR of an MS against the extant version: most of the errors
   highlighted would be in the OCR, but it should also highlight the
   majority of the errors in the extant text. If this is correct, DP
   does not need to be involved.

(5) The typography side of things cannot be addressed until a method is
   devised to determine whether a redo is superior to a current version
   of the same format, and should therefore replace it. Currently the
   method is to just publish both versions, meaning that from a user's
   perspective, the current version always appears superior to new
   versions, so there is no incentive to rework it.

I am still keen to help with this work, and will be happy to attempt (3)
for a pilot of 10 books over the Winter when I have some time off. (2)
is, however, outside my control, and nothing can be done until it is
complete. I am also happy to create LaTeX versions of these books if (5)
is resolved.

Cheers

Jon

Jon Hurst

James Adcock

tags

participants (2)