
Back to RL work tomorrow, so I thought I'd summarise where this discussion stands, at least from my point of view: (1) It is recognised by all concerned that a problem exists with the accuracy and/or typography in PG's top downloaded books. There is a desire to do something about this from Greg and at least a number of members of this list. (2) The first vital step towards doing something is to link each title to a scan set. We do not know the editions of the extant text, so choosing a scan set means choosing an edition. For new entries to the library this is the prerogative of the creator. How to choose one for the updates has not yet been answered. I would prefer it done by someone or a group of someones nominated by Greg with reference to the PG community at large. As I do not have the required level of knowledge to make executive decisions about editions, I can be of no help here. (3) Once master scans are uploaded, it will be possible to make some progress. There will be no point producing final derivative versions that will be buried by the current version. What _can_ be done is to create a clean text by whatever means and diff it against the extant text, thereby producing a comprehensive errata list for the WWs. This would at least take care of the accuracy side of things. (4) According to Bowerbird, Step 3 is possible by simply diffing a fresh OCR of an MS against the extant version: most of the errors highlighted would be in the OCR, but it should also highlight the majority of the errors in the extant text. If this is correct, DP does not need to be involved. (5) The typography side of things cannot be addressed until a method is devised to determine whether a redo is superior to a current version of the same format, and should therefore replace it. Currently the method is to just publish both versions, meaning that from a user's perspective, the current version always appears superior to new versions, so there is no incentive to rework it. I am still keen to help with this work, and will be happy to attempt (3) for a pilot of 10 books over the Winter when I have some time off. (2) is, however, outside my control, and nothing can be done until it is complete. I am also happy to create LaTeX versions of these books if (5) is resolved. Cheers Jon

According to Bowerbird, Step 3 is possible by simply diffing a fresh OCR of an MS against the extant version: most of the errors highlighted would be in the OCR, but it should also highlight the majority of the errors in the extant text. If this is correct, DP does not need to be involved.
See http://www.freekindlebooks.org/Dev/HuckDiff.txt for an example of these kind of diffs performed on the end results after removing scannos. Note that this is an example of "cross-diffing" in that 76 and 32325 both have provenance and are acknowledged as coming from different editions. When one does such a diff from a new scan one finds not only these "real" differences but also a much larger set of scannos which need to be fixed. This proceeds relatively quickly and easily however [compared to DP], if one has created a PDF-a or DJVU file as part of the OCR process, which allows one to search on the text in question, thereby bringing you directly to that section of the scan image one needs to compare to. I think the most important take-away from this diff is that one should NOT assume that the golden-moldy texts are in good shape, nor should one assume that the passage of time and having hundreds of thousands of people reading a PG offering results in actually correcting the mistakes that are in the golden-moldy texts. It doesn't. On the contrary, the golden-moldies are more likely to be type-ins -- subject to the vagaries of the human mind -- and are more likely to come from texts of lower quality provenance. And these golden-moldies are what customers download and read the most. PS: Take a good hard look at http://www.gutenberg.org/files/76/76-h/76-h.htm if you don't believe these diffs represent "real errors!"
participants (2)
-
James Adcock
-
Jon Hurst