Re: [gutvol-d] Improving the PG library

24 Sep 2012

      On 2012-09-23, David Starner wrote:
...
I don't see any value in that. Scans are a pain, and the only saving
grace is that they accurately represent a physical printed edition.
The value in the MS is in its unequivocal nature. It says: "After much
discussion amongst knowledgeable people, this, and only this, is what we
at PG consider to be a version of this book. Everything else we publish
is a derivation of this. If they don't match this, they are in error."

Example 1: As a PG customer I find something in an ebook that doesn't
look right. I want to contribute. I want to check it, and if it is
wrong, report it. Off I go to PG and I find... nothing. If I'm really
keen to help, I go to TIA and I find 50 versions, and I have no idea
what is the correct version. Oh well...

Example 2: We propose redoing "Pride and Prejudice". Without an MS we
immediately hit the buffers. What edition is the extant text? What scan
was used?  What edition should we use? Is there a scan we can use
_anywhere_? What is the copyright situation of the 1923 version? Should
we go for a first edition or the latest one we can get copyright
clearance for? With the MS: throw the MS at the DP OCR pool and off we
go.
...
What's gained by mangling the scans instead of recording typos
externally to them?
In example 1, I am in luck. PG has been adding definitive scans to the
PG archives and Marcello has done a really cool interface. I note a
difference with my version and the scan, so I report it... and the WW
writes back saying that it is not actually a difference: it is just a
typo that was corrected in the text and the definitive scan is actually
the thing that is incorrect. So much for the definitive scan...
...
As for universal, the image is not necessarily capable of encoding all
formatting nuances.
...
... fur (plenty of 20th/21st century examples for babies)...
Ah... so that's what the Kindle touch is about -- fur capability. :-)
...
Anything that requires you to look at every word on every page, which
italics and bold do, is time-consuming. And perhaps as important as
italics is superscript; 10<sup>30</sup> changes quite a bit when it's
comes back as 1030.
There are things that can help here, but I want to keep my sights for a
core format set as low as possible. I would re-iterate that my aim is
only to produce a foundation. You can build whatever you like on that
foundation, and some sort of optional formatting methodology would
likely be one of those things. For the books that I turn into Kindle
PDFs for, you will have a micro-formatting highlighting overlay
available. If Bowerbird does some books you will have ZML. You may have
Don's Canonical Starting Point. You may have TEI, RST, Docbook or
LaTeX. If you don't like anything that is available, you will still
always have the option to go back to the MS and RTT and do your own
thing, even if that takes a little longer.

The main point of RTT is that the phrase "let's make
(X)HTML/TEI/RST/LaTeX/ZML (delete as appropriate) the PG master format"
_will_, quite rightly, start an unproductive flame war. RTT is inferior
to all these things by design, so that any of these things can be built
upon it. Your choice once you have, say, a ZML version available, might
be to base your work on a transformation from ZML, but the RTT will
hopefully have saved the person doing the ZML version a heap of work.

Cheers

Jon

Re: [gutvol-d] Improving the PG library

Jon Hurst