Re: [gutvol-d] Improving the PG library

24 Sep 2012

      On Sun, Sep 23, 2012 at 2:55 AM, Jon Hurst <jon.a@hursts.eclipse.co.uk> wrote:
...
The MS is not supposed to be the original; it is supposed to be PG's
definitive version.
I don't see any value in that. Scans are a pain, and the only saving
grace is that they accurately represent a physical printed edition.
...
I agree that changes to the MS would have to be very
carefully researched and controlled, and there may well be some benefit
in storing the actual original alongside the MS if any changes of this
nature are made.
What's gained by mangling the scans instead of recording typos
externally to them?
...
I define "universal master format" as a format capable of encoding any
and all characters and any and all formatting nuances. There are lots of
"master formats" available, each capable of encoding a subset of these
things, but only the image itself is "universal".
I would define "master format" as one that can be effectively used to
derive other formats from.

As for universal, the image is not necessarily capable of encoding all
formatting nuances. For properties of physical books, scans can't
handle transparencies (which I've had in a book I was thinking about
scanning), paper changes, metallic inks (one illustration had an EETS
book printed for it, but the metallic ink didn't reproduce well in my
scans), fur (plenty of 20th/21st century examples for babies), mirrors
(ditto), holes (again, ditto), or pop-ups (there's a beautiful 19th
century edition of Euclid with them, for example). For properties of
the text, it does not show line breaks at tops of pages (ambiguous in
poetry) and it obscures spellings at end of lines (is it spelled
to-night or tonight?). It's powerful, but not unlimited.
...
This is true, although I wouldn't use the term "reproofread" as that
implies checking the words and punctuation again (the time-consuming
bit), which is exactly what I want to avoid.
Anything that requires you to look at every word on every page, which
italics and bold do, is time-consuming. And perhaps as important as
italics is superscript; 10<sup>30</sup> changes quite a bit when it's
comes back as 1030.
...
The point is that if you started with a scan of that
complex book with the intention of producing an ebook using your markup
language of choice, you would likely at some point in the process have
something very similar to the RTT.
Not necessarily. If I were working on a book, I would format as I went
along. Yes, DP has found for their purposes it works better to
separate them, but I don't think that's what most people working on a
book alone would do. Nor do most people make a line for line copy;
without external systems like DP, it's easier to input the text as
paragraphs ended by new lines.
...
If someone else then decided that the
book would be better in their markup language of choice they would also
at some stage have something very similar to the RTT, but they would
have had to repeat all the work that you did because they wouldn't have
had access to your RTT.
I don't see why the RTT is the ideal level for that, though. For any
large book, I'd rather have the TEI version then RTT--even if you lock
me in a cave without Internet and I have to figure out the TEI format
by guesswork and write my own XML converters. RTT makes me figure out
what I need to rewrap or not and fix all the microformatting, stuff
that could be automatically pulled out of even the most stupid HTML.
(Okay, so it would be a royal PIA to pull it out of "smart" HTML.
Still for a sufficiently large enough work, it'd be worth it.) If I
have an RTT version of the text, and an HTML version or TEI version or
sufficiently smart structured text version, and I wanted to make a
version in TeX or whatever, I'd start with the smart version instead
of the RTT. Anything short of Postscript or PDF is going to be better
than RTT.

-- 
Kie ekzistas vivo, ekzistas espero.

Re: [gutvol-d] Improving the PG library

David Starner