Re: [gutvol-d] Improving the PG library

22 Sep 2012


      On Sat, Sep 22, 2012 at 2:11 AM,  <jon.a@hursts.eclipse.co.uk> wrote:
...
(typos in the original can be corrected
directly in the MS if required).
This is practically a breaking rule for me. "Typos" have been fixed in
books uploaded to PG that weren't typos before. Vandalizing the master
scans, too? It's easy to make mistakes in correcting the original; you
certainly don't want to engrave your mistakes into the original
images.
...
Step 2 is creation of a reference text transformation (RTT) from the
master scan. The RTT is a line by line pixel to UTF-8 transformation
without wrapping or additional markup.
I don't know why Roger Frank claims UTF-8 can't represent multiple
spaces; as a superset of ASCII, it does so in the exact same way ASCII
plain text does, as well as offering a variety of smaller and larger
spaces if you want to go that way. (I think anything here is
problematic, as the original typesetters did not have a fixed width
space character, instead of having the concepts of aligning A to B and
adding more space (not more spaces) here.)
...
The immediate benefit is that the RTT can be used to produce a
comprehensive list of errata for its extant text, thus allowing the WW
to eliminate the vast majority of errors in one hit.
If there is such an issue; I redid an old work for illustrations, and
found there was but one error in the original--that DP missed on the
second round through, too.
...
The future benefit is that the MS is a perfect universal master
format:
I don't think that word means what you think it means. In any case, in
the post-Google Books world, scans are easy to come across for many of
the works we do.
...
it does not include any introduced errors and it codes every
formatting nuance. All transformations from MS to either intermediate
master formats (TEI, RST, ZML, LaTeX etc.) or final formats (HTML,
PDF, epub, mobi etc.) initially follow much the same path, and the RTT
represents a fairly late divergence point.
I don't understand; the RTT misses out on a huge amount of important
information. Italics can have a large impact on meaning at times, for
one.
...
The point is you end up with the
foundations to do interesting work without having to first do a whole
lot of boring OCRing and proofreading
No; to turn the RTT into anything, you'll have to reproofread the
book, to catch italics and the rest of the formatting. Toss a few
sidenotes in there, and besides just catching stuff, you'd spend a lot
of time separating them out from the surrounding text. Two column
material? Trees?
...
Additional benefits: both the MS and RTT are usable ebook formats in
and of themselves,
The RTT is not good; you've thrown away important information. The MS
adds nothing to what IA or Google Books offers.
...
and the combination will allow a pretty nifty
errata system to be written, whereby a reader types in a suspect
phrase, gets taken to the line in the MS where that phrase is found,
and can deliver the errata by simply clicking on the line and clicking
a "Please Check" button.
You can't errata italics or many other important parts of the book.
It's not unuseful, but it's hardly complete.

-- 
Kie ekzistas vivo, ekzistas espero.

Re: [gutvol-d] Improving the PG library

David Starner