
On Sat, Sep 22, 2012 at 2:11 AM, <jon.a@hursts.eclipse.co.uk> wrote:
(typos in the original can be corrected directly in the MS if required).
This is practically a breaking rule for me. "Typos" have been fixed in books uploaded to PG that weren't typos before. Vandalizing the master scans, too? It's easy to make mistakes in correcting the original; you certainly don't want to engrave your mistakes into the original images.
Step 2 is creation of a reference text transformation (RTT) from the master scan. The RTT is a line by line pixel to UTF-8 transformation without wrapping or additional markup.
I don't know why Roger Frank claims UTF-8 can't represent multiple spaces; as a superset of ASCII, it does so in the exact same way ASCII plain text does, as well as offering a variety of smaller and larger spaces if you want to go that way. (I think anything here is problematic, as the original typesetters did not have a fixed width space character, instead of having the concepts of aligning A to B and adding more space (not more spaces) here.)
The immediate benefit is that the RTT can be used to produce a comprehensive list of errata for its extant text, thus allowing the WW to eliminate the vast majority of errors in one hit.
If there is such an issue; I redid an old work for illustrations, and found there was but one error in the original--that DP missed on the second round through, too.
The future benefit is that the MS is a perfect universal master format:
I don't think that word means what you think it means. In any case, in the post-Google Books world, scans are easy to come across for many of the works we do.
it does not include any introduced errors and it codes every formatting nuance. All transformations from MS to either intermediate master formats (TEI, RST, ZML, LaTeX etc.) or final formats (HTML, PDF, epub, mobi etc.) initially follow much the same path, and the RTT represents a fairly late divergence point.
I don't understand; the RTT misses out on a huge amount of important information. Italics can have a large impact on meaning at times, for one.
The point is you end up with the foundations to do interesting work without having to first do a whole lot of boring OCRing and proofreading
No; to turn the RTT into anything, you'll have to reproofread the book, to catch italics and the rest of the formatting. Toss a few sidenotes in there, and besides just catching stuff, you'd spend a lot of time separating them out from the surrounding text. Two column material? Trees?
Additional benefits: both the MS and RTT are usable ebook formats in and of themselves,
The RTT is not good; you've thrown away important information. The MS adds nothing to what IA or Google Books offers.
and the combination will allow a pretty nifty errata system to be written, whereby a reader types in a suspect phrase, gets taken to the line in the MS where that phrase is found, and can deliver the errata by simply clicking on the line and clicking a "Please Check" button.
You can't errata italics or many other important parts of the book. It's not unuseful, but it's hardly complete. -- Kie ekzistas vivo, ekzistas espero.