On 1/24/2012 10:01 AM, Jim Adcock wrote:
This combination of paragraphs doesn't make sense.
I think if you parsed it carefully it would.
Greg>Having numerous formats derived from a single master is a long-time goal.
Yes, there are many individuals at PG who have had this goal. In the past there have been institutional impediments to the goal, and much of the past effort to achieve the goal can best be characterized as "routing around damage." This is one of the reasons that the current mechanism is so convoluted.
We've had some success with RST and TEI, and I've encouraged new projects to consider RST.
I think it's indisputable that PG has had /some/ success with RST and TEI. The success is more along the lines of a proof-of-concept than actual production-ready code, but the success is there nonetheless.
There are still some limitations,though...
In my view, the biggest limitation of RST is the difficulty of producing it. While more mature, RST suffers from the same main drawbacks as BowerBird's s.m.l.: it requires that the producer have a strong understanding of subtle markup rules, the distinction between markup and content is not obvious and therefore easily confounded, and there is no automated way to detect errors in markup. Of course, both of these formats are susceptible to a flawed implementation of the tool chain that produces other formats, but this potential exists no matter what format is chosen. It is a mistake to equate flawed tools with a flawed format.
Greg>On the many, many words on gutvol-d recently about poor results with auto-conversion from HTML to other formats (epub and mobi, among others): this is often due to choices that producers make about using HTML to impact layout, rather than just structure.
What Mr. Newby is talking about here is /not/ the HTML that is a result of Mr. Perathoner's design decisions, but the HTML which is posted by volunteers, primarily of late by DP. It is certainly true that bad decisions by volunteers can make it hard to use HTML as a master format, although to be fair bad decisions by volunteers can also make RST and TEI hard to use as well. It is simply more likely that HTML will be flawed than RST or TEI because anyone sophisticated enough to use one of these last two formats is likely to know enough to use them correctly, whereas anyone with Microsoft Word or Adobe Dreamweaver /thinks/ s/he know how to use HTML, often incorrectly. Should RST gain the same popularity as HTML I'm sure it would be just as problematic if not more so. But because HTML is the source for all e-book file formats, that is where the focus should be. The solution? 1. Define what constitutes good HTML. 2. Judge the quality of conversion tools by how well they satisfy 1.
I recently posted in this forum an analysis, based on a request from this forum, that indicated that RST in practice *is not* working as a "single master." To whit the EPUB and MOBI being generated from RST is not particularly successful, possibly no more so than the DP "HTML" efforts.
Yes, /in practice/. So it seems that the reforms that need to be made are reforms to the /practice/. Mr. Perathoner has an advantage over all the rest of us in that he is the only one with access to the PG servers. Thus, he pretty much gets to do what he wants, and the tool chain pretty much reflects his tastes and biases. If you want any improvements to be made to the PG tool chain /in practice/ you effectively need to convince Mr. Perathoner, and no one else, that the improvements need to be made. As you may have noticed, Mr. Perathoner is as prickly as BowerBird, so suggestions need to be made with much more subtlety, tact and finesse than I am capable of. On the other hand, it may be possible to take a page from PG's book and route around the damage. I've looked at the HTML output you've provided from PG and I haven't seen anything that can't be repaired. It should be possible to build a web interface that sits in front of PG, forwards requests, rewrites the HTML to meet industry standards, then either delivers /that/ HTML or compiles it into ePub or .mobi. I don't know that I have the time to build anything like that (I'm not particularly committed to the future of Project Gutenberg) but it would be interesting to see how much interest there is in such a project. On what is perhaps an unrelated note, is anyone capturing the output of Distributed Proofreaders and transferring it to the Internet Archive before it gets degraded for Project Gutenberg? Are there any private archives at Distributed Proofreaders that could be transferred as well?