Re: [gutvol-d] Producing epub ready HTML

24 Jan 2012

      On 1/24/2012 10:01 AM, Jim Adcock wrote:
...
This combination of paragraphs doesn't make sense.
I think if you parsed it carefully it would.
...
...
Greg>Having numerous formats derived from a single master is a
long-time goal.
Yes, there are many individuals at PG who have had this goal. In the 
past there have been institutional impediments to the goal, and much of 
the past effort to achieve the goal can best be characterized as 
"routing around damage." This is one of the reasons that the current 
mechanism is so convoluted.
...
...
We've had some success with RST and TEI, and I've encouraged new
projects to consider RST.
I think it's indisputable that PG has had /some/ success with RST and 
TEI. The success is more along the lines of a proof-of-concept than 
actual production-ready code, but the success is there nonetheless.
...
...
There are still some limitations,though...
In my view, the biggest limitation of RST is the difficulty of producing 
it. While more mature, RST suffers from the same main drawbacks as 
BowerBird's s.m.l.: it requires that the producer have a strong 
understanding of subtle markup rules, the distinction between markup and 
content is not obvious and therefore easily confounded, and there is no 
automated way to detect errors in markup.

Of course, both of these formats are susceptible to a flawed 
implementation of the tool chain that produces other formats, but this 
potential exists no matter what format is chosen. It is a mistake to 
equate flawed tools with a flawed format.
...
...
Greg>On the many, many words on gutvol-d recently about poor
results with auto-conversion from HTML to other formats (epub and
mobi, among others): this is often due to choices that producers
make about using HTML to impact layout, rather than just structure.
What Mr. Newby is talking about here is /not/ the HTML that is a result 
of Mr. Perathoner's design decisions, but the HTML which is posted by 
volunteers, primarily of late by DP. It is certainly true that bad 
decisions by volunteers can make it hard to use HTML as a master format, 
although to be fair bad decisions by volunteers can also make RST and 
TEI hard to use as well. It is simply more likely that HTML will be 
flawed than RST or TEI because anyone sophisticated enough to use one of 
these last two formats is likely to know enough to use them correctly, 
whereas anyone with Microsoft Word or Adobe Dreamweaver /thinks/ s/he 
know how to use HTML, often incorrectly.

Should RST gain the same popularity as HTML I'm sure it would be just as 
problematic if not more so. But because HTML is the source for all 
e-book file formats, that is where the focus should be.

The solution? 1. Define what constitutes good HTML. 2. Judge the quality 
of conversion tools by how well they satisfy 1.
...
I recently posted in this forum an analysis, based on a request from
this forum, that indicated that RST in practice *is not* working as a
"single master."  To whit the EPUB and MOBI being generated from RST
is not particularly successful, possibly no more so than the DP
"HTML" efforts.
Yes, /in practice/. So it seems that the reforms that need to be made 
are reforms to the /practice/.

Mr. Perathoner has an advantage over all the rest of us in that he is 
the only one with access to the PG servers. Thus, he pretty much gets to 
do what he wants, and the tool chain pretty much reflects his tastes and 
biases.

If you want any improvements to be made to the PG tool chain /in 
practice/ you effectively need to convince Mr. Perathoner, and no one 
else, that the improvements need to be made. As you may have noticed, 
Mr. Perathoner is as prickly as BowerBird, so suggestions need to be 
made with much more subtlety, tact and finesse than I am capable of.

On the other hand, it may be possible to take a page from PG's book and 
route around the damage. I've looked at the HTML output you've provided 
from PG and I haven't seen anything that can't be repaired. It should be 
possible to build a web interface that sits in front of PG, forwards 
requests, rewrites the HTML to meet industry standards, then either 
delivers /that/ HTML or compiles it into ePub or .mobi.

I don't know that I have the time to build anything like that (I'm not 
particularly committed to the future of Project Gutenberg) but it would 
be interesting to see how much interest there is in such a project.

On what is perhaps an unrelated note, is anyone capturing the output of 
Distributed Proofreaders and transferring it to the Internet Archive 
before it gets degraded for Project Gutenberg? Are there any private 
archives at Distributed Proofreaders that could be transferred as well?

Re: [gutvol-d] Producing epub ready HTML

Lee Passey