[gutvol-d] Re: DP output is technically obsolete

19 Apr 2010

      ...
...
...
...
...
"Greg" == Greg Newby <gbnewby@pglaf.org> writes:
Greg> On Sun, Apr 18, 2010 at 05:05:09PM +0200, Carlo Traverso
    Greg> wrote:
    >> Is PG ready to accept Epub as submission format? (i.e. one
    >> submits a valid epub from which the other formats are derived)?
    >> If so, one can target Epub, otherwise at best one is forced to
    >> submit HTML or txt that converts not-too-badly with current PG
    >> tools, and this migh be extremely challenging.
    >> 
    >> Carlo

    Greg> I don't think we're ready for this except in rare cases
    Greg> where ePub is the best format for display for a particular
    Greg> item (we just released a book where PDF was the best format,
    Greg> believe it or not).

    Greg> The challenge is that when books are fixed, someone
    Greg> (typically the whitewasher, seldom the original submitter)
    Greg> needs to regenerate all the files from that book.

    Greg> Since there is not yet any standard processing stream to
    Greg> generate static ePub files, this makes it hard for fixes (to
    Greg> HTML & text) to be applied to ePubs.

    Greg> I would, of course, love to see something become our
    Greg> "standard" conversion tool, usable by anyone.  Right now,
    Greg> the closest for PG is Marcello's software to build the
    Greg> cached ePub files.  It's wonderful and functional, but is it
    Greg> ready for all envisioned purposes?  I think not, due at
    Greg> least in part to shortcomings of the input HTML.

That's the whole point of my proposal. Starting with hand-crafted HTML
we are likely to end with poor ePub, since the inference of metadata
might be wrong, and many features of HTML need to be tuned to ePub and
might not turn out correct; While obtaining reasonable HTML from ePub
is just unzipping and discarding metadata. Maybe it will be harder to
have "nicely handcrafted" HTML, but we have to give the best available
product in the standard format that most users are likely to use (and
of course a reasonable product in every other format).

To maintain ePub (to correct typos) one has to unzip the ePub, correct
the HTML and re-zip.

Another issue is to automate the creation of txt from HTML. Currently,
the output of w3m -dump (or links -dump, or lynx -dump etc.) is pretty
good for txt, except that font changes (mainly, underscores for
italics) are lost.

It shouldn't be difficult to pre-process the HTML to show the
underscores for italics, in such a way that one obtains a reasonable
PG txt file. This might work better from the HTML generated from epub
(in which the HTML is more constrained) than for handcrafted HTML.

It might be a bit more challenging to downgrade from UTF-8 (as
generated by -dump) to iso-8859-1 or to ASCII, for example to handle
the unicode characters that are used to draw tables, but this might be
very well automated too.

This is on my side an offer to work towards the production of a
toolchain along these lines, if it is not discarded a priori.

Carlo

[gutvol-d] Re: DP output is technically obsolete

traverso＠posso.dm.unipi.it