Re: [gutvol-d] The problems with paragraph formatting at PG

14 Dec 2011

      Lee Passey wrote:
...
On Tue, December 13, 2011 3:05 am, Paul Flo Williams wrote:
...
There is a fourth way: pre-process your (X)HTML to downgrade it to an
HTML
3.2 tag soup + Kindle attributes so that the kindlegen step does nothing
other than wrapping up your HTML into a MOBI file.
"Tag soup" refers to formatted markup which does not consist of correct
HTML
[snip]
...
That said, it is true that Kindle relies on
certain proprietary elements and attributes, which I guess qualifies it
generally in the "tag soup" category. But you definitely want to keep the
file as well-formed XML.
That's an extraordinarily verbose way of agreeing with me :-)
...
...
Kindlegen does a simple job badly, so bypass the bits that you don't
need.
What you are suggesting is essentially to re-write Kindlegen to do things
better. This is not a bad idea, but I think it may be a bit more complex
than
you think (or maybe I'm being unfair in suggesting that you don't grasp
the
complexity).
No, I'm suggesting doing some pre-processing to the HTML that you give to
Kindlegen, converting constructs that it will convert badly into ones that
fly straight through from your input to its mobi output.

Here's a concrete example:

Let's say I've got a poem with some long lines that I wish to markup. I
don't know what font size or screen width the reader is using, so I want
to make the display as flexible as possible. In the book that I'm copying,
the poem already has wrapped lines:

  The first line of my poem
     The second line of my poem
  A longer line comes next, and goes on a bit
         but still does not rhyme
     The last line ends with a flourish!

I've decided that a flexible way of marking this up in HTML is this:

<html>
<head>
<style>
.poem { margin-left: 2em }
.line, .iline { display: block; text-indent: -2em }
.line { }
.iline { margin-left: 1em }
</style>
<body>
<div class="poem">
<p class="verse"><span class="line">The first line of my poem</span>
<span class="iline">The second line of my poem</span>
<span class="line">A longer line comes next, and goes on a bit but still
does not rhyme</span>
<span class="iline">The last line ends with a flourish!</span></p>
<p class="verse"><span class="line">The first line of my poem</span>
<span class="iline">The second line of my poem</span>
<span class="line">A longer line comes next, and goes on a bit, but still
does not rhyme</span>
<span class="iline">The last line ends with a flourish!</span></p>
</div>
</body>
</html>

You'll note that this needs CSS to show up as I intended, but the display
in a modern browser works well. However, kindlegen does an awful job at
converting this.

So I decide to preprocess the HTML that I feed to kindlegen. I can strip
the CSS entirely and use the classes to do some substitutions, so that I
feed kindlegen this:

<html>
<head>
</head>
<body>
<p height="1em">The first line of my poem</p>
<p height="0" width="-2em">  The second line of my poem</p>
<p height="0">A longer line comes next, and goes on a bit, but still does
not rhyme</p>
<p height="0" width="-2em">  The last line ends with a
flourish!</p>
<p height="1em">The first line of my poem</p>
<p height="0" width="-2em">  The second line of my poem</p>
<p height="0">A longer line comes next, and goes on a bit, but still does
not rhyme</p>
<p height="0" width="-2em">  The last line ends with a
flourish!</p>
</body>
</html>

(I haven't got a copy of Tallent handy, so I might have got the
indentation trick wrong.)

In this case, I selected elements to process by their class attribute, and
performed some attribute and textual substitutions. I don't bother
touching all the parts of the document that already convert well, so I
don't have to fully process the CSS.

This part of the document goes through kindlegen verbatim because there
isn't anything it needs to convert.

Of course, this is part of my toolchain for my books because I have made
certain choices about the vocabulary I markup with, but the principle is
simple enough. I don't need to understand the structure of mobi files or
throw away kindlegen wholesale.