Re: [gutvol-d] The problems with paragraph formatting at PG

14 Dec 2011

      On Wed, December 14, 2011 7:00 am, Paul Flo Williams wrote:
...
Lee Passey wrote:
[snipped discussion of "tag soup"]
That's an extraordinarily verbose way of agreeing with me :-)
Verbosity is only one of my many faults ;-). The point I was trying to make,
perhaps inartfully, is that the term "tag soup" covers a multiple of sins.

You might have tag soup where you have valid XML which is not valid HTML
because it contains elements or attributes in addition to those allowed by the
XHTML DTD. This kind of tag soup can still be parsed by an XML parser, and is
mostly harmless. I consider this to be class 3 tag soup.

You can also have tag soup which is valid HTML but not valid XML. This is
because HTML is derived from SGML, and some things like implicitly closing
tags and possibly unnested tags are allowed in SGML but not XML. While valid,
this kind of file cannot be parsed by an XML parser, including XSL. To me this
is class 2 tag soup.

Lastly, you can have tag soup which is simply wrong by any standard: examples
would include using block elements inside inline elements, failing to escape
ampersands or angle brackets, or using invalid character entities. This is
class 1 tag soup.

Class 3 tag soup can be fixed by an appropriate XSLT script, but class 2 and
class 1 tag soup cannot. Abbyy FineReader produces class 2 tag soup, as does
the script at archive.org written by kenh, which means that many of my XML
tools cannot work until the file is "fixed."

I don't know the tolerance Kindlegen has for these different types of tag
soup, but from a practical perspective, I would think you should limit
yourself to class 3 tag soup as Kindlegen input. So when you say "downgrade it
to an HTML 3.2 tag soup," I simply wanted to caution that it's okay to
downgrade to class 3 tag soup, but probably not class 2, and certainly not
class 1.
...
...
What you are suggesting is essentially to re-write Kindlegen to do things
better. This is not a bad idea, but I think it may be a bit more complex
than you think (or maybe I'm being unfair in suggesting that you don't
grasp the complexity).
No, I'm suggesting doing some pre-processing to the HTML that you give to
Kindlegen, converting constructs that it will convert badly into ones that
fly straight through from your input to its mobi output.
Here's a concrete example:
I'm less interested in concrete examples than I am in abstract examples. It's
easy to write a script that converts <div class="iline">...</div> to
<div>  ...</div>. It's somewhat more difficult to write a script
that converts <div class="iline">...</div> to <div
style="margin-left:1em">...</div>, where the style element is derived from the
<style> definition, (I can't think of a way to do it with XSLT, which means
some other script language must be used) but relatively straight-forward.

But what I really want is a script/program that converts <div
style="margin-left:[some unpredictable value]"> to <div class="semanitic">[a
number of non breaking spaces calculated based on the unpredictable
value]...</div>. In other words, a generic transformation rather than a
specific transformation.

If Kindlegen does /most/ of the transformations correctly, then it makes the
most sense to provide a program that performs only those transformations that
Kindlegen handles badly, if at all. But if the number of identified
transformations that Kindlegen handles badly grows signficantly, and if one
has designed a generic transformation engine, then it may make the most sense
to simply handle /all/ the required transformations in this new, open-souce
transformation engine, and leave Kindlegen only the job of packaging the
(X)HTML with additions into the .mobi format.

Because we understand the .mobi format, it should be a small step to add the
packaging function to the transformation engine and replace Kindlegen
entirely.

I'm not saying that that's the way it /should/ be done, only that it's an
option that should not be dismissed.

[snip]
...
I've decided that a flexible way of marking this up in HTML is this:
<html>
<head>
<style>
.poem { margin-left: 2em }
.line, .iline { display: block; text-indent: -2em }
.line { }
.iline { margin-left: 1em }
</style>
<body>
<div class="poem">
<p class="verse"><span class="line">The first line of my poem</span>
So what will happen if the user agent you're using indents paragraphs 50% of
the display? And it's pretty clear to me that a verse is not a paragraph. So
why not use <div> for verses instead?

The default display mode for <div> is block, and the default display mode for
<span> is inline. But in your example you have changed the display mode for
<span class="line"> to block. Why not just use <div> in the first place, as
its default presentation is exactly what you wanted?

My version would have been:

<div class="poem">
  <div class="verse">
    <div class="line">The first line of my poem</div>
    <div class="iline">  The second line of my poem</div>
    <div class="line">A longer line comes next, and goes on a bit, but still
does not rhyme</div>
    ...

You may note that I have used non-breaking spaces in the "master" version just
like in the Kindle version. This is consistent with my view that "master"
versions should look acceptable even when the User Agent can't handle CSS.
...
You'll note that this needs CSS to show up as I intended, but the display
in a modern browser works well. However, kindlegen does an awful job at
converting this.
So I decide to preprocess the HTML that I feed to kindlegen. I can strip
the CSS entirely and use the classes to do some substitutions, so that I
feed kindlegen this:
<html>
<head>
</head>
<body>
<p height="1em">The first line of my poem</p>
A line is not a paragraph; use <div> instead. You've also lost the association
of lines into verses, and verses into poems. Probably not a problem if this
markup is /derived/ from a "master" version, but there's really no reason not
to preserve the structure moving forward. (It's interesting to note that
Kindlegen also preserves styles it cannot convert, and which the Kindle will
ignore).

I believe Kindle supports the <blockquote> element which provides right/left
margin indentation, just like your <div class="poem"> does. Maybe for the
Kindle you would want to enclose your entire poem as a <blockquote
class="poem">, which may provide some of the display "goodness" you are
seeking.

[snip]
...
I don't need to understand the structure of mobi files or
throw away kindlegen wholesale.
Absolutely. My fundamental rule is don't do anything that doesn't need to be
done (even if it would be fun to do). But I can definitely envision that over
time a generic Kindlegen preprocessor might elbow out Kindlegen itself.