Re: [gutvol-d] The problems with paragraph formatting at PG

13 Dec 2011

      On Tue, December 13, 2011 3:05 am, Paul Flo Williams wrote:
...
There is a fourth way: pre-process your (X)HTML to downgrade it to an HTML
3.2 tag soup + Kindle attributes so that the kindlegen step does nothing
other than wrapping up your HTML into a MOBI file.
"Tag soup" refers to formatted markup which does not consist of correct HTML
syntax or document structure. The expectation for web browsers is that they
will not fail when presented with invalid HTML, presenting the content making
reasonable heuristic "guesses".

"Tag soup" may collectively refer to a large number of common authoring
mistakes, such as malformed HTML tags, improperly-nested HTML elements, and
unescaped character entities (especially ampersands (&) and less-than signs
(<)).

See generally, http://en.wikipedia.org/wiki/Tag_soup.

I do not know if the MobiPocket parser which is at the heart of the Kindle is
a tag soup parser or not, but there is no need to degrade valid (X)HTML to tag
soup for the Kindle to use it. That said, it is true that Kindle relies on
certain proprietary elements and attributes, which I guess qualifies it
generally in the "tag soup" category. But you definitely want to keep the file
as well-formed XML.

[snip]
...
So, use xsltproc or Perl plus one of the XML modules to do what kindlegen
does, except applying some of your own conversions.
[snip]
...
Kindlegen does a simple job badly, so bypass the bits that you don't need.
What you are suggesting is essentially to re-write Kindlegen to do things
better. This is not a bad idea, but I think it may be a bit more complex than
you think (or maybe I'm being unfair in suggesting that you don't grasp the
complexity).

Using XSLT is a good idea, but it is not a complete solution. It should work
fine for inline styles, but doesn't apply CSS styling to each element first.
What is needed is a program that will apply CSS to the document tree, and
perhaps do other transformations as well, before using XSLT to produce the
final output.

Clearly perl is an option, although I would favor Java due to its superior
performance and the JAXP APIs. Python seem to be the language /du jour/, and
it seems to be the one favored by most people involved with e-books, so maybe
that would be the right solution. Fast hardware can clearly compensate for
Python's poor performance.

We all know, more or less, how .mobi files are created, so once you've gotten
to the point where the input is Kindle-ready, one could just close the circle
and replace Kindlegen all together.

Sounds like a fun project.

Re: [gutvol-d] The problems with paragraph formatting at PG

Lee Passey