
On Tue, December 13, 2011 3:05 am, Paul Flo Williams wrote:
There is a fourth way: pre-process your (X)HTML to downgrade it to an HTML 3.2 tag soup + Kindle attributes so that the kindlegen step does nothing other than wrapping up your HTML into a MOBI file.
"Tag soup" refers to formatted markup which does not consist of correct HTML syntax or document structure. The expectation for web browsers is that they will not fail when presented with invalid HTML, presenting the content making reasonable heuristic "guesses". "Tag soup" may collectively refer to a large number of common authoring mistakes, such as malformed HTML tags, improperly-nested HTML elements, and unescaped character entities (especially ampersands (&) and less-than signs (<)). See generally, http://en.wikipedia.org/wiki/Tag_soup. I do not know if the MobiPocket parser which is at the heart of the Kindle is a tag soup parser or not, but there is no need to degrade valid (X)HTML to tag soup for the Kindle to use it. That said, it is true that Kindle relies on certain proprietary elements and attributes, which I guess qualifies it generally in the "tag soup" category. But you definitely want to keep the file as well-formed XML. [snip]
So, use xsltproc or Perl plus one of the XML modules to do what kindlegen does, except applying some of your own conversions.
[snip]
Kindlegen does a simple job badly, so bypass the bits that you don't need.
What you are suggesting is essentially to re-write Kindlegen to do things better. This is not a bad idea, but I think it may be a bit more complex than you think (or maybe I'm being unfair in suggesting that you don't grasp the complexity). Using XSLT is a good idea, but it is not a complete solution. It should work fine for inline styles, but doesn't apply CSS styling to each element first. What is needed is a program that will apply CSS to the document tree, and perhaps do other transformations as well, before using XSLT to produce the final output. Clearly perl is an option, although I would favor Java due to its superior performance and the JAXP APIs. Python seem to be the language /du jour/, and it seems to be the one favored by most people involved with e-books, so maybe that would be the right solution. Fast hardware can clearly compensate for Python's poor performance. We all know, more or less, how .mobi files are created, so once you've gotten to the point where the input is Kindle-ready, one could just close the circle and replace Kindlegen all together. Sounds like a fun project.