
On Wed, December 14, 2011 7:00 am, Paul Flo Williams wrote:
Lee Passey wrote:
[snipped discussion of "tag soup"]
That's an extraordinarily verbose way of agreeing with me :-)
Verbosity is only one of my many faults ;-). The point I was trying to make, perhaps inartfully, is that the term "tag soup" covers a multiple of sins. You might have tag soup where you have valid XML which is not valid HTML because it contains elements or attributes in addition to those allowed by the XHTML DTD. This kind of tag soup can still be parsed by an XML parser, and is mostly harmless. I consider this to be class 3 tag soup. You can also have tag soup which is valid HTML but not valid XML. This is because HTML is derived from SGML, and some things like implicitly closing tags and possibly unnested tags are allowed in SGML but not XML. While valid, this kind of file cannot be parsed by an XML parser, including XSL. To me this is class 2 tag soup. Lastly, you can have tag soup which is simply wrong by any standard: examples would include using block elements inside inline elements, failing to escape ampersands or angle brackets, or using invalid character entities. This is class 1 tag soup. Class 3 tag soup can be fixed by an appropriate XSLT script, but class 2 and class 1 tag soup cannot. Abbyy FineReader produces class 2 tag soup, as does the script at archive.org written by kenh, which means that many of my XML tools cannot work until the file is "fixed." I don't know the tolerance Kindlegen has for these different types of tag soup, but from a practical perspective, I would think you should limit yourself to class 3 tag soup as Kindlegen input. So when you say "downgrade it to an HTML 3.2 tag soup," I simply wanted to caution that it's okay to downgrade to class 3 tag soup, but probably not class 2, and certainly not class 1.
What you are suggesting is essentially to re-write Kindlegen to do things better. This is not a bad idea, but I think it may be a bit more complex than you think (or maybe I'm being unfair in suggesting that you don't grasp the complexity).
No, I'm suggesting doing some pre-processing to the HTML that you give to Kindlegen, converting constructs that it will convert badly into ones that fly straight through from your input to its mobi output.
Here's a concrete example:
I'm less interested in concrete examples than I am in abstract examples. It's easy to write a script that converts <div class="iline">...</div> to <div> ...</div>. It's somewhat more difficult to write a script that converts <div class="iline">...</div> to <div style="margin-left:1em">...</div>, where the style element is derived from the <style> definition, (I can't think of a way to do it with XSLT, which means some other script language must be used) but relatively straight-forward. But what I really want is a script/program that converts <div style="margin-left:[some unpredictable value]"> to <div class="semanitic">[a number of non breaking spaces calculated based on the unpredictable value]...</div>. In other words, a generic transformation rather than a specific transformation. If Kindlegen does /most/ of the transformations correctly, then it makes the most sense to provide a program that performs only those transformations that Kindlegen handles badly, if at all. But if the number of identified transformations that Kindlegen handles badly grows signficantly, and if one has designed a generic transformation engine, then it may make the most sense to simply handle /all/ the required transformations in this new, open-souce transformation engine, and leave Kindlegen only the job of packaging the (X)HTML with additions into the .mobi format. Because we understand the .mobi format, it should be a small step to add the packaging function to the transformation engine and replace Kindlegen entirely. I'm not saying that that's the way it /should/ be done, only that it's an option that should not be dismissed. [snip]
I've decided that a flexible way of marking this up in HTML is this:
<html> <head> <style> .poem { margin-left: 2em } .line, .iline { display: block; text-indent: -2em } .line { } .iline { margin-left: 1em } </style> <body> <div class="poem"> <p class="verse"><span class="line">The first line of my poem</span>
So what will happen if the user agent you're using indents paragraphs 50% of the display? And it's pretty clear to me that a verse is not a paragraph. So why not use <div> for verses instead? The default display mode for <div> is block, and the default display mode for <span> is inline. But in your example you have changed the display mode for <span class="line"> to block. Why not just use <div> in the first place, as its default presentation is exactly what you wanted? My version would have been: <div class="poem"> <div class="verse"> <div class="line">The first line of my poem</div> <div class="iline"> The second line of my poem</div> <div class="line">A longer line comes next, and goes on a bit, but still does not rhyme</div> ... You may note that I have used non-breaking spaces in the "master" version just like in the Kindle version. This is consistent with my view that "master" versions should look acceptable even when the User Agent can't handle CSS.
You'll note that this needs CSS to show up as I intended, but the display in a modern browser works well. However, kindlegen does an awful job at converting this.
So I decide to preprocess the HTML that I feed to kindlegen. I can strip the CSS entirely and use the classes to do some substitutions, so that I feed kindlegen this:
<html> <head> </head> <body> <p height="1em">The first line of my poem</p>
A line is not a paragraph; use <div> instead. You've also lost the association of lines into verses, and verses into poems. Probably not a problem if this markup is /derived/ from a "master" version, but there's really no reason not to preserve the structure moving forward. (It's interesting to note that Kindlegen also preserves styles it cannot convert, and which the Kindle will ignore). I believe Kindle supports the <blockquote> element which provides right/left margin indentation, just like your <div class="poem"> does. Maybe for the Kindle you would want to enclose your entire poem as a <blockquote class="poem">, which may provide some of the display "goodness" you are seeking. [snip]
I don't need to understand the structure of mobi files or throw away kindlegen wholesale.
Absolutely. My fundamental rule is don't do anything that doesn't need to be done (even if it would be fun to do). But I can definitely envision that over time a generic Kindlegen preprocessor might elbow out Kindlegen itself.