
On 1/13/2012 6:48 PM, James Adcock wrote:
We all know that internally the Kindle supports only a slight variation of HTML 3.2.
Not sure what "we all know." Are you saying Kindle Fire is only internally HTML 3.2? Are you saying KF8 is only internally HTML 3.2?
Absolutely not. In fact, I disclaim any knowledge at all about the software on the Kindle Fire or the KF8 file format. I'm sure the community over at MobileRead will soon have it figured out, but so far I'm not pursuing that knowledge. I will admit that the Fire hardware seems very attractive at its price point, and now that procedures for "rooting" the device are well-established I will probably be buying one in the near future, and hope that I will be able to gain more knowledge about the format as time goes on. So, what /do/ we know. First and foremost, we know that the limiting factor for the Kindles of all generations in display any particular format is /not/ the file format, but the installed software. /No/ improvement to KindleGen can or will create new capabilities that the installed reader software does not already support. We know that Amazon has published documentation for exactly what markup the pre-Fire Kindle reader supports (see, http://kindlegen.s3.amazonaws.com/AmazonKindlePublishingGuidelines.pdf). Examination of that document reveals that the supported HTML tags for the pre-Fire devices are almost a perfect match for HTML 3.2 (Especially interesting is http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=TagRef_OEB.htm, which is referenced by the Kindle Publishing Guidelines). We know that the pre-Fire Kindle reader does not support HTML 4 tags, or CSS. "But wait," you say. "The Kindle Publishing Guidelines specifically says that Kindles support some (not all) CSS and the <em> tag which was not a part of HTML 3.2! How can you say you know what you claim to know?" Good question. (It's easy to ask good questions when someone else is putting words in your mouth :-)). I started with a fairly simple HTML file which nonetheless contained a CSS style sheet, inline styles, and some HTML 4 tags. I used KindleGen to create a .mobi file from that HTML. I then used the prc2html program I wrote to extract the HTML back from the .mobi file, and examined the resulting HTML source. What I discovered was that all CSS styles that the Publishing Guidelines claimed are supported by the pre-Fire Kindle had been converted to HTML 3.2 tags (e.g. <p align="center"> and <p style="text-align:center"> were converted to <center>) and HTML 4 tags had been converted to their HTML 3.2 counterparts (e.g. <em> became <i>). Inline references to internal style sheets were ignored (e.g. <p class="center">). Just to be explicit, when I say converted I mean replaced; the new markup was inserted and the old markup removed. Interestingly, unsupported inline CSS styles and other attributes were left in place, as was the entire internal style sheet (that section between <style> and </style>). Another thing that is well known is that the .mobi format is little more that an HTML file in a MobiPocket wrapper. I have been universally successful when taking an HTML file, converting them to PalmDOC, rename the .pdb to .prc and opening them in the Kindle reader. I took the simple HTML file that I started the experiment with and converted it to .prc using the foregoing procedure. When I loaded my pseudo Kindle file into the Kindle reader (for PC, as I don't own the hardware) I discovered that 3.2 markup was rendered, but 4.0 markup and styles were ignored. Interestingly, the Kindle DXG boasted a web browser, but it obviously used the old MobiPocket rendering engine, because even the web browser was limited to HTML 3.2 markup. Almost immediately upon release the DX web browser was widely panned and dismissed. Now I'm going to engage in some speculation. I suspect that somewhere in the 2nd generation time frame Amazon engineers realized that the simple 3.2-based MobiPocket engine was obsolete, and began working on a new HTML display engine (it's possible, even likely, that when Amazon bought MobiPocket they bought it as a business unit, and the MobiPocket engineers are responsible for the ongoing Kindle reader development and maintenance. The 4th generation Kindles probably include this new, more powerful and compliant engine. This new engine probably supports most, if not all, of the HTML 4.01 specification. If I were writing new e-reader software for an Android device, I would probably adopt the WebKit engine for HTML rendering. I don't believe that is what the Amazon engineers did. I think they probably ported the latest generation Kindle reader software to Android. One of the most fundamental conclusions that should be apparent from all of this discussion is that the determining factor in what is supported, or not, is the e-book rendering software, not the e-book generating software. So the question is not really whether KindleGen supports HTML 5, but whether the Fire /rendering/ software supports HTML 5. It's possible that the software does /not/ support HTML 5, but KindleGen is smart enough to convert HTML 5 to HTML 4.01. It's also possible that the Fire (via the KF8 file format) supports HTML 5, but earlier versions of KindleGen was stripping those "unknown" tags out, and now it's leaving them in. I also think it's possible, indeed likely, that with the Fire Amazon has a method to upgrade the Kindle software "behind the scenes" without the users even knowing that an upgrade occurred. In this case, older Fires may be in the process of upgrading to HTML 5, and some change to KindleGen was required to support the upgrade.
Agreed that the new kindlegen *does not* seem to "magically" fix all the limitations of the older generations of Kindle. Most notably the paragraph top-margin/bottom-margin problem which keeps killing so many of the PG MOBI files remains on older Kindles even using the new kindlegen.
Yes, KindleGen doesn't fix the limitations of earlier software, and indeed cannot. It /can/ modify/convert the input in such a way that older software can deal with it, but KindleGen can't make a silk purse out of a sow's ear. I think that the answer is to try and save your files using whatever markup will give you the highest fidelity possible, and then downgrade from there is necessary. As to the issue of acceptance of files at Project Gutenberg, in my experience there is no such restriction on posting files at the Internet Archive. My advice would be to post the good stuff there, and then give PG just the impoverished text, with a transcriber's not that says better stuff is available at IA, with a URL.