
On 4/19/2010 9:00 AM, Jim Adcock wrote: [snip]
It would be nice to have a portable version of the current tools, so that transcribers can see how their HTML is going to "officially" translate into ePub and MOBI prior to submission. I tried porting the tools, but got bogged down by the amount of stuff which wouldn't port easily.
Only half of this proposal is possible: the .mobi half. As others have pointed out recently, .epub is not really an e-book format. For reasons both technical and practical, most people agree that HTML is the preferred markup for creating e-books. The primary drawback to HTML is that it is inherently a multi-file solution; the HTML file is distinct from the image files, CSS files, font files, etc. Moreover, if you had multiple HTML files that made up the book (and sometimes there are good technical reasons for doing so) you needed yet another metafile that described how the different files related to each other. After about a year of wrangling, in September 2006 the IDPF officially released the "Open Container Format," which specified how a collection of HTML files and the other files on which they depend would be included in a ZIP archive. The specification recommends using the file extension ".epub" to identify files that are OCF containers. In other words, an ".epub" file is just a ".zip" file with a few additional metadata files added. Software that purports to "convert" HTML to .epub should not do /anything/ to the source file, except perhaps to insure that it is valid XHTML (for older HTML files). There is no need to validate an .epub conversion, as no conversion should have occurred. If a rendered .epub document does not look exactly like the same collection of files rendered by a browser from the file system, it is the fault of the .epub rendering software, not the "conversion." Mobipocket, on the other hand, is a different ball of worms. The original Mobipocket reader (which, I understand, became the basis for the Kindle software) used a subset of HTML markup, and in a few instances changed the meaning of tags (<hr /> does not create a Horizontal Rule, but starts a new page in the user agent). It did not recognize all of the named entities, and did not support CSS at all. A Mobipocket PRC file was simply this almost-HTML compressed using Rick Bram's PalmDOC compression scheme (which was actually quite elegant in its simplicity). The later ".mobi" format was the same almost-HTML file compressed across the entire package using Huffman encoding instead. It produces a somewhat small file; the contents of the archive are identical to those in the ".prc" format. Mobipocket Publisher (which I assume is still what is used to create Kindle files) claimed that Mobipocket files supported CSS. In fact what happened was that Mobipocket Publisher would load a CSS file if it were specified in the source HTML, and would convert all the style attributes and computed CSS to the almost-HTML the Mobipocket reader recognized. Thus, a style like "style='font-size: larger';" might be converted to "<font size='4'>", but a style like "style='margin-left: 10em';" was simply discarded, because the Mobipocket almost-HTML did not recognize any way to change margin sizes. If you wanted to test the Mobipocket conversion, I would think the way to do that would be to extract the modified HTML from the Mobipocket file, and then write whatever kind of tests you needed to be sure the conversion was correct. I have some 'C' code hanging around to extract HTML from ".mobi" files; if you want it, I could send it to you.