
On Thu, September 29, 2011 12:11 pm, Jim Adcock wrote:
Last I remembered, the "Kindle" format was identical to the MobiPocket ".mobi" format, which was just a dumbed-down ePub using a different compression format.
Not a bad overview, but missing the seminal PalmDOC format; IMO, no discussion of formats can be considered complete without including PalmDOC. (Maybe I ought to create a wikipedia account and go edit it myself.)
"Kindle" .azw format is "identical" to .mobi except that it uses a proprietary encryption format. Mobi *way* predates ePub, so if you will, ePub could be considered a "more modern" format than Mobi, not Mobi is a "dumbed down" version of ePub.
I apologize for any confusion. I didn't mean to imply that the Mobipocket format was in any way /developed/ from the ePub format. Rather, the two formats appear to have been independently and simultaneously created and are similar in many ways; the Mobipocket format is clearly the inferior of the two (although still quite capable). If you were to take the HTML portion of a very full-featured ePub file, converted it to .mobi format, and then extracted the HTML from the .mobi version, you will see that it likely will have lost some fidelity. Thus, .mobi can be said to be a "dumbed-down" version of ePub, just as GutenText can be said to be a "dumbed-down" version of nearly everything else, despite being first-in-time. I was talking about capability, not chronology. (What follows is a somewhat pendantic history of e-book file formats, based upon my own experiences and observations. Stop reading now if you don't want to be bored out of your gourd. You have been warned.) It all started back in 1996, when Rick Bram developed a method to compress a text files for the Palm OS. The Palm Pilot didn't have what we would consider a file system, but instead stored all its data in database records (hence the extension .pdb which was appropriate for any Palm DataBase). Rick broke the text files into 4k chunks (which was the database record size on the Palm) and compressed each chunk separately using a minor variation on the LZ77 compression method. He then added a header record which contained a minimal amount of book metadata, and a map of all the "chunks" that made up the complete file (file metadata). He called the format "Palm Doc". Palm Doc was wildly successful, given the relatively small segment of the populace that was even using e-books at the time, but it didn't take long for people to run in to the obvious limitations of simplistic text. Peanut Press (which later became Palm Digital Reader, which later became eReader, which was then bought out by Barnes & Noble) invented a markup language known as the Palm Markup Language, which brought a certain amount of presentational "goodness" to its offerings. The Peanut (later eReader) format compressed an entire PML file using the LZ algorithm, and then encrypted the resulting file with the credit card information that was used to the purchase the book, as a form of TPM. At the world's first eBook conference in Gaithersburg, Md., sponsored by the National Institutes of Standards and Technologies (NIST) in October 1998, Microsoft Vice President for Emerging Technology Dick Brass proposed the creation of an open, nonproprietary standard for eBooks based on the HTML and Extensible Markup Language (XML) specifications. During 1999, the Open eBook Authoring Group worked to draft a publication structure, which was subsequently released to the public in August 1999 as OEBPS ver. 1.0. Following the release of OEBPS 1.0, the Open eBook Forum (OEBF, now the International Digital Publishers Forum, IDPF, an organization which was never open, nor a forum) was formally incorporated in January 2000. Refinement of the Open EBook Publication Structure has continued in fits and starts ever since that time. Version 1.0.1 of the specification was released in July 2001, version 1.2 was released in August 2002, and version 2.0 was released in September 2007, with a maintenance release (2.0.1) in September 2010. The biggest problem with the OEBPS (in my opinion) was that while it specified a set of files which in the aggregate could almost completely define an e-book, it didn't specify a consumer format, that is, a single file that could be presented by a software User Agent. This failure was no doubt due to the fact that the OEBF was (and is) dominated by large commercial interests (Microsoft, Palm Digital, Gemstar, Adobe), each of which was already committed to a proprietary user format designed for their own hardware or software. Dick Brass' vision of an open, nonproprietary standard was thus corrupted to a set of specifications that commercial publishers could use to prepare a publication which could then be fed into a specific vendor's final conversion process. In late 2004 and early 2005 Jon Noring and I began agitating for a truly open consumer-oriented e-book format, which we dubbed Open Reader. Our vision was to take the OEBPS suite of files, and a small amount of archive metadata, and archive them using the ZIP archive format, in a manner very similar to the Java JAR format. No doubt coincidentally, the IDPF began a process in late 2005 to specify a consumer e-book format. On October 30, 2006, the IDPF announced the official release of the Open Container Format, which consisted of the OEBPS suite of files, together with a small amount of archive metadata, and possibly some encryption instructions, archived using the ZIP archive format, and which was intended to be very similar to the Open Office document format. The specification recommended that files built according to the specification be given the ".epub" file extension. Thus, ePub was born. OPS 3.0, now under development, is virtually nothing more than the 2.0.1 version of the specification with the addition of techniques for embedding audio and video into an ePub file. One of the design goals of OPS 3.0 is that it will be _forward compatible_: that is, not only will new User Agents be able to parse and display ePub 2.0 files, old 2.0 User Agents should be able to parse and display ePub 3.0 files (minus, of course, the embedded audio and video). Almost as soon as version 1.0 of the specification was published, the e-book pirate community began creating e-books following that specification, and distributing them archived into a single .zip file. It could be argued that "ePubs" have been available since 2000, and the official owner of the specification, the IDPF, only acquiesced to what had become common practice. Almost concurrently with the incorporation of the OEBF, Mobipocket SA was incorporated in France in 2000, but followed a somewhat different course in designing its e-book format. It also chose to use HTML 3.2 as its markup language, but encoded the entire file using Rick Bram's classic Palm Doc method. Some additional metadata was included, and as a means of TPM, each "chunk" was encrypted by a method which relied on access to a "Device ID." Early Mobipocket files used the .prc extension, which is technically incorrect as it supposedly stands for "Palm Resource Code", i.e. a /program/ for the Palm Pilot, but in the e-book world .prc is almost universally recognized as "Mobipocket v.1." The follow-on .mobi format ("Mobipocket v.2) continued to use HTML/XHTML as its markup format, but abandoned the chunked LZ77 compression in favor of Huffman encoding. At this point in my career I quit paying attention to Mobipocket, so I have no information on what changes Amazon made to the v.2 to create the .azw format ("Mobipocket v.3). Years ago I was able to use an HTML file as input to a Palm Doc creation program, renamed the resulting file from .pdb to .prc, and read it in Mobipocket Reader, with no indication that it was not recognized as a standard Mobipocket file. (You could probably still do that with the Kindle. Does anyone want to try?) As BowerBird would be quick to point out (if he bothered to read anything I wrote) one of the problems with HTML is that writing a full featured display engine is /hard/! (Just ask David Jean, the inventor of µBook). Luckily, for most purposes e-books can get away with using a subset of the HTML element set. So the Mobipocket engineers built a display engine that recognized only those HTML elements which they felt appropriate for e-books, and completely ignored styles attributes whether specified either inline on through cascading style sheets. They also settled on the HTML 4 specification, relying on many deprecated elements. Back in those days, Mobipocket claimed that it was completely CSS and OEBPS compliant. Nevertheless, when I created an HTML to PDB file using an element such as <div style="text-align:center">some text</div> Mobipocket reader did not center the text. But when I ran the file through the Mobipocket Creator program, the text became centered. Extracting the text using prc2html program (which also extracts files from unencrypted .mobi files) I discovered that the subject text had been converted to <center>some text</center> by the conversion program. Further experimentation revealed that the Mobipocket reader did not recognize /any/ style attributes. When a style had a corresponding HTML 4 element or attribute, the Mobipocket Creator program converted style attributes to the corresponding HTML 4 element, otherwise the styling was /left in the file/ but ignored by the reader. For a relatively complete description of the .mobi format, see http://wiki.mobileread.com/wiki/MOBI. This article focuses primarily on the file format and discusses the Mobipocket markup only briefly, although it does point out that "you only get the full range of Mobipocket's formatting capabilities if you have markup written to use Mobipocket's non-standard, extended, and under-documented implementation of HTML 3.2. See: File tag reference on the mobipocket web site (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=TagRef_OEB.htm)." The Mobipocket page referenced here has a good comparison of the markup supported by Mobipocket compared to the markup suppported by the first version of the OEB Publication Structure. Now I quit paying much attention to Mobipocket a couple of years before it was acquired by Amazon, so I don't know how the reader, format and development suite have evolved since that time. This is one of the reasons I continue to ask whether the Kindle can parse/display HTML natively, other than by using the web browser, because that would tend to be indicative that the old Mobipocket display engine may have been replaced/upgraded. Obviously, Amazon has added yet more metadata to the file, may be using a different compression scheme (although from what has been said here, maybe not) and has added a new TPM mechanism. Perhaps no one else here is interested, but /I/ would be interested if anyone could provide me with more technical details about how the Mobipocket format has evolved since the Amazon acquisition.