these grapes are sweet -- lesson #18

this thread explains how to digitize a book quickly and easily. search archive.org for "booksculture00mabiuoft" for the book. *** again, here's spellcheck run against grapes006.txt:
http://zenmarkuplanguage.com/grapes115.py http://zenmarkuplanguage.com/grapes115.txt
this tells us what words we still need to add to our "specials" dictionary, so we'll overlay that file now:
and we'll run a the spellcheck using that full "specials":
http://zenmarkuplanguage.com/grapes116.py http://zenmarkuplanguage.com/grapes116.txt
this shows us we've attained a null "notfound" list, so we can create a custom unique dictionary for the book. and here's the newest version of the text-file, which now includes the formatting and styling information:
i also modified the pagenumbers and runheads so they would take the form z.m.l. uses "traditionally". -bowerbird

www.amazon.com Pricing starts at $79 Includes touch-screen models Includes a color touch-screen "iPad" like device [at least that is what WSJ has been claiming but I just don't see it -- Color Nook competition? Yes. iPad competition? Doesn't look like it to me.] Claimed much-optimized web browsing on the color version. I think most of us can predict by now the advantages and disadvantages of that approach. "Good News" -- from my point of view: Amazon is now touting the ability to use free books from free sites, including archive.org and gutenberg.org http://www.amazon.com/gp/b/?node=2245146011 Not sure exactly who is doing what re Kindle support on archive.org, but something new is happening there. I downloaded a "Kindle" book from archive.org "at random" and what I find is much more readable than what I have found there in the past -- clearly some new OCR technology, readable, but surprisingly the OCR doesn't bother to "heal" all the linebreaks found within words. Still, if I can't find a book anywhere else, I think I would be happy to read the "Kindle" archive.org version. -- The approach taken seems similar to Google Books OCR, but better.

Hi Everybody, I will chime in a bit. (OT ?) Personally, I think there is more Hype to the Fire than at first meets the eye. As for the other kindles some improvements, but read the specs carefully. Am 28.09.2011 um 17:44 schrieb Jim Adcock:
www.amazon.com
Pricing starts at $79 Special offer! Add $30 for advertisement free! Battery life for 1 month when reading only for half an hour a day!! Why do not the say how long the battery last with continuos reading! I read hours at a time! Only 6" screen.
Includes touch-screen models
Includes a color touch-screen "iPad" like device [at least that is what WSJ has been claiming but I just don't see it -- Color Nook competition? Yes. iPad competition? Doesn't look like it to me.]
Claimed much-optimized web browsing on the color version. I think most of us can predict by now the advantages and disadvantages of that approach. The Touch models have naturally better battery life and better screens, but suffer never the less from the same questions as the non-touch models!
The Fire seems interesting, but to state some facts: 1) 8 GB half that of iPad 2) 7" Screen smaller than iPad 3) roughly similar battery life. 4) missing a 3G option 5) processing power hard to say, they do not mention what is inside! They push the silk browser for speed! But, why! Sure it is nice, but you still have to push the data. So, the processor must have some short comings. All in all, a some what rushed product! Price wise from looking at the specs the same price to feature ratio as the iPad. On the other side the iPad2 is not ready for prime time for me either. regards Keith.

Why do not the say how long the battery last with continuos reading! I read hours at a time! Only 6" screen.
In practice in my experience the Kindles' batteries pretty much "last forever" with continuous reading. What runs the battery down, in practice, is leaving the send/receive transmit needlessly turned on. Unlike LED or LCD displays the black and white Kindles consume vanishingly small amounts of power while passively displaying a page of a book. In practice I do charge my Kindles about "once a month." as compared to my netbook, which needs to be charged about "once a day."

In practice I do charge my Kindles about "once a month." as compared to my netbook, which needs to be charged about "once a day."
This is consistent with what I've found with my K2 as well. I get about 3 days with the modem on, and around 25 days without it on, reading between 1.5 and 3 hours a day.

Hi Alex, Your Kindle is likely to be older. According to the specs of the newer readers be getting about 10 days with the modem on. But that is my point, the advertised specs can be misleading. regards Keith. Am 29.09.2011 um 20:28 schrieb Alex Buie:
In practice I do charge my Kindles about "once a month." as compared to my netbook, which needs to be charged about "once a day."
This is consistent with what I've found with my K2 as well. I get about 3 days with the modem on, and around 25 days without it on, reading between 1.5 and 3 hours a day. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi Jim, My statement was poised to the fact that amazon states a 2 month battery life with just half an hour a day. Why not give a more revealing estimate of the battery life. Like 30 hours of continuous reading. Like I said I read often 4-5 hours at time. So that would give me only about a week of reading, but that is not that bad at all. Yet, as I mentioned in another post batteries do "regenerate" when not in use. regards Keith. Am 29.09.2011 um 20:23 schrieb Jim Adcock:
Why do not the say how long the battery last with continuos reading! I read hours at a time! Only 6" screen.
In practice in my experience the Kindles' batteries pretty much "last forever" with continuous reading. What runs the battery down, in practice, is leaving the send/receive transmit needlessly turned on. Unlike LED or LCD displays the black and white Kindles consume vanishingly small amounts of power while passively displaying a page of a book. In practice I do charge my Kindles about "once a month." as compared to my netbook, which needs to be charged about "once a day."

Why not give a more revealing estimate of the battery life. Like 30 hours of continuous reading.
Because I don't keep track of how many hours I read a month, and because when the technology becomes good enough that one almost never has to recharge then it's time to start thinking about the problem in a totally different manner. Charging just isn't in practice an issue on a Kindle -- except that it happens infrequently enough that one has to ask: "Dang, now where did I leave that charging dongle?"
Like I said I read often 4-5 hours at time. So that would give me only about a week of reading.
Not necessarily true. You might find that the Kindle gives you two or three weeks of reading. Again, reading takes almost no power on a Kindle, it may be the other stuff that the Kindle does that consumes the power, in which case the amount of time the charge lasts still has little or no correlation to your "hours of reading per day metric." For example if I go a month or two without using a particular Kindle at all, I may well still find that it has discharged its battery "doing other stuff" even though I've been reading it "zero hours a day."

HI Jim, I do not doubt that the kindle has great battery life. I guess it comes down to how you read advertising. 30 hours or 2 at 1/2 hour a day. regards Keith. Am 02.10.2011 um 01:45 schrieb Jim Adcock:
Why not give a more revealing estimate of the battery life. Like 30 hours of continuous reading.
Because I don't keep track of how many hours I read a month, and because when the technology becomes good enough that one almost never has to recharge then it's time to start thinking about the problem in a totally different manner. Charging just isn't in practice an issue on a Kindle -- except that it happens infrequently enough that one has to ask: "Dang, now where did I leave that charging dongle?"
Like I said I read often 4-5 hours at time. So that would give me only about a week of reading.
Not necessarily true. You might find that the Kindle gives you two or three weeks of reading. Again, reading takes almost no power on a Kindle, it may be the other stuff that the Kindle does that consumes the power, in which case the amount of time the charge lasts still has little or no correlation to your "hours of reading per day metric." For example if I go a month or two without using a particular Kindle at all, I may well still find that it has discharged its battery "doing other stuff" even though I've been reading it "zero hours a day."
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I do not doubt that the kindle has great battery life. I guess it comes down to how you read advertising. 30 hours or 2 at 1/2 hour a day.
Sorry, but I think you didn't read my previous email. Again, battery life in the Kindle is not necessarily highly correlated to reading hours, as you continue to insist must be the case. Again, the Kindle battery can go to zero in about a month even if you are reading it 0 hours a day. Reading a Kindle basically doesn't take any power. Page turns do. Turning the modem on to transmit data does. Turning the Kindle on and off does. Kindle auto-updates take power. Reading a PDF instead of a mobi file takes more power -- more computing power -- but once a PDF page is displayed it again basically takes no power. Trying to relate Kindle reading time to battery life would be like having a battery powered inkjet printer (which do exist) and asking how long can one spend reading before the inkjet printer's battery goes dead. Answer: You can read as many hours as you want with the inkjet printer turned off and the battery won't go dead, because the inkjet's battery life doesn't depend on whether you are reading or not. Reading the printed page made by the inkjet printer is passive, depending on overhead light "for power". Same with the Kindle. It is pretty close to being a 100% passive device while you are reading a page, relying on overhead light "for power" so that you can see what is on the Kindle display. Page turns take some power [very little] but retaining that display on the Kindle once it has been displayed basically takes zero power -- so read as long as you like. Read eInk.com to get some feel for how the display technology works "without taking any power."

PCMag has some early reviews of these things: http://enews.pcmag.com/u.d?6YGgZm7YzcSrT78Kt6RQ=80 http://enews.pcmag.com/u.d?6YGgZm7YzcSrT78Kt6RV=90

Not sure exactly who is doing what re Kindle support on archive.org, but something new is happening there. I downloaded a "Kindle" book from archive.org "at random" and what I find is much more readable than what I have found there in the past....
Just double-checking, the "Kindle" books on archive.org are not encrypted, which is a good thing, and how it should be, so users can use them for other than just Kindles.

On Wed, September 28, 2011 10:10 am, Jim Adcock wrote: [snip]
Just double-checking, the "Kindle" books on archive.org are not encrypted, which is a good thing, and how it should be, so users can use them for other than just Kindles.
Last I remembered, the "Kindle" format was identical to the MobiPocket ".mobi" format, which was just a dumbed-down ePub using a different compression format. (Somewhere in this mess I have the source for a program I wrote that decompresses the ".mobi" format into its component parts...) Part of the ePub-to-mobi conversion process is to convert styles, both inline and CSS, into regular HTML tags, as MobiPocket didn't recognize styles in any form. My suspicion is that the Archive.org people simply grabbed KindleGen, or a similar command-line program, and did a mass conversion of all their ePubs to Kindle. As an alternative, a more sophisticated interface could generate the Kindle pubs on demand when a Kindle version did not already exist ... or even in every case if they have more processing power than disk space. So for me, this raises the interesting (although in no way important) questions: Are there any publications on archive.org where there is a Kindle version but no ePub? Are there any Kindle versions available which are more "refined" than their ePub counterparts? Are the new Kindles capable of reading ePub (or HTML?) files directly? Inquiring minds want to know ...

Last I remembered, the "Kindle" format was identical to the MobiPocket ".mobi"
Hey guys, First time I've written to this list, but I figured now was just as good a time as any other. I'm a collections management intern at the Archive. format, Yep, this is correct. Amazon added a few extra headers to the file format, but the kindle can read .mobi files with out the .azw headers.
My suspicion is that the Archive.org people simply grabbed KindleGen, or a similar command-line program, and did a mass conversion of all their ePubs to Kindle. As an alternative, a more sophisticated interface could generate the Kindle pubs on demand when a Kindle version did not already exist ... or even in every case if they have more processing power than disk space.
Yep, this is mostly the case. Unless someone sends us a .mobi when they upload an item, we dynamically generate one from the epub at each request time.
Are there any publications on archive.org where there is a Kindle version but no ePub? We do not currently generate epubs from mobis, so if someone uploads ONLY a .mobi file, that will be the only format it's available in. The preferred way of uploading a book is as a zip or tar of tiffs or jpegs, because then we can derive them into everything. You can see which formats will derive into what on this page: http://www.archive.org/help/derivatives.php
Are there any Kindle versions available which are more "refined" than their ePub counterparts? Yes, if someone uploads a custom-made mobi with their item, that one will be served instead of a kindlegen conversion.
Are the new Kindles capable of reading ePub (or HTML?) files directly? As far as we know from our contact at Amazon, no. (This is also confirmed by the product website)
Let me know if you have any more questions! Alex -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com On Wed, Sep 28, 2011 at 3:10 PM, Lee Passey <lee@novomail.net> wrote:
On Wed, September 28, 2011 10:10 am, Jim Adcock wrote:
[snip]
Just double-checking, the "Kindle" books on archive.org are not encrypted, which is a good thing, and how it should be, so users can use them for other than just Kindles.
Last I remembered, the "Kindle" format was identical to the MobiPocket ".mobi" format, which was just a dumbed-down ePub using a different compression format. (Somewhere in this mess I have the source for a program I wrote that decompresses the ".mobi" format into its component parts...) Part of the ePub-to-mobi conversion process is to convert styles, both inline and CSS, into regular HTML tags, as MobiPocket didn't recognize styles in any form.
My suspicion is that the Archive.org people simply grabbed KindleGen, or a similar command-line program, and did a mass conversion of all their ePubs to Kindle. As an alternative, a more sophisticated interface could generate the Kindle pubs on demand when a Kindle version did not already exist ... or even in every case if they have more processing power than disk space.
So for me, this raises the interesting (although in no way important) questions:
Are there any publications on archive.org where there is a Kindle version but no ePub?
Are there any Kindle versions available which are more "refined" than their ePub counterparts?
Are the new Kindles capable of reading ePub (or HTML?) files directly?
Inquiring minds want to know ...
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Thu, September 29, 2011 7:14 am, Alex Buie wrote: [snip]
You can see which formats will derive into what on this page: http://www.archive.org/help/derivatives.php
[snip]
Let me know if you have any more questions!
I note that there is no derivation path starting from ePub. All of the supported formats can be derived from an ePub, and it seems to me that an ePub file has the greatest fidelity across the board. When will a derivation path from ePub be created?

ePub is never derived into any other format (currently). All our text formats (txt, html, epub [and consequently MOBI], and pdf) for books we scan are derived from the _images.tar/.zip (page scans) and the abbyy.gz/xml (OCR data). Alex -- Alex Buie Network Coordinator / Server Engineer KWD Services, Inc Media and Hosting Solutions +1(703)445-3391 +1(480)253-9640 +1(703)919-8090 abuie@kwdservices.com On Thu, Sep 29, 2011 at 10:18 AM, Lee Passey <lee@novomail.net> wrote:
On Thu, September 29, 2011 7:14 am, Alex Buie wrote:
[snip]
You can see which formats will derive into what on this page: http://www.archive.org/help/derivatives.php
[snip]
Let me know if you have any more questions!
I note that there is no derivation path starting from ePub. All of the supported formats can be derived from an ePub, and it seems to me that an ePub file has the greatest fidelity across the board. When will a derivation path from ePub be created?

Are the new Kindles capable of reading ePub (or HTML?) files directly?
Well, all of the Kindles have been capable of reading HTML files from day one -- if one accesses that HTML file via the included "experimental" HTML browser. Now the early Kindle HTML browsers really stank, the more recent HTML browsers seem to do what they need to do "pretty well." For example they have an "article reading mode" where you can go to a newspaper web site, find an article you want to read, switch to "article reading mode" and the article more-or-less shows up as-if one were reading natively from a mobi file. What you cannot do is say side-load an HTML file onto a Kindle using USB and expect a Kindle to read it as-if a supported "native" file format. One would need to use Kindlegen, for example, to convert that HTML (or unencrypted ePub) file to .mobi format before side-loading it onto a Kindle. Kindle "native" doc file support is only for: AZW, TXT, PDF, and unencrypted MOBI (aka PRC) They also claim support for a bunch of MSFT file formats, but that is really only via "on the fly" file format conversion using MSFT Office file format conversion tools (assuming you have MSFT Office)

Last I remembered, the "Kindle" format was identical to the MobiPocket ".mobi" format, which was just a dumbed-down ePub using a different compression format.
http://en.wikipedia.org/wiki/Comparison_of_e-book_formats "Kindle" .azw format is "identical" to .mobi except that it uses a proprietary encryption format. Mobi *way* predates ePub, so if you will, ePub could be considered a "more modern" format than Mobi, not Mobi is a "dumbed down" version of ePub. In terms of rendering, Mobi format can be, and often is, faster, in part because it relies on "lazy evaluation" -- which may not always result in correct rendering at the start of a page when a document is opened "in the middle", but which potentially (and in practice) can lead to much faster loads of documents than ePub. ePub in turn *does not* specify a "standard" encryption format. Many vendors follow the Adobe encryption standard [and license software renderers etc. from Adobe] which potentially allows their encrypted ePubs to be compatible and readable on each other's devices. Apple I believe uses their own incompatible encryption scheme. While both Mobi and ePub are specified nowadays based on IDPF, in practice implementations rendering those efforts interpret the guide structure standards differently, resulting in practical incompatibilities re display of "index" information, such that naïve conversion of ePub to Mobi doesn't result in a usable "index" on most Mobi rendering devices [read: Kindle, in practice, for most people today]. [My read of the IDPF standards is that most ePub authors don't implement index info correctly in the first place, relying instead on messed-up Adobe proprietary conventions for displaying guide info in lieu of a "correct" index implementation -- but we've had this argument before] If you read the IDPF stuff carefully [and if one reads the HTML stuff carefully] it really doesn't say how most of this stuff needs to be "correctly" rendered, simply that it needs to be correctly parsed, allowing device manufacturers to simply ignore that which they don't want to implement. Examples being color support, or lack thereof, and what fonts, if any, are supported, and how. Device support for generic fonts, for example, tends to be horribly "messed up", ill-implemented, and ill-supported, such that HTML documents originally designed for the desktop, with "very carefully specified" font schemes, still end up being rendered horribly messed-up when auto-converted to ePub or Mobi and rendered on a particular device. Not to imply that even IF that "desktop" HTML document *were* "correctly" rendered on a ePub or Mobi device that the end result would necessarily represent any useful result. It probably wouldn't be -- which is why the device manufacturers tend to ignore font specifications in the first place. "EPUB Straight to the Point" by Elizabeth Castro is one more-or-less useful intelligent book on these issues. "How to Create an eBook with Adobe® InDesign® CS5" by Rufus Deuchler also has some intelligent thoughts [not to imply that using InDesign is one of them] His page http://rufus.deuchler.net/2010/10/css-and-xhtml-tags-for-epub.html is an intelligent discussion of the problems -- and a decent example ePub (that can also automagically be converted to mobi by Amazon Kindle on the Desktop Emulator program) He says something intelligent there which is frequently ignored by people who think they know HTML and/or eBook authoring: "The first lesson in ePub [and Mobi] is that designers and publishers have very little control over the formatting of their eBooks." [And they would be better off if they didn't try to assert such control in the first place!] One common mistake one sees in practice at PG and other places is that authors and implementers performing horrendous feats of HTML "secure" [incorrectly] in the knowledge that they "know" the HTML standard and thus that what looks cool on their desktop browser and OS is going to automatically look cool on other people's desktops, browsers, laptops, netpads, tablets, and ebook readers. On the contrary, the more effort they put into making the HTML "look cool" on their desktop the more likely it will look horribly screwed up on almost everyone else's desktops, browsers, laptops, netpads, tablets, and ebook readers! KISS. PS: Trying to implement "Drop Caps" in HTML, ePub, and/or Mobi being one common example of this disease. PPS: Trying really fancy and complicated coding of Poetry being yet another example! PPPS: Even trying to specify paragraph indent style, and/or blank line or lack thereof between paragraphs is frequently messed up!

On Thu, September 29, 2011 12:11 pm, Jim Adcock wrote:
Last I remembered, the "Kindle" format was identical to the MobiPocket ".mobi" format, which was just a dumbed-down ePub using a different compression format.
Not a bad overview, but missing the seminal PalmDOC format; IMO, no discussion of formats can be considered complete without including PalmDOC. (Maybe I ought to create a wikipedia account and go edit it myself.)
"Kindle" .azw format is "identical" to .mobi except that it uses a proprietary encryption format. Mobi *way* predates ePub, so if you will, ePub could be considered a "more modern" format than Mobi, not Mobi is a "dumbed down" version of ePub.
I apologize for any confusion. I didn't mean to imply that the Mobipocket format was in any way /developed/ from the ePub format. Rather, the two formats appear to have been independently and simultaneously created and are similar in many ways; the Mobipocket format is clearly the inferior of the two (although still quite capable). If you were to take the HTML portion of a very full-featured ePub file, converted it to .mobi format, and then extracted the HTML from the .mobi version, you will see that it likely will have lost some fidelity. Thus, .mobi can be said to be a "dumbed-down" version of ePub, just as GutenText can be said to be a "dumbed-down" version of nearly everything else, despite being first-in-time. I was talking about capability, not chronology. (What follows is a somewhat pendantic history of e-book file formats, based upon my own experiences and observations. Stop reading now if you don't want to be bored out of your gourd. You have been warned.) It all started back in 1996, when Rick Bram developed a method to compress a text files for the Palm OS. The Palm Pilot didn't have what we would consider a file system, but instead stored all its data in database records (hence the extension .pdb which was appropriate for any Palm DataBase). Rick broke the text files into 4k chunks (which was the database record size on the Palm) and compressed each chunk separately using a minor variation on the LZ77 compression method. He then added a header record which contained a minimal amount of book metadata, and a map of all the "chunks" that made up the complete file (file metadata). He called the format "Palm Doc". Palm Doc was wildly successful, given the relatively small segment of the populace that was even using e-books at the time, but it didn't take long for people to run in to the obvious limitations of simplistic text. Peanut Press (which later became Palm Digital Reader, which later became eReader, which was then bought out by Barnes & Noble) invented a markup language known as the Palm Markup Language, which brought a certain amount of presentational "goodness" to its offerings. The Peanut (later eReader) format compressed an entire PML file using the LZ algorithm, and then encrypted the resulting file with the credit card information that was used to the purchase the book, as a form of TPM. At the world's first eBook conference in Gaithersburg, Md., sponsored by the National Institutes of Standards and Technologies (NIST) in October 1998, Microsoft Vice President for Emerging Technology Dick Brass proposed the creation of an open, nonproprietary standard for eBooks based on the HTML and Extensible Markup Language (XML) specifications. During 1999, the Open eBook Authoring Group worked to draft a publication structure, which was subsequently released to the public in August 1999 as OEBPS ver. 1.0. Following the release of OEBPS 1.0, the Open eBook Forum (OEBF, now the International Digital Publishers Forum, IDPF, an organization which was never open, nor a forum) was formally incorporated in January 2000. Refinement of the Open EBook Publication Structure has continued in fits and starts ever since that time. Version 1.0.1 of the specification was released in July 2001, version 1.2 was released in August 2002, and version 2.0 was released in September 2007, with a maintenance release (2.0.1) in September 2010. The biggest problem with the OEBPS (in my opinion) was that while it specified a set of files which in the aggregate could almost completely define an e-book, it didn't specify a consumer format, that is, a single file that could be presented by a software User Agent. This failure was no doubt due to the fact that the OEBF was (and is) dominated by large commercial interests (Microsoft, Palm Digital, Gemstar, Adobe), each of which was already committed to a proprietary user format designed for their own hardware or software. Dick Brass' vision of an open, nonproprietary standard was thus corrupted to a set of specifications that commercial publishers could use to prepare a publication which could then be fed into a specific vendor's final conversion process. In late 2004 and early 2005 Jon Noring and I began agitating for a truly open consumer-oriented e-book format, which we dubbed Open Reader. Our vision was to take the OEBPS suite of files, and a small amount of archive metadata, and archive them using the ZIP archive format, in a manner very similar to the Java JAR format. No doubt coincidentally, the IDPF began a process in late 2005 to specify a consumer e-book format. On October 30, 2006, the IDPF announced the official release of the Open Container Format, which consisted of the OEBPS suite of files, together with a small amount of archive metadata, and possibly some encryption instructions, archived using the ZIP archive format, and which was intended to be very similar to the Open Office document format. The specification recommended that files built according to the specification be given the ".epub" file extension. Thus, ePub was born. OPS 3.0, now under development, is virtually nothing more than the 2.0.1 version of the specification with the addition of techniques for embedding audio and video into an ePub file. One of the design goals of OPS 3.0 is that it will be _forward compatible_: that is, not only will new User Agents be able to parse and display ePub 2.0 files, old 2.0 User Agents should be able to parse and display ePub 3.0 files (minus, of course, the embedded audio and video). Almost as soon as version 1.0 of the specification was published, the e-book pirate community began creating e-books following that specification, and distributing them archived into a single .zip file. It could be argued that "ePubs" have been available since 2000, and the official owner of the specification, the IDPF, only acquiesced to what had become common practice. Almost concurrently with the incorporation of the OEBF, Mobipocket SA was incorporated in France in 2000, but followed a somewhat different course in designing its e-book format. It also chose to use HTML 3.2 as its markup language, but encoded the entire file using Rick Bram's classic Palm Doc method. Some additional metadata was included, and as a means of TPM, each "chunk" was encrypted by a method which relied on access to a "Device ID." Early Mobipocket files used the .prc extension, which is technically incorrect as it supposedly stands for "Palm Resource Code", i.e. a /program/ for the Palm Pilot, but in the e-book world .prc is almost universally recognized as "Mobipocket v.1." The follow-on .mobi format ("Mobipocket v.2) continued to use HTML/XHTML as its markup format, but abandoned the chunked LZ77 compression in favor of Huffman encoding. At this point in my career I quit paying attention to Mobipocket, so I have no information on what changes Amazon made to the v.2 to create the .azw format ("Mobipocket v.3). Years ago I was able to use an HTML file as input to a Palm Doc creation program, renamed the resulting file from .pdb to .prc, and read it in Mobipocket Reader, with no indication that it was not recognized as a standard Mobipocket file. (You could probably still do that with the Kindle. Does anyone want to try?) As BowerBird would be quick to point out (if he bothered to read anything I wrote) one of the problems with HTML is that writing a full featured display engine is /hard/! (Just ask David Jean, the inventor of µBook). Luckily, for most purposes e-books can get away with using a subset of the HTML element set. So the Mobipocket engineers built a display engine that recognized only those HTML elements which they felt appropriate for e-books, and completely ignored styles attributes whether specified either inline on through cascading style sheets. They also settled on the HTML 4 specification, relying on many deprecated elements. Back in those days, Mobipocket claimed that it was completely CSS and OEBPS compliant. Nevertheless, when I created an HTML to PDB file using an element such as <div style="text-align:center">some text</div> Mobipocket reader did not center the text. But when I ran the file through the Mobipocket Creator program, the text became centered. Extracting the text using prc2html program (which also extracts files from unencrypted .mobi files) I discovered that the subject text had been converted to <center>some text</center> by the conversion program. Further experimentation revealed that the Mobipocket reader did not recognize /any/ style attributes. When a style had a corresponding HTML 4 element or attribute, the Mobipocket Creator program converted style attributes to the corresponding HTML 4 element, otherwise the styling was /left in the file/ but ignored by the reader. For a relatively complete description of the .mobi format, see http://wiki.mobileread.com/wiki/MOBI. This article focuses primarily on the file format and discusses the Mobipocket markup only briefly, although it does point out that "you only get the full range of Mobipocket's formatting capabilities if you have markup written to use Mobipocket's non-standard, extended, and under-documented implementation of HTML 3.2. See: File tag reference on the mobipocket web site (http://www.mobipocket.com/dev/article.asp?BaseFolder=prcgen&File=TagRef_OEB.htm)." The Mobipocket page referenced here has a good comparison of the markup supported by Mobipocket compared to the markup suppported by the first version of the OEB Publication Structure. Now I quit paying much attention to Mobipocket a couple of years before it was acquired by Amazon, so I don't know how the reader, format and development suite have evolved since that time. This is one of the reasons I continue to ask whether the Kindle can parse/display HTML natively, other than by using the web browser, because that would tend to be indicative that the old Mobipocket display engine may have been replaced/upgraded. Obviously, Amazon has added yet more metadata to the file, may be using a different compression scheme (although from what has been said here, maybe not) and has added a new TPM mechanism. Perhaps no one else here is interested, but /I/ would be interested if anyone could provide me with more technical details about how the Mobipocket format has evolved since the Amazon acquisition.

If you were to take the HTML portion of a very full-featured ePub file, converted it to .mobi format, and then extracted the HTML from the .mobi version, you will see that it likely will have lost some fidelity.
Contrary to what other people here might want to say, whenever you change eBook file formats you potentially, and almost always in practice, lose fidelity. In practice PG does use a path HTML -> ePub -> mobi, so mobi will "lose fidelity" in any case. But the major loss of fidelity is typically HTML -> ePub -- because the HTML has been written with an implied target of the original PG "author"s favorite desktop or laptop machine running their favorite HTML browser, with their favorite font, display sizing, display Gamma, HTML window opening size etc. Most authors don't seem to understand that they are accidentally or intentionally targeting one particular machine, which means that other "HTML" rendering machines -- including ePub and mobi -- often will not display this HTML as the implied intention. ====
...I discovered that the subject text had been converted to
<center>some text</center> by the conversion program. ==== Careful reading of the IDPF and/or HTML shows that the rendering process can include the "compiler" program. Thus for example "Kindlegen+Kindle" can be considered an HTML machine, and/or an ePub machine. The fact that Kindle doesn't support a particular style element isn't particularly important if Kindlegen *does.*
Perhaps no one else here is interested, but /I/ would be interested if anyone could provide me with more technical details about how the Mobipocket format has evolved since the Amazon acquisition.
I think many people in the Mobi dev community would be very interested in the "technical details" of the Mobipocket format but Amazon has been woefully unforthcoming and further the documentation they do provide is often woefully wrong. Thus, for example, discovering what glyphs a particular Kindle version actually supports has been tested using exhaustive testing techniques. This is about it when it comes to Amazon provided documentation (and even much of that is wrong) http://kindlegen.s3.amazonaws.com/AmazonKindlePublishingGuidelines.pdf
participants (6)
-
Alex Buie
-
Bowerbird@aol.com
-
James Adcock
-
Jim Adcock
-
Keith J. Schultz
-
Lee Passey