Calibre: Open Source Software for Managing eBook Collections

http://chronicle.com/blogs/profhacker/calibre-revisited/30407 ". . .Calibre allows you to convert content from various internet sources such as Project Gutenberg into the appropriate format for your e-reader, whether it is a Kindle, a Nook, a Sony, or something else."

". . .Calibre allows you to convert content from various internet sources such as Project Gutenberg into the appropriate format for your e-reader, whether it is a Kindle, a Nook, a Sony, or something else."
What I find in practice is: Calibre is very slow and takes a fair amount of work to use in practice. It converts the input file format to its own internal format, and then to the designated output file format, and in the process seems to apply a bunch of heuristics and assumptions which don't seem to work out very well in practice for me. For example if I send a large set of Unicode code points into it, a smaller set of Unicode code points comes back out of it -- which I don't understand. I would have thought that "Unicode is Unicode" and that Calibre would pass it through unmolested. It crashes or hangs for me on very big and complicated stuff. Not to imply that it doesn't work more-or-less on simple stuff, and many people are using it happily. (I used Calibre a lot to "check things out" but have decided I can't rely on it for "checking things out" because it introduces a lot of its own sets of bugs and assumptions while making file format transformations.)

On Sat, Feb 5, 2011 at 10:21 AM, Jim Adcock <jimad@msn.com> wrote:
For example if I send a large set of Unicode code points into it, a smaller set of Unicode code points comes back out of it -- which I don't understand. I would have thought that "Unicode is Unicode" and that Calibre would pass it through unmolested.
I don't know what you're seeing, but that's not unexpected in some cases. If it applied NFD and decomposed the characters, you would get the base characters + a set of combining characters out. It could also translate special spaces to normal ones, or even impose NFKC or NFKD, which would normalize all sorts of characters--that's questionable, because it turns things like ² into 2, but it may be expected by some programs. -- Kie ekzistas vivo, ekzistas espero.

I don't know what you're seeing, but that's not unexpected in some cases.
What I see is a lot of code points which were rendered correctly going into Calibre come out not being rendered correctly. Like maybe it's trying to do some code page mapping on everything, or something, and the algorithm isn't working correctly?

On 02/05/2011 07:21 PM, Jim Adcock wrote:
Calibre is very slow and takes a fair amount of work to use in practice.
Calibre might be a good choice for the end user. There's a lot of knobs you can turn. For mass conversion it is too fat and too slow and the knobs won't help you any unless you turn them into a position that works for all books. And good luck with that.
It converts the input file format to its own internal format, and then to the designated output file format, and in the process seems to apply a bunch of heuristics and assumptions which don't seem to work out very well in practice for me. For example if I send a large set of Unicode code points into it, a smaller set of Unicode code points comes back out of it -- which I don't understand. I would have thought that "Unicode is Unicode" and that Calibre would pass it through unmolested.
Not necessarily. For all accented characters there's a precomposed and decomposed form. If you use the precomposed forms, you'll use a lot more codepoints (one for each character), while in the decomposed form you'll use one codepoint for each character stripped of its accent and one for each accent. The decomposed form might get better results on the limited fonts you find in ereaders. OTOH, precomposed chars as a rule look better than decomposed chars because the font designer can put the accent in the exact right place, while the decomposed accents get placed algorithmically (and YMMV). -- Marcello Perathoner webmaster@gutenberg.org

On Sat, Feb 5, 2011 at 10:40 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
OTOH, precomposed chars as a rule look better than decomposed chars because the font designer can put the accent in the exact right place, while the decomposed accents get placed algorithmically (and YMMV).
I don't know about ebook readers, but many fonts will simply turn a character that has a precomposed character into that precomposed character, so they look the same. -- Kie ekzistas vivo, ekzistas espero.

On 2/5/2011 11:40 AM, Marcello Perathoner wrote:
The decomposed form might get better results on the limited fonts you find in ereaders.
I tend to doubt this, although I have little or no evidence one way or the other. My rather limited experience is that support in User Agents for re-composing characters is rare; rather than getting é you would get e or some such. I would love to be proven wrong, however. Can anyone offer evidence of how various User Agents treat decomposed characters?

For all but a quite small subset of end users, that reads like a non sequitur to me. On Sat, Feb 5, 2011 at 10:40 AM, Marcello Perathoner <marcello@perathoner.de
wrote:
On 02/05/2011 07:21 PM, Jim Adcock wrote:
Calibre is very slow and takes a fair amount of work to use in practice.
Calibre might be a good choice for the end user. There's a lot of knobs you can turn.
For mass conversion it is too fat and too slow and the knobs won't help you any unless you turn them into a position that works for all books. And good luck with that.

I have found out works great if the input format has a fair amount of info to work with ... ie, mobi or epub. Failure trends to happen more often when the input format is text.or pdf. Sent via DROID on Verizon Wireless -----Original message----- From: Jim Adcock <jimad@msn.com> To: 'Project Gutenberg Volunteer Discussion' <gutvol-d@lists.pglaf.org> Sent: Sat, Feb 5, 2011 18:22:10 GMT+00:00 Subject: [gutvol-d] Re: Calibre: Open Source Software for Managing eBook Collections
". . .Calibre allows you to convert content from various internet sources such as Project Gutenberg into the appropriate format for your e-reader, whether it is a Kindle, a Nook, a Sony, or something else."
What I find in practice is: Calibre is very slow and takes a fair amount of work to use in practice. It converts the input file format to its own internal format, and then to the designated output file format, and in the process seems to apply a bunch of heuristics and assumptions which don't seem to work out very well in practice for me. For example if I send a large set of Unicode code points into it, a smaller set of Unicode code points comes back out of it -- which I don't understand. I would have thought that "Unicode is Unicode" and that Calibre would pass it through unmolested. It crashes or hangs for me on very big and complicated stuff. Not to imply that it doesn't work more-or-less on simple stuff, and many people are using it happily. (I used Calibre a lot to "check things out" but have decided I can't rely on it for "checking things out" because it introduces a lot of its own sets of bugs and assumptions while making file format transformations.) _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

I have found out works great if the input format has a fair amount of info to work with ... ie, mobi or epub. Failure trends to happen more often when the input format is text.or pdf.
Sorry, let me show y'all what I am talking about. I put a bunch of example files up at: http://www.freekindlebooks.org/Dev/compare The file names are intended to be indicative of the format translations, and the tools used to transform them: html2html.html is the null-op transformation of html to html using file copy -- ie this is the source file I am using. khtml2mobi.mobi is using Kindlegen to transform html to mobi (Kindle format) chtml2mobi.mobi is using Calibre to transform html to mobi chtml2epub.epub is using Calibre to transform html to epub cmobi2epub.epub is using Calibre to transform mobi to epub Looking at html2html.html, khtml2mobi.mobi, and chtml2mobi.mobi on modern devices [Kindle 3rd Gen, Sony Pocket Reader] and/or desktop emulators show that these files are "doing the right thing" ie they actually display most of the Unicode code points, which is what one would expect from a modern font implementation which implements most of Unicode. However, displaying on the same devices and/or desktop emulators show that chtml2epub.epub and cmobi2epub.epub doesn't correctly display most of Unicode, but rather substitutes the question mark char '?' NOT even [?] ie question-mark-in-a-box which is the typical display of missing-glyph on Unicode-compatible devices. Conclusion: Calibre seems to be breaking down re correctly outputting many many [most?] Unicode code points when outputting in epub format. Outputting in mobi format it seems to do better.

I have found out works great if the input format has a fair amount of info to work with ... ie, mobi or epub. Failure trends to happen more often when
I vaguely remember a thread somewhere discussing the fact that some of the e-readers not handling anything beyond latin-1 characters and therefore calibre tried to deprecate certain code points. Might want to follow up as a bug in the calibre forums. Sent via DROID on Verizon Wireless -----Original message----- From: Jim Adcock <jimad@msn.com> To: 'Project Gutenberg Volunteer Discussion' <gutvol-d@lists.pglaf.org> Sent: Mon, Feb 7, 2011 17:11:07 GMT+00:00 Subject: [gutvol-d] Re: Calibre: Open Source Software for Managing eBook Collections the input format is text.or pdf. Sorry, let me show y'all what I am talking about. I put a bunch of example files up at: http://www.freekindlebooks.org/Dev/compare The file names are intended to be indicative of the format translations, and the tools used to transform them: html2html.html is the null-op transformation of html to html using file copy -- ie this is the source file I am using. khtml2mobi.mobi is using Kindlegen to transform html to mobi (Kindle format) chtml2mobi.mobi is using Calibre to transform html to mobi chtml2epub.epub is using Calibre to transform html to epub cmobi2epub.epub is using Calibre to transform mobi to epub Looking at html2html.html, khtml2mobi.mobi, and chtml2mobi.mobi on modern devices [Kindle 3rd Gen, Sony Pocket Reader] and/or desktop emulators show that these files are "doing the right thing" ie they actually display most of the Unicode code points, which is what one would expect from a modern font implementation which implements most of Unicode. However, displaying on the same devices and/or desktop emulators show that chtml2epub.epub and cmobi2epub.epub doesn't correctly display most of Unicode, but rather substitutes the question mark char '?' NOT even [?] ie question-mark-in-a-box which is the typical display of missing-glyph on Unicode-compatible devices. Conclusion: Calibre seems to be breaking down re correctly outputting many many [most?] Unicode code points when outputting in epub format. Outputting in mobi format it seems to do better. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 02/07/2011 06:10 PM, Jim Adcock wrote:
However, displaying on the same devices and/or desktop emulators show that chtml2epub.epub and cmobi2epub.epub doesn't correctly display most of Unicode, but rather substitutes the question mark char '?' NOT even [?] ie question-mark-in-a-box which is the typical display of missing-glyph on Unicode-compatible devices.
Next step would be to unzip the epub and compare it with the html. -- Marcello Perathoner webmaster@gutenberg.org

Next step would be to unzip the epub and compare it with the html.
Good point. I look at the html in the epub and the codes ARE there, and I can't see anything wrong with the other opf files, but if I try to display this Calibre html->epub in a variety of epub devices, no go, and if I use kindlegen to cross compile the Calibre EPUB to MOBI, that doesn't work either -- so I don't know what's wrong with the Calibre html->epub generation! Something not obvious, obviously.

excuse me?? You unzip the calibre output and the html was correct? And the codes do not show up right does not show up on the devices, then I WOULD say it is not calibre fault!! The devices are not doing thier work right would be my assumption. Then again I have nothing, but what you have stated. Yet, I do refer to a earlier post BB who I think wrap it up nicely. regards Keith. Am 07.02.2011 um 20:13 schrieb Jim Adcock:
Next step would be to unzip the epub and compare it with the html.
Good point. I look at the html in the epub and the codes ARE there, and I can't see anything wrong with the other opf files, but if I try to display this Calibre html->epub in a variety of epub devices, no go, and if I use kindlegen to cross compile the Calibre EPUB to MOBI, that doesn't work either -- so I don't know what's wrong with the Calibre html->epub generation! Something not obvious, obviously.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

You unzip the calibre output and the html was correct?
No, what I said was that I didn't see anything *obviously* wrong there. If you, or others, actually play with Calibre, and assign it real-world "reasonable" tasks, and see the large and interesting variety of ways that it fails, then you will understand my, and others, reluctance to assume that an output from Calibre is representative of anything. http://www.idpf.org/ is the source of the descriptions of what EPUB is and what the file format can and cannot do, if anyone actually cares to try to actually figure it all out. It would be a gift to the PG community if anyone can come up with even a simple tool that reliably takes even one HTML file and correctly and reliably following all the IDPF rules convert that HTML into an EPUB that even begins to work on most EPUB machines. It is NOT as simple as hacking an existing EPUB and inserting your unmangled HTML into the zip file -- as anyone who has tried this approach on non-trivial HTML files has found. Is there already such a tool on the internet? Not that I've found -- and god knows I've gone looking.

Am 12.02.2011 um 21:42 schrieb Jim Adcock:
You unzip the calibre output and the html was correct?
No, what I said was that I didn't see anything *obviously* wrong there. If Took a long time for you to answer! What I was saying that you yourself said that the HTML was O.K. You complained that it was not being display on devices correctly. So logically, the devices must not be using the format correctly. Maybe, they are just using a subset of the format?
HTML is always trivial. regards Keith.
you, or others, actually play with Calibre, and assign it real-world "reasonable" tasks, and see the large and interesting variety of ways that it fails, then you will understand my, and others, reluctance to assume that an output from Calibre is representative of anything.
http://www.idpf.org/ is the source of the descriptions of what EPUB is and what the file format can and cannot do, if anyone actually cares to try to actually figure it all out. It would be a gift to the PG community if anyone can come up with even a simple tool that reliably takes even one HTML file and correctly and reliably following all the IDPF rules convert that HTML into an EPUB that even begins to work on most EPUB machines. It is NOT as simple as hacking an existing EPUB and inserting your unmangled HTML into the zip file -- as anyone who has tried this approach on non-trivial HTML files has found. Is there already such a tool on the internet? Not that I've found -- and god knows I've gone looking.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hi Jim, FIRST OFF, PLEASE learn to cite correctly. That is at least write who you are citing! Yes, I have read them!. The only problem is you have to be able to understand the formalistic language! Just like reading legal documents. alot of formalistic rigamarole. Now, to use HTML you do not to read the Standard, just like you do not need to read all laws passed by congress to abide by them! Now back to epub correctness. A File in epub format can be validate by an xml validator. If it validates then it should show correctly on your devices! If not then your device is doing something wrong!. As far as Unicode is concerned. The full unicode does not need to be displayed, nor all glyphs. The devices/readers only should display an alternative that represents that it can not display that glyph. So the problems you are seeing, do not reflect necessarily a wrong encoding by Calibre, but may be an inability of the device in displaying Unicode! regards Keith. Am 13.02.2011 um 03:38 schrieb James Adcock:
HTML is always trivial.
Excuse me? Have you read even ONE of the many HTML standards???
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Now back to epub correctness. A File in epub format can be validate by an xml validator. If it validates then it should show correctly on your devices! If not then your device is doing something wrong!.
As far as Unicode is concerned. The full unicode does not need to be displayed, nor all glyphs. The devices/readers only should display an alternative that represents
Certainly not a true statement, unless one defines "shows correctly" in a vacuously broad manner. PG/DP HTML/XML authors can and often do write code that only "shows correctly" on at most one browser, and certainly not on epub devices. A common example of this is attempts to implement dropcaps or similar devices. that it can not display that glyph. The devices are supposed to display a "missing glyph" symbol which is unambiguously different than any valid glyph -- and this is certainly not true of epub implementations based on ADE including the Sony Readers who use the valid question mark glyph to also represent the missing glyph symbol.
So the problems you are seeing, do not reflect necessarily a wrong encoding by Calibre, but may be an inability of the device in displaying Unicode!
Could be true, but Calibre displays so many other failure modes, including crashing and hanging forever, as to make Calibre not very usable as an encoding technique to research these issues.

If you haven't already done so, I think you should report this to the Calibre development team. http://calibre-ebook.com/bugs Walt On 2/7/2011 12:10 PM, Jim Adcock wrote:
I have found out works great if the input format has a fair amount of info to work with ... ie, mobi or epub. Failure trends to happen more often when the input format is text.or pdf. Sorry, let me show y'all what I am talking about. I put a bunch of example files up at:
http://www.freekindlebooks.org/Dev/compare
The file names are intended to be indicative of the format translations, and the tools used to transform them:
html2html.html is the null-op transformation of html to html using file copy -- ie this is the source file I am using.
khtml2mobi.mobi is using Kindlegen to transform html to mobi (Kindle format)
chtml2mobi.mobi is using Calibre to transform html to mobi
chtml2epub.epub is using Calibre to transform html to epub
cmobi2epub.epub is using Calibre to transform mobi to epub
Looking at html2html.html, khtml2mobi.mobi, and chtml2mobi.mobi on modern devices [Kindle 3rd Gen, Sony Pocket Reader] and/or desktop emulators show that these files are "doing the right thing" ie they actually display most of the Unicode code points, which is what one would expect from a modern font implementation which implements most of Unicode.
However, displaying on the same devices and/or desktop emulators show that chtml2epub.epub and cmobi2epub.epub doesn't correctly display most of Unicode, but rather substitutes the question mark char '?' NOT even [?] ie question-mark-in-a-box which is the typical display of missing-glyph on Unicode-compatible devices.
Conclusion: Calibre seems to be breaking down re correctly outputting many many [most?] Unicode code points when outputting in epub format. Outputting in mobi format it seems to do better.
participants (10)
-
David Starner
-
don kretz
-
James Adcock
-
Jim Adcock
-
Keith J. Schultz
-
Lee Passey
-
Marcello Perathoner
-
Me
-
Michael S. Hart
-
Walt Farrell