Convert doc to HTML to Kindle and ePub -- help!

I helped an ESL author with his book about Theravada Buddhism, and told him that I would convert his book to Kindle and ePub, and upload to Amazon and Smashwords, for free. Because I need the experience, and because I couldn't possibly charge him if I'm spending hours figuring out how to do it. So far, I'm stymied. I figured that the appropriate path would be to convert the Word doc to HTML, and from there go to Kindle and ePub. I thought I'd start by using Dreamweaver rather than using Save as HTML in Word. However, my antique copy of Dreamweaver (Dreamweaver 4) is refusing to open the Word 2003 doc. It seems that I can copy and paste from the doc into Dreamweaver. Should I do that? Should I preformat the work in Word, using Outline to produce various levels of headings? What should I do about the XE codes that appeared in the document AFTER I sent the finished manuscript to the author? I think that they were introduced by the Sri Lankan publisher. They look like this: { XE "awakened" /i}. Some of them lack the /i. I believe that these are codes that generate an index, yes? An ebook doesn't need an index, does it? Just searching should suffice? Yes? No? Should I just jettison all these codes, or convert them to something in HTML? What about footnotes? Insert anchors and link, right? Same for TOC, yes? Dreamweaver output, check with HTMLTidy, then process with Mobipocket Creator and Sigil? (Yes, I should have a later Dreamweaver. I maintain my zendo's website, and I'm producing it in deprecated code. However, I'm too poor to afford $400 for the latest Dreamweaver.) I'm not a complete HTML newbie, but my skills are fairly basic. I would appreciate some help from the folks here. I figured I've earned it by proofing 78,000 pages at DP. You could send me to the best site for advice, or send me a private email, or post it here, in case this would be of interest to other volunteers. Oh, and Happy New Year! -- Karen Lofstrom Zora on DP

On 01/01/2012 08:03 AM, Karen Lofstrom wrote:
I helped an ESL author with his book about Theravada Buddhism, and told him that I would convert his book to Kindle and ePub, and upload to Amazon and Smashwords, for free. Because I need the experience, and because I couldn't possibly charge him if I'm spending hours figuring out how to do it.
Try this way: - export html from msword - run html thru tidy with the msword clean option - convert to epub and kindle with calibre. Some hand cleaning of the html may also be necessary. YMMV -- Marcello Perathoner webmaster@gutenberg.org

I would also not toss out the index, because as any professional indexer (which I am not) will tell you, an index is much more than just a list of keywords with page numbers. This is of course assuming that they put some thought into their index and it is not, in fact, just a list of keywords with page numbers. Aaron On 1/1/12, Marcello Perathoner <marcello@perathoner.de> wrote:
On 01/01/2012 08:03 AM, Karen Lofstrom wrote:
I helped an ESL author with his book about Theravada Buddhism, and told him that I would convert his book to Kindle and ePub, and upload to Amazon and Smashwords, for free. Because I need the experience, and because I couldn't possibly charge him if I'm spending hours figuring out how to do it.
Try this way:
- export html from msword - run html thru tidy with the msword clean option - convert to epub and kindle with calibre.
Some hand cleaning of the html may also be necessary. YMMV
-- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Sun, Jan 1, 2012 at 7:28 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
- export html from msword - run html thru tidy with the msword clean option - convert to epub and kindle with calibre.
Some hand cleaning of the html may also be necessary. YMMV
I thought that msword generated sloppy html? What I've been reading online is that you get better results using something like Dreamweaver to do the reformatting. Also, will msword handle footnotes and indices? -- Karen Lofstrom

On 01/01/2012 09:26 PM, Karen Lofstrom wrote:
On Sun, Jan 1, 2012 at 7:28 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
- export html from msword - run html thru tidy with the msword clean option - convert to epub and kindle with calibre.
Some hand cleaning of the html may also be necessary. YMMV
I thought that msword generated sloppy html? What I've been reading online is that you get better results using something like Dreamweaver to do the reformatting.
I don't know if you can export from msword to dreamwaver.
Also, will msword handle footnotes and indices?
Why not just try? If the results are disappointing you still can go the long way. -- Marcello Perathoner webmaster@gutenberg.org

On Sun, Jan 1, 2012 at 10:52 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
Why not just try? If the results are disappointing you still can go the long way.
Notes disappeared, links to index stayed as XE codes. As I look at them, I see that the publisher used an indexing program that looked for every occurrence of a word, and did not check for importance. Index is worthless without pruning. At least it's a short book ... -- Karen Lofstrom

On 1/1/2012 1:26 PM, Karen Lofstrom wrote:
On Sun, Jan 1, 2012 at 7:28 AM, Marcello Perathoner <marcello@perathoner.de> wrote:
- export html from msword - run html thru tidy with the msword clean option - convert to epub and kindle with calibre.
Some hand cleaning of the html may also be necessary. YMMV
I thought that msword generated sloppy html?
MSWord doesn't produce /sloppy/ HTML, it produces /excessive/ HTML. MSWord is designed to write stuff which will then be printed. When you save HTML from MSWord it assumes you're going to want to load it back into MSWord at some point to be printed. So it saves out /everything/ it needs to convert from MSHTML back to DOCX. This is why you run tidy immediately after exporting from Word. Don't use the "Compact", or "Filtered," or whatever they call it these days, HTML format because that removes the hints that Tidy uses to know that special Microsoft processing is required. Once you get this far, never look back--this process is a one-way street.
What I've been reading online is that you get better results using something like Dreamweaver to do the reformatting.
Forget about Dreamweaver. It has the same fundamental flawed design philosophy that MSWord has: that you require that the output look exactly the same on every machine in the world. Dreamweaver has only two things working in its favor: it tends to be ubiquitous, and it's better than MSWord. This is what the bard was referring to when he coined the term "damning with faint praise." If you have even passing familiarity with HTML you can do any necessary cleanup by hand using a simple text editor. (On Windows, TextPad and the free Microsoft Visual Web Developer Express are among my favorites; for a more cross-platform solution the Eclipse HTML editor is not bad).
Also, will msword handle footnotes and indices?
It should, although I couldn't say for sure without seeing the document. And don't omit any indices; the notion that full text search is a replace for indices or a table of contents is a myth promulgated by those who are too lazy to create them.

Marcello>- export html from msword Marcello>- run html thru tidy with the msword clean option I'm not having luck getting tidy to work wonders with the msword clean option using word 2010. It is because the" tidy --msword yes" option is designed for word 2000? I do have pretty good luck with the "filtered" html option from msword 2010 without using tidy: For example, I take a "clean" html book file I am working on: 720kB utf-8 html I import it into word 2010 and save it in docx format: 380kB docx I open that (to make sure word isn't "cheating") and then save it as an unfiltered html: 1181kB html I "tidy -uft8 --msword yes" on that and I get: 1156kB of hot mess (which still renders fine in an html browser) Compared to saving filtered html: 1696kb unicode html -- which sounds horrible, until one realizes that msword has expanded my utf-8 into Unicode, Which is easily fixed say in notepad++ 853kB uft-8 html And after a half-hour of manual (regex) clean up I am back full circle [almost] to where I started from with "clean" HTML: 716kB uft-8 html Now I think it is fair to argue that this is not a "fair test" in that I started with clean hand-written HTML to begin with before turning that into a word doc[x]. If you take something written in dreamweaver and export it to doc and from there to filtered html it seems to me that it is more-than-likely that some parts of the dreamweaver don't correspond happily to the html in which case a greater effort will be required on those parts. And one still has to consider that kindlegen can still mess up even well-written html.

Karen, If I was doing this I'd save from Word as filtered HTML, edit the HTML with Sea Monkey (a version of the Mozilla browser that contains an HTML editor) and use that to make the links for the table of contents and the footnotes. If the author used heading styles in his MS you can automatically generate an HTML table of contents using the Sea Monkey editor. If not, mark up the chapter headings as H2, H3, etc. and then generate the TOC. Once you have the HTML looking good the next stop is to import it into Sigil and split the file into multiple chapter files. This will generate an EPUB for you. Create a cover image using The GIMP and import it into the EPUB. View the EPUB on a PC using the Nook viewer and fix anything that doesn't look right. Run kindlegen on the EPUB to create a MOBI file. Check it out using the Kindle previewer. Lather, rinse, repeat. It will take practice to learn to make a book that looks good on both Kindle and Nook. Both platforms have quirks. Consider charging the guy *something* for your time. Maybe a flat rate rather than an hourly rate. Remember, do somebody a favor and they will always remember you--when they need another favor! James Simmons On Sun, Jan 1, 2012 at 1:03 AM, Karen Lofstrom <klofstrom@gmail.com> wrote:
I helped an ESL author with his book about Theravada Buddhism, and told him that I would convert his book to Kindle and ePub, and upload to Amazon and Smashwords, for free. Because I need the experience, and because I couldn't possibly charge him if I'm spending hours figuring out how to do it.
So far, I'm stymied. I figured that the appropriate path would be to convert the Word doc to HTML, and from there go to Kindle and ePub. I thought I'd start by using Dreamweaver rather than using Save as HTML in Word. However, my antique copy of Dreamweaver (Dreamweaver 4) is refusing to open the Word 2003 doc. It seems that I can copy and paste from the doc into Dreamweaver. Should I do that?
Should I preformat the work in Word, using Outline to produce various levels of headings?
What should I do about the XE codes that appeared in the document AFTER I sent the finished manuscript to the author? I think that they were introduced by the Sri Lankan publisher. They look like this: { XE "awakened" /i}. Some of them lack the /i. I believe that these are codes that generate an index, yes? An ebook doesn't need an index, does it? Just searching should suffice? Yes? No? Should I just jettison all these codes, or convert them to something in HTML?
What about footnotes? Insert anchors and link, right? Same for TOC, yes?
Dreamweaver output, check with HTMLTidy, then process with Mobipocket Creator and Sigil?
(Yes, I should have a later Dreamweaver. I maintain my zendo's website, and I'm producing it in deprecated code. However, I'm too poor to afford $400 for the latest Dreamweaver.)
I'm not a complete HTML newbie, but my skills are fairly basic. I would appreciate some help from the folks here. I figured I've earned it by proofing 78,000 pages at DP. You could send me to the best site for advice, or send me a private email, or post it here, in case this would be of interest to other volunteers.
Oh, and Happy New Year!
-- Karen Lofstrom Zora on DP _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Thanks all, for the advice re converting the doc file to HTML, epub, and azw. No clear consensus as to the best way -- different folks have different strokes. I'm just going to have to experiment and find out which way works for me. I think I have enough info now to experiment. I may be back if I run into unforeseen problems. -- Karen Lofstrom who ate black-eye peas for dinner (good luck, Southern US style) and fried mochi for dessert (good luck, Japanese style) and just watched the latest Sherlock, which was even more delicious

On 1/1/2012 9:26 PM, James Simmons wrote:
Karen,
If I was doing this I'd save from Word as filtered HTML, edit the HTML with Sea Monkey (a version of the Mozilla browser that contains an HTML editor) and use that to make the links for the table of contents and the footnotes. If the author used heading styles in his MS you can automatically generate an HTML table of contents using the Sea Monkey editor. If not, mark up the chapter headings as H2, H3, etc. and then generate the TOC.
This is pretty good advice, except save as regular HTML and let Tidy do the cleanup for you. Note especially that you can use heading styles in MS Word, which will convert to <h[n]> tags when you export the file. Remember that in an e-book a "Table" of contents is in reality a "List" of pointers. In ePub 3 the working group acknowledged that specifically, deprecating the much-maligned .ncx file in favor of ordered, and potentially nested, lists. Your "Table" of contents should be composed of <ol> or <ul> and <li>. ePubEditor will recognize TOC lists and automagically generate an ePub .ncx file. It can also scan your publication for explicit headers and build a nested TOC list for you as well. As one of the compromises the MobiPocket made to make up for the fact that it didn't support styles, the reader starts a new page (screen) when it encounters the <h1> element (and when it encounters the <hr> element). Take these quirks into account when building your document.
Once you have the HTML looking good the next stop is to import it into Sigil and split the file into multiple chapter files. This will generate an EPUB for you.
This advice is not so good. The ePub working group is committed to replacing all inline styles with style sheets, preferable external style sheets. The author of Sigil started his project specifically because he could not find a good tool that conformed precisely to the ePub specification. As a result, Sigil will frequently re-write your HTML to move styles around. The Kindle/Mobi format is based on HTML 3.2, and does not deal well with style sheets. I think you will find that you will often get your HTML to a point where it looks good on the Kindle, but if you let Sigil change the files at all it will spoil the Kindle version. This problem is not unique to the Kindle. One of my favorite android-based readers is the open-source CoolReader. Unfortunately, CoolReader is not yet far enough in it's development that it deals well with out-of-line styles either. This is why when I wrote ePubEditor I explicitly decided to defer to the end user's judgment as far as an HTML editor goes. ePubEditor can "clean" HTML files using user-specified style sheets, and can be used to remove or replace HTML elements without missing end tags, but it will never re-write an HTML file without the user's explicit permission. ePubEditor can also break up a single HTML file into small sections, based on the existence of specific elements (e.g. <h3 class="chapter">). If you choose to try ePubEditor, please complain to me vociferously about all the things you don't like about it.
Create a cover image using The GIMP andimport it into the EPUB.
Actually, use whatever painting tool your most comfortable with -- you can even go out to OpenLibrary.org and grab one off their web site. I have discovered, however, that most ePub readers ignore it when you specify the cover image in the guide section. You need to put a link to the cover in a small HTML file, and then include that file as the first file in the spine. Perhaps a Kindle user out there could elaborate on the best way to include a cover image for Kindle?
View the EPUB on a PC using the Nook viewer ...
The Adobe Digital Editions reader is probably the gold standard for ePub, as it was written by the same team that participates (rather high-handedly) in the ePub working group. For previewing, ADE would be my first (but not only) choice.
... and fix anything that doesn't look right. Run kindlegen on the EPUB to create a MOBI file. Check it out using the Kindle previewer.
More good advice follows, which is why it is not snipped.
Lather, rinse, repeat. It will take practice to learn to make a book that looks good on both Kindle and Nook. Both platforms have quirks.
Consider charging the guy *something* for your time. Maybe a flat rate rather than an hourly rate. Remember, do somebody a favor and they will always remember you--when they need another favor!
James Simmons

Oh, by the way -- James Simmons said:
Consider charging the guy *something* for your time. Maybe a flat rate rather than an hourly rate. Remember, do somebody a favor and they will always remember you--when they need another favor!
He's already paid me a fair bit of cash to rewrite his book in good English (he's Sri Lankan and his English is erratic). The ebook work is lagniappe for him and experimentation for me. Also, we're both of us Buddhists, so a certain degree of sangha solidarity is appropriate. -- Karen Lofstrom

Hi Karen, This may seem as a old fashioned way to do things. But, … I would suggest just converting the Word doc to HTML manually. I have done something similar many times, though going from Word to LaTeX. Basically, you use Words find and replace to convert. You can find formatting then just put the needed tags around the text and removing the formatting from word. You have to convert different characters to get things working smoothly. In the end you just have a marked up text in Word which you save a a plain text file. This method gives you full control of the html output and you can even add CSS during conversion. Hope this helps. regards Keith. Am 01.01.2012 um 08:03 schrieb Karen Lofstrom:
So far, I'm stymied. I figured that the appropriate path would be to convert the Word doc to HTML, and from there go to Kindle and ePub. I thought I'd start by using Dreamweaver rather than using Save as HTML in Word. However, my antique copy of Dreamweaver (Dreamweaver 4) is refusing to open the Word 2003 doc. It seems that I can copy and paste from the doc into Dreamweaver. Should I do that?
Should I preformat the work in Word, using Outline to produce various levels of headings?
participants (8)
-
Aaron Cannon
-
don kretz
-
James Adcock
-
James Simmons
-
Karen Lofstrom
-
Keith J. Schultz
-
Lee Passey
-
Marcello Perathoner