
On Thu, December 15, 2011 1:32 am, Paul Flo Williams wrote: [snip irrelevant context]
For me the tools are Vim, Perl, xsltproc, Calibre, kindlegen, git, xmllint and elbow grease. Thank goodness I'm not claiming to have produced a magic button that can produce nineteen formats before breakfast from an input language used by no one.
And, for the record, I'd also welcome any further discussion of, or pointers to, tools, techniques and strategies for producing ebooks.
Judging by the list of tools you're using, I would venture to say you're a *nix guy. I don't know if my tool set will help you out, because for e-book creation I work almost exclusively on Windows (much to my shame). But this is what I use: Most important is ABBYY FineReader. Not only does FineReader do very good OCR, but the user interface has a side-by-side feature where a page image is displayed next to the recognized text. FineReader has a global search and replace function, a stemming dictionary, user defined dictionaries, and will highlight words that the OCR was "uncertain" about. I do all my spell checking in FineReader, and do not export the document until I have paged through the entire document checking the layout (sometimes FineReader gets confused about what is, and is not, a paragraph when you have a lot of really short paragraphs in a row). Once I have done all the proof-reading I can in FineReader, I export the document as simple HTML. The HTML produced by FineReader is class 2 tag soup (SGML), so my next step is to convert the FineReader output to XHTML. At the same time, it would be nice if I could guess at some of the structures in the book other than paragraphs, such as blockquotes and headers. FineReader can't seem to intuit these structures, but it does produce an inordinant number of <font> tags. If the original document had chapter titles in a font larger or different from the common font I figured that maybe those are headers. In the end I wrote a program that takes FineReader output, converts it to XHTML, and attempts to add some structure based on varying font sizes. It's not perfect and I'm sure it introduces errors, but it does seem that the absolute number of errors is reduced. I named the program fr2html.exe. If you want the 'C' code, I'd be happy to send it to you. Or, because it makes extensive use of DomCApi on sourceforge, I could add it as a sub project there. I then open the resultant HTML file in Microsoft Web Developer. This program has a split screen view so I can edit the HTML directly yet see the formatted output. It also does validation on the HTML as I work. I don't do degraded text. To get my work product into Project Gutenberg I post the result into some public repository on the web. Then I used to send e-mail to Michael Hart to the effect of "there it is; if you want to add it to Project Gutenberg, go get it." I don't know if he ever did, but that's not my problem. Now that Mr. Hart is gone, I don't know who I should notify to do an end-run around the whitewashers.