Re: [gutvol-d] let's get some tutorials from the xhtml crowd

19 Dec 2011

      On Thu, December 15, 2011 1:32 am, Paul Flo Williams wrote:

[snip irrelevant context]
...
For me the tools are Vim, Perl, xsltproc, Calibre, kindlegen, git, xmllint
and elbow grease. Thank goodness I'm not claiming to have produced a magic
button that can produce nineteen formats before breakfast from an input
language used by no one.
And, for the record, I'd also welcome any further discussion of, or
pointers to, tools, techniques and strategies for producing ebooks.
Judging by the list of tools you're using, I would venture to say you're a
*nix guy. I don't know if my tool set will help you out, because for e-book
creation I work almost exclusively on Windows (much to my shame). But this is
what I use:

Most important is ABBYY FineReader. Not only does FineReader do very good OCR,
but the user interface has a side-by-side feature where a page image is
displayed next to the recognized text. FineReader has a global search and
replace function, a stemming dictionary, user defined dictionaries, and will
highlight words that the OCR was "uncertain" about. I do all my spell checking
in FineReader, and do not export the document until I have paged through the
entire document checking the layout (sometimes FineReader gets confused about
what is, and is not, a paragraph when you have a lot of really short
paragraphs in a row).

Once I have done all the proof-reading I can in FineReader, I export the
document as simple HTML. The HTML produced by FineReader is class 2 tag soup
(SGML), so my next step is to convert the FineReader output to XHTML. At the
same time, it would be nice if I could guess at some of the structures in the
book other than paragraphs, such as blockquotes and headers. FineReader can't
seem to intuit these structures, but it does produce an inordinant number of
<font> tags. If the original document had chapter titles in a font larger or
different from the common font I figured that maybe those are headers. In the
end I wrote a program that takes FineReader output, converts it to XHTML, and
attempts to add some structure based on varying font sizes. It's not perfect
and I'm sure it introduces errors, but it does seem that the absolute number
of errors is reduced.

I named the program fr2html.exe. If you want the 'C' code, I'd be happy to
send it to you. Or, because it makes extensive use of DomCApi on sourceforge,
I could add it as a sub project there.

I then open the resultant HTML file in Microsoft Web Developer. This program
has a split screen view so I can edit the HTML directly yet see the formatted
output. It also does validation on the HTML as I work.

I don't do degraded text. To get my work product into Project Gutenberg I post
the result into some public repository on the web. Then I used to send e-mail
to Michael Hart to the effect of "there it is; if you want to add it to
Project Gutenberg, go get it." I don't know if he ever did, but that's not my
problem. Now that Mr. Hart is gone, I don't know who I should notify to do an
end-run around the whitewashers.

Re: [gutvol-d] let's get some tutorials from the xhtml crowd

Lee Passey