[gutvol-p] converting from HTML to latex for printing

Harald Geyer harald at lefant.net
Thu Jul 29 12:37:20 PDT 2010


Hi!

I'm looking into semi-automatically converting gutenberg.org ebooks
from HTML to latex for high quality printing.

I think by now I've read pretty much every howto on HTML on the
wiki page. Alas it seems that there isn't any recommended standard
on how to structure the manually crafted HTML files. After a
(very short) survey it seems that some people stick to proper
<H2> tags, while others do crazy stuff like <p CLASS=chapter>.

Obviously some manual work won't be avoidable, but I want to get
it as easy and automatic as possible. So I need to find or write
tools that:

warn me on any bad practices in the file
get the conversion mostly right in default cases
make it easy to fix problems manually without overlooking something

The idea is that it is ok to tweak the conversion manually but I don't
want to have to proof-read the whole book to see if the conversion
was satisfactory.

I guess I have to main questions:

1) Is anybody aware of any tool that might be helpful or can be
a starting point for future development? - So far I've looked at
pandoc, which at least produces valid latex (most other tools
failed that) and is generally great for transforming text formats.
However pandoc doesn't support any user interaction, like emitting
warnings about unknown tags, etc.

2) Which workflow would make most sense in a broader gutenberg.org
context? Should I first try to do some automatic conversion and
then try to fix the result manually - that's what I'm doing now.
Or should I first fix the HTML, resubmit it to gutenberg.org and
then have an automatic conversion to latex? That might be useful
for automatic conversion to other formats too.

Of course I plan to make the latex files publicly available.
How are the chances that they are accepted to be directly
available on gutenberg.org?

Harald



More information about the gutvol-p mailing list