converting from HTML to latex for printing
Hi! I'm looking into semi-automatically converting gutenberg.org ebooks from HTML to latex for high quality printing. I think by now I've read pretty much every howto on HTML on the wiki page. Alas it seems that there isn't any recommended standard on how to structure the manually crafted HTML files. After a (very short) survey it seems that some people stick to proper <H2> tags, while others do crazy stuff like <p CLASS=chapter>. Obviously some manual work won't be avoidable, but I want to get it as easy and automatic as possible. So I need to find or write tools that: warn me on any bad practices in the file get the conversion mostly right in default cases make it easy to fix problems manually without overlooking something The idea is that it is ok to tweak the conversion manually but I don't want to have to proof-read the whole book to see if the conversion was satisfactory. I guess I have to main questions: 1) Is anybody aware of any tool that might be helpful or can be a starting point for future development? - So far I've looked at pandoc, which at least produces valid latex (most other tools failed that) and is generally great for transforming text formats. However pandoc doesn't support any user interaction, like emitting warnings about unknown tags, etc. 2) Which workflow would make most sense in a broader gutenberg.org context? Should I first try to do some automatic conversion and then try to fix the result manually - that's what I'm doing now. Or should I first fix the HTML, resubmit it to gutenberg.org and then have an automatic conversion to latex? That might be useful for automatic conversion to other formats too. Of course I plan to make the latex files publicly available. How are the chances that they are accepted to be directly available on gutenberg.org? Harald
participants (1)
-
Harald Geyer