From harald at lefant.net Thu Jul 29 12:37:20 2010 From: harald at lefant.net (Harald Geyer) Date: Thu, 29 Jul 2010 21:37:20 +0200 Subject: [gutvol-p] converting from HTML to latex for printing Message-ID: Hi! I'm looking into semi-automatically converting gutenberg.org ebooks from HTML to latex for high quality printing. I think by now I've read pretty much every howto on HTML on the wiki page. Alas it seems that there isn't any recommended standard on how to structure the manually crafted HTML files. After a (very short) survey it seems that some people stick to proper

tags, while others do crazy stuff like

. Obviously some manual work won't be avoidable, but I want to get it as easy and automatic as possible. So I need to find or write tools that: warn me on any bad practices in the file get the conversion mostly right in default cases make it easy to fix problems manually without overlooking something The idea is that it is ok to tweak the conversion manually but I don't want to have to proof-read the whole book to see if the conversion was satisfactory. I guess I have to main questions: 1) Is anybody aware of any tool that might be helpful or can be a starting point for future development? - So far I've looked at pandoc, which at least produces valid latex (most other tools failed that) and is generally great for transforming text formats. However pandoc doesn't support any user interaction, like emitting warnings about unknown tags, etc. 2) Which workflow would make most sense in a broader gutenberg.org context? Should I first try to do some automatic conversion and then try to fix the result manually - that's what I'm doing now. Or should I first fix the HTML, resubmit it to gutenberg.org and then have an automatic conversion to latex? That might be useful for automatic conversion to other formats too. Of course I plan to make the latex files publicly available. How are the chances that they are accepted to be directly available on gutenberg.org? Harald From Bowerbird at aol.com Thu Jul 29 14:59:56 2010 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Thu, 29 Jul 2010 17:59:56 -0400 (EDT) Subject: [gutvol-p] Re: converting from HTML to latex for printing Message-ID: harald, you have set for yourself a task that is impossible. i tell you that so as to warn you. and to give you a marker, so that if you _do_ attain your goal, you will have an _extremely_ strong sense of accomplishment. because you will _deserve_ it. because your goal is impossible. or close enough to impossible that it might as well be impossible. *** someone "with authority" will probably come on here to tell you that they aren't really interested in receiving your reworked .html, or your latex files, for that matter, or even your high-quality .pdf. scratch that. they'll probably backchannel you to tell you all that, because they don't particularly enjoy having to say it here in public. but i'll let them tell you that. because even if they did say "ok", it's impossible for you to do what you want, so it doesn't matter. i could explain why it's impossible -- because the idiosyncrasies of the handmade .html files are too impossible for anyone to tame -- but you probably won't believe it until you have tried, and failed, so i won't bother to elaborate. so, even though i probably shouldn't do it, because it might serve to encourage you, i'll be the first to wish you luck on your endeavor. -bowerbird p.s. but i give very big kudos to you for having discovered pandoc... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1649 bytes Desc: not available URL: