
On Tue, November 15, 2011 3:05 am, a@aboq.org wrote:
On Monday, 14th November 2011 at 14:11:53 (GMT -0800), Jim Adcock wrote:
you would find that Open Office in HTML mode makes a decent WYSIWYG html editor that outputs decent HTML, unlike MSWord, which if you try to use it to edit HTML outputs horrible HTML.
Not if you choose "simplified HTML" as output in MS Word. I was shocked to find out how clean *that* HTML was; it almost passes W3C's validator check (in fact, passes it entirely after you manually adjust 2 or 3 details in the output).
It's been some years (>5) since I played with MS Word's HTML output, and I haven't looked at it since MS Word 2003 (I have to say that I absolutely /hate/ Word 2007 -- it seems that Microsoft Engineers are intentionally making the UI worse with every revision). But as you note, even the "simplified" HTML that MS Word produces is still not quite right. HTML Tidy has a --word-2000 option which was designed to remove the MS Word cruft from Word HTML files. Although it is somewhat counter-intuitive, I discovered that Tidy was not as good at cleaning the "simplified" output as it was at cleaning the "horribly complex" output, because the indicators that Tidy used to detect "bad" HTML had been removed. Again, it's been many years since I went down that road, but it may be that using "horribly complex" Word output and then running it through Tidy may be a better option than using "simplified" Word output.