
On 4/21/2010 1:41 PM, James Adcock wrote:
Why tidy?
As Mr. Perathoner has pointed out, it is because OCF requires that the interior text be valid XML, and it is certain that not all of the hand-crafted HTML in the PG repository is valid XHTML. Tidy should cause no harm, but /will/ guarantee XHTML output. It is not the only tool that could produce this result, but it is probably the best (although not perfect). I suspect many of the DP post-processors use tidy as part of their regular workflow.
Many people work hard to retain linebreaks in the HTML so the code can be gone over again at a future date and then PG throws away those linebreaks.
If your notion of "retaining linebreaks" is by putting a newline in your HTML text you have already lost the battle. According to the HTML specification, newlines are white space, and must be treated as such. HTML is an explicit markup language; ie. any markup which is not part of the base text must be explicit, eg. <br /> and not CR, LF, of CRLF. I do not believe that there is any HTML authoring/editing tool which will preserve newline characters as implicit markup. If you really are "work[ing] hard to retain linebreaks in HTML" then you will make them explicit. You can do this by adding explicit markup that user agents will ignore (eg. <span class='linebreak'> </span>), using an invalid HTML element (eg. <lb>) which browser will ignore, or by using the HTML break element in such a way that its display can be turned on or off by the use of CSS styles (eg. <br class='lb' />). If you expect everyone to respect newline characters as line breaks in HTML, in direct contravention to the HTML spec, you are borrowing trouble. I agree with you that line breaks need to be preserved; I just think they should be preserved explicitly, and not implicitly.