
OK guys, we have a problem. When one uses the "--clean" option, tidy removes any "<center>" elements and replaces them with "<div class='c1'>", and adds "div.c1 {text-align: center}" to the internal style sheet. This seems reasonable, because according to the HTML spec, "The CENTER element is exactly equivalent to specifying the DIV element with the align attribute set to 'center'." In a bit of a chained dependency, it turns out the "align" attribute is /also/ deprecated in favor of the CSS "text-align" style. So Tidy's behavior is completely consistent with the HTML spec, and in theory should cause no presentational differences before and after a page is Tidy'ed. In theory, there is no difference between theory and reality; in reality, there is. Consider the following snippet: <center> <table> <tr> <td> line one<br /> a longer line two<br /> a very much longer line three </td> </tr> </table> <center> Using my four test browsers, Firefox 3,5, IE 8, Opera 9 and Safari 4, in each case the above table was center in the browser, but the text inside the table data element remained left justified. When I changed the "<center>" element to "<div style='text-align: center'>" the text inside the table data element became centered as well. This is the behavior I would expect; the whole notion of "Cascading" in CSS indicates that style continue down the tree until changed. But it does illustrate the fact that there is a distinction between centering an /element/ (in this case the table), and centering the text /inside/ an element. So while, in theory, the "<center>" element should be equivalent to "<div style='text-align:center'>", in practice it seems that not only are they not equivalent in /some/ browsers, they are not equivalent in /any/ browser. I believe one of our design goals was that Tidy would make no change to otherwise valid HTML that would cause it to render differently using browser defaults after Tidying. Thus, empty paragraphs, which are forbidden, are converted to /two/ "br />" elements, to match the default paragraph presentation in browsers. Leaving aside the fact that the use of tables to control layout is simply morally reprehensible, the fact is that there a many, many pages 'in the wild' that do so. And Tidy's current behavior will cause those pages' presentations to change after running Tidy. I think that in this case we have not met our design goal. Now I can fix the code so that this doesn't happen in the future, if only I knew what the right fix /is/. I could simply remove "center" from the list of elements that get 'cleaned', and print a warning that the resulting contains elements that are deprecated (this warning probably ought to be there whenever deprecated elements remain in the output). Or I could focus more directly on this specific issue and whenever a "<table>" is a descendant of a "<center>" element I could add "style='text-align:left'" to the "<table>" element (assuming a "text-align" style is not already attached to that element) /before/ cleaning (both styles should then be moved to the internal style sheet). Or perhaps there is yet another solution that I haven't thought of? I don't think that simply telling the end user "your HTML doesn't follow the rules; we could fix it but we won't" is an option; after all, that's what Tidy is for right? So, what should I do? ps. I don't like the behavior that the "--drop-font-tags" option also drops "<center>" elements; page layout is not in the same classification as font appearance, and I can envision situations where I would want to drop "<font>" elements but retain "<center>" elements. But that is an argument for another day.

Why tidy? Many people work hard to retain linebreaks in the HTML so the code can be gone over again at a future date and then PG throws away those linebreaks.

James Adcock wrote:
Why tidy?
Because I have to convert all the crooked HTML that has been posted in 20 years into valid XHTML.
Many people work hard to retain linebreaks in the HTML so the code can be gone over again at a future date and then PG throws away those linebreaks.
It is simpler to fix the HTML than to fix the Epub, so why should the Epub retain the line breaks? -- Marcello Perathoner webmaster@gutenberg.org

It is simpler to fix the HTML than to fix the Epub, so why should the Epub retain the line breaks?
Sorry, if you say that tidy is only being used to generate epubs not to modify the posted HTML then fine. On one of my previous HTML submissions a WW said he had run tidy on it. Obviously the intent is to allow future DP'ers or PG'ers who have figured out a better scheme, TEI Lite or whatever (hypothetical), to make another DP pass or solo on the effort by extracting the already "corrected" txt matched against the original OCR rather than having to start again "from scratch." And again pgdiff can extract linebreak info given a txt which has lost linebreaks and an OCR that retains them, but, its still cleaner and easier not to have lost them in the first place.

On 4/21/2010 1:41 PM, James Adcock wrote:
Why tidy?
As Mr. Perathoner has pointed out, it is because OCF requires that the interior text be valid XML, and it is certain that not all of the hand-crafted HTML in the PG repository is valid XHTML. Tidy should cause no harm, but /will/ guarantee XHTML output. It is not the only tool that could produce this result, but it is probably the best (although not perfect). I suspect many of the DP post-processors use tidy as part of their regular workflow.
Many people work hard to retain linebreaks in the HTML so the code can be gone over again at a future date and then PG throws away those linebreaks.
If your notion of "retaining linebreaks" is by putting a newline in your HTML text you have already lost the battle. According to the HTML specification, newlines are white space, and must be treated as such. HTML is an explicit markup language; ie. any markup which is not part of the base text must be explicit, eg. <br /> and not CR, LF, of CRLF. I do not believe that there is any HTML authoring/editing tool which will preserve newline characters as implicit markup. If you really are "work[ing] hard to retain linebreaks in HTML" then you will make them explicit. You can do this by adding explicit markup that user agents will ignore (eg. <span class='linebreak'> </span>), using an invalid HTML element (eg. <lb>) which browser will ignore, or by using the HTML break element in such a way that its display can be turned on or off by the use of CSS styles (eg. <br class='lb' />). If you expect everyone to respect newline characters as line breaks in HTML, in direct contravention to the HTML spec, you are borrowing trouble. I agree with you that line breaks need to be preserved; I just think they should be preserved explicitly, and not implicitly.

I do not believe that there is any HTML authoring/editing tool which will preserve newline characters as implicit markup.
Sorry what I do and others do is retain the original books linebreaks in the coding of the HTML. Fortunately HTML *is* a reflow file format which ignores those linebreaks and treats them as whitespace, allowing the end user to use whatever size device, fonts, screen orientations etc that they choose. If one later want to make another pass at the book one simply strips out the HTML markup leaving the original text part intact with the same linebreaks as were in the original book. Then as a hypothetical example one can resubmit that plaintext with the original linebreaks back through the DP process.
participants (4)
-
James Adcock
-
Joshua Hutchinson
-
Lee Passey
-
Marcello Perathoner