The problems with paragraph formatting at PG

One of the things that make books ugly at PG is the problem of simple paragraph formatting, which sounds like a simple issue under HTML, but which in practice becomes a mess particularly in EPUB and even more so in MOBI (Kindle). For people who don't understand or don't believe this yet, example files will be provided at the bottom of this posting. People who believe they know HTML believe they know how this issue works, namely one can specify top and bottom margins for one's paragraphs using units one chooses, and similarly choose to indent the first line, or not, and then the HTML browser goes off and does what you ask. And if you specify both top and bottom margins then the browser "merges" those margins so that a paragraph followed by a paragraph doesn't end up with a double-wide margin between those paragraphs. And that is how it works, more or less, in HTML browsers, except, not really, not necessarily, not even there, because even in HTML browsers style specifications are *hints*, not firm rendering commands, and need not be followed the way you specified them. Even in HTML browser land there are still people running HTML browsers old-enough not to merge top and bottom margins, for example, or who have overridden these specifications. But, the situation becomes much worse in EPUB, and especially MOBI land. In EPUB land there is common EPUB rendering software out there that simply ignores paragraph formatting specifications, and substitutes their own. Like Aldiko, for example. [Adobe Digital Editions and its derivative such as Sony and Nook seem to get this right] In MOBI land -- read Kindle and kindlegen (which is pretty much how everyone including PG is force to make MOBI files) there are two big problems: 1) Top and bottom margins are NOT merged. When a paragraph follows a paragraph then the vertical whitespace between those two paragraphs is added to each other. 2) Top and bottom margins are ROUNDED to the closest 1em. So, say for example you take a "split the baby" approach of specifying a top-margin of 0.5em (or the equivalent in one or another of the measures) and a bottom-margin of 0.5em. What do you get displayed on a Kindle? You get two full vertical "blank lines" between every two adjacent paragraph. Now if this doesn't sound to bad to you, consider what happens when a "dialog" happens between two characters in your book: You get two full vertical "blank lines" between each line of dialog. And what happens when you specify poetry using the "one paragraph per line" approach? You get two full vertical "blank lines" between each line of poetry. Does this look attractive? No it does not. It looks d*mned ugly. Barely readable, in fact. How then does one work around these problems? Once one recognizes that there really is a problem, then three (partial) work-around solutions come to mind: 1) Do not even specify paragraph formatting, but rather allow the built-in paragraph formatting in each HTML, EPUB, or MOBI device to do its job. 2) Specify (say) a 1.0em top margin and a 0.0em bottom margin. 3) *Almost* "Split the Baby" by specifying a 0.51 top margin and a 0.49 bottom margin. Finally, understand that all paragraph style specifications are *hints* and if you are relying on details of paragraph formatting to make your book "work" then you are going to have disappointed readers on one or another device. IE: use the <p> tag to specify paragraphs, not things which are not paragraphs. Example test files demonstrating the problem: http://freekindlebooks.org/Dev/marginstestpg.mobi http://freekindlebooks.org/Dev/marginstestpg.epub -- not sure why, but my browser likes to rename marginstestpg.epub into marginstestpg.zip on downloads. Rename it back to marginstestpg.epub to make most epub display software happy playing with it again. PS: I have verified that the problems discusses are not problems introduced by epubmaker per say.

On Mon, December 12, 2011 8:02 am, Jim Adcock wrote:
One of the things that make books ugly at PG is the problem of simple paragraph formatting, which sounds like a simple issue under HTML, but which in practice becomes a mess particularly in EPUB and even more so in MOBI (Kindle).
[snip]
How then does one work around these problems? Once one recognizes that there really is a problem, then three (partial) work-around solutions come to mind:
1) Do not even specify paragraph formatting, but rather allow the built-in paragraph formatting in each HTML, EPUB, or MOBI device to do its job.
[snip]
IE: use the <p> tag to specify paragraphs, not things which are not paragraphs.
Excellent advice, but advice which is regularly ignored, probably because most automated tools assume /every/ division of text is a paragraph (which is probably a better assumption than assuming that every block of text is /not/ a paragraph, and an automated tool has to go one way or the other). Almost all HTML tags have "semantic overload": when you use a particular tag it is assumed that the enclosed text carries that tag's semantics. While almost all HTML tags have this semantic overload, it is most apparent in the <p> tag. Most of us can recognize what is, and is not, a paragraph, and most of us know that paragraphs are typically rendered as 1. a division of text beginning and ending on a new line where the first line is indented a perceptable amount and which has no space between paragraphs, or 2. a division of text beginning and ending on a new line without indentation, but with one blank line between paragraphs. When you mark a block of text as a <p>aragraph, ask yourself "what will this look like if the user switches from presentation 1. to presentation 2, or vice versa. If it makes a difference you probably don't have a paragraph. When using automated tools which want to make everything a paragraph, I like to add a style rule at the beginning of my file: p {text-indent: 50%} This produces a paragraph indentation that is blatently excessive; but now I can scroll through the text and quickly identify those divisions that are not really paragraphs. There are two tags in HTML that are specifically free from semantic overload: <div> and <span>. Any time you encounter a division of text which has been marked with a <p>, but obviously isn't based on the foregoing rules, replace the <p> with a <div>. You can go back later and figure out the semantics of the <div> but in the short term you will get the result you want. <p> is not the only tag which is often used counter to its semantics. Another example is the <h4> tag, which is intended to be used as a 4th level header or title and typically is rendered as left-justified, bold textual division. Because of this, I have seen some producers use <h4> as table of contents items, e.g.: <div class="toc"> <h4><a href="ch01.html">Chapter One</a></h4> <h4><a href="ch02,html">Chapter Two</a></h4> <h4><a href="ch03.html">Chapter Three</a></h4> ... Frequently, people expect titles to be centered on a page. If you were to build a TOC like this, you should ask yourself, "what will this look like if the user switches to a centered presentation for titles?" This block obviously has the semantics of a list, and should be marked that way. At the very least, the list items should be changed to <div>, as <div> is free of semantic overload. In ePubEditor I have a TOC builder which builds its list according to header tags. You can image what this kind of structure does to my TOC builder. Generally, if you have added styling to /any/ HTML element other than <div> or <span>, you are probably using the wrong element. And if you have added styling to a <div> or a <span> you should ask yourself if there is not some HTML element which possesses the semantics of the styled element. Frequently there won't be, but asking the question helps.
PS: I have verified that the problems discusses are not problems introduced by epubmaker per say.
I apologize for being pedantic, but it's /per se/ "by itself", from the Latin /per/ ("by, through") and /se/ ("self, itself, himself, etc.).

Hi Jim, The problem you have described below is actually three: 1) First and for most is that most do not understand the logic in layout and the proper use of spacing. The space between two paragraphs is just one space. It is not divided between top and bottom! As far as getting the correct space between the different entities of a book it is all in the math. 2) As you rightly mention most do not understand the semantics of HTML and CSS. Also, in the light the above how do expect them to code the spacing correctly. 3) The inferior implementation of HTML and CCS in the ereaders. This is something which is beyond out control. regards Keith. Am 12.12.2011 um 16:02 schrieb Jim Adcock:
One of the things that make books ugly at PG is the problem of simple paragraph formatting, which sounds like a simple issue under HTML, but which in practice becomes a mess particularly in EPUB and even more so in MOBI (Kindle).
For people who don't understand or don't believe this yet, example files will be provided at the bottom of this posting.
People who believe they know HTML believe they know how this issue works, namely one can specify top and bottom margins for one's paragraphs using units one chooses, and similarly choose to indent the first line, or not, and then the HTML browser goes off and does what you ask. And if you specify both top and bottom margins then the browser "merges" those margins so that a paragraph followed by a paragraph doesn't end up with a double-wide margin between those paragraphs.
And that is how it works, more or less, in HTML browsers, except, not really, not necessarily, not even there, because even in HTML browsers style specifications are *hints*, not firm rendering commands, and need not be followed the way you specified them. Even in HTML browser land there are still people running HTML browsers old-enough not to merge top and bottom margins, for example, or who have overridden these specifications.
But, the situation becomes much worse in EPUB, and especially MOBI land.
In EPUB land there is common EPUB rendering software out there that simply ignores paragraph formatting specifications, and substitutes their own. Like Aldiko, for example. [Adobe Digital Editions and its derivative such as Sony and Nook seem to get this right]
In MOBI land -- read Kindle and kindlegen (which is pretty much how everyone including PG is force to make MOBI files) there are two big problems:
1) Top and bottom margins are NOT merged. When a paragraph follows a paragraph then the vertical whitespace between those two paragraphs is added to each other.
2) Top and bottom margins are ROUNDED to the closest 1em.
So, say for example you take a "split the baby" approach of specifying a top-margin of 0.5em (or the equivalent in one or another of the measures) and a bottom-margin of 0.5em. What do you get displayed on a Kindle?
You get two full vertical "blank lines" between every two adjacent paragraph.
Now if this doesn't sound to bad to you, consider what happens when a "dialog" happens between two characters in your book:
You get two full vertical "blank lines" between each line of dialog.
And what happens when you specify poetry using the "one paragraph per line" approach?
You get two full vertical "blank lines" between each line of poetry.
Does this look attractive? No it does not. It looks d*mned ugly. Barely readable, in fact.
How then does one work around these problems? Once one recognizes that there really is a problem, then three (partial) work-around solutions come to mind:
1) Do not even specify paragraph formatting, but rather allow the built-in paragraph formatting in each HTML, EPUB, or MOBI device to do its job.
2) Specify (say) a 1.0em top margin and a 0.0em bottom margin.
3) *Almost* "Split the Baby" by specifying a 0.51 top margin and a 0.49 bottom margin.
Finally, understand that all paragraph style specifications are *hints* and if you are relying on details of paragraph formatting to make your book "work" then you are going to have disappointed readers on one or another device. IE: use the <p> tag to specify paragraphs, not things which are not paragraphs.
Example test files demonstrating the problem:
http://freekindlebooks.org/Dev/marginstestpg.mobi
http://freekindlebooks.org/Dev/marginstestpg.epub
-- not sure why, but my browser likes to rename marginstestpg.epub into marginstestpg.zip on downloads. Rename it back to marginstestpg.epub to make most epub display software happy playing with it again.
PS: I have verified that the problems discusses are not problems introduced by epubmaker per say.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Keith> 1) ... The space between two paragraphs is just one space. It is not divided between top and bottom! As far as getting the correct space between the different entities of a book it is all in the math. Agreed that the space between paragraphs is just one space but since not every paragraph is proceeded and/or followed by a a paragraph it became easier for HTML writers to be able to specify top and bottom margins separately, and for HTML browsers to automagically "merge" those margins when a paragraph *was* adjacent to a paragraph. One can "do it all" by just specifying top-margins (and that is what Amazon says you should be doing) it's just somewhat more painful to specify things that way. 2) As you rightly mention most do not understand the semantics of HTML and CSS. Also, in the light the above how do expect them to code the spacing correctly. This is a chicken-and-the-egg problem. The volunteer transcribers have to have *some* understanding of what they are doing [and why] in order to be able to do it. [And PG, institutionally speaking, is not providing that guidance.] Unfortunately, 99.999% of the HTML documentation and books (and tools) one can find out there have little or nothing to do with the "PG" and ebook tasks at hand, and unfortunately more to do with "fixed layout" HTML "home pages" such as one sees if one for example goes to the home page of amazon.com. Again there are a handful of "ebook" oriented texts out there like Elizabeth Castro, Rufus Deuchler, and Joshua Tallent, but then these tend to swing immediately into issues of Adobe InDesign hacking, which is also not helpful.... 3) The inferior implementation of HTML and CCS in the ereaders. This is something which is beyond out control. My claim is that it is generally really not necessary to exercise all the excesses of HTML nor all the shortcomings in the ereaders in order to create high-quality transcriptions of most of the books that are out there. I think the real problem is the tendency of all of us to try to "prove" we are uber-geeks leads many to try to out-geek all the others when it comes to the complexity of the HTML code being submitted. Now mind you, there *are* legitimately *some* very complicated and technical books being submitted -- such as mathematical texts -- but neither txt70 nor HTML is really suitable for these books in the first place.

Hi Jim, Am 13.12.2011 um 14:38 schrieb Jim Adcock:
Keith> 1) ... The space between two paragraphs is just one space. It is not divided between top and bottom! As far as getting the correct space between the different entities of a book it is all in the math.
Agreed that the space between paragraphs is just one space but since not every paragraph is proceeded and/or followed by a a paragraph it became easier for HTML writers to be able to specify top and bottom margins separately, and for HTML browsers to automagically "merge" those margins when a paragraph *was* adjacent to a paragraph. One can "do it all" by just specifying top-margins (and that is what Amazon says you should be doing) it's just somewhat more painful to specify things that way. I do care what Amazon says! It is simple wrong. As for top and bottom margins it is simple wrong to use them! I said it is all in the math. Just because author are to lazy and do not how to do it. It is quite a bit of work to get layout correct. This is not a problem of HTML but all layout done by lay persons using diverse tools and programs. They have not learn how to do it properly nor can they know how to implement proper layout. What most do not understand there is not just one layout/styles for a paragraph. You need multiple layouts. There is no one fits all. Like I have been saying, one has to know how to layout a book in the first place before you can start laying out an ebook with a tool or program.
regards Keith.
2) As you rightly mention most do not understand the semantics of HTML and CSS. Also, in the light the above how do expect them to code the spacing correctly.
This is a chicken-and-the-egg problem. The volunteer transcribers have to have *some* understanding of what they are doing [and why] in order to be able to do it. [And PG, institutionally speaking, is not providing that guidance.] Unfortunately, 99.999% of the HTML documentation and books (and tools) one can find out there have little or nothing to do with the "PG" and ebook tasks at hand, and unfortunately more to do with "fixed layout" HTML "home pages" such as one sees if one for example goes to the home page of amazon.com. Again there are a handful of "ebook" oriented texts out there like Elizabeth Castro, Rufus Deuchler, and Joshua Tallent, but then these tend to swing immediately into issues of Adobe InDesign hacking, which is also not helpful....
3) The inferior implementation of HTML and CCS in the ereaders. This is something which is beyond out control.
My claim is that it is generally really not necessary to exercise all the excesses of HTML nor all the shortcomings in the ereaders in order to create high-quality transcriptions of most of the books that are out there. I think the real problem is the tendency of all of us to try to "prove" we are uber-geeks leads many to try to out-geek all the others when it comes to the complexity of the HTML code being submitted. Now mind you, there *are* legitimately *some* very complicated and technical books being submitted -- such as mathematical texts -- but neither txt70 nor HTML is really suitable for these books in the first place.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Jim Adcock wrote:
In MOBI land -- read Kindle and kindlegen (which is pretty much how everyone including PG is force to make MOBI files) there are two big problems:
1) Top and bottom margins are NOT merged. When a paragraph follows a paragraph then the vertical whitespace between those two paragraphs is added to each other.
2) Top and bottom margins are ROUNDED to the closest 1em.
[snip]
How then does one work around these problems? Once one recognizes that there really is a problem, then three (partial) work-around solutions come to mind:
1) Do not even specify paragraph formatting, but rather allow the built-in paragraph formatting in each HTML, EPUB, or MOBI device to do its job.
2) Specify (say) a 1.0em top margin and a 0.0em bottom margin.
3) *Almost* "Split the Baby" by specifying a 0.51 top margin and a 0.49 bottom margin.
There is a fourth way: pre-process your (X)HTML to downgrade it to an HTML 3.2 tag soup + Kindle attributes so that the kindlegen step does nothing other than wrapping up your HTML into a MOBI file. Granted, this won't help you pushing a nice MOBI through PG, but let's see how it might work for your own delivery. Let's say you've marked up your book at XHTML with CSS, and that you're pretty much using this as your delivered EPUB, as well as for browser consumption. You find that kindlegen does a half-hearted job of converting your styles into kindle-specific paragraph attributes. You've used mobiunpack to examine just how poor the situation is. You know from reading Joshua Tallent's book that using width and height attributes on paragraphs, and liberal sprinklings of NBSP will help you lay out your poetry in a way that works for the different font sizes that a user might pick. So, use xsltproc or Perl plus one of the XML modules to do what kindlegen does, except applying some of your own conversions. You can use classes in your XHTML source to target the elements that need some conversion help, which will help you apply funky conversions to just poetry or just footnotes or just the first paragraph after a heading, without needing to understand XPATH. Any non-trivial book is already going to involve some divergence between your "ideal" source and the one you chuck through kindlegen for it to up-chuck (er, I mean convert), so perhaps making kindlegen do *less* for you is the solution. Two concrete examples from the books I'm marking up at the moment: 1. The book uses unspaced em dashes, and I like them, so I'm marking them up as U+200B U+2014 U+200B (i.e. zero-width space, em dash, zero-width space), which allows lines to wrap either side of the em dash, exactly as the original book does. That's helpful because the original book is 150 years old as uses those long chapter headings that TEI calls "arguments". However, the Kindle doesn't understand ZWSP, but will do the right thing if I change them to ZWNJ, which is the wrong character but it works. 2. I have some genealogical data in the back which is a set of increasingly indented paragraphs, some of which are numbered. It looks rather like ordered lists except that the first son and first daughter both get numbered "1", and some other oddities, so I've just used paragraphs. The Kindle destroys the indentation, so again I have to pre-process the paragraphs in the fashion that Tallent demonstrates. OK, so number (2) is rather specific to the needs of a single book, but the first conversion has to be done a lot as I'm working through the book, so it's scripted, along with any other help that kindlegen needs. Kindlegen does a simple job badly, so bypass the bits that you don't need.

On Tue, December 13, 2011 3:05 am, Paul Flo Williams wrote:
There is a fourth way: pre-process your (X)HTML to downgrade it to an HTML 3.2 tag soup + Kindle attributes so that the kindlegen step does nothing other than wrapping up your HTML into a MOBI file.
"Tag soup" refers to formatted markup which does not consist of correct HTML syntax or document structure. The expectation for web browsers is that they will not fail when presented with invalid HTML, presenting the content making reasonable heuristic "guesses". "Tag soup" may collectively refer to a large number of common authoring mistakes, such as malformed HTML tags, improperly-nested HTML elements, and unescaped character entities (especially ampersands (&) and less-than signs (<)). See generally, http://en.wikipedia.org/wiki/Tag_soup. I do not know if the MobiPocket parser which is at the heart of the Kindle is a tag soup parser or not, but there is no need to degrade valid (X)HTML to tag soup for the Kindle to use it. That said, it is true that Kindle relies on certain proprietary elements and attributes, which I guess qualifies it generally in the "tag soup" category. But you definitely want to keep the file as well-formed XML. [snip]
So, use xsltproc or Perl plus one of the XML modules to do what kindlegen does, except applying some of your own conversions.
[snip]
Kindlegen does a simple job badly, so bypass the bits that you don't need.
What you are suggesting is essentially to re-write Kindlegen to do things better. This is not a bad idea, but I think it may be a bit more complex than you think (or maybe I'm being unfair in suggesting that you don't grasp the complexity). Using XSLT is a good idea, but it is not a complete solution. It should work fine for inline styles, but doesn't apply CSS styling to each element first. What is needed is a program that will apply CSS to the document tree, and perhaps do other transformations as well, before using XSLT to produce the final output. Clearly perl is an option, although I would favor Java due to its superior performance and the JAXP APIs. Python seem to be the language /du jour/, and it seems to be the one favored by most people involved with e-books, so maybe that would be the right solution. Fast hardware can clearly compensate for Python's poor performance. We all know, more or less, how .mobi files are created, so once you've gotten to the point where the input is Kindle-ready, one could just close the circle and replace Kindlegen all together. Sounds like a fun project.

Lee Passey wrote:
On Tue, December 13, 2011 3:05 am, Paul Flo Williams wrote:
There is a fourth way: pre-process your (X)HTML to downgrade it to an HTML 3.2 tag soup + Kindle attributes so that the kindlegen step does nothing other than wrapping up your HTML into a MOBI file.
"Tag soup" refers to formatted markup which does not consist of correct HTML
[snip]
That said, it is true that Kindle relies on certain proprietary elements and attributes, which I guess qualifies it generally in the "tag soup" category. But you definitely want to keep the file as well-formed XML.
That's an extraordinarily verbose way of agreeing with me :-)
Kindlegen does a simple job badly, so bypass the bits that you don't need.
What you are suggesting is essentially to re-write Kindlegen to do things better. This is not a bad idea, but I think it may be a bit more complex than you think (or maybe I'm being unfair in suggesting that you don't grasp the complexity).
No, I'm suggesting doing some pre-processing to the HTML that you give to Kindlegen, converting constructs that it will convert badly into ones that fly straight through from your input to its mobi output. Here's a concrete example: Let's say I've got a poem with some long lines that I wish to markup. I don't know what font size or screen width the reader is using, so I want to make the display as flexible as possible. In the book that I'm copying, the poem already has wrapped lines: The first line of my poem The second line of my poem A longer line comes next, and goes on a bit but still does not rhyme The last line ends with a flourish! I've decided that a flexible way of marking this up in HTML is this: <html> <head> <style> .poem { margin-left: 2em } .line, .iline { display: block; text-indent: -2em } .line { } .iline { margin-left: 1em } </style> <body> <div class="poem"> <p class="verse"><span class="line">The first line of my poem</span> <span class="iline">The second line of my poem</span> <span class="line">A longer line comes next, and goes on a bit but still does not rhyme</span> <span class="iline">The last line ends with a flourish!</span></p> <p class="verse"><span class="line">The first line of my poem</span> <span class="iline">The second line of my poem</span> <span class="line">A longer line comes next, and goes on a bit, but still does not rhyme</span> <span class="iline">The last line ends with a flourish!</span></p> </div> </body> </html> You'll note that this needs CSS to show up as I intended, but the display in a modern browser works well. However, kindlegen does an awful job at converting this. So I decide to preprocess the HTML that I feed to kindlegen. I can strip the CSS entirely and use the classes to do some substitutions, so that I feed kindlegen this: <html> <head> </head> <body> <p height="1em">The first line of my poem</p> <p height="0" width="-2em"> The second line of my poem</p> <p height="0">A longer line comes next, and goes on a bit, but still does not rhyme</p> <p height="0" width="-2em"> The last line ends with a flourish!</p> <p height="1em">The first line of my poem</p> <p height="0" width="-2em"> The second line of my poem</p> <p height="0">A longer line comes next, and goes on a bit, but still does not rhyme</p> <p height="0" width="-2em"> The last line ends with a flourish!</p> </body> </html> (I haven't got a copy of Tallent handy, so I might have got the indentation trick wrong.) In this case, I selected elements to process by their class attribute, and performed some attribute and textual substitutions. I don't bother touching all the parts of the document that already convert well, so I don't have to fully process the CSS. This part of the document goes through kindlegen verbatim because there isn't anything it needs to convert. Of course, this is part of my toolchain for my books because I have made certain choices about the vocabulary I markup with, but the principle is simple enough. I don't need to understand the structure of mobi files or throw away kindlegen wholesale.

Paul Flo>No, I'm suggesting doing some pre-processing to the HTML that you give to Kindlegen, converting constructs that it will convert badly into ones that fly straight through from your input to its mobi output. Just to be clear, Marcello is already doing some of this using his epubmaker software in "kindle" mode -- where he moves a bunch of stuff from CSS back to inline coding -- because in theory at least kindlegen handles more stuff correctly inline than it does in CSS. [epubmaker is part of the posting "sausage making" between your submission of HTML to PG and PG posting mobi on their site.] If you install his epubmaker software on your computer, and then, at least if you accidentally on purpose leave off the Path to kindlegen, epubmaker leaves behind kindle versions of the kindle "epub" intermediate files which you can unzip and examine the "HTML" transformations Marcello is already making in an attempt to better accommodate in kindlegen the HTML the volunteers are actually submitting.
.line, .iline { display: block; text-indent: -2em }
Kindle and some other reader devices don't in general like negative indents. Some epub devices allow the end user to adjust (read: reduce) margins from the "end user font choices" system dialog, and then once again negative indents make the devices pretty unhappy. Welcome to the wonders of poetry -- assuming you can discern the original author's intent in the first place.

On 12/14/2011 05:17 PM, Jim Adcock wrote:
If you install his epubmaker software on your computer, and then, at least if you accidentally on purpose leave off the Path to kindlegen, epubmaker leaves behind kindle versions of the kindle "epub"
Or if you say epubmaker -v -v ... -- Marcello Perathoner webmaster@gutenberg.org

For those interested in formatting poetry: http://blog.epubbooks.com/898/formatting-poetry-for-small-screens Can't say I agree entirely, but it's always good to get someone else's perspective.

Lee quoting>http://blog.epubbooks.com/898/formatting-poetry-for-small-screens The author makes the common mistake of assuming that epub devices work the way we would like them to work. One might hope he would actually test his suggestions on a large number of different epub devices running differing rendering software to see what actually happens.

On Wed, December 14, 2011 7:00 am, Paul Flo Williams wrote:
Lee Passey wrote:
[snipped discussion of "tag soup"]
That's an extraordinarily verbose way of agreeing with me :-)
Verbosity is only one of my many faults ;-). The point I was trying to make, perhaps inartfully, is that the term "tag soup" covers a multiple of sins. You might have tag soup where you have valid XML which is not valid HTML because it contains elements or attributes in addition to those allowed by the XHTML DTD. This kind of tag soup can still be parsed by an XML parser, and is mostly harmless. I consider this to be class 3 tag soup. You can also have tag soup which is valid HTML but not valid XML. This is because HTML is derived from SGML, and some things like implicitly closing tags and possibly unnested tags are allowed in SGML but not XML. While valid, this kind of file cannot be parsed by an XML parser, including XSL. To me this is class 2 tag soup. Lastly, you can have tag soup which is simply wrong by any standard: examples would include using block elements inside inline elements, failing to escape ampersands or angle brackets, or using invalid character entities. This is class 1 tag soup. Class 3 tag soup can be fixed by an appropriate XSLT script, but class 2 and class 1 tag soup cannot. Abbyy FineReader produces class 2 tag soup, as does the script at archive.org written by kenh, which means that many of my XML tools cannot work until the file is "fixed." I don't know the tolerance Kindlegen has for these different types of tag soup, but from a practical perspective, I would think you should limit yourself to class 3 tag soup as Kindlegen input. So when you say "downgrade it to an HTML 3.2 tag soup," I simply wanted to caution that it's okay to downgrade to class 3 tag soup, but probably not class 2, and certainly not class 1.
What you are suggesting is essentially to re-write Kindlegen to do things better. This is not a bad idea, but I think it may be a bit more complex than you think (or maybe I'm being unfair in suggesting that you don't grasp the complexity).
No, I'm suggesting doing some pre-processing to the HTML that you give to Kindlegen, converting constructs that it will convert badly into ones that fly straight through from your input to its mobi output.
Here's a concrete example:
I'm less interested in concrete examples than I am in abstract examples. It's easy to write a script that converts <div class="iline">...</div> to <div> ...</div>. It's somewhat more difficult to write a script that converts <div class="iline">...</div> to <div style="margin-left:1em">...</div>, where the style element is derived from the <style> definition, (I can't think of a way to do it with XSLT, which means some other script language must be used) but relatively straight-forward. But what I really want is a script/program that converts <div style="margin-left:[some unpredictable value]"> to <div class="semanitic">[a number of non breaking spaces calculated based on the unpredictable value]...</div>. In other words, a generic transformation rather than a specific transformation. If Kindlegen does /most/ of the transformations correctly, then it makes the most sense to provide a program that performs only those transformations that Kindlegen handles badly, if at all. But if the number of identified transformations that Kindlegen handles badly grows signficantly, and if one has designed a generic transformation engine, then it may make the most sense to simply handle /all/ the required transformations in this new, open-souce transformation engine, and leave Kindlegen only the job of packaging the (X)HTML with additions into the .mobi format. Because we understand the .mobi format, it should be a small step to add the packaging function to the transformation engine and replace Kindlegen entirely. I'm not saying that that's the way it /should/ be done, only that it's an option that should not be dismissed. [snip]
I've decided that a flexible way of marking this up in HTML is this:
<html> <head> <style> .poem { margin-left: 2em } .line, .iline { display: block; text-indent: -2em } .line { } .iline { margin-left: 1em } </style> <body> <div class="poem"> <p class="verse"><span class="line">The first line of my poem</span>
So what will happen if the user agent you're using indents paragraphs 50% of the display? And it's pretty clear to me that a verse is not a paragraph. So why not use <div> for verses instead? The default display mode for <div> is block, and the default display mode for <span> is inline. But in your example you have changed the display mode for <span class="line"> to block. Why not just use <div> in the first place, as its default presentation is exactly what you wanted? My version would have been: <div class="poem"> <div class="verse"> <div class="line">The first line of my poem</div> <div class="iline"> The second line of my poem</div> <div class="line">A longer line comes next, and goes on a bit, but still does not rhyme</div> ... You may note that I have used non-breaking spaces in the "master" version just like in the Kindle version. This is consistent with my view that "master" versions should look acceptable even when the User Agent can't handle CSS.
You'll note that this needs CSS to show up as I intended, but the display in a modern browser works well. However, kindlegen does an awful job at converting this.
So I decide to preprocess the HTML that I feed to kindlegen. I can strip the CSS entirely and use the classes to do some substitutions, so that I feed kindlegen this:
<html> <head> </head> <body> <p height="1em">The first line of my poem</p>
A line is not a paragraph; use <div> instead. You've also lost the association of lines into verses, and verses into poems. Probably not a problem if this markup is /derived/ from a "master" version, but there's really no reason not to preserve the structure moving forward. (It's interesting to note that Kindlegen also preserves styles it cannot convert, and which the Kindle will ignore). I believe Kindle supports the <blockquote> element which provides right/left margin indentation, just like your <div class="poem"> does. Maybe for the Kindle you would want to enclose your entire poem as a <blockquote class="poem">, which may provide some of the display "goodness" you are seeking. [snip]
I don't need to understand the structure of mobi files or throw away kindlegen wholesale.
Absolutely. My fundamental rule is don't do anything that doesn't need to be done (even if it would be fun to do). But I can definitely envision that over time a generic Kindlegen preprocessor might elbow out Kindlegen itself.

Lee Passey wrote:
On Wed, December 14, 2011 7:00 am, Paul Flo Williams wrote:
Lee Passey wrote:
[snipped discussion of "tag soup"]
That's an extraordinarily verbose way of agreeing with me :-)
Verbosity is only one of my many faults ;-). The point I was trying to make, perhaps inartfully, is that the term "tag soup" covers a multiple of sins.
And then you made a taxonomy of them, compounding the sin :-)
I don't know the tolerance Kindlegen has for these different types of tag soup, but from a practical perspective, I would think you should limit yourself to class 3 tag soup as Kindlegen input. So when you say "downgrade it to an HTML 3.2 tag soup," I simply wanted to caution that it's okay to downgrade to class 3 tag soup, but probably not class 2, and certainly not class 1.
Granted. I'm processing XML throughout, else I'd be writing too many custom tools.
The default display mode for <div> is block, and the default display mode for <span> is inline. But in your example you have changed the display mode for <span class="line"> to block. Why not just use <div> in the first place, as its default presentation is exactly what you wanted?
OK, I'll have to shortcut this with a mea culpa: I both over- and under-thought my example. I spent a lot of time marking up a book where many paragraphs contained quoted verses or couplets, and I went backwards and forwards with thoughts of "it looks like two paragraphs, so should I mark it up as such, or should I mark it up as the single paragraph (thought) that it really is, and leave presentation for later?" When going with the latter, I had to use <span> because paragraphs can't contain other divisions of text. I then thought I'd based my example on the markup in the book, but did a poor job. The point about selecting elements to transform by class instead of element name, as we might do with microformats, got lost in the noise.
My fundamental rule is don't do anything that doesn't need to be done (even if it would be fun to do). But I can definitely envision that over time a generic Kindlegen preprocessor might elbow out Kindlegen itself.
Two years ago, I think you'd be right. With KF8 in the wings, I can't say I'm willing to do any more than provide crutches for the vocabulary I've chosen to use for markup.

On 12/15/2011 02:19 AM, Paul Flo Williams wrote:
Two years ago, I think you'd be right. With KF8 in the wings, I can't say I'm willing to do any more than provide crutches for the vocabulary I've chosen to use for markup.
Relatedly, here's some early info on that: http://www.the-digital-reader.com/2011/12/13/kindle-format-8-demo-now-availa... David
participants (7)
-
D Garcia
-
James Adcock
-
Jim Adcock
-
Keith J. Schultz
-
Lee Passey
-
Marcello Perathoner
-
Paul Flo Williams