[gutvol-d] Re: so what is so important about pagination?

22 Feb 2010

      On 2/22/2010 4:02 PM, Scott Olson wrote:
...
On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey <lee@novomail.net
<mailto:lee@novomail.net>> wrote:
When ABBYY FineReader saves its OCR output in HTML format it has the
    option of placing a break (<br>) at the end of each line, and a
    horizontal rule (<hr>) between each page (an alternative is to save
    each scanned page as a separate file, but I find that less
    convenient). I then wrote a short program (could probably be done
    just as easily with a perl script, or even sed) that replaced each
    <hr> with an anchor tag indicating the page number (<a name="page##"
    />), and replaced each <br> with <lb />. Now <lb> is not a valid
    HTML element (hence the cheat), but I know of no user agent that
    will fail to render an HTML file just because it has an invalid
    element in it.
Since the user agent will take care of rewrapping, you could just leave
the linebreaks where they are.
I considered that, but I'm not in favor of invisible markup. What do I 
mean by invisible? We know that the HTML spec says that multiple white 
space can be collapsed unless it is specifically identified as 
"non-breaking," and we know that spaces, tabs and newlines are all white 
space and sometimes very hard to distinguish from each other. This means 
that my HTML tools might have a tendency to wrap these lines up if I'm 
not extremely diligent. And because it's still white space there's a 
good likelihood that I may not even notice if it gets screwed up.

I like markup that's in my face, and obviously not part of the text. 
Markup rules like "three blank lines indicate a minor header, but four 
blank lines indicate a major header" and "one space at the beginning of 
a line means don't wrap this line, but two spaces means wrap this line 
but do a block indent" just make me shudder. If it's markup it should be 
markup, and if it's not it shouldn't pretend that it is.

This is kind of a specific instance of the general rule that a markup 
element should do one thing, and one thing only.
...
If you really want to have them encoded,
I'd opt for some CSS.
br.lb <http://br.lb>
   {display: none}
in your <style> section
Then <br class="lb" /> wherever you're currently putting <lb />.
This is an option I have tried, and is not a bad idea. I prefer the 
invalid element idea simply because many user agents I'm familiar with, 
particularly phones and handheld devices simply have not yet figured out 
how to do CSS. At one point Mobipocket claimed CSS support, but on 
closer examination I discovered that their publisher tool simply went 
through and replaced CSS styles with elements that their UA actually 
recognized. Your "display:none" trick simply wouldn't have worked in the 
old Mobipocket reader.

Now I know that the Kindle software was based largely (if not entirely) 
on the old Mobipocket reader. What is the effect of trying to use CSS to 
turn off the display of line-breaks after the file has been converted to 
.mobi? Maybe someone with a Kindle could enlighten us?