
On 2/22/2010 4:02 PM, Scott Olson wrote:
On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey <lee@novomail.net <mailto:lee@novomail.net>> wrote:
When ABBYY FineReader saves its OCR output in HTML format it has the option of placing a break (<br>) at the end of each line, and a horizontal rule (<hr>) between each page (an alternative is to save each scanned page as a separate file, but I find that less convenient). I then wrote a short program (could probably be done just as easily with a perl script, or even sed) that replaced each <hr> with an anchor tag indicating the page number (<a name="page##" />), and replaced each <br> with <lb />. Now <lb> is not a valid HTML element (hence the cheat), but I know of no user agent that will fail to render an HTML file just because it has an invalid element in it.
Since the user agent will take care of rewrapping, you could just leave the linebreaks where they are.
I considered that, but I'm not in favor of invisible markup. What do I mean by invisible? We know that the HTML spec says that multiple white space can be collapsed unless it is specifically identified as "non-breaking," and we know that spaces, tabs and newlines are all white space and sometimes very hard to distinguish from each other. This means that my HTML tools might have a tendency to wrap these lines up if I'm not extremely diligent. And because it's still white space there's a good likelihood that I may not even notice if it gets screwed up. I like markup that's in my face, and obviously not part of the text. Markup rules like "three blank lines indicate a minor header, but four blank lines indicate a major header" and "one space at the beginning of a line means don't wrap this line, but two spaces means wrap this line but do a block indent" just make me shudder. If it's markup it should be markup, and if it's not it shouldn't pretend that it is. This is kind of a specific instance of the general rule that a markup element should do one thing, and one thing only.
If you really want to have them encoded, I'd opt for some CSS. br.lb <http://br.lb> {display: none} in your <style> section Then <br class="lb" /> wherever you're currently putting <lb />.
This is an option I have tried, and is not a bad idea. I prefer the invalid element idea simply because many user agents I'm familiar with, particularly phones and handheld devices simply have not yet figured out how to do CSS. At one point Mobipocket claimed CSS support, but on closer examination I discovered that their publisher tool simply went through and replaced CSS styles with elements that their UA actually recognized. Your "display:none" trick simply wouldn't have worked in the old Mobipocket reader. Now I know that the Kindle software was based largely (if not entirely) on the old Mobipocket reader. What is the effect of trying to use CSS to turn off the display of line-breaks after the file has been converted to .mobi? Maybe someone with a Kindle could enlighten us?