
Love it!!! Michael On Mon, 22 Feb 2010, Scott Olson wrote:
On Mon, Feb 22, 2010 at 3:45 PM, Lee Passey <lee@novomail.net> wrote: When ABBYY FineReader saves its OCR output in HTML format it has the option of placing a break (<br>) at the end of each line, and a horizontal rule (<hr>) between each page (an alternative is to save each scanned page as a separate file, but I find that less convenient). I then wrote a short program (could probably be done just as easily with a perl script, or even sed) that replaced each <hr> with an anchor tag indicating the page number (<a name="page##" />), and replaced each <br> with <lb />. Now <lb> is not a valid HTML element (hence the cheat), but I know of no user agent that will fail to render an HTML file just because it has an invalid element in it.
Since the user agent will take care of rewrapping, you could just leave the linebreaks where they are. If you really want to have them encoded, I'd opt for some CSS. br.lb {display: none} in your <style> section Then <br class="lb" /> wherever you're currently putting <lb />.