At 03:39 PM 14/04/2010, you wrote:
Al Haines (shaw) wrote:

Actually, it's fairly common practice that if a paragraph/verse starts with some kind of graphical/illuminated character, the actual character it stands for is not included in the HTML version.

And that makes the HTML pretty useless for further processing like conversion to mobile formats.

It should be made a requirement that the stream of non-markup-characters be identical in all versions of an ebook:

  lynx --dump

should produce a text that wdiffs equal with the text version.

and it does. stripping this:

<p><span
class="dropcapc">&nbsp;</span><span
class="dropcap">T</span>he children were nestled all
snug in their beds,<br />
While visions of sugar-plums danced in their heads;<br />
And mamma in her kerchief, and I in my cap,<br />
Had just settled our brains for a long winter's nap,<br /><br
/>
</p>

gives you this:

The children were nestled all snug in their beds,
While visions of sugar-plums danced in their heads;
And mamma in her kerchief, and I in my cap,
Had just settled our brains for a long winter's nap,

JHowse


================================================================================
"Turning a Picture into a thousand words"Preserving History One Page at a Time!!
Celebrating more than 17,350 books posted to Project Gutenberg!
Join Project Gutenberg's Distributed Proofreaders http://www.pgdp.net/c/
================================================================================