Re: [gutvol-d] Producing epub ready HTML

23 Jan 2012

      On Mon, January 23, 2012 3:48 pm, Marcello Perathoner wrote:
...
Please show me how.
You do it the same way you do TEI, except you map the tags differently. We
know, for example, that XHTML can easily be converted to TEI, so from there it
seems the process could be the same.

What I did was to load the HTML into an in-memory DOM (best to use a tag-soup
parser, because you can't guarantee well-formedness of the input file [unless
you've pre-processed with Tidy]). Then walk the tree, spitting out appropriate
text as you go. Some tags get translated output both before and after their
children (e.g. <i> and <b>) other tags only need something before or after. If
you want to be really careful about line lengths, buffer words at a time, and
then decide whether a new-line needs to be added before the word.
...
The generated text must be ready to post, eg. word wrap, pg header, pg
footer, lines between chapters, etc. all must be there and adhere to pg
standard. There must be no post-generation edits required.
Do you want me to send you the code (based on the 2005? Tidy code base)? It
doesn't spit out the PG garbage text, but that could easily be added. I can't
say that it adheres to the PG standard, because I am still unaware that there
/is/ any PG standard, but if you were to tell me explicitly what /you/ think
the standard is I could tell you whether it satisfies it.

Re: [gutvol-d] Producing epub ready HTML

Lee Passey