
On Mon, January 23, 2012 3:48 pm, Marcello Perathoner wrote:
Please show me how.
You do it the same way you do TEI, except you map the tags differently. We know, for example, that XHTML can easily be converted to TEI, so from there it seems the process could be the same. What I did was to load the HTML into an in-memory DOM (best to use a tag-soup parser, because you can't guarantee well-formedness of the input file [unless you've pre-processed with Tidy]). Then walk the tree, spitting out appropriate text as you go. Some tags get translated output both before and after their children (e.g. <i> and <b>) other tags only need something before or after. If you want to be really careful about line lengths, buffer words at a time, and then decide whether a new-line needs to be added before the word.
The generated text must be ready to post, eg. word wrap, pg header, pg footer, lines between chapters, etc. all must be there and adhere to pg standard. There must be no post-generation edits required.
Do you want me to send you the code (based on the 2005? Tidy code base)? It doesn't spit out the PG garbage text, but that could easily be added. I can't say that it adheres to the PG standard, because I am still unaware that there /is/ any PG standard, but if you were to tell me explicitly what /you/ think the standard is I could tell you whether it satisfies it.