Re: [gutvol-d] Producing epub ready HTML

24 Jan 2012


      On 01/24/2012 12:25 AM, Lee Passey wrote:
...
On Mon, January 23, 2012 3:48 pm, Marcello Perathoner wrote:
...
Please show me how.
You do it the same way you do TEI, except you map the tags differently. We
know, for example, that XHTML can easily be converted to TEI, so from there it
seems the process could be the same.
What I did was to load the HTML into an in-memory DOM (best to use a tag-soup
parser, because you can't guarantee well-formedness of the input file [unless
you've pre-processed with Tidy]). Then walk the tree, spitting out appropriate
text as you go. Some tags get translated output both before and after their
children (e.g.<i>  and<b>) other tags only need something before or after. If
you want to be really careful about line lengths, buffer words at a time, and
then decide whether a new-line needs to be added before the word.
Ok. Now put that into code, runnable on an ubuntu box, and give it to 
the WWers to evaluate.
...
...
The generated text must be ready to post, eg. word wrap, pg header, pg
footer, lines between chapters, etc. all must be there and adhere to pg
standard. There must be no post-generation edits required.
Do you want me to send you the code (based on the 2005? Tidy code base)? It
doesn't spit out the PG garbage text, but that could easily be added. I can't
say that it adheres to the PG standard, because I am still unaware that there
/is/ any PG standard, but if you were to tell me explicitly what /you/ think
the standard is I could tell you whether it satisfies it.
Take a hundred random samples from the archive and pipe the HTML file 
thru your device and see if something very close to the posted txt file 
comes out. (You may safely ignore where the lines break, but not the 
number of empty lines between blocks.)


-- 
Marcello Perathoner
webmaster@gutenberg.org