On 01/24/2012 12:25 AM, Lee Passey wrote:
On Mon, January 23, 2012 3:48 pm, Marcello Perathoner wrote:
Please show me how.
You do it the same way you do TEI, except you map the tags differently. We know, for example, that XHTML can easily be converted to TEI, so from there it seems the process could be the same.
What I did was to load the HTML into an in-memory DOM (best to use a tag-soup parser, because you can't guarantee well-formedness of the input file [unless you've pre-processed with Tidy]). Then walk the tree, spitting out appropriate text as you go. Some tags get translated output both before and after their children (e.g.<i> and<b>) other tags only need something before or after. If you want to be really careful about line lengths, buffer words at a time, and then decide whether a new-line needs to be added before the word.
Ok. Now put that into code, runnable on an ubuntu box, and give it to the WWers to evaluate.
The generated text must be ready to post, eg. word wrap, pg header, pg footer, lines between chapters, etc. all must be there and adhere to pg standard. There must be no post-generation edits required.
Do you want me to send you the code (based on the 2005? Tidy code base)? It doesn't spit out the PG garbage text, but that could easily be added. I can't say that it adheres to the PG standard, because I am still unaware that there /is/ any PG standard, but if you were to tell me explicitly what /you/ think the standard is I could tell you whether it satisfies it.
Take a hundred random samples from the archive and pipe the HTML file thru your device and see if something very close to the posted txt file comes out. (You may safely ignore where the lines break, but not the number of empty lines between blocks.) -- Marcello Perathoner webmaster@gutenberg.org