Re: [gutvol-d] XML version of some books of PG (and other formats)

I'm curious to see if your script can handle tables. That is our current biggest bugaboo when it comes to transforming to PG TXT format. Josh ----- Original Message ----- From: "Sebastien Blondeel" <blondeel@clipper.ens.fr> To: gutvol-d@lists.pglaf.org Subject: [gutvol-d] XML version of some books of PG (and other formats) Date: Fri, 3 Dec 2004 05:28:39 +0100
Hello,
I hacked some scripts doing the following:
RTF -> XML
RTF: from Word, using a (very) simple stylesheet: just paragraphs, 3 title levels, footnotes, and italics Meta-information is in the properties of the document. My script can extract images too, if wanted.
XML: using a personal and simple DTD (embedded), probably easy to port to any more complete DTD, such as TEI
This is the hard part, and I am never quite sure it will not break in case the Word file is weird.
From that, I then did other (proof-of-concept) scripts to produce:
XML -> PG TXT XML -> (LaTeX) -> PDF, DVI, PS (with hyperlinks) XML -> valid HTML 4.01 (probably useless) XML -> XHTML 1.0 Strict with some CSS (embedded)
The programming is very defensive, so when all transforms finish I am confident enough the stuff is right.
You can find examples of those formats at http://www.eleves.ens.fr/home/blondeel/ebooksgratuits/ (most of the books there don't have the meta-info properly set up, so don't worry too much about that).
My scripts also clean up small typography mistakes (they are specialized in French rules but can of course be taught any thing). They will be used to help give PG nicer formats from the ebooksgratuits team (until now their Word macros could only produce PG TXT, which is not very sexy to read for the end user).
Regards, _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

On Fri, Dec 03, 2004 at 08:46:10AM -0500, Joshua Hutchinson wrote:
I'm curious to see if your script can handle tables. That is our current biggest bugaboo when it comes to transforming to PG TXT format.
My DTD doesn't mention them (yet?). It focuses mainly on the French books of the ebooksgratuits site. I guess it can very easily be injected in a more complete DTD (TEI, Docbook, whatever). I already did Perl (not XSLT!) translations of XML tables (Docbook, for example) to other formats (HTML: easy; LaTeX: harder...; TXT: w3m -dump of the HTML version is usually good enough) for other projects. I heard there were now Perl modules able to deal with XML and XSLT so it should be even easier to take care of. XSLT-style of programming is not for me... How complex are your tables and what do you need to do with them? Any example of (input, output desired, and constraints [API, language...] of the transformation)?

Joshua Hutchinson wrote:
I'm curious to see if your script can handle tables. That is our current biggest bugaboo when it comes to transforming to PG TXT format.
HTML, TXT and TEI versions of the 0.3 docs are up at: http://www.gutenberg.org/tei/marcello/0.3/doc/ There are two tables in the docs, a small one and a bigger one that needs manual specifying of the column width. -- Marcello Perathoner webmaster@gutenberg.org
participants (3)
-
Joshua Hutchinson
-
Marcello Perathoner
-
Sebastien Blondeel