Re: [gutvol-d] XML version of some books of PG (and other formats)

Anyone have a link to w3m in a windows executable (command line is fine, I just don't have access to a way to compile the source where I'm at right now)? This definitely looks interesting. Josh ----- Original Message ----- From: "Sebastien Blondeel" <blondeel@clipper.ens.fr> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org> Subject: Re: [gutvol-d] XML version of some books of PG (and other formats) Date: Fri, 3 Dec 2004 21:12:28 +0100
On Fri, Dec 03, 2004 at 12:07:31PM -0500, Joshua Hutchinson wrote:
A web browser will generate line breaks within cells so that the table will end up looking very similar to the above. I haven't tried w3m ... will it handle the above scenario? I've tried lynx dumping to a
You should have. Yes it does:
-=-=-= $ cat /tmp/toto.html <table border="0"> <tr> <td>Data label that is extremely long and is broken up accordingly over multiple lines.</td> <td>Now we have a column of data that is also very long and broken up over multiple lines.</td> <td>Now we have a column of data that is also very long and broken up over multiple lines.</td> </tr> </table>
$ w3m -cols 72 -dump /tmp/toto.html Data label that is Now we have a column of Now we have a column of extremely long and is data that is also very data that is also very broken up accordingly long and broken up over long and broken up over over multiple lines. multiple lines. multiple lines.
$ w3m -cols 48 -dump /tmp/toto.html Data label that Now we have a Now we have a is extremely column of data column of data long and is that is also that is also broken up very long and very long and accordingly broken up over broken up over over multiple multiple lines. multiple lines. lines. -=-=-=
(Note: for the 72 columns version I don't know why there is an extra space between columns 1 and 2. Probably a bug of w3m: it was already there in the base-ball example. This is easy to detect and fix I guess: use ``border=1'' and clean out the frames:
$ w3m -cols 76 -dump /tmp/toto.html +-------------------------------------------------------------------------+ |Data label that is |Now we have a column of |Now we have a column of | |extremely long and is |data that is also very |data that is also very | |broken up accordingly |long and broken up over |long and broken up over | |over multiple lines. |multiple lines. |multiple lines. | +-------------------------------------------------------------------------+ ^^^ ^^
You can detect those useless empty columns and remove them (or decide to have 2 or 3 blanks between columns). Doing this without frames is more dangerous, and the columns more difficult to detect:
$ w3m -cols 72 -dump /tmp/toto.html Data la el that is Now we have a column of Now we have a column of extreme y long and is data that is also very data that is also very brokenx p accordingly long and broken up over long and broken up over over mu tiple lines. multiple lines. multiple lines. ^^^ this is not a column break.
text file and IE/Mozilla dumping to a text, and they all fail miserably.
w3m is better than lynx for tables (and many other things: it is able to display images in console mode and inside xterms!). links is good too.
As for the example given in a later message:
-=-=-= $ w3m -cols 72 -dump teioutput.html | head 150. Ptolemy publishes his geography. 230. The Peutinger Table pictures the Roman roads. 400-14. Fa-hien travels through and describes Afghanistan and India. 499. Hoei-Sin said to have visited the kingdom of Fu-sang, 20,000 furlongs east of China (identified by some with California). 518-21. Hoei-Sing and Sung-Yun visit and describe the Pamirs and the Punjab. 540. Cosmas Indicopleustes visits India, and combats the sphericity of the globe. 629-46. Hiouen-Tshang travels through Turkestan, Afghanistan, India,
$ perl ~/work/PGDP/ebooksgratuits/Fmt.pl 72 teioutput.html | head <table border="0"> <tr><td>150.</td><td>Ptolemy publishes his geography.</td></tr> <tr><td>230.</td><td>The Peutinger Table pictures the Roman roads.</td></tr> <tr><td>400-14.</td><td>Fa-hien travels through and describes Afghanistan and India.</td></tr> <tr><td>499.</td><td>Hoei-Sin said to have visited the kingdom of Fu-sang, 20,000 furlongs east of China (identified by some with California).</td></tr> <tr><td>518-21.</td><td>Hoei-Sing and Sung-Yun visit and describe the -=-=-=
Note: sometimes the columns in w3m are, weirdly, unbalanced. I don't have an example right here but I guess you can help with percentage-width attributes in the columns (if that is possible at all in TEI). I am using an old version of w3m, too (Debian stable).
With links:
-=-=-= $ links -dump teioutput.html | head 150. Ptolemy publishes his geography. 230. The Peutinger Table pictures the Roman roads. 400-14. Fa-hien travels through and describes Afghanistan and India. 499. Hoei-Sin said to have visited the kingdom of Fu-sang, 20,000 furlongs east of China (identified by some with California). 518-21. Hoei-Sing and Sung-Yun visit and describe the Pamirs and the Punjab. 540. Cosmas Indicopleustes visits India, and combats the sphericity of the globe. 629-46. Hiouen-Tshang travels through Turkestan, Afghanistan, India,
$ links -dump toto.html +------------------------------------------------------------------------+ | Data label that is | Now we have a column of | Now we have a column | | extremely long and is | data that is also very | of data that is also | | broken up accordingly | long and broken up over | very long and broken | | over multiple lines. | multiple lines. | up over multiple | | | | lines. | +------------------------------------------------------------------------+
$ links -dump toto.html # without border Data label that is Now we have a column of Now we have a column of extremely long and is data that is also very data that is also very broken up accordingly over long and broken up over long and broken up over multiple lines. multiple lines. multiple lines. -=-=-= _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

On Fri, Dec 03, 2004 at 03:36:36PM -0500, Joshua Hutchinson wrote:
Anyone have a link to w3m in a windows executable (command line is fine, I just don't have access to a way to compile the source where I'm at right now)? This definitely looks interesting.
A friend of mine competent in Windows stuff suggests to use the following: ==== xml2txt.js ==== var x = new ActiveXObject("Msxml2.FreeThreadedDOMDocument"); x.load(WScript.Arguments(0)); var p = x.documentElement.selectNodes("//p"); for (var it = new Enumerator(p) ; !it.atEnd() ; it.moveNext()) { var t = it.item().text.replace(/\s+/, " "); if (t.charAt(0) == " ") t = t.slice(1); var l = t.length; var i = 0; var j = 0; while (i < l) { j = i + 77; if (j <= l) { j = t.lastIndexOf(" ", j); if (j < i) { j = t.indexOf(" ", i); if (j == -1) j = l; }; } else { j = l; }; WScript.Echo(t.slice(i, j)); i = j + 1; }; WScript.Echo(""); }; WScript.Quit(0); ==================== Run it with the following command: -=-=-= cscript //nologo xml2txt.js ton_fichier.xml -=-=-= With the paragraphs in the XML marked as <p></p> (Cf the selectNodes call).
participants (2)
-
Joshua Hutchinson
-
Sebastien Blondeel