Re: [gutvol-d] XML version of some books of PG (and other formats)

The problem I've always run into is where the table tries to grow beyond 80 characters wide. For instance, say that one row looks like this in the original book. Data label that Now we have a column Now we have a column is extremely long of data that is also of data that is also and is broken up very long and broken very long and broken accordingly over up over multiple lines. up over multiple lines. multiple lines. Most automated text converters will put each cell on one line with no line breaks. A web browser will generate line breaks within cells so that the table will end up looking very similar to the above. I haven't tried w3m ... will it handle the above scenario? I've tried lynx dumping to a text file and IE/Mozilla dumping to a text, and they all fail miserably. Josh ----- Original Message ----- From: "Sebastien Blondeel" <blondeel@clipper.ens.fr> To: "Project Gutenberg Volunteer Discussion" <gutvol-d@lists.pglaf.org> Subject: Re: [gutvol-d] XML version of some books of PG (and other formats) Date: Fri, 3 Dec 2004 17:53:49 +0100
On Fri, Dec 03, 2004 at 10:52:25AM -0500, Joshua Hutchinson wrote:
The hard part is getting the table info within PG text 80 column width.
It is not always possible of course.
FYI, this table becomes this in TEI markup (NOTE: I made the second
That looks simple enough.
Club column just continue under the first for simplicities sake):
Change it to HTML:
-=-=-= <table border="0"> <tr> <td colspan="4" align="center">THE RECORD OF 1875.</td> [...] -=-=-=
then replace: row -> tr cell -> td
then "w3m -dump table.html" gives:
$ w3m -dump table.html THE RECORD OF 1875. Club. Won. Lost. P.C. Boston 71 8 .809 Athletic 55 28 .756 Hartford 54 28 .639 St. Louis 29 39 .574 Philadelphia 37 31 .544 Chicago 30 37 .448 Mutual 29 38 .426 St. Louis Reds 4 14 .222 Washington 4 22 .156 New Haven 7 39 .152 Centennial 2 13 .133 Western 1 12 .077 Atlantic 2 42 .065
(the star after St. Louis has disappeared).
If you need it embedded in a program I can try to code the algorithm, depending on the programming language you want (Perl should be easy).
Then you can detect cells with just numbers in them should be right-aligned, etc.
It should also be easy to translate this to LaTeX for PDF/DVI/PS output. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

On Fri, Dec 03, 2004 at 12:07:31PM -0500, Joshua Hutchinson wrote:
A web browser will generate line breaks within cells so that the table will end up looking very similar to the above. I haven't tried w3m ... will it handle the above scenario? I've tried lynx dumping to a
You should have. Yes it does: -=-=-= $ cat /tmp/toto.html <table border="0"> <tr> <td>Data label that is extremely long and is broken up accordingly over multiple lines.</td> <td>Now we have a column of data that is also very long and broken up over multiple lines.</td> <td>Now we have a column of data that is also very long and broken up over multiple lines.</td> </tr> </table> $ w3m -cols 72 -dump /tmp/toto.html Data label that is Now we have a column of Now we have a column of extremely long and is data that is also very data that is also very broken up accordingly long and broken up over long and broken up over over multiple lines. multiple lines. multiple lines. $ w3m -cols 48 -dump /tmp/toto.html Data label that Now we have a Now we have a is extremely column of data column of data long and is that is also that is also broken up very long and very long and accordingly broken up over broken up over over multiple multiple lines. multiple lines. lines. -=-=-= (Note: for the 72 columns version I don't know why there is an extra space between columns 1 and 2. Probably a bug of w3m: it was already there in the base-ball example. This is easy to detect and fix I guess: use ``border=1'' and clean out the frames: $ w3m -cols 76 -dump /tmp/toto.html +-------------------------------------------------------------------------+ |Data label that is |Now we have a column of |Now we have a column of | |extremely long and is |data that is also very |data that is also very | |broken up accordingly |long and broken up over |long and broken up over | |over multiple lines. |multiple lines. |multiple lines. | +-------------------------------------------------------------------------+ ^^^ ^^ You can detect those useless empty columns and remove them (or decide to have 2 or 3 blanks between columns). Doing this without frames is more dangerous, and the columns more difficult to detect: $ w3m -cols 72 -dump /tmp/toto.html Data la el that is Now we have a column of Now we have a column of extreme y long and is data that is also very data that is also very brokenx p accordingly long and broken up over long and broken up over over mu tiple lines. multiple lines. multiple lines. ^^^ this is not a column break.
text file and IE/Mozilla dumping to a text, and they all fail miserably.
w3m is better than lynx for tables (and many other things: it is able to display images in console mode and inside xterms!). links is good too. As for the example given in a later message: -=-=-= $ w3m -cols 72 -dump teioutput.html | head 150. Ptolemy publishes his geography. 230. The Peutinger Table pictures the Roman roads. 400-14. Fa-hien travels through and describes Afghanistan and India. 499. Hoei-Sin said to have visited the kingdom of Fu-sang, 20,000 furlongs east of China (identified by some with California). 518-21. Hoei-Sing and Sung-Yun visit and describe the Pamirs and the Punjab. 540. Cosmas Indicopleustes visits India, and combats the sphericity of the globe. 629-46. Hiouen-Tshang travels through Turkestan, Afghanistan, India, $ perl ~/work/PGDP/ebooksgratuits/Fmt.pl 72 teioutput.html | head <table border="0"> <tr><td>150.</td><td>Ptolemy publishes his geography.</td></tr> <tr><td>230.</td><td>The Peutinger Table pictures the Roman roads.</td></tr> <tr><td>400-14.</td><td>Fa-hien travels through and describes Afghanistan and India.</td></tr> <tr><td>499.</td><td>Hoei-Sin said to have visited the kingdom of Fu-sang, 20,000 furlongs east of China (identified by some with California).</td></tr> <tr><td>518-21.</td><td>Hoei-Sing and Sung-Yun visit and describe the -=-=-= Note: sometimes the columns in w3m are, weirdly, unbalanced. I don't have an example right here but I guess you can help with percentage-width attributes in the columns (if that is possible at all in TEI). I am using an old version of w3m, too (Debian stable). With links: -=-=-= $ links -dump teioutput.html | head 150. Ptolemy publishes his geography. 230. The Peutinger Table pictures the Roman roads. 400-14. Fa-hien travels through and describes Afghanistan and India. 499. Hoei-Sin said to have visited the kingdom of Fu-sang, 20,000 furlongs east of China (identified by some with California). 518-21. Hoei-Sing and Sung-Yun visit and describe the Pamirs and the Punjab. 540. Cosmas Indicopleustes visits India, and combats the sphericity of the globe. 629-46. Hiouen-Tshang travels through Turkestan, Afghanistan, India, $ links -dump toto.html +------------------------------------------------------------------------+ | Data label that is | Now we have a column of | Now we have a column | | extremely long and is | data that is also very | of data that is also | | broken up accordingly | long and broken up over | very long and broken | | over multiple lines. | multiple lines. | up over multiple | | | | lines. | +------------------------------------------------------------------------+ $ links -dump toto.html # without border Data label that is Now we have a column of Now we have a column of extremely long and is data that is also very data that is also very broken up accordingly over long and broken up over long and broken up over multiple lines. multiple lines. multiple lines. -=-=-=
participants (2)
-
Joshua Hutchinson
-
Sebastien Blondeel