
On 4/16/2010 1:03 PM, Michael McDermott wrote: [snip]
One of the last two would, of course, do in a pinch, but I was wondering whether anyone else here had any ideas/recipes on how to automatically or mostly-automatically typeset a PG etext for printing.
As bowerbird is unfailingly quick to point out, automatic processing of any document file relies on the file being regularized in such a way that any transformation you which to make is unambiguously identifiable. If it is not possible to unambiguously identify a transformation you wish to make, the file must include unambiguously identifiable meta-information (information that is not part of the primary data) that identifies the transformation (this kind of meta-information is commonly known as "markup"). Project Gutenberg requires no textual regularization of any kind for its impoverished text files, and therefore these files are extremely difficult to automatically transform. Of course, there are some conventions which have evolved some of which are used more regularly than others. Thus, if you are content with the italicization of text set off by underscores (_) you will probably be successful with this transformation more than 90% of the time. On the other hand, if you want to start chapters on a new page, you will probably be successful with that transformation less than 50% of the time. The degree of success you have will depend to a large extent on the degree of transformation you want to achieve; if you are content to simply print the file as is, changing only the font face (don't try to change the font size, or you will run into reflowing problems) you will can probably achieve 99+% success. If you want to make a PG file look like an ordinary paperback, certainly less than 50%. (This is, of course, assuming you are using "off-the-shelf" tools. If you're comfortable with scripting languages you could no doubt do better). Your degree of success will also depend on the age of the PG file you want to transform. As time has gone on, and conventions have evolved, later texts are more "regular" than earlier texts; good luck converting _Pride and Prejudice_. You will probably have the most success by using the HTML version of a file, when it can be found (I do not believe that the majority of texts at Project Gutenberg are yet available in HTML versions); this is because while PG HTML texts are still not completely consistent in their use of markup, they are probably /more/ consistent than the impoverished text files. I am assuming you used html2ps or html2pdf version 2.0.43 available from http://www.tufat.com/s_html2ps_html2pdf.htm, and that you have completely read the documentation (BTW, I have not). According to the website, html2pdf almost completely supports CSS version 2, and the media parameters values of CSS3. Were it I (and it will not be, because I am completely happy reading HTML on my mobile device, and because I find PDF to be the one format which is actually worse than PG impoverished text format) I would find a css style sheet which has most of the features I like then use that with the PG HTML files and html2pdf. The resulting PDF can then be printed using Acrobat Reader or equivalent (if you are committed to the destruction of the environment). I suspect that html2pdf will not consume a style sheet unless it is referenced by the html document itself, and DP/PG has been highly resistant to the notion of adding a reference to a generic style sheet in every HTML file, so you will probably have to edit each file to add "<link href="pgstd.css" type="text/css" rel="stylesheet" />" to the <head> section of each HTML file, but I would think that would fall under the category of "semi-automated." If you cannot find an HTML version of the text you want (be sure to look outside of PG, as there are many other sources) you might want to try bowerbird's ZML2HTML coverter; I suspect it may work about 75% of the time to get basic HTML out of PG impoverished text. FWIW, the style sheet I typically use for reading HTML files can be found at http://www.ebookcooperative.com/ebook.css.