[gutvol-d] Removing spurious break lines

24 Apr 2010

      Hi,

some books, like "Don Quijote" (http://www.gutenberg.org/etext/2000)
have spurious break lines all over the text. From what I understood PG
generates all the derived formats from the HTML, if there is one, or
from the raw text format otherwise.

In this case there is an HTML version, but it also contains the
spurious break lines. My guess is that the HTML was automatically
generated from the text, and the text breaks the lines at ~79 - 80
characters.

Are there guidelines on how to format the raw text to make it more
amenable for automatic conversion to other formats by the PG tools? Is
it ok to reformat this text removing the spurious break lines in the
raw text?

Was the HTML automatically generated? or do I have to fix also the HTML?

How can I check the results in other formats before sending it to PG?

Also, are the conversion tools open source?

Cheers,

-- 
Joaquin Cuenca Abela