Removing spurious break lines

Hi, some books, like "Don Quijote" (http://www.gutenberg.org/etext/2000) have spurious break lines all over the text. From what I understood PG generates all the derived formats from the HTML, if there is one, or from the raw text format otherwise. In this case there is an HTML version, but it also contains the spurious break lines. My guess is that the HTML was automatically generated from the text, and the text breaks the lines at ~79 - 80 characters. Are there guidelines on how to format the raw text to make it more amenable for automatic conversion to other formats by the PG tools? Is it ok to reformat this text removing the spurious break lines in the raw text? Was the HTML automatically generated? or do I have to fix also the HTML? How can I check the results in other formats before sending it to PG? Also, are the conversion tools open source? Cheers, -- Joaquin Cuenca Abela

Wrt "Don Quijote", the page claims the HTML has been generated manually. I have generated an improved HTML version with a python script, and added a few manual fixes (like adding some extra headers). The trickiest part was to accurately identify verses. The original text is inconsistent on to where it splits the lines (but most of the text cuts lines at 75 characters). How can I submit the modified HTML? Thanks, On Sat, Apr 24, 2010 at 11:18 AM, Joaquin Cuenca Abela <e98cuenc@gmail.com> wrote:
Hi,
some books, like "Don Quijote" (http://www.gutenberg.org/etext/2000) have spurious break lines all over the text. From what I understood PG generates all the derived formats from the HTML, if there is one, or from the raw text format otherwise.
In this case there is an HTML version, but it also contains the spurious break lines. My guess is that the HTML was automatically generated from the text, and the text breaks the lines at ~79 - 80 characters.
Are there guidelines on how to format the raw text to make it more amenable for automatic conversion to other formats by the PG tools? Is it ok to reformat this text removing the spurious break lines in the raw text?
Was the HTML automatically generated? or do I have to fix also the HTML?
How can I check the results in other formats before sending it to PG?
Also, are the conversion tools open source?
Cheers,
-- Joaquin Cuenca Abela
-- Joaquin Cuenca Abela
participants (1)
-
Joaquin Cuenca Abela