[gutvol-p] Re: a question about the format about plain text ebooks
Al Haines (shaw)
ajhaines at shaw.ca
Mon Nov 2 14:10:49 PST 2009
Hello, Anya - the delimiter lines you mention have been the standard for PG
ebooks for some time now, especially those with an etext number of 10000 or
greater. Before that, there may not be sufficient standardization of those
lines for your purposes.
Note that the delimiter lines may or may not have a space between the
leading and trailing asterisks and the text between them, i.e. you may see
***START OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON***
***END OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON***
*** START OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON ***
*** END OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON ***
Other things to note, depending on how you plan to harvest sample texts:
If an "etext" is actually a collection of MP3 files, only the readme.txt
file accompanying them will have the above delimiters, and its contents will
consist mainly of a list of the MP3 files, which may not be useful to your
If you randomly take texts from PG, most of them will be English, since
that's the most common language represented in PG, but you'll also get an
assortment of many other languages, for most, if not all, of which there may
not be US-ASCII versions.
Other readers of this forum will undoubtedly have further
----- Original Message -----
From: Anya Kazantseva
To: gutvol-p at lists.pglaf.org
Sent: Monday, November 02, 2009 1:16 PM
Subject: [gutvol-p] a question about the format about plain text ebooks
I am doing some text analysis on a large subset of Project Gutenberg etexts
in us-ascii plain text format. I need to extract only the actual etext body
from each etext file; in other words, I need to be able to cut off any legal
fine print, notices to potential volunteers and information about donations.
I looked randomly at several files and it looks like *** START OF THE
PROJECT GUTENBERG EBOOK*** and “END OF PROJECT GUTENBERG EBOOK”
delimit the parts I need. My question is: can I count on these markers
appearing in each and every text? Or are there other delimiters and/or tags?
Thanks to anyone who can help.
gutvol-p mailing list
gutvol-p at lists.pglaf.org
More information about the gutvol-p