[gutvol-p] Re: a question about the format about plain text ebooks

Al Haines (shaw) ajhaines at shaw.ca
Mon Nov 2 14:10:49 PST 2009


Hello, Anya - the delimiter lines you mention have been the standard for PG 
ebooks for some time now, especially those with an etext number of 10000 or 
greater.  Before that, there may not be sufficient standardization of those 
lines for your purposes.


Note that the delimiter lines may or may not have a space between the 
leading and trailing asterisks and the text between them, i.e. you may see 
this:

***START OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON***
***END OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON***

or this:

*** START OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON ***
*** END OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON ***


Other things to note, depending on how you plan to harvest sample texts:

If an "etext" is actually a collection of MP3 files, only the readme.txt 
file accompanying them will have the above delimiters, and its contents will 
consist mainly of a list of the MP3 files, which may not be useful to your 
analysis.

If you randomly take texts from PG, most of them will be English, since 
that's the most common language represented in PG, but you'll also get an 
assortment of many other languages, for most, if not all, of which there may 
not be US-ASCII versions.


Other readers of this forum will undoubtedly have further 
comments/suggestions.

Al Haines
Project Gutenberg


----- Original Message ----- 
From: Anya Kazantseva
To: gutvol-p at lists.pglaf.org
Sent: Monday, November 02, 2009 1:16 PM
Subject: [gutvol-p] a question about the format about plain text ebooks


Hi everybody,

I am doing some text analysis on a large subset of Project Gutenberg etexts 
in us-ascii plain text format. I need to extract only the actual etext body 
from each etext file; in other words, I need to be able to cut off any legal 
fine print, notices to potential volunteers and information about donations. 
I looked randomly at several files and it looks like *** START OF THE 
PROJECT GUTENBERG EBOOK*** and “END             OF PROJECT GUTENBERG EBOOK” 
delimit the parts I need. My question is: can I count on these markers 
appearing in each and every text? Or are there other delimiters and/or tags?

Thanks to anyone who can help.

Regards,

Anna




_______________________________________________
gutvol-p mailing list
gutvol-p at lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-p 





More information about the gutvol-p mailing list