a question about the format about plain text ebooks

2 Nov 2009

      Hi everybody,

I am doing some text analysis on a large subset of Project Gutenberg etexts
in us-ascii plain text format. I need to extract only the actual etext body
from each etext file; in other words, I need to be able to cut off any legal
fine print, notices to potential volunteers and information about donations.
I looked randomly at several files and it looks like *** START OF THE
PROJECT GUTENBERG EBOOK*** and “END             OF PROJECT GUTENBERG EBOOK”
delimit the parts I need. My question is: can I count on these markers
appearing in each and every text? Or are there other delimiters and/or tags?

Thanks to anyone who can help.

Regards,

Anna

Anya Kazantseva

Al Haines (shaw)

tags

participants (2)