Hi everybody,

I am doing some text analysis on a large subset of Project Gutenberg etexts in us-ascii plain text format. I need to extract only the actual etext body from each etext file; in other words, I need to be able to cut off any legal fine print, notices to potential volunteers and information about donations. I looked randomly at several files and it looks like *** START OF THE PROJECT GUTENBERG EBOOK*** and “END OF PROJECT GUTENBERG EBOOK” delimit the parts I need. My question is: can I count on these markers appearing in each and every text? Or are there other delimiters and/or tags?

Thanks to anyone who can help.

Regards,

Anna