2 Nov
2009
2 Nov
'09
9:16 p.m.
Hi everybody, I am doing some text analysis on a large subset of Project Gutenberg etexts in us-ascii plain text format. I need to extract only the actual etext body from each etext file; in other words, I need to be able to cut off any legal fine print, notices to potential volunteers and information about donations. I looked randomly at several files and it looks like *** START OF THE PROJECT GUTENBERG EBOOK*** and “END OF PROJECT GUTENBERG EBOOK” delimit the parts I need. My question is: can I count on these markers appearing in each and every text? Or are there other delimiters and/or tags? Thanks to anyone who can help. Regards, Anna