[gutvol-p] a question about the format about plain text ebooks
mmacondo at gmail.com
Mon Nov 2 13:16:42 PST 2009
I am doing some text analysis on a large subset of Project Gutenberg etexts
in us-ascii plain text format. I need to extract only the actual etext body
from each etext file; in other words, I need to be able to cut off any legal
fine print, notices to potential volunteers and information about donations.
I looked randomly at several files and it looks like *** START OF THE
PROJECT GUTENBERG EBOOK*** and “END OF PROJECT GUTENBERG EBOOK”
delimit the parts I need. My question is: can I count on these markers
appearing in each and every text? Or are there other delimiters and/or tags?
Thanks to anyone who can help.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 692 bytes
Desc: not available
More information about the gutvol-p