[gutvol-p] a question about the format about plain text ebooks

Anya Kazantseva mmacondo at gmail.com
Mon Nov 2 13:16:42 PST 2009


Hi everybody,

I am doing some text analysis on a large subset of Project Gutenberg etexts
in us-ascii plain text format. I need to extract only the actual etext body
from each etext file; in other words, I need to be able to cut off any legal
fine print, notices to potential volunteers and information about donations.
I looked randomly at several files and it looks like *** START OF THE
PROJECT GUTENBERG EBOOK*** and “END             OF PROJECT GUTENBERG EBOOK”
delimit the parts I need. My question is: can I count on these markers
appearing in each and every text? Or are there other delimiters and/or tags?

Thanks to anyone who can help.

Regards,

Anna
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 692 bytes
Desc: not available
URL: <http://lists.pglaf.org/pipermail/gutvol-p/attachments/20091102/4c07130d/attachment.txt>


More information about the gutvol-p mailing list