[gutvol-p] Re: a question about the format about plain text ebooks
mmacondo at gmail.com
Tue Nov 3 08:38:22 PST 2009
Thanks a lot to both of you. I was worried that the delimiters are not cast
in stone and, alas, you confirm that... But I think if I use your
suggestions I should be relatively OK since I am only using plain text
ascii ebooks only in English and only on certain topics (as specified in the
On Mon, Nov 2, 2009 at 5:56 PM, <Bowerbird at aol.com> wrote:
> anna said:
> > can I count on these markers
> > appearing in each and every text?
> > Or are there other delimiters and/or tags?
> you can count on them appearing, except when they don't.
> but if you search for lines with three (or more) asterisks,
> which have the word "start" and "gutenberg" (uppercase),
> you're likely to have very good results across the corpus.
> however, you will then want to screen the first paragraph
> after that, for words like "distributed" and/or "produced"
> (and a handful of others that i don't remember right now),
> so as to get rid of the production-credit for the book.
> there are also some huge (literally) gotchas that you will
> _not_ want to download, including the human genome files
> and an innocuous e-text (whose number i have repressed)
> that includes the entire library up to that point. (i dunno
> whose idea that was, but i wish somebody else would have
> been smart enough to shoot it down as a very stupid idea.)
> gutvol-p mailing list
> gutvol-p at lists.pglaf.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Size: 2298 bytes
Desc: not available
More information about the gutvol-p