Thanks a lot to both of you. I was worried that the delimiters are not cast in stone and, alas, you confirm that... But I think if I use your suggestions I should be relatively OK since I am only using plain text ascii ebooks only in English and only on certain topics (as specified in the catalogue).
Cheers,
Anna
anna said:you can count on them appearing, except when they don't.
> can I count on these markers
> appearing in each and every text?
> Or are there other delimiters and/or tags?
but if you search for lines with three (or more) asterisks,
which have the word "start" and "gutenberg" (uppercase),
you're likely to have very good results across the corpus.
however, you will then want to screen the first paragraph
after that, for words like "distributed" and/or "produced"
(and a handful of others that i don't remember right now),
so as to get rid of the production-credit for the book.
there are also some huge (literally) gotchas that you will
_not_ want to download, including the human genome files
and an innocuous e-text (whose number i have repressed)
that includes the entire library up to that point. (i dunno
whose idea that was, but i wish somebody else would have
been smart enough to shoot it down as a very stupid idea.)
-bowerbird
_______________________________________________
gutvol-p mailing list
gutvol-p@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-p