anna said: > can I count on these markers > appearing in each and every text? > Or are there other delimiters and/or tags? you can count on them appearing, except when they don't. but if you search for lines with three (or more) asterisks, which have the word "start" and "gutenberg" (uppercase), you're likely to have very good results across the corpus. however, you will then want to screen the first paragraph after that, for words like "distributed" and/or "produced" (and a handful of others that i don't remember right now), so as to get rid of the production-credit for the book. there are also some huge (literally) gotchas that you will _not_ want to download, including the human genome files and an innocuous e-text (whose number i have repressed) that includes the entire library up to that point. (i dunno whose idea that was, but i wish somebody else would have been smart enough to shoot it down as a very stupid idea.) -bowerbird