Re: a question about the format about plain text ebooks
anna said:
can I count on these markers appearing in each and every text? Or are there other delimiters and/or tags?
you can count on them appearing, except when they don't. but if you search for lines with three (or more) asterisks, which have the word "start" and "gutenberg" (uppercase), you're likely to have very good results across the corpus. however, you will then want to screen the first paragraph after that, for words like "distributed" and/or "produced" (and a handful of others that i don't remember right now), so as to get rid of the production-credit for the book. there are also some huge (literally) gotchas that you will _not_ want to download, including the human genome files and an innocuous e-text (whose number i have repressed) that includes the entire library up to that point. (i dunno whose idea that was, but i wish somebody else would have been smart enough to shoot it down as a very stupid idea.) -bowerbird
Thanks a lot to both of you. I was worried that the delimiters are not cast in stone and, alas, you confirm that... But I think if I use your suggestions I should be relatively OK since I am only using plain text ascii ebooks only in English and only on certain topics (as specified in the catalogue). Cheers, Anna On Mon, Nov 2, 2009 at 5:56 PM, <Bowerbird@aol.com> wrote:
anna said:
can I count on these markers appearing in each and every text? Or are there other delimiters and/or tags?
you can count on them appearing, except when they don't.
but if you search for lines with three (or more) asterisks, which have the word "start" and "gutenberg" (uppercase), you're likely to have very good results across the corpus.
however, you will then want to screen the first paragraph after that, for words like "distributed" and/or "produced" (and a handful of others that i don't remember right now), so as to get rid of the production-credit for the book.
there are also some huge (literally) gotchas that you will _not_ want to download, including the human genome files and an innocuous e-text (whose number i have repressed) that includes the entire library up to that point. (i dunno whose idea that was, but i wish somebody else would have been smart enough to shoot it down as a very stupid idea.)
-bowerbird
_______________________________________________ gutvol-p mailing list gutvol-p@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-p
-- Anya...
You should also be aware of common Project-Gutenberg-isms that modify the author's (or at least the publishers') use of punctuation, that what was italic in the original text may be marked up using one of a variety of differing markers, and that the choice of spellings depends strongly on date of original pub, and whether the original pub was published in England, US, or Australia -- and/or whether that original pub's spelling was modified somewhere along the way. This is assuming that you are doing some kind of linguistic analysis so that you might care about these kinds of issues. "Plain Text ASCII" being something that is actually very ill-defined.
participants (3)
-
Anya Kazantseva
-
Bowerbird@aol.com
-
Jim Adcock