From mmacondo at gmail.com Mon Nov 2 13:16:42 2009 From: mmacondo at gmail.com (Anya Kazantseva) Date: Mon, 2 Nov 2009 16:16:42 -0500 Subject: [gutvol-p] a question about the format about plain text ebooks Message-ID: Hi everybody, I am doing some text analysis on a large subset of Project Gutenberg etexts in us-ascii plain text format. I need to extract only the actual etext body from each etext file; in other words, I need to be able to cut off any legal fine print, notices to potential volunteers and information about donations. I looked randomly at several files and it looks like *** START OF THE PROJECT GUTENBERG EBOOK*** and ?END OF PROJECT GUTENBERG EBOOK? delimit the parts I need. My question is: can I count on these markers appearing in each and every text? Or are there other delimiters and/or tags? Thanks to anyone who can help. Regards, Anna -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 692 bytes Desc: not available URL: From ajhaines at shaw.ca Mon Nov 2 14:10:49 2009 From: ajhaines at shaw.ca (Al Haines (shaw)) Date: Mon, 2 Nov 2009 14:10:49 -0800 Subject: [gutvol-p] Re: a question about the format about plain text ebooks References: Message-ID: <918E6E9FF2BC44F08E647971BD20652F@alp2400> Hello, Anya - the delimiter lines you mention have been the standard for PG ebooks for some time now, especially those with an etext number of 10000 or greater. Before that, there may not be sufficient standardization of those lines for your purposes. Note that the delimiter lines may or may not have a space between the leading and trailing asterisks and the text between them, i.e. you may see this: ***START OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON*** ***END OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON*** or this: *** START OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON *** *** END OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON *** Other things to note, depending on how you plan to harvest sample texts: If an "etext" is actually a collection of MP3 files, only the readme.txt file accompanying them will have the above delimiters, and its contents will consist mainly of a list of the MP3 files, which may not be useful to your analysis. If you randomly take texts from PG, most of them will be English, since that's the most common language represented in PG, but you'll also get an assortment of many other languages, for most, if not all, of which there may not be US-ASCII versions. Other readers of this forum will undoubtedly have further comments/suggestions. Al Haines Project Gutenberg ----- Original Message ----- From: Anya Kazantseva To: gutvol-p at lists.pglaf.org Sent: Monday, November 02, 2009 1:16 PM Subject: [gutvol-p] a question about the format about plain text ebooks Hi everybody, I am doing some text analysis on a large subset of Project Gutenberg etexts in us-ascii plain text format. I need to extract only the actual etext body from each etext file; in other words, I need to be able to cut off any legal fine print, notices to potential volunteers and information about donations. I looked randomly at several files and it looks like *** START OF THE PROJECT GUTENBERG EBOOK*** and ?END OF PROJECT GUTENBERG EBOOK? delimit the parts I need. My question is: can I count on these markers appearing in each and every text? Or are there other delimiters and/or tags? Thanks to anyone who can help. Regards, Anna _______________________________________________ gutvol-p mailing list gutvol-p at lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-p From Bowerbird at aol.com Mon Nov 2 14:56:13 2009 From: Bowerbird at aol.com (Bowerbird at aol.com) Date: Mon, 2 Nov 2009 17:56:13 EST Subject: [gutvol-p] Re: a question about the format about plain text ebooks Message-ID: anna said: > can I count on these markers > appearing in each and every text? > Or are there other delimiters and/or tags? you can count on them appearing, except when they don't. but if you search for lines with three (or more) asterisks, which have the word "start" and "gutenberg" (uppercase), you're likely to have very good results across the corpus. however, you will then want to screen the first paragraph after that, for words like "distributed" and/or "produced" (and a handful of others that i don't remember right now), so as to get rid of the production-credit for the book. there are also some huge (literally) gotchas that you will _not_ want to download, including the human genome files and an innocuous e-text (whose number i have repressed) that includes the entire library up to that point. (i dunno whose idea that was, but i wish somebody else would have been smart enough to shoot it down as a very stupid idea.) -bowerbird -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 1286 bytes Desc: not available URL: From mmacondo at gmail.com Tue Nov 3 08:38:22 2009 From: mmacondo at gmail.com (Anya Kazantseva) Date: Tue, 3 Nov 2009 11:38:22 -0500 Subject: [gutvol-p] Re: a question about the format about plain text ebooks In-Reply-To: References: Message-ID: Thanks a lot to both of you. I was worried that the delimiters are not cast in stone and, alas, you confirm that... But I think if I use your suggestions I should be relatively OK since I am only using plain text ascii ebooks only in English and only on certain topics (as specified in the catalogue). Cheers, Anna On Mon, Nov 2, 2009 at 5:56 PM, wrote: > anna said: > > can I count on these markers > > appearing in each and every text? > > Or are there other delimiters and/or tags? > > you can count on them appearing, except when they don't. > > but if you search for lines with three (or more) asterisks, > which have the word "start" and "gutenberg" (uppercase), > you're likely to have very good results across the corpus. > > however, you will then want to screen the first paragraph > after that, for words like "distributed" and/or "produced" > (and a handful of others that i don't remember right now), > so as to get rid of the production-credit for the book. > > there are also some huge (literally) gotchas that you will > _not_ want to download, including the human genome files > and an innocuous e-text (whose number i have repressed) > that includes the entire library up to that point. (i dunno > whose idea that was, but i wish somebody else would have > been smart enough to shoot it down as a very stupid idea.) > > -bowerbird > > _______________________________________________ > gutvol-p mailing list > gutvol-p at lists.pglaf.org > http://lists.pglaf.org/mailman/listinfo/gutvol-p > > -- Anya... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/html Size: 2298 bytes Desc: not available URL: From jimad at msn.com Thu Nov 5 11:55:34 2009 From: jimad at msn.com (Jim Adcock) Date: Thu, 5 Nov 2009 11:55:34 -0800 Subject: [gutvol-p] Re: a question about the format about plain text ebooks In-Reply-To: References: Message-ID: You should also be aware of common Project-Gutenberg-isms that modify the author's (or at least the publishers') use of punctuation, that what was italic in the original text may be marked up using one of a variety of differing markers, and that the choice of spellings depends strongly on date of original pub, and whether the original pub was published in England, US, or Australia -- and/or whether that original pub's spelling was modified somewhere along the way. This is assuming that you are doing some kind of linguistic analysis so that you might care about these kinds of issues. "Plain Text ASCII" being something that is actually very ill-defined.