a question about the format about plain text ebooks
Hi everybody, I am doing some text analysis on a large subset of Project Gutenberg etexts in us-ascii plain text format. I need to extract only the actual etext body from each etext file; in other words, I need to be able to cut off any legal fine print, notices to potential volunteers and information about donations. I looked randomly at several files and it looks like *** START OF THE PROJECT GUTENBERG EBOOK*** and “END OF PROJECT GUTENBERG EBOOK” delimit the parts I need. My question is: can I count on these markers appearing in each and every text? Or are there other delimiters and/or tags? Thanks to anyone who can help. Regards, Anna
Hello, Anya - the delimiter lines you mention have been the standard for PG ebooks for some time now, especially those with an etext number of 10000 or greater. Before that, there may not be sufficient standardization of those lines for your purposes. Note that the delimiter lines may or may not have a space between the leading and trailing asterisks and the text between them, i.e. you may see this: ***START OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON*** ***END OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON*** or this: *** START OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON *** *** END OF THE PROJECT GUTENBERG EBOOK THE LIFE OF REASON *** Other things to note, depending on how you plan to harvest sample texts: If an "etext" is actually a collection of MP3 files, only the readme.txt file accompanying them will have the above delimiters, and its contents will consist mainly of a list of the MP3 files, which may not be useful to your analysis. If you randomly take texts from PG, most of them will be English, since that's the most common language represented in PG, but you'll also get an assortment of many other languages, for most, if not all, of which there may not be US-ASCII versions. Other readers of this forum will undoubtedly have further comments/suggestions. Al Haines Project Gutenberg ----- Original Message ----- From: Anya Kazantseva To: gutvol-p@lists.pglaf.org Sent: Monday, November 02, 2009 1:16 PM Subject: [gutvol-p] a question about the format about plain text ebooks Hi everybody, I am doing some text analysis on a large subset of Project Gutenberg etexts in us-ascii plain text format. I need to extract only the actual etext body from each etext file; in other words, I need to be able to cut off any legal fine print, notices to potential volunteers and information about donations. I looked randomly at several files and it looks like *** START OF THE PROJECT GUTENBERG EBOOK*** and “END OF PROJECT GUTENBERG EBOOK” delimit the parts I need. My question is: can I count on these markers appearing in each and every text? Or are there other delimiters and/or tags? Thanks to anyone who can help. Regards, Anna _______________________________________________ gutvol-p mailing list gutvol-p@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-p
participants (2)
-
Al Haines (shaw)
-
Anya Kazantseva