[gutvol-p] Re: a question about the format about plain text ebooks

Anya Kazantseva mmacondo at gmail.com
Tue Nov 3 08:38:22 PST 2009


Thanks a lot to both of you. I was worried that the delimiters are not cast
in stone and, alas, you confirm that... But I think if I use your
suggestions  I should be relatively OK since I am only using plain text
ascii ebooks only in English and only on certain topics (as specified in the
catalogue).

Cheers,

Anna


On Mon, Nov 2, 2009 at 5:56 PM, <Bowerbird at aol.com> wrote:

> anna said:
> >   can I count on these markers
> >   appearing in each and every text?
> >   Or are there other delimiters and/or tags?
>
> you can count on them appearing, except when they don't.
>
> but if you search for lines with three (or more) asterisks,
> which have the word "start" and "gutenberg" (uppercase),
> you're likely to have very good results across the corpus.
>
> however, you will then want to screen the first paragraph
> after that, for words like "distributed" and/or "produced"
> (and a handful of others that i don't remember right now),
> so as to get rid of the production-credit for the book.
>
> there are also some huge (literally) gotchas that you will
> _not_ want to download, including the human genome files
> and an innocuous e-text (whose number i have repressed)
> that includes the entire library up to that point.  (i dunno
> whose idea that was, but i wish somebody else would have
> been smart enough to shoot it down as a very stupid idea.)
>
> -bowerbird
>
> _______________________________________________
> gutvol-p mailing list
> gutvol-p at lists.pglaf.org
> http://lists.pglaf.org/mailman/listinfo/gutvol-p
>
>


-- 
Anya...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 2298 bytes
Desc: not available
URL: <http://lists.pglaf.org/pipermail/gutvol-p/attachments/20091103/a1e427b8/attachment.txt>


More information about the gutvol-p mailing list