[gutvol-p] Re: a question about the format about plain text ebooks

Bowerbird at aol.com Bowerbird at aol.com
Mon Nov 2 14:56:13 PST 2009


anna said:
>    can I count on these markers
>    appearing in each and every text? 
>    Or are there other delimiters and/or tags?

you can count on them appearing, except when they don't.

but if you search for lines with three (or more) asterisks,
which have the word "start" and "gutenberg" (uppercase),
you're likely to have very good results across the corpus.

however, you will then want to screen the first paragraph
after that, for words like "distributed" and/or "produced"
(and a handful of others that i don't remember right now),
so as to get rid of the production-credit for the book.

there are also some huge (literally) gotchas that you will
_not_ want to download, including the human genome files
and an innocuous e-text (whose number i have repressed)
that includes the entire library up to that point.   (i dunno
whose idea that was, but i wish somebody else would have
been smart enough to shoot it down as a very stupid idea.)

-bowerbird
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: text/html
Size: 1286 bytes
Desc: not available
URL: <https://lists.pglaf.org/pipermail/gutvol-p/attachments/20091102/3252fafe/attachment.txt>


More information about the gutvol-p mailing list