[gutvol-p] Re: Programmatic fetching books from Gutenberg

John Mark Ockerbloom ockerblo at pobox.upenn.edu
Thu Jul 16 11:53:37 PDT 2009


(sending this time to the proper list address)

Prior to etext #10000, Gutenberg filenames were strings that
didn't have anything to do with the etext numbers (except that
the number indirectly determined which etext* directory files
would be in, since both the numbers and the etext* directories
were tied to release date).

As these early etexts get revised, they typically move into
the new numerically-based-filename system.  However, there are
still slightly more than 6000 Gutenberg etexts that use the old-style
filenames.

Information about the filenames, new and old, is included in
the RDF file.  For instance, if you are interested in etext #3167,
you'll find a metadata record for it early in the file in a
pgterms:etext element with ID "etext3167".  Later in the RDF file,
you'll see two pgterms:file elements that have an isFormatOf
relationship with the ID "etext3167".  Those elements in turn
specify information about the two files associated with that etext,
including name, MIME type, length, and last-modified date.
The name of the file, in particular, is in the rdf:about
attribute of the pgterms:file element.

In this case, you'll find that the two files associate with etext #3167
are in etext02/wsxpm10.txt and etext02/wsxpm10.zip (relative to the
top-level Gutenberg text directory).

If for some reason the RDF itself is too big for you
to handle easily, it looks like it's auto-generated, so you could probably
write a script in Perl or some other suitable text-crunching language
to extract only the information you're interested in, in some more compact
form.

I have my own independent copy of some of this information, in my own
format, which I assembled before the RDF directory was made available.
But if I were to start over again, I'd probably just pull straight from
the RDF.  (And try to be more proactive about getting Gutenberg to
fix its metadata at various spots instead of just fixing it on my end.)

I hoep this helps.

John



More information about the gutvol-p mailing list