[gutvol-p] Re: Programmatic fetching books from Gutenberg

Thu Jul 16 13:07:37 PDT 2009

Paulo Levi wrote:

> It appears that the format seems to follow a rule sort of
> #etext1802
> "&f;1/8/0/1802/1802.txt

&f; is an xml entity which is defined at the top of the rdf file. 
Retrieve the url &f;1/8/0/1802/1802.txt and you get the file.

This is an xml file and you *really* should parse it with an xml parser. 
That will take care of all these problems for you. If you are concened 
about memory, use a sax parser.

> Thought it doesn't appear to be consistent. I saw something about a old indexing 
> scheme for files older than 10000. What is the scheme (can it be guessed from 
> the #number?) and is it going to disappear from the Gutenberg server ? Or are 
> you going to make redirects?

If you parse the rdf file the xml way you get the current url of the 
ebook files regardless of old/new directory scheme and any future file 
moves.