Paulo Levi wrote:
It appears that the format seems to follow a rule sort of #etext1802 "&f;1/8/0/1802/1802.txt
&f; is an xml entity which is defined at the top of the rdf file. Retrieve the url &f;1/8/0/1802/1802.txt and you get the file. This is an xml file and you *really* should parse it with an xml parser. That will take care of all these problems for you. If you are concened about memory, use a sax parser.
Thought it doesn't appear to be consistent. I saw something about a old indexing scheme for files older than 10000. What is the scheme (can it be guessed from the #number?) and is it going to disappear from the Gutenberg server ? Or are you going to make redirects?
If you parse the rdf file the xml way you get the current url of the ebook files regardless of old/new directory scheme and any future file moves.