[gutvol-p] Re: Programmatic fetching books from Gutenberg
Marcello Perathoner
marcello at perathoner.de
Thu Jul 16 13:07:37 PDT 2009
Paulo Levi wrote:
> It appears that the format seems to follow a rule sort of
> #etext1802
> "&f;1/8/0/1802/1802.txt
&f; is an xml entity which is defined at the top of the rdf file.
Retrieve the url &f;1/8/0/1802/1802.txt and you get the file.
This is an xml file and you *really* should parse it with an xml parser.
That will take care of all these problems for you. If you are concened
about memory, use a sax parser.
> Thought it doesn't appear to be consistent. I saw something about a old indexing
> scheme for files older than 10000. What is the scheme (can it be guessed from
> the #number?) and is it going to disappear from the Gutenberg server ? Or are
> you going to make redirects?
If you parse the rdf file the xml way you get the current url of the
ebook files regardless of old/new directory scheme and any future file
moves.
More information about the gutvol-p
mailing list