On 10-05-26 16:22, Marcello Perathoner wrote:
catalog.rdf.bz2 gets updated every night.
Yes it does, but the information in there is very different from the individual files. I attach records from the catalogue and from the individual file. The biggest problem is, <http://www.gutenberg.org/feeds/catalog.rdf#etext27676> <http://www.gutenberg.org/ebooks/27676> are completely different. And there's no way (other than hand coding special cases to make it from one to the other). So even if we were to initially populate with catalog.rdf.bz2 we couldn't then go and pull the detailed records. (Btw, content-negotiation doesn't seem to work: curl -H "Accept: application/rdf+xml" http://www.gutenberg.org/ebooks/27676 gives a 406)
The individual rdf files don't get updated after first creation. That's something we will do eventually but need to figure out an efficient way.
We'd be happy to help with that. We've done a lot of thinking about this and have a quite scalable way -- see http://bibliographica.org/docs/ordf/ in fact it would be pretty low overhead if you didn't use the reasoning and fancy indexing.
Peacing the individual files together is not as easy as it seems because we have to remove redundant information. We'll have to copy the entire database into a triple store and serialize it out again. Not likely to happen soon.
That's not so hard. We could even do it for you given a tar of all the individual files. In fact for our purposes this would be better because otherwise we'd have to break the big file out into many small graphs (for each distinct subject's bnode closure). Cheers, -w -- William Waites <william.waites@okfn.org> Mob: +44 789 798 9965 Open Knowledge Foundation Fax: +44 131 464 4948 Edinburgh, UK