On 10-05-26 16:22, Marcello Perathoner wrote:
catalog.rdf.bz2 gets updated every night.
Yes it does, but the information in there is very different from the individual files. I attach records from the catalogue and from the individual file. The biggest problem is, http://www.gutenberg.org/feeds/catalog.rdf#etext27676 http://www.gutenberg.org/ebooks/27676 are completely different. And there's no way (other than hand coding special cases to make it from one to the other). So even if we were to initially populate with catalog.rdf.bz2 we couldn't then go and pull the detailed records. (Btw, content-negotiation doesn't seem to work: curl -H "Accept: application/rdf+xml" http://www.gutenberg.org/ebooks/27676 gives a 406)
The individual rdf files don't get updated after first creation. That's something we will do eventually but need to figure out an efficient way.
We'd be happy to help with that. We've done a lot of thinking about this and have a quite scalable way -- see http://bibliographica.org/docs/ordf/ in fact it would be pretty low overhead if you didn't use the reasoning and fancy indexing.
Peacing the individual files together is not as easy as it seems because we have to remove redundant information. We'll have to copy the entire database into a triple store and serialize it out again. Not likely to happen soon.
That's not so hard. We could even do it for you given a tar of all the
individual
files. In fact for our purposes this would be better because otherwise
we'd have
to break the big file out into many small graphs (for each distinct
subject's
bnode closure).
Cheers,
-w
--
William Waites