[gutvol-p] Re: Gutenberg Catalogue RDF

William Waites william.waites at okfn.org
Wed May 26 09:14:47 PDT 2010


On 10-05-26 16:22, Marcello Perathoner wrote:
>
> catalog.rdf.bz2 gets updated every night.

Yes it does, but the information in there is very different from the
individual files.
I attach records from the catalogue and from the individual file.

The biggest problem is,

<http://www.gutenberg.org/feeds/catalog.rdf#etext27676>
<http://www.gutenberg.org/ebooks/27676>

are completely different. And there's no way (other than hand coding
special cases
to make it from one to the other). So even if we were to initially
populate with
catalog.rdf.bz2 we couldn't then go and pull the detailed records.

(Btw, content-negotiation doesn't seem to work:
    curl -H "Accept: application/rdf+xml"
http://www.gutenberg.org/ebooks/27676
gives a 406)

> The individual rdf files don't get updated after first creation.
> That's something we will do eventually but need to figure out an
> efficient way.

We'd be happy to help with that. We've done a lot of thinking about this
and have
a quite scalable way -- see http://bibliographica.org/docs/ordf/ in fact
it would be
pretty low overhead if you didn't use the reasoning and fancy indexing.

> Peacing the individual files together is not as easy as it seems
> because  we have to remove redundant information. We'll have to copy
> the entire database into a triple store and serialize it out again.
> Not likely to happen soon.

That's not so hard. We could even do it for you given a tar of all the
individual
files. In fact for our purposes this would be better because otherwise
we'd have
to break the big file out into many small graphs (for each distinct
subject's
bnode closure).

Cheers,
-w

-- 
William Waites           <william.waites at okfn.org>
Mob: +44 789 798 9965    Open Knowledge Foundation
Fax: +44 131 464 4948                Edinburgh, UK



More information about the gutvol-p mailing list