Gutenberg Catalogue RDF
Hi all, I'm working on a project, http://bibliographica.org/ which involves annotating and enriching information about authors and works. We need some seed data! I would very much like to use the gutenberg catalogue for this and it seems like I have three options: * generate RDF out of the marc dump (lossy, messy) * use the catalog.rdf.bz2 (old rdf layout) * use the individual RDF e.g. http://www.gutenberg.org/ebooks/12345.rdf (means crawling the site). I'd really rather not crawl the site (and suspect you'd rather I not as well) but I would like to use the RDF generated for individual works (well, manifestations, but I digress). Any chance of producing a dump like catalog.rdf.bz2 but with the updated schema? Kind regards, -w -- William Waites <william.waites@okfn.org> Mob: +44 789 798 9965 Open Knowledge Foundation Fax: +44 131 464 4948 Edinburgh, UK
William Waites wrote:
Hi all,
I'm working on a project, http://bibliographica.org/ which involves annotating and enriching information about authors and works. We need some seed data! I would very much like to use the gutenberg catalogue for this and it seems like I have three options:
* generate RDF out of the marc dump (lossy, messy) * use the catalog.rdf.bz2 (old rdf layout) * use the individual RDF e.g. http://www.gutenberg.org/ebooks/12345.rdf (means crawling the site).
I'd really rather not crawl the site (and suspect you'd rather I not as well) but I would like to use the RDF generated for individual works (well, manifestations, but I digress).
Any chance of producing a dump like catalog.rdf.bz2 but with the updated schema?
catalog.rdf.bz2 gets updated every night. The individual rdf files don't get updated after first creation. That's something we will do eventually but need to figure out an efficient way. Peacing the individual files together is not as easy as it seems because we have to remove redundant information. We'll have to copy the entire database into a triple store and serialize it out again. Not likely to happen soon. -- Marcello Perathoner webmaster@gutenberg.org
On 10-05-26 16:22, Marcello Perathoner wrote:
catalog.rdf.bz2 gets updated every night.
Yes it does, but the information in there is very different from the individual files. I attach records from the catalogue and from the individual file. The biggest problem is, <http://www.gutenberg.org/feeds/catalog.rdf#etext27676> <http://www.gutenberg.org/ebooks/27676> are completely different. And there's no way (other than hand coding special cases to make it from one to the other). So even if we were to initially populate with catalog.rdf.bz2 we couldn't then go and pull the detailed records. (Btw, content-negotiation doesn't seem to work: curl -H "Accept: application/rdf+xml" http://www.gutenberg.org/ebooks/27676 gives a 406)
The individual rdf files don't get updated after first creation. That's something we will do eventually but need to figure out an efficient way.
We'd be happy to help with that. We've done a lot of thinking about this and have a quite scalable way -- see http://bibliographica.org/docs/ordf/ in fact it would be pretty low overhead if you didn't use the reasoning and fancy indexing.
Peacing the individual files together is not as easy as it seems because we have to remove redundant information. We'll have to copy the entire database into a triple store and serialize it out again. Not likely to happen soon.
That's not so hard. We could even do it for you given a tar of all the individual files. In fact for our purposes this would be better because otherwise we'd have to break the big file out into many small graphs (for each distinct subject's bnode closure). Cheers, -w -- William Waites <william.waites@okfn.org> Mob: +44 789 798 9965 Open Knowledge Foundation Fax: +44 131 464 4948 Edinburgh, UK
William Waites wrote:
On 10-05-26 16:22, Marcello Perathoner wrote:
catalog.rdf.bz2 gets updated every night.
Yes it does, but the information in there is very different from the individual files.
Not at all. The information is quite the same. (There's also author birth and death dates in the individual files, but otherwise its the same.)
I attach records from the catalogue and from the individual file.
The biggest problem is,
<http://www.gutenberg.org/feeds/catalog.rdf#etext27676> <http://www.gutenberg.org/ebooks/27676>
are completely different. And there's no way (other than hand coding special cases to make it from one to the other). So even if we were to initially populate with catalog.rdf.bz2 we couldn't then go and pull the detailed records.
Why should you do that? The big file contains the same information as the individual ones. -- Marcello Perathoner webmaster@gutenberg.org
On 10-05-26 19:24, Marcello Perathoner wrote:
Not at all. The information is quite the same. (There's also author birth and death dates in the individual files, but otherwise its the same.)
Not quite. The differences I can see for the CIA Factbook entry (examples I sent earlier, happens to be the first entry in the catalog.rdf.bz2 that I have lying around) * Different subject URI <-- very important, could kludge with owl:sameAs but shouldn't have to * Different layout for dc:subject (uses a rdf:Bag in one, a simple bunch of bnodes in the other) * Creator/Contributor/Publisher has a URI in the individual files but a text string in the catalog.rdf.gz. Using a URI is the right way to do it. * Links to downloadable resources are absent in the catalog So the first means that it is ambiguous which thing I am referring to if I use your URIs without going to the trouble of putting in owl:sameAs and then inferencing on that (resource intensive and messy). The second means that when I create a lens (c.f. fresnel vocabulary) for looking at the data I can't do it in a consistent way because sometimes dc:subject has one shape and sometimes another. The third means that if I want to present all works by an author I have to resort to smooshing on a text string when you already have URIs minted for that purpose. The fourth means that I can't provide links to the actual text, or download it automatically for indexing/text-mining purposes if I use catalog.rdf.bz2 The information is *similar* but not the same. -w -- William Waites <william.waites@okfn.org> Mob: +44 789 798 9965 Open Knowledge Foundation Fax: +44 131 464 4948 Edinburgh, UK
William Waites wrote:
* Different subject URI <-- very important, could kludge with owl:sameAs but shouldn't have to
The DCMI changed their recommendations. The RDF files follow what recommendations where current at the time I wrote the scripts. The syntax may be different but the semantic is the same.
* Different layout for dc:subject (uses a rdf:Bag in one, a simple bunch of bnodes in the other)
A Bag *is* just a bunch of nodes. The syntax may be different but the semantic is the same.
* Creator/Contributor/Publisher has a URI in the individual files but a text string in the catalog.rdf.gz. Using a URI is the right way to do it.
Thats the only difference. Actually using an URL is quite the wrong way. I did that only to make it possible for somebody to create an exact replica of our dataset (ie. containing the exact same set of (wrong?) assumptions we made.) The semantic of the string literal is: the author of this book is spelled 'John Doe'. The semantic of the URL is: the author of this book is spelled 'John Doe' *and* the authors of two books are the same person if they share the same url. Now the second statement is a very bold statement, especially if you don't find any LoC record for the book you are cataloguing or the LoC doesn't know either. (This happens quite often.)
* Links to downloadable resources are absent in the catalog
Look further down. -- Marcello Perathoner webmaster@gutenberg.org
Marcello, On 10-05-26 21:36, Marcello Perathoner wrote:
William Waites wrote:
* Different subject URI <-- very important, could kludge with owl:sameAs but shouldn't have to
The DCMI changed their recommendations. The RDF files follow what recommendations where current at the time I wrote the scripts.
The syntax may be different but the semantic is the same.
Different subject URIs mean different subjects. What is the canonical URI that is to be used to refer to one of gutenberg's texts? (I don't think it's the one in catalog.rdf)
* Different layout for dc:subject (uses a rdf:Bag in one, a simple bunch of bnodes in the other)
A Bag *is* just a bunch of nodes.
The syntax may be different but the semantic is the same.
This: dc:subject [ dcam:memberOf dc:LCSH; rdf:value "Geography -- Handbooks, manuals, etc.", "Political science -- Handbooks, manuals, etc.", "Political statistics -- Handbooks, manuals, etc.", "World politics -- Handbooks, manuals, etc."], [ dcam:memberOf dc:LCC; rdf:value "G"]; is different from this: dc:subject [ a rdf:Bag; rdf:_1 [ a dc:LCSH; rdf:value "Geography -- Handbooks, manuals, etc."]; rdf:_2 [ a dc:LCSH; rdf:value "World politics -- Handbooks, manuals, etc."]; rdf:_3 [ a dc:LCSH; rdf:value "Political science -- Handbooks, manuals, etc."]; rdf:_4 [ a dc:LCSH; rdf:value "Political statistics -- Handbooks, manuals, etc."]], [ a dc:LCC; rdf:value "G"]; Try writing a script that yields the strings in there and you'll see that it is different.
* Creator/Contributor/Publisher has a URI in the individual files but a text string in the catalog.rdf.gz. Using a URI is the right way to do it.
Thats the only difference.
Actually using an URL is quite the wrong way. I did that only to make it possible for somebody to create an exact replica of our dataset (ie. containing the exact same set of (wrong?) assumptions we made.)
The semantic of the string literal is: the author of this book is spelled 'John Doe'.
The semantic of the URL is: the author of this book is spelled 'John Doe' *and* the authors of two books are the same person if they share the same url.
Now the second statement is a very bold statement, especially if you don't find any LoC record for the book you are cataloguing or the LoC doesn't know either. (This happens quite often.)
Since we want to be able to make statements about Authors, they need to have URIs. I agree it's a bold statement, and inferring this information is error-prone. However I could start from scratch or I could start from the work you've already done. I'd rather not have to start from scratch. So the information needed to make an exact replica of your dataset is contained in the individual works rdf, not the catalog.rdf. How does that work?
* Links to downloadable resources are absent in the catalog
Look further down.
ok. Cheers, -w -- William Waites <william.waites@okfn.org> Mob: +44 789 798 9965 Open Knowledge Foundation Fax: +44 131 464 4948 Edinburgh, UK
William Waites wrote:
What is the canonical URI that is to be used to refer to one of gutenberg's texts? (I don't think it's the one in catalog.rdf)
http://www.gutenberg.org/ebooks/42
So the information needed to make an exact replica of your dataset is contained in the individual works rdf, not the catalog.rdf. How does that work?
You write a script that downloads all .rdf files, then you import them all into your software. -- Marcello Perathoner webmaster@gutenberg.org
participants (2)
-
Marcello Perathoner
-
William Waites