Programmatically build a catalog database?

Hello, (Hopefully this is the proper mailing list for such topic. Let me know otherwise.) I would like to build a local database of the Gutenberg catalog. The 'Gutenberg Feeds' page [1] lists the following resource to help achieve that programmatically: (1) All books in one huge file, in " superseded DCMI recommendation" format (2) A separate file for each book, in "current DCMI recommendation" format (3) A RSS Feed, in rss version="0.91" format So far: (1) sports all the Gutenberg assets, and is handy for the initial database build. But this looks a bit overkill for a day to day synchronization. (2) seems more appropriate than (1) for daily updates, but sports a different format: "current DCMI recommendation" vs. " superseded DCMI recommendation" (3) is a bit a blast from the past, but at least provides a list of new resources daily. Sadly there is no explicit link to (2), so one has to infer it from the <link> information. Questions: - Is there a version of (1) in the same format as (2)? Assuming the "current DCMI recommendation" is the canonical representation. That would save one from dealing with two different formats, or hacking (1) to get all the references to (2) and then hammer PG to get the individual files in format (2). - Why are (1) and (2) in different formats? - Is there an alternative feed that lists the rdf resource explicitly? An Atom feed perhaps? Apologies if these are FAQs, but I couldn't locate an unambiguous archive of this mailing list. Is GNAME a good proxy for the list postings? http://dir.gmane.org/gmane.culture.literature.e-books.gutenberg.volunteers Alternatively, is there a more straightforward way to build a local database of PG's assets? Perhaps I'm missing something :) Thanks in advance for any pointers. Cheers, PA. [1] http://www.gutenberg.org/wiki/Gutenberg:Feeds

I wrote an Activity for the One Laptop Per Child project which includes a text file listing all the books in PG and PG Australia. The child using the Activity can search through this list and download any of the books she finds. I didn't use the RDF feed but instead used the offline catalogs: http://www.gutenberg.org/wiki/Gutenberg:Offline_Catalogs I wrote the code in Python and you can check it out here: http://git.sugarlabs.org/readetexts James Simmons On Mon, Feb 18, 2013 at 2:48 PM, Petite Abeille <petite.abeille@gmail.com>wrote:
Hello,
(Hopefully this is the proper mailing list for such topic. Let me know otherwise.)
I would like to build a local database of the Gutenberg catalog.
The 'Gutenberg Feeds' page [1] lists the following resource to help achieve that programmatically:
(1) All books in one huge file, in " superseded DCMI recommendation" format (2) A separate file for each book, in "current DCMI recommendation" format (3) A RSS Feed, in rss version="0.91" format
So far:
(1) sports all the Gutenberg assets, and is handy for the initial database build. But this looks a bit overkill for a day to day synchronization. (2) seems more appropriate than (1) for daily updates, but sports a different format: "current DCMI recommendation" vs. " superseded DCMI recommendation" (3) is a bit a blast from the past, but at least provides a list of new resources daily. Sadly there is no explicit link to (2), so one has to infer it from the <link> information.
Questions:
- Is there a version of (1) in the same format as (2)? Assuming the "current DCMI recommendation" is the canonical representation. That would save one from dealing with two different formats, or hacking (1) to get all the references to (2) and then hammer PG to get the individual files in format (2). - Why are (1) and (2) in different formats? - Is there an alternative feed that lists the rdf resource explicitly? An Atom feed perhaps?
Apologies if these are FAQs, but I couldn't locate an unambiguous archive of this mailing list.
Is GNAME a good proxy for the list postings?
http://dir.gmane.org/gmane.culture.literature.e-books.gutenberg.volunteers
Alternatively, is there a more straightforward way to build a local database of PG's assets? Perhaps I'm missing something :)
Thanks in advance for any pointers.
Cheers,
PA.
[1] http://www.gutenberg.org/wiki/Gutenberg:Feeds
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Hello, On Feb 19, 2013, at 8:10 PM, James Simmons <nicestep@gmail.com> wrote:
I wrote an Activity for the One Laptop Per Child project which includes a text file listing all the books in PG and PG Australia. The child using the Activity can search through this list and download any of the books she finds. I didn't use the RDF feed but instead used the offline catalogs:
You mean the "The GUTINDEX Listings of EBooks"? At first glance these GUTINDEX look a bit less structured than the RDF format. Or are you referring to the MARC Records? Which, according to that page, are generated from the XML/RDF catalog. Also, that page refers to the "machine-readable format" as being the RDF catalog, so I don't mind using that.
I wrote the code in Python and you can check it out here:
Thanks for the pointer. Would you know of a list of PG related projects by any chance?

I did in fact use the GUTINDEX files. The RDF files seem to contain lots of markup and very little information. The GUTINDEX files are not perfect, but if you know how PG files are named and the formats you are likely to find for a given title they do the job. You can see my Activity in action here: http://activities.sugarlabs.org/en-US/sugar/addon/4035 I do not have a list of PG related projects. James Simmons On Tue, Feb 19, 2013 at 1:50 PM, Petite Abeille <petite.abeille@gmail.com>wrote:
Hello,
On Feb 19, 2013, at 8:10 PM, James Simmons <nicestep@gmail.com> wrote:
I wrote an Activity for the One Laptop Per Child project which includes a text file listing all the books in PG and PG Australia. The child using the Activity can search through this list and download any of the books she finds. I didn't use the RDF feed but instead used the offline catalogs:
You mean the "The GUTINDEX Listings of EBooks"?
At first glance these GUTINDEX look a bit less structured than the RDF format.
Or are you referring to the MARC Records? Which, according to that page, are generated from the XML/RDF catalog.
Also, that page refers to the "machine-readable format" as being the RDF catalog, so I don't mind using that.
I wrote the code in Python and you can check it out here:
Thanks for the pointer.
Would you know of a list of PG related projects by any chance?
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On 02/18/2013 09:48 PM, Petite Abeille wrote:
Questions:
- Is there a version of (1) in the same format as (2)? Assuming the "current DCMI recommendation" is the canonical representation. That would save one from dealing with two different formats, or hacking (1) to get all the references to (2) and then hammer PG to get the individual files in format (2).
No.
- Why are (1) and (2) in different formats?
Historical reasons. When the program generating (1) was written, that format was the current recommendation. When the program generating (2) was written, the recommendation had been updated.
- Is there an alternative feed that lists the rdf resource explicitly? An Atom feed perhaps?
No. But the rdf files are easy enough to find. Regards -- Marcello Perathoner webmaster@gutenberg.org

On Feb 19, 2013, at 8:23 PM, Marcello Perathoner <marcello@perathoner.de> wrote:
Historical reasons. When the program generating (1) was written, that format was the current recommendation. When the program generating (2) was written, the recommendation had been updated.
Fair enough. Which one is the canonical form in PG? (1) or (2)? Thanks.

On 02/19/2013 08:40 PM, Petite Abeille wrote:
On Feb 19, 2013, at 8:23 PM, Marcello Perathoner<marcello@perathoner.de> wrote:
Historical reasons. When the program generating (1) was written, that format was the current recommendation. When the program generating (2) was written, the recommendation had been updated.
Fair enough. Which one is the canonical form in PG? (1) or (2)?
None. With us it is just an output format for others to ingest. We use a postgresql database. Regards -- Marcello Perathoner webmaster@gutenberg.org

On Feb 19, 2013, at 9:06 PM, Marcello Perathoner <marcello@perathoner.de> wrote:
Fair enough. Which one is the canonical form in PG? (1) or (2)?
None. With us it is just an output format for others to ingest.
Ok. Which format would you recommend to use in terms of future support?

On 02/19/2013 09:10 PM, Petite Abeille wrote:
On Feb 19, 2013, at 9:06 PM, Marcello Perathoner<marcello@perathoner.de> wrote:
Fair enough. Which one is the canonical form in PG? (1) or (2)?
None. With us it is just an output format for others to ingest.
Ok. Which format would you recommend to use in terms of future support?
Probably (2). But getting too many of those at once will block your IP. If you want, I can tar+bzip2 them all together in a cron job. Regards -- Marcello Perathoner webmaster@gutenberg.org

On Feb 19, 2013, at 9:49 PM, Marcello Perathoner <marcello@perathoner.de> wrote:
Probably (2). But getting too many of those at once will block your IP.
Argh… overlooked your message… indeed I'm blocked now :/
If you want, I can tar+bzip2 them all together in a cron job.
Yes, that would be very handy, thanks. Also, regarding the IP blocking… what's an acceptable rate of requests? I can easily throttle the requests, but to what rate? Thanks again for your help.

On 02/19/2013 09:24 PM, Petite Abeille wrote:
We use a postgresql database.
Ah, also, if you store all the information in a relational database to start with, is the database schema available somewhere? I.e. the DDL?
No. The database schema changes very often and people importing database dumps would only get frustrated. The RDF file is an interface that stays put even when the database schema changes. Regards -- Marcello Perathoner webmaster@gutenberg.org

On Feb 19, 2013, at 8:23 PM, Marcello Perathoner <marcello@perathoner.de> wrote:
No. But the rdf files are easy enough to find.
So… how does one actually get these individual rdf files then? $ curl -IL http://www.gutenberg.org/ebooks/12345.rdf HTTP/1.1 403 Forbidden Ok, so www.gutenberg.org doesn't like curl. $ curl -L -A 'foo' http://www.gutenberg.org/ebooks/12345.rdf <p>You exceeded your rate. Come back later.</p> Now even the regular site fainted: http://www.gutenberg.org/ebooks/42137 503 Service Unavailable We're sorry, but your computer or network may be sending automated requests. To protect our users, we can't process your request right now. See: http://www.gutenberg.org/terms_of_use/ So… how does one get these individual rdf files programmatically? Thanks in advance.

On 02/20/2013 12:21 AM, Petite Abeille wrote:
On Feb 19, 2013, at 8:23 PM, Marcello Perathoner<marcello@perathoner.de> wrote:
No. But the rdf files are easy enough to find.
So… how does one actually get these individual rdf files then?
$ curl -IL http://www.gutenberg.org/ebooks/12345.rdf
HTTP/1.1 403 Forbidden
Ok, so www.gutenberg.org doesn't like curl.
No. And doesn't like any kind of automated access. If you want to do that, use a mirror from the mirror list: www.gutenberg.org/MIRRORS.ALL I've tar'ed all rdf files together so you can get them with one request: $ curl --user-agent "MyBotName/1.0; +mailto:contact@me.org" http://www.gutenberg.org/feeds/rdf-files.tar.bz2 ... Regards -- Marcello Perathoner webmaster@gutenberg.org

On Feb 20, 2013, at 1:04 PM, Marcello Perathoner <marcello@perathoner.de> wrote:
No. And doesn't like any kind of automated access.
Ooohhh… the irony :D
If you want to do that, use a mirror from the mirror list:
www.gutenberg.org/MIRRORS.ALL
Right. I saw that. But these don't contain any of the RDF files, do they? Or? For the record though… what's an acceptable query rate for gutenberg.org?
I've tar'ed all rdf files together so you can get them with one request:
$ curl --user-agent "MyBotName/1.0; +mailto:contact@me.org" http://www.gutenberg.org/feeds/rdf-files.tar.bz2 …
Ah, much excellent. Thanks! Is that aggregation going to be maintained in the future? Or is it a one-off?
participants (3)
-
James Simmons
-
Marcello Perathoner
-
Petite Abeille