Re: Fwd: Programmatic fetching books from Gutenberg

David A. Desrosiers wrote:
On Wed, Jul 15, 2009 at 8:28 PM, Paulo Levi<i30817@gmail.com> wrote:
So, i like to know how from the etext1802 number can i get a working url to download the book, and how to construct variants for each format.
I do something very similar on the Plucker "samples" page:
I check HEAD on each resource (using an intelligent caching mechanism on my side), and then either present a working link, or a striked-out link, depending on whether the format is available or not.
That seems an horrible waste of resources seeing that you only need to scan the rdf file to see what files we have.

On Mon, Jul 27, 2009 at 3:24 AM, Marcello Perathoner<marcello@perathoner.de> wrote:
That seems an horrible waste of resources seeing that you only need to scan the rdf file to see what files we have.
Scanning the RDF file tells me absolutely nothing about the availability of the actual target format itself. Checking HEAD on each target link does, however. Since I'm caching it on the server-side, I only have to remotely check it the first time, which is not a "horrible waste of resources" at all.

On Jul 27, 2009, at 4:34 PM, David A. Desrosiers wrote:
On Mon, Jul 27, 2009 at 3:24 AM, Marcello Perathoner<marcello@perathoner.de> wrote:
That seems an horrible waste of resources seeing that you only need to scan the rdf file to see what files we have.
Scanning the RDF file tells me absolutely nothing about the availability of the actual target format itself. Checking HEAD on each target link does, however. Since I'm caching it on the server-side, I only have to remotely check it the first time, which is not a "horrible waste of resources" at all.
My, can't we admit that XPath is a bit over our head, so we prefer confronting the admin we're supposed to be cooperating with? Wrt resources, my guess it's about par traffic-wise (1-5k per book vs. megabytes of RDF) but much better CPU-wise. That is, if you don't want the RDF for other fine things like metadata etc. ralf

On Mon, Jul 27, 2009 at 1:45 PM, Ralf Stephan<ralf@ark.in-berlin.de> wrote:
My, can't we admit that XPath is a bit over our head, so we prefer confronting the admin we're supposed to be cooperating with? Wrt resources, my guess it's about par traffic-wise (1-5k per book vs. megabytes of RDF) but much better CPU-wise. That is, if you don't want the RDF for other fine things like metadata etc.
I think you've missed my point. The RDF flat-out cannot tell me which of the target _formats_ are available for immediate download to the users. I'm not looking for which _titles_ are available in the catalog, I'm looking for which _formats_ are available. Also note that I'm already parsing the feeds to see what the top 'n' titles are already, so parsing XML via whatever methods I need is not the blocker here. Let me give you an example of two titles available in the catalog: Vergänglichkeit by Sigmund Freud http://www.gutenberg.org/cache/plucker/29514/29514 The Lost Word by Henry Van Dyke http://www.gutenberg.org/cache/plucker/4384/4384 Both of these _titles_ are available in the Gutenberg catalog, but the second one is not available in the Plucker _format_ for immediate download. Big difference from parsing title availability from the catalog.rdf file. Make sense now?

I confirm that neither the Plucker nor the Mobile formats are mentioned in the catalog file. Do you have an explanation, Marcello? ralf On Jul 27, 2009, at 8:42 PM, David A. Desrosiers wrote:
On Mon, Jul 27, 2009 at 1:45 PM, Ralf Stephan<ralf@ark.in-berlin.de> wrote:
My, can't we admit that XPath is a bit over our head, so we prefer confronting the admin we're supposed to be cooperating with? Wrt resources, my guess it's about par traffic-wise (1-5k per book vs. megabytes of RDF) but much better CPU-wise. That is, if you don't want the RDF for other fine things like metadata etc.
I think you've missed my point.
The RDF flat-out cannot tell me which of the target _formats_ are available for immediate download to the users. I'm not looking for which _titles_ are available in the catalog, I'm looking for which _formats_ are available. Also note that I'm already parsing the feeds to see what the top 'n' titles are already, so parsing XML via whatever methods I need is not the blocker here.
Let me give you an example of two titles available in the catalog:
Vergänglichkeit by Sigmund Freud http://www.gutenberg.org/cache/plucker/29514/29514
The Lost Word by Henry Van Dyke http://www.gutenberg.org/cache/plucker/4384/4384
Both of these _titles_ are available in the Gutenberg catalog, but the second one is not available in the Plucker _format_ for immediate download. Big difference from parsing title availability from the catalog.rdf file.
Make sense now? _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
Ralf Stephan http://www.ark.in-berlin.de pub 1024D/C5114CB2 2009-06-07 [expires: 2011-06-06] Key fingerprint = 76AE 0D21 C06C CBF9 24F8 7835 1809 DE97 C511 4CB2

On Tue, Jul 28, 2009 at 09:16:41AM +0200, Ralf Stephan wrote:
I confirm that neither the Plucker nor the Mobile formats are mentioned in the catalog file. Do you have an explanation, Marcello?
I believe Marcello is out on vacation for 2 weeks. But I know the explanation: the epub, mobi and a few other formats are not part of the Project Gutenberg collection's files, so not part of the database. They are generated on-demand (or cached if they were generated recently enough), from HTML or text. We are planning many more "on the fly" conversion options for the future. I have one for a mobile eBook format (for cell phones), and hope to have a PDF converter (with lots of options). We've been working on some text-to-speech converters, too, but that work has gone slowly. The catalog file only tracks the actual files that are stored as part of the collection (stuff you can view while navigating the directory tree via FTP or other methods). -- Greg
On Jul 27, 2009, at 8:42 PM, David A. Desrosiers wrote:
On Mon, Jul 27, 2009 at 1:45 PM, Ralf Stephan<ralf@ark.in-berlin.de> wrote:
My, can't we admit that XPath is a bit over our head, so we prefer confronting the admin we're supposed to be cooperating with? Wrt resources, my guess it's about par traffic-wise (1-5k per book vs. megabytes of RDF) but much better CPU-wise. That is, if you don't want the RDF for other fine things like metadata etc.
I think you've missed my point.
The RDF flat-out cannot tell me which of the target _formats_ are available for immediate download to the users. I'm not looking for which _titles_ are available in the catalog, I'm looking for which _formats_ are available. Also note that I'm already parsing the feeds to see what the top 'n' titles are already, so parsing XML via whatever methods I need is not the blocker here.
Let me give you an example of two titles available in the catalog:
Vergänglichkeit by Sigmund Freud http://www.gutenberg.org/cache/plucker/29514/29514
The Lost Word by Henry Van Dyke http://www.gutenberg.org/cache/plucker/4384/4384
Both of these _titles_ are available in the Gutenberg catalog, but the second one is not available in the Plucker _format_ for immediate download. Big difference from parsing title availability from the catalog.rdf file.
Make sense now? _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
Ralf Stephan http://www.ark.in-berlin.de pub 1024D/C5114CB2 2009-06-07 [expires: 2011-06-06] Key fingerprint = 76AE 0D21 C06C CBF9 24F8 7835 1809 DE97 C511 4CB2
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Ralf Stephan wrote:
I confirm that neither the Plucker nor the Mobile formats are mentioned in the catalog file. Do you have an explanation, Marcello?
Because the plucker files are not in the archive. They are generated on-the-fly (sort of) and then cached for some time. We don't know which files are in the cache at any actual moment. This design was changed for the epub and mobi files. They are generated manually and stay in the cache (until manually rebuilt). They could be added to the database as well and to the RDF file too. Just a small matter of programming ... Just now I'm concentrating on making better epubs. By the time I'll be satisfied, I'll tackle the databse code.

David A. Desrosiers wrote:
On Mon, Jul 27, 2009 at 1:45 PM, Ralf Stephan<ralf@ark.in-berlin.de> wrote:
My, can't we admit that XPath is a bit over our head, so we prefer confronting the admin we're supposed to be cooperating with? Wrt resources, my guess it's about par traffic-wise (1-5k per book vs. megabytes of RDF) but much better CPU-wise. That is, if you don't want the RDF for other fine things like metadata etc.
I think you've missed my point.
The RDF flat-out cannot tell me which of the target _formats_ are available for immediate download to the users. I'm not looking for which _titles_ are available in the catalog, I'm looking for which _formats_ are available. Also note that I'm already parsing the feeds to see what the top 'n' titles are already, so parsing XML via whatever methods I need is not the blocker here.
Let me give you an example of two titles available in the catalog:
Vergänglichkeit by Sigmund Freud http://www.gutenberg.org/cache/plucker/29514/29514
The Lost Word by Henry Van Dyke http://www.gutenberg.org/cache/plucker/4384/4384
Both of these _titles_ are available in the Gutenberg catalog, but the second one is not available in the Plucker _format_ for immediate download. Big difference from parsing title availability from the catalog.rdf file.
So you are doing a HEAD on the cache location? I hope you don't have many of these in the field, because you're going to look very sorry whenever the location of the cache changes. (It will! I give you fair notice for free :-) )
Make sense now?
No. Why is that "immediate download" bit so important for you? You will get a completely random set of files. (A cached plucker file expires 7 days after *generation* not after the last *access*. So all you get is the set of files generated in the last 7 days.) And a wrong set too. The first file could have been deleted on the server long before you finished your barrage of HEAD request.
participants (4)
-
David A. Desrosiers
-
Greg Newby
-
Marcello Perathoner
-
Ralf Stephan