[gutvol-d] Re: Fwd: Programmatic fetching books from Gutenberg

9 Aug 2009

      David A. Desrosiers wrote:
...
On Mon, Jul 27, 2009 at 1:45 PM, Ralf Stephan<ralf@ark.in-berlin.de> wrote:
...
My, can't we admit that XPath is a bit over our head,
so we prefer confronting the admin we're supposed
to be cooperating with? Wrt resources, my guess it's
about par traffic-wise (1-5k per book vs. megabytes
of RDF) but much better CPU-wise. That is, if you don't
want the RDF for other fine things like metadata etc.
I think you've missed my point.
The RDF flat-out cannot tell me which of the target _formats_ are
available for immediate download to the users. I'm not looking for
which _titles_ are available in the catalog, I'm looking for which
_formats_ are available. Also note that I'm already parsing the feeds
to see what the top 'n' titles are already, so parsing XML via
whatever methods I need is not the blocker here.
Let me give you an example of two titles available in the catalog:
Vergänglichkeit by Sigmund Freud
http://www.gutenberg.org/cache/plucker/29514/29514
The Lost Word by Henry Van Dyke
http://www.gutenberg.org/cache/plucker/4384/4384
Both of these _titles_ are available in the Gutenberg catalog, but the
second one is not available in the Plucker _format_ for immediate
download. Big difference from parsing title availability from the
catalog.rdf file.
So you are doing a HEAD on the cache location? I hope you don't have 
many of these in the field, because you're going to look very sorry 
whenever the location of the cache changes. (It will! I give you fair 
notice for free :-) )
...
Make sense now?
No.

Why is that "immediate download" bit so important for you? You will get 
a completely random set of files. (A cached plucker file expires 7 days 
after *generation* not after the last *access*. So all you get is the 
set of files generated in the last 7 days.)

And a wrong set too. The first file could have been deleted on the 
server long before you finished your barrage of HEAD request.

[gutvol-d] Re: Fwd: Programmatic fetching books from Gutenberg

Marcello Perathoner