RE: [gutvol-d] [etext04|etext05]/index.html missing?

16 Nov 2004

      Well, not knowing what to do, I went to the Robots Readme on the
Gutenberg.org web site and copied the wget command listed under the heading
"Getting All EBook Files." I started this process on Sunday evening, at the
end of a cable modem. Little did I realize that more than 24 hours later,
the process would still be running.

In a private message, I was told to use rsync. OK. If rsync is the preferred
method, then why is wget presented as the example?

It appears that I'm storing a bunch of index.html files that are redundant
if I use rsync. I guess I can clean them up at my leisure. However, again
the web page says "keep the html files" to make re-roboting faster.

Well, I'll be a mirror site for all of the ZIP and HTML files, anyway.

Please post suggestions here or pm me. Thank you.

-----Original Message-----
From: gutvol-d-bounces@lists.pglaf.org
[mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Marcello Perathoner
Sent: Monday, November 15, 2004 11:13 AM
To: Project Gutenberg Volunteer Discussion
Subject: Re: [gutvol-d] [etext04|etext05]/index.html missing?

John Hagerson wrote:
...
I am using wget to download books from www.gutenberg.org. The process is
stuck on etext04 in what appears to be a futile effort to download
index.html.
The indexes are auto-generated on the fly by Apache.

If the load on the fileservers is too high the connection times out 
before a full directory listing can be retrieved.

You should not harvest at peak hours anyway.

-- 
Marcello Perathoner
webmaster@gutenberg.org

_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/listinfo.cgi/gutvol-d