
On Mon, Nov 15, 2004 at 07:21:55PM -0600, John Hagerson wrote:
Well, not knowing what to do, I went to the Robots Readme on the Gutenberg.org web site and copied the wget command listed under the heading "Getting All EBook Files." I started this process on Sunday evening, at the end of a cable modem. Little did I realize that more than 24 hours later, the process would still be running.
In a private message, I was told to use rsync. OK. If rsync is the preferred method, then why is wget presented as the example?
It appears that I'm storing a bunch of index.html files that are redundant if I use rsync. I guess I can clean them up at my leisure. However, again the web page says "keep the html files" to make re-roboting faster.
Well, I'll be a mirror site for all of the ZIP and HTML files, anyway.
Please post suggestions here or pm me. Thank you.
John, please see the mirroring HOWTO at http://gutenberg.org/howto Mirroring the entire site is different than harvesting a few directories or sets of files. The "index.html" is created by the remote server, to simply list the files in a directory - you are right that it's transient/temporary/imaginary. Note that a 256Kbit DSL model will take about 6 days to download the entire PG collection (it's 140GB). We do not recommend DSL or cable modems for setting up mirrors, and generally don't list them in our mirror list. -- Greg
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Marcello Perathoner Sent: Monday, November 15, 2004 11:13 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] [etext04|etext05]/index.html missing?
John Hagerson wrote:
I am using wget to download books from www.gutenberg.org. The process is stuck on etext04 in what appears to be a futile effort to download index.html.
The indexes are auto-generated on the fly by Apache.
If the load on the fileservers is too high the connection times out before a full directory listing can be retrieved.
You should not harvest at peak hours anyway.
-- Marcello Perathoner webmaster@gutenberg.org
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d