[etext04|etext05]/index.html missing?

I am using wget to download books from www.gutenberg.org. The process is stuck on etext04 in what appears to be a futile effort to download index.html. The file must have been there last night, because I didn't have this problem. Could the appropriate person please look into this? Thank you very much.

On Mon, Nov 15, 2004 at 08:26:05AM -0600, John Hagerson wrote:
I am using wget to download books from www.gutenberg.org. The process is stuck on etext04 in what appears to be a futile effort to download index.html.
The file must have been there last night, because I didn't have this problem.
Could the appropriate person please look into this?
There's no index.html currently. -- gbn

John Hagerson wrote:
I am using wget to download books from www.gutenberg.org. The process is stuck on etext04 in what appears to be a futile effort to download index.html.
The indexes are auto-generated on the fly by Apache. If the load on the fileservers is too high the connection times out before a full directory listing can be retrieved. You should not harvest at peak hours anyway. -- Marcello Perathoner webmaster@gutenberg.org

On Mon, Nov 15, 2004 at 06:13:06PM +0100, Marcello Perathoner wrote:
John Hagerson wrote:
I am using wget to download books from www.gutenberg.org. The process is stuck on etext04 in what appears to be a futile effort to download index.html.
The indexes are auto-generated on the fly by Apache.
If the load on the fileservers is too high the connection times out before a full directory listing can be retrieved.
You should not harvest at peak hours anyway.
One more thing (or two): - you can't get the big directories via FTP. Use HTTP. (The FTP servers stop after 2K items). - Don't use HTTP, use rsync. See the mirroring HOWTO at gutenberg.org/howto for more info (yes, you can use rsync to just get particular directories, filename extensions, etc.). But if things are still weird, send something we can replicate and we'll help fix it! -- gbn

Well, not knowing what to do, I went to the Robots Readme on the Gutenberg.org web site and copied the wget command listed under the heading "Getting All EBook Files." I started this process on Sunday evening, at the end of a cable modem. Little did I realize that more than 24 hours later, the process would still be running. In a private message, I was told to use rsync. OK. If rsync is the preferred method, then why is wget presented as the example? It appears that I'm storing a bunch of index.html files that are redundant if I use rsync. I guess I can clean them up at my leisure. However, again the web page says "keep the html files" to make re-roboting faster. Well, I'll be a mirror site for all of the ZIP and HTML files, anyway. Please post suggestions here or pm me. Thank you. -----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Marcello Perathoner Sent: Monday, November 15, 2004 11:13 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] [etext04|etext05]/index.html missing? John Hagerson wrote:
I am using wget to download books from www.gutenberg.org. The process is stuck on etext04 in what appears to be a futile effort to download index.html.
The indexes are auto-generated on the fly by Apache. If the load on the fileservers is too high the connection times out before a full directory listing can be retrieved. You should not harvest at peak hours anyway. -- Marcello Perathoner webmaster@gutenberg.org _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d

On Mon, Nov 15, 2004 at 07:21:55PM -0600, John Hagerson wrote:
Well, not knowing what to do, I went to the Robots Readme on the Gutenberg.org web site and copied the wget command listed under the heading "Getting All EBook Files." I started this process on Sunday evening, at the end of a cable modem. Little did I realize that more than 24 hours later, the process would still be running.
In a private message, I was told to use rsync. OK. If rsync is the preferred method, then why is wget presented as the example?
It appears that I'm storing a bunch of index.html files that are redundant if I use rsync. I guess I can clean them up at my leisure. However, again the web page says "keep the html files" to make re-roboting faster.
Well, I'll be a mirror site for all of the ZIP and HTML files, anyway.
Please post suggestions here or pm me. Thank you.
John, please see the mirroring HOWTO at http://gutenberg.org/howto Mirroring the entire site is different than harvesting a few directories or sets of files. The "index.html" is created by the remote server, to simply list the files in a directory - you are right that it's transient/temporary/imaginary. Note that a 256Kbit DSL model will take about 6 days to download the entire PG collection (it's 140GB). We do not recommend DSL or cable modems for setting up mirrors, and generally don't list them in our mirror list. -- Greg
-----Original Message----- From: gutvol-d-bounces@lists.pglaf.org [mailto:gutvol-d-bounces@lists.pglaf.org] On Behalf Of Marcello Perathoner Sent: Monday, November 15, 2004 11:13 AM To: Project Gutenberg Volunteer Discussion Subject: Re: [gutvol-d] [etext04|etext05]/index.html missing?
John Hagerson wrote:
I am using wget to download books from www.gutenberg.org. The process is stuck on etext04 in what appears to be a futile effort to download index.html.
The indexes are auto-generated on the fly by Apache.
If the load on the fileservers is too high the connection times out before a full directory listing can be retrieved.
You should not harvest at peak hours anyway.
-- Marcello Perathoner webmaster@gutenberg.org
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
participants (3)
-
Greg Newby
-
John Hagerson
-
Marcello Perathoner