Problems downloading PG titles without a web browser

I had created an Activity for the One Laptop Per Child project which downloads and reads the Plain Text version of PG titles. I use the offline catalogue and some Python code to list available titles in the catalogue that match a search string. For instance the child might enter "Twain" and get a list of all the books by and about Twain. Then the child can download the book using the same program. This has been working well for years but lately it has stopped working. When I try to debug it I see it trying to download this URL for instance: http://www.gutenberg.org/dirs/1/1/119/119.zip Put this URL in any web browser and it will download "A Tramp Abroad" by Mark Twain. However, try and download the same URL using my Python code and you get this: Forbidden You don't have permission to access /dirs/1/1/119/119.zip on this server. ------------------------------ Apache Server at www.gutenberg.org Port 80 My Activity is a much more convenient way of downloading books than a web browser. For one thing, it gives the downloaded file a more intelligent name than 119.zip. I would like for it to work again as it did before. Suggestions? James Simmons

On 02/10/2013 06:46 PM, James Simmons wrote:
I had created an Activity for the One Laptop Per Child project which downloads and reads the Plain Text version of PG titles. I use the offline catalogue and some Python code to list available titles in the catalogue that match a search string. For instance the child might enter "Twain" and get a list of all the books by and about Twain. Then the child can download the book using the same program.
This has been working well for years but lately it has stopped working. When I try to debug it I see it trying to download this URL for instance:
Supply a user-agent that clearly defines your app and provides a way to contact you eg. OLPCReader/1.0; +http://www.olpc-reader.org/app-info.html OLPCReader/1.0; +mailto:me@example.com The standard Python-urllib user agent will not do! or use a mirror site. See: www.gutenberg.org/MIRRORS.ALL Regards -- Marcello Perathoner webmaster@gutenberg.org

Marcello (and everyone else who replied): It looks like the mirrors have the same policy. I'm currently using this to download from the URL: http://doc.sugarlabs.org/epydocs/sugar.network.GlibURLDownloader-class.html I'm not seeing a way to put a user agent in using this code. It looks like urllib2 supports it. I used the other because I was already using it for collaboration and it gives me a way to do a progress report on the download. I appreciate everyone's detective work and suggestions. James Simmons On Sun, Feb 10, 2013 at 1:22 PM, Marcello Perathoner <marcello@perathoner.de
wrote:
On 02/10/2013 06:46 PM, James Simmons wrote:
I had created an Activity for the One Laptop Per Child project which downloads and reads the Plain Text version of PG titles. I use the offline catalogue and some Python code to list available titles in the catalogue that match a search string. For instance the child might enter "Twain" and get a list of all the books by and about Twain. Then the child can download the book using the same program.
This has been working well for years but lately it has stopped working. When I try to debug it I see it trying to download this URL for instance:
http://www.gutenberg.org/dirs/**1/1/119/119.zip<http://www.gutenberg.org/dirs/1/1/119/119.zip>
Supply a user-agent that clearly defines your app and provides a way to contact you eg.
OLPCReader/1.0; +http://www.olpc-reader.org/**app-info.html<http://www.olpc-reader.org/app-info.html>
OLPCReader/1.0; +mailto:me@example.com
The standard Python-urllib user agent will not do!
or
use a mirror site. See: www.gutenberg.org/MIRRORS.ALL
Regards
-- Marcello Perathoner webmaster@gutenberg.org

I'm puzzled. I think PG is an organization about making it as easy as possible to download free ebooks. What problem is being addressed by making it more restrictive and difficult? Why are we breaking existing software interfaces? Wouldn't OLPC be about as close as one could get to an ideal PG consumer? On Sun, Feb 10, 2013 at 4:11 PM, James Simmons <nicestep@gmail.com> wrote:
Marcello (and everyone else who replied):
It looks like the mirrors have the same policy. I'm currently using this to download from the URL:
http://doc.sugarlabs.org/epydocs/sugar.network.GlibURLDownloader-class.html
I'm not seeing a way to put a user agent in using this code. It looks like urllib2 supports it. I used the other because I was already using it for collaboration and it gives me a way to do a progress report on the download. I appreciate everyone's detective work and suggestions.
James Simmons
On Sun, Feb 10, 2013 at 1:22 PM, Marcello Perathoner < marcello@perathoner.de> wrote:
On 02/10/2013 06:46 PM, James Simmons wrote:
I had created an Activity for the One Laptop Per Child project which downloads and reads the Plain Text version of PG titles. I use the offline catalogue and some Python code to list available titles in the catalogue that match a search string. For instance the child might enter "Twain" and get a list of all the books by and about Twain. Then the child can download the book using the same program.
This has been working well for years but lately it has stopped working. When I try to debug it I see it trying to download this URL for instance:
http://www.gutenberg.org/dirs/**1/1/119/119.zip<http://www.gutenberg.org/dirs/1/1/119/119.zip>
Supply a user-agent that clearly defines your app and provides a way to contact you eg.
OLPCReader/1.0; +http://www.olpc-reader.org/**app-info.html<http://www.olpc-reader.org/app-info.html>
OLPCReader/1.0; +mailto:me@example.com
The standard Python-urllib user agent will not do!
or
use a mirror site. See: www.gutenberg.org/MIRRORS.ALL
Regards
-- Marcello Perathoner webmaster@gutenberg.org
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

Maybe a guess, but an educated one (from running websites myself): the current number of robots that grab everything on the web without consideration at a completely mind-boggling rate, and getting things over and over again. This accounts for a large fraction of any website's traffic -- and is utterly useless, similar to 99% of all email being spam. Jeroen. On 2013-02-11 03:31, don kretz wrote:
I'm puzzled. I think PG is an organization about making it as easy as possible to download free ebooks. What problem is being addressed by making it more restrictive and difficult?
Why are we breaking existing software interfaces?
Wouldn't OLPC be about as close as one could get to an ideal PG consumer?

On 02/11/2013 03:31 AM, don kretz wrote:
I'm puzzled. I think PG is an organization about making it as easy as possible to download free ebooks. What problem is being addressed by making it more restrictive and difficult?
By making it difficult for bots to access the main site, we give humans a chance to use the site. We require friendly "bots" to identify themselves so we can contact their owners in case they misbehave or block them altogether. See also wikipedia's reasoning: http://meta.wikimedia.org/wiki/User-Agent_policy Regards -- Marcello Perathoner webmaster@gutenberg.org

By making it difficult for bots to access the main site, we give humans a chance to use the site.
Perhaps you are mistaking real human beings as being robots? Because the people who have been complaining to me about broken PG downloads do not appear to be robots. Either that, or there have been major improvements in AI that I don't know about!

Recently some readers who use Magic Catalog have also complained about having problems.

On 02/11/2013 05:52 AM, James Adcock wrote:
Recently some readers who use Magic Catalog have also complained about having problems.
What problems? Users downloading dozens of books in a short time might be caught by our robot detection algorithm, especially if they don't supply valid http referrers. They should just wait until the block times out. Usually a few minutes. Regards -- Marcello Perathoner webmaster@gutenberg.org

Recently some readers who use Magic Catalog have also complained about having problems.
What problems?
I have no clue what is going so I quote from an email below: Quote (from a UK reader): I am based in the UK. Having used Magic Catalog with my "old" Kindle Keyboard without problems, I tried it on my wife's new Kindle. The books downloaded (by Wi-Fi) but the titles appeared as "pg000" format, and when attempting to open them they would not open. I went back to my "old" Kindle and found the same problem there! I discovered that the Web Page to which my browser defaulted was now amazon.co.uk (previously it had been Google). I think the kindle update (which happens automatically when I switch wi-fi on) had changed the default browser page, and that the Magic Catalog download was sending via Amazon - which was corrupting it (?!). I changed my default browser page back to google.co.uk - and direct download from the Magic Catalog works again! I will keep an eye on what page my browser defaults to - especially after an automatic update of the Kindle.

On 2/10/2013 5:11 PM, James Simmons wrote:
Marcello (and everyone else who replied):
It looks like the mirrors have the same policy. I'm currently using this to download from the URL:
http://doc.sugarlabs.org/epydocs/sugar.network.GlibURLDownloader-class.html
I'm not seeing a way to put a user agent in using this code. It looks like urllib2 supports it. I used the other because I was already using it for collaboration and it gives me a way to do a progress report on the download. I appreciate everyone's detective work and suggestions.
James Simmons
Try downloading from http://gutenberg.readingroo.ms (e.g. http://gutenberg.readingroo.ms/1/1/119/119.zip). This is a mirror created, and presumably maintained, by Mr. Newby. I don't know if it is the mostest, mostest up-to-date mirror, but my examination shows that it is pretty darn close, and for the older works which have the most interest (and the most flaws) it should be highly stable. I'm fairly certain that this server is not part of a server farm or behind a load balancer, so it may not scale under heavy load. On the other hand, it's kind of an unknown server so you won't be competing with the world at large. I'm also fairly certain that this server has not implemented any of Mr. Perathoner's PRST -> PHTML ->PePub tool chain, (which is IMO a good thing) so don't expect to get any auto-generated Perathoner ePubs. This link just takes you to the repository itself, without fanfare.
participants (6)
-
don kretz
-
James Adcock
-
James Simmons
-
Jeroen Hellingman
-
Lee Passey
-
Marcello Perathoner