Apparently the PG archives from 2007 and 2008 have been lost, so I'm resending this message from November of 2007 for the benefit of Messrs. Hurst and Kretz [begin quoted text] Quick ... what is the most commonly downloaded book from project Gutenberg in the last three years? I promised Jon Noring some data a few months back, and I thought I'd deliver it in this forum, because some other people might find it interesting. As most people here know, TPTB at project Gutenberg deny having any download statistics beyond the past 30 days. Fortunately, for years now the Internet Archive has been trolling the internet, making periodic snapshots of web sites, including Project Gutenberg. So I went to the Internet Archive and captured the Project Gutenberg statistics pages since September 2004. I collated all the data, and came up with a list of 408 files which have appeared in the "30 day - Top 100" since that date. I added and resorted them, and now have a list of the most popular downloads from the PG web site since Sept. 2004. And the most popular download during the past three years is: (drumroll please) The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci (etext 5000) The rest of the top ten are: 2 The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (9551 & 1661) 3 The Art of War by Sun Tzu (132, 17405 & 20594) 4 Le Kamasutra by Vatsyayana (14609) 5 Pride and Prejudice by Jane Austen (1342 & 20686) 6 The War of the Worlds by H. G. Wells (36 & 8976) 7 Ulysses by James Joyce (4300) 8 Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert Hubbard (12933) 9 Manual of Surgery by Alexander Miles and Alexis Thomson (17921) 10 Hand Shadows to Be Thrown upon the Wall by Henry Bursill (12962) 11 The Adventures of Huckleberry Finn by Mark Twain (76 & 19640) 12 Alice's Adventures in Wonderland by Lewis Carroll (11, 19573 & 928) (I know this is 12, but I couldn't bear to leave out Alice and Huck.) (Caveat: I was unable to get precise 30 day intervals, so this list is an approximation. A /very good/ approximation, but an approximation nonetheless.) (Caveat bis: These data are derived from that reported on the PG web site. They are only as good as PG's reporting.) Of course, because the PG corpus is always growing, this kind of linear analysis may over-weight early downloads. So I changed the collation algorithm a bit. I started with a 6 month baseline, and then as I added each 30 day list I increased the weighting by 4%. That is, the data as of Feb. 2005 was counted at 100%, but the data from Feb-Mar was counted at 104%, the data from Mar-Apr was counted at 108%, the data from Apr-May was counted at 112%, etc. Thus, more recent downloads got counted more heavily that more distant downloads. So what is the adjusted top ten list? 1 The Notebooks of Leonardo Da Vinci - Complete by Leonardo da Vinci (5000) 2 The Adventures of Sherlock Holmes by Sir Arthur Conan Doyle (9551 & 1661) 3 The Art of War by Sun Tzu (132, 17405 & 20594) 4 Le Kamasutra by Vatsyayana (14609) 5 Pride and Prejudice by Jane Austen (1342 & 20686) 6 Manual of Surgery by Alexander Miles and Alexis Thomson (17921) 7 How to Speak and Write Correctly by Joseph Devlin (6409) 8 Ulysses by James Joyce (4300) 9 Little Journeys to the Homes of the Great - Volume 01 of 14 by Elbert Hubbard (12933) 10 The War of the Worlds by H. G. Wells (36 & 8976) 11 The Adventures of Huckleberry Finn by Mark Twain (76 & 19640) As you can see, the Manual of Surgery is more popular recently, and Hand Shadows less so. Alice dropped to 14, so I didn't feel like I could include her. What is interesting is that the addition of new files to the PG corpus has not had much affect on the most popular file downloads. The data for all 400+ files can be found at http://www.passkeysoft.com/~lee/zero.txt and http://www.passkeysoft.com/~lee/four.txt. Bowerbird, if you want to know where to start in your conversion process to z.m.l., I would suggest the books on this list. I hand manipulated the first 50 entries in each file, to try to count multiple editions of the same book as a single entry, the remaining data is raw. Enjoy! -- Nothing of significance below this line. _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/listinfo.cgi/gutvol-d
participants (1)
-
Lee Passey