HTML and text statistics

Having recently downloaded the PG 2010 DVD (thank you very much PG), I started looking at the books: - The DVD seems to contain about 34,000 zip files. - A small number of these (about 50) seem to be zipped mp3 files (*m.zip). - About 1800 files seem to be zipped html files (*h.zip). Does this mean that PG (at that time) only had about 1800 'real' (as opposed to generated-from-the-text-file) html books, or have I misunderstood something? Presumably many more books have been added since, I think I recall an append from Greg that the number is now more like 40,000, and I get the impression that most of these will come from DP and have html versions. But even so this seems to imply that there are about 28-30,000 books as text files for which no html version exists other than the generated one. Is this number about right? or have I missed something obvious? Bob Gibbins

Hi, Bob. The DVD was hand-selected (probably by me) to pack lots of files onto the DVD. Only a subset of the HTML was included, to save space. We had a lot more HTML at the time. - Greg On Wed, Feb 15, 2012 at 10:59:37AM -0000, Robert Gibbins wrote:
Having recently downloaded the PG 2010 DVD (thank you very much PG), I started looking at the books: - The DVD seems to contain about 34,000 zip files. - A small number of these (about 50) seem to be zipped mp3 files (*m.zip). - About 1800 files seem to be zipped html files (*h.zip).
Does this mean that PG (at that time) only had about 1800 'real' (as opposed to generated-from-the-text-file) html books, or have I misunderstood something? Presumably many more books have been added since, I think I recall an append from Greg that the number is now more like 40,000, and I get the impression that most of these will come from DP and have html versions. But even so this seems to imply that there are about 28-30,000 books as text files for which no html version exists other than the generated one.
Is this number about right? or have I missed something obvious?
Bob Gibbins
Dr. Gregory B. Newby Chief Executive and Director Project Gutenberg Literary Archive Foundation www.gutenberg.org A 501(c)(3) not-for-profit organization with EIN 64-6221541 gbnewby@pglaf.org

Greg>The DVD was hand-selected (probably by me) to pack lots of files onto the DVD. Only a subset of the HTML was included, to save space. The CD and DVD project might let you get more like what you thought you were getting: http://snowy.arsc.alaska.edu/pgiso/ I tried this a couple years ago and was happy with the results.

Another egg! This is this first one I've seen that would seem to most easily give you a copy of every book. On Wed, Feb 15, 2012 at 12:13 PM, Jim Adcock <jimad@msn.com> wrote:
Greg>The DVD was hand-selected (probably by me) to pack lots of files onto the DVD. Only a subset of the HTML was included, to save space.
The CD and DVD project might let you get more like what you thought you were getting:
http://snowy.arsc.alaska.edu/pgiso/
I tried this a couple years ago and was happy with the results.
_______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d

On Wed, February 15, 2012 3:59 am, Robert Gibbins wrote:
Having recently downloaded the PG 2010 DVD (thank you very much PG), I started looking at the books: - The DVD seems to contain about 34,000 zip files. - A small number of these (about 50) seem to be zipped mp3 files (*m.zip). - About 1800 files seem to be zipped html files (*h.zip).
Does this mean that PG (at that time) only had about 1800 'real' (as opposed to generated-from-the-text-file) html books, or have I misunderstood something?
I haven't looked at the DVD, but I have looked at the mirrored file system, and have learned a few things. Yesterday I grabbed, more or less randomly, 10 HTML files from the "Top 100 (Last 30 Days) list. Of these 10, only one was generated from the Impoverished Text File. Of the remaining 9, only 2 were zipped HTML, the remaining 7 were single files handcrafted to the HTML v. 3.2 spec. These HTML files are found in the n/n/n/nnn-h folder, which is usually a peer to the zip file when it exists. I don't think searching for *h.zip will give you a true picture of the non-generated HTML files -- but I could be wrong; it's possible that the DVD is more, or different, from the file system image.
participants (5)
-
don kretz
-
Greg Newby
-
Jim Adcock
-
Lee Passey
-
Robert Gibbins