scraping the p.g. default .txt files

well, i have scraped the p.g. default .txt files -- http://www.gutenberg.org/files/#####/#####.txt -- from #10000 up, and surprisingly _quickly_. text is indeed compact. even when not zipped. a few notes. circa #18644 is the most recent? really? i thought we were up close to #20000? i take it .aus and .eur are in that count? please relabel human genome files! not really .txt! out of each 1,000 e-texts, about 150 are a.w.o.l. -- different types (e.g., mp3) or something or other, reducing these 8,644 down to some 7,000 or so. plus before i process further, i will toss out the non-english and other pesky variants... let's get it working on the simple ones first. which might take the 7,000 down to 6,000. i'd thought of it initially as a mere pilot-test, but it's looking more like split-half reliability. (i choose 10,000+ only because filenames were generated with a one-line template.) anyway, i chunked those files into folders of 1,000 e-texts each, because that was the size where my old machine starting choking, but o.s.x. seems to handle folders just fine even when the number of files inside is 5,000+... so i might consolidate the folders further, but in the meantime, the results lead to good news. each set of 1,000 e-texts takes roughly 300 megs, so the entire set of 20,000 would be about 6 gigs, meaning they will fit comfortably on today's dvd. and that is without any compression at all, baby. if we figure in compression, and tomorrow's dvd, we're talking an impressive library on a single disc. and a _huge_ library in a case containing 10 discs... -bowerbird
participants (1)
-
Bowerbird@aol.com