
david said:
We could start with the results of stripping the header
and the "footer", where most of the legalese is these days. does anyone here know the best way to strip both of them?
Also, the ten or twelve most common words in the book after stripping the ten or twelve most common words in the English language.
you'd need to strip more than a dozen. below is a list from wikipedia. there's a strong power-law in word usage. unless you strip 200-500 common words, it probably won't reveal anything very interesting... -bowerbird
Here are the top 100 words (from Project Gutenberg texts) in alphabetical order: a about after all an and any are as at be been before but by can could did do down first for from good great had has have he her him his I if in into is it its know like little made man may me men more mr much must my no not now of on one only or other our out over said see she should so some such than that the their them then there these they this time to two up upon us very was we were what when which who will with would you your