Automated readability scores for PG eBooks

Feedback/input would be valued. I've been corresopnding with Simon Ronald at RocketReader.com to see about integrating readability scores into the main PG book catalog. Because we don't have a lot of subject cataloging, one value of this is that it does a good job of identifing children's eBooks (they tend to be "easy"). This is also usable for people seeking to develop literacy or provide literacy instruction, by providing a way of reading something "harder" or "easier" as desired. Take a look at the list below (ten hardest, then ten easiest). The first score is overall, followed by a set of scores that made it up. I had provided some earlier feedback on how "hard" books were not necessarily prose, which is part of what Dr. Ronald is responding to. If you have feedback on the results, or my idea for adding these scores as an element of the catalog search results, please chime in! -- Greg ----- Forwarded message from "Dr. Simon Ronald" ----- Subject: Further Readability Results Date: Tue, 20 Jun 2006 03:34:26 +0930 Hello Greg, Here are some further "hardest and easiest" books based on a recent run. The run required 1 hour and 49 minutes to complete. This run classified 15,099 books - being a full scan of the English books. We incorporated a ordered list detection algorithm - some of the books contained (sometimes very noisy) lists of items - we found 162 books in total that were list based. It should be noted that we classified the entire book as list or "not list" based on a threshold -> if the book was a list then each separate line was considered a sentence for the purposes of readability. In time we will incorporate intra book list detection to allow the readability methodology to vary depending on the context within the book. It should also be noted that some of the HTML versions may well contain markup hints such as the use of the <ol> or <ul> HTML tag, we could use these and other tags to improve the quality of sentence chunking. Each entry has a series of 12 percentiles listed after the main readability percentile. These percentiles correspond to the 12 readability attributes in this order. bigword density short word density (-) wordsPerSentences syllablesPerWords profainwordsPerWords numbersPerWords mostCommon1000WordsPerWord (-) commascharsPerWords wordsPerParagraphs letterFrequencyDistributionError adjacentLetterPairsFrequencyDistributionError uniqueStemmedWordsPerWord; 99.914 95 90 95 97 0 79 86 84 79 88 85 90 Note on the Resemblances and Differences in the Structure and the Development of the Brain in Man and Apes (etext2354) 99.907 96 93 90 98 0 71 96 94 49 80 70 96 Original Letters and Biographic Epitomes (etext13203) 99.907 89 86 96 88 95 82 71 94 81 80 67 67 The Great Conspiracy, Volume 7 (etext7139) 99.904 85 87 93 86 0 86 97 75 78 88 78 99 A Biography of Edmund Spenser (etext6937) 99.897 92 90 95 92 88 69 76 90 80 48 60 78 Memoirs of the Court of St. Cloud (Being secret letters from a gentleman at Paris to a nobleman in London) \xe2\x8 0\x94 Volume 1 (etext3892) 99.897 82 92 32 87 88 93 98 87 68 93 83 85 Graf von Loeben and the Legend of Lorelei (etext11066) 99.894 96 93 89 98 80 91 84 97 84 88 36 27 The Modern Regime, Volume 2 (etext2582) 99.887 92 88 73 90 92 82 77 89 45 97 67 76 An Enquiry Concerning the Principles of Taste, and of the Origin of our Ideas of Beauty, etc. (etext13485) 99.887 91 89 88 92 0 64 75 99 96 88 67 96 Giordano Bruno (etext4228) 99.884 99 95 92 99 88 79 84 34 87 66 70 74 Monism as Connecting Religion and Science (etext9199) 99.881 93 91 75 92 94 64 67 77 94 88 67 80 Rise of the Dutch Republic, the \xe2\x80\x94 Volume 22: 1574-76 (etext4824) 99.874 91 92 23 93 80 80 98 95 70 80 91 74 The Principal Navigations, Voyages, Traffiques and Discoveries of the English Nation \xe2\x80\x94 Volume 01 (etext7182) 99.868 97 95 74 99 97 81 91 77 64 27 36 78 Gilbertus Anglicus (etext16155) 99.868 86 95 63 86 0 98 90 89 50 93 89 80 Cessions of Land by Indian Tribes to the United States: Illustrated by Those in the State of Indiana^M (etext17148 ) 99.858 87 87 95 87 80 56 79 93 63 66 83 73 An Essay towards Fixing the True Standards of Wit, Humour, Railery, Satire, and Ridicule (1744) (etext16233) 99.858 91 96 71 90 95 99 99 99 3 2 85 65 Noteworthy Families (Modern Science) (etext17128) 99.858 99 99 5 99 95 92 99 99 68 0 81 87 Roget's Thesaurus of English Words and Phrases (etext10681) 99.854 97 92 89 98 0 83 81 79 95 97 60 57 Eighteenth Brumaire of Louis Bonaparte (etext1346) 99.844 99 99 84 99 80 98 93 14 61 88 83 55 Venereal Diseases in New Zealand (1922) (etext15352) 99.844 96 91 98 96 88 80 74 98 97 66 64 9 Act, Declaration, & Testimony for the Whole of our Covenanted Reformation, as Attained to, and Established in Britain and Ireland; Particularly Betwixt the Years 1638 and 1649, Inclusive (etext13200) 99.844 90 89 96 90 0 79 78 69 87 66 70 97 Dr. Bullivant (etext9249) 99.831 90 88 94 89 80 71 61 96 90 66 30 72 Memoirs of the Court of St. Cloud (Being secret letters from a gentleman at Paris to a nobleman in London) \xe2\x8 0\x94 Volume 7 (etext3898) 99.831 99 99 87 99 99 92 88 27 54 93 94 34 Three Contributions to the Theory of Sex (etext14969) 99.831 95 88 91 92 80 61 67 72 83 93 64 70 Superstition Unveiled (etext15696) 99.831 95 93 45 97 80 89 97 94 43 13 72 80 Aboriginal American Authors (etext9188) 99.824 89 92 88 88 0 90 92 56 54 80 94 84 Transactions of the American Society of Civil Engineers, Vol. LXVIII, Sept. 1910 (etext18012) 99.824 87 99 40 90 0 99 95 94 3 88 92 97 On the Origin of Species (etext8205) 99.798 91 89 93 90 88 69 60 89 86 48 47 74 Memoirs of the Court of St. Cloud (Being secret letters from a gentleman at Paris to a nobleman in London) \xe2\x8 0\x94 Volume 3 (etext3894) 99.798 90 87 85 88 0 81 64 96 84 48 81 94 The Lives of the Twelve Caesars, Volume 11: Titus (etext6396) 99.798 90 86 97 90 0 81 76 90 94 27 76 82 The evolution of English lexicography (etext11694) 99.798 73 81 96 78 98 56 54 95 72 66 89 96 A Modest Proposal (etext1080) 99.798 95 96 41 97 0 95 96 79 54 98 78 78 Webster's March 7th Speech/Secession (etext1663) 99.798 91 92 47 92 92 50 87 87 87 66 64 89 Rise of the Dutch Republic, the \xe2\x80\x94 Volume 01: Introduction I (etext4801) 99.798 88 86 77 88 95 50 73 73 97 80 72 83 Rise of the Dutch Republic, the \xe2\x80\x94 Volume 26: 1577, part III (etext4828) 99.785 95 85 98 90 88 75 71 92 94 66 56 32 The Auchensaugh Renovation of the National Covenant and (etext12381) 99.785 93 92 81 92 88 92 85 98 84 80 24 16 The Modern Regime, Volume 1 (etext2581) 99.785 70 78 92 75 92 81 79 94 87 48 52 82 The Mayflower and Her Log; July 15, 1620-May 6, 1621 \xe2\x80\x94 Volume 5 (etext4105) Easiest 4.176 2 1 11 2 0 0 1 36 9 48 78 11 The Song of the Blood-Red Flower (etext12935) 4.176 0 0 11 0 0 0 15 48 8 2 94 6 Six Little Bunkers at Grandma Bell's (etext14623) 4.176 14 8 27 12 0 0 9 15 10 27 41 7 Melbourne House, Volume 2 (etext12964) 4.176 7 3 6 4 0 0 5 3 9 66 86 23 The Romantic (etext13292) 4.176 17 5 14 10 0 0 1 12 48 48 6 23 The Girl from Montana (etext15274) 4.176 7 6 13 7 0 0 11 27 22 27 56 9 Jess of the Rebel Trail (etext15382) 4.176 5 3 8 3 0 0 0 2 45 66 60 25 Stories of American Life and Adventure (etext15597) 4.176 3 8 13 6 0 0 12 3 72 1 78 13 Kazan (etext10084) 4.176 11 5 8 8 0 0 12 1 9 48 94 11 The Second Honeymoon (etext17446) 4.176 13 9 10 12 0 0 13 8 3 48 36 23 The Circus Boys on the Plains : or, the Young Advance Agents Ahead of the Show (etext2478) 4.176 19 11 11 15 0 0 10 26 42 27 9 1 The Captives (etext3601) 4.176 0 1 12 1 0 0 2 3 49 27 90 29 Old Granny Fox (etext4980) 4.176 0 0 6 0 0 0 2 0 19 27 89 54 Sleepy-Time Tales: the Tale of Fatty Coon (etext5701) 4.176 0 0 12 1 0 0 10 6 31 2 93 40 The Adventures of Johnny Chuck (etext5844) 4.176 2 3 11 3 0 0 7 1 12 5 85 54 Tale of Brownie Beaver (etext6754) 4.176 12 7 19 8 0 0 9 7 55 5 52 13 The City of Fire (etext7008) 4.176 23 9 10 15 0 0 13 16 18 13 3 29 The Man with Two Left Feet^M (etext7471) 4.176 4 5 13 5 0 0 8 15 40 5 80 21 Way of the Lawless (etext9903) 3.931 11 7 11 8 0 0 14 7 36 2 85 9 The Hunted Woman (etext11328) 3.931 4 1 14 2 0 0 1 34 44 27 64 3 Mary Marie (etext11143) 3.931 12 8 22 11 0 0 10 20 15 13 41 13 Contrary Mary (etext17938) 3.931 0 0 3 0 0 0 0 1 0 98 97 29 The New McGuffey First Reader (etext1489) 3.931 4 7 18 6 0 0 7 33 21 13 56 9 Martin Pippin in the Apple Orchard (etext2032) 3.931 8 9 8 8 0 0 1 12 48 13 64 23 Twenty-Two Goblins (etext2290) 3.931 9 8 9 8 0 0 7 5 43 27 72 13 God's Country\xe2\x80\x94And the Woman (etext4585) 3.931 12 11 11 13 0 0 6 12 60 27 13 14 The Valley of Silent Men (etext4707) 3.931 5 2 23 3 0 0 6 16 10 13 80 23 The Boy Scout Camera Club, or, the Confession of a Photograph (etext7356) 3.931 8 9 15 10 0 0 17 4 6 5 81 21 Bob Cook and the German Spy (etext9899) 3.676 16 6 8 10 0 0 25 3 3 13 76 13 The Three Sisters (etext11876) 3.676 7 2 10 3 0 0 3 17 43 27 78 11 His Second Wife (etext17259) 3.676 6 5 17 6 0 0 14 33 24 5 60 3 Michael O'Halloran (etext9489) 3.384 12 7 6 10 0 0 28 0 20 27 13 32 The Sheriff's Son (etext17043) 3.384 1 0 17 1 0 0 9 36 11 5 88 9 The Bobbsey Twins in the Great West (etext5952) 3.384 6 3 7 4 0 0 3 24 20 13 52 34 Pan (etext7214) 3.384 0 0 5 0 0 0 2 1 20 5 93 59 Five Little Friends (etext7801) 3.384 0 1 7 1 0 0 13 0 11 1 88 53 The Tale of Sandy Chipmunk (etext9462) 3.109 0 0 6 0 0 0 0 19 0 2 93 50 Boy Blue and His Friends (etext16046) 3.109 8 3 10 4 0 0 1 30 19 27 75 6 Wanderers (etext7762) 2.874 1 2 56 2 0 0 0 33 32 2 9 5 Twilight Land (etext1751) 2.874 0 0 13 0 0 0 11 38 9 1 91 6 The Curlytops on Star Island (etext5989) 2.666 0 0 10 0 0 0 6 44 9 13 76 7 The Bobbsey Twins at Home (etext18420) 2.666 0 0 34 0 0 0 0 1 43 48 64 5 The King of Ireland's Son (etext3495) 2.460 7 11 12 11 0 0 8 3 70 5 19 16 Baree, Son of Kazan (etext4748) 2.460 10 9 11 10 0 0 2 15 9 13 52 21 Samuel the Seeker (etext5961) 2.255 6 1 2 2 0 0 28 1 7 80 30 14 Plays (etext10623) 2.255 12 13 14 14 0 0 11 4 23 5 30 14 King of the Khyber Rifles (etext6066) 2.255 3 3 19 3 0 0 6 11 13 13 64 23 Riders of the Silences (etext9867) 1.917 4 2 7 3 0 0 11 4 7 48 89 6 Anne Severn and the Fieldings (etext10817) 1.785 0 0 15 0 0 0 0 9 13 5 72 36 Fifty Famous Stories Retold (etext18442) 1.507 8 4 17 6 0 0 21 0 24 13 19 16 The Light in the Clearing (etext14150) 1.507 9 7 21 8 0 0 3 2 30 27 13 14 Voyages of Dr. Dolittle (etext1154) 1.507 0 0 2 0 0 0 43 22 3 27 30 2 Six Plays (etext5618) 1.391 4 6 18 6 0 0 9 0 24 5 67 6 The Secret Garden (etext17396) 1.391 12 9 8 10 0 0 16 7 21 13 2 19 Black Jack (etext9925) 1.080 10 4 14 6 0 0 8 9 14 5 24 19 The Gay Cockade (etext16433) 0.742 4 4 14 4 0 0 5 3 55 2 6 16 Isobel : a Romance of the Northern Trail (etext6715) 0.440 0 0 5 0 0 0 2 2 0 1 97 16 Bunny Rabbit's Diary (etext16982) 0.281 6 3 6 4 0 0 22 2 8 5 19 6 Mary Olivier: a Life (etext9366) Cheers, Dr. Simon Ronald CEO The Leader in High Performance Reading Level 2, 25 Gresham Street, Adelaide, SA, Australia, 5000 GPO Box 944, Adelaide SA 5001 Ph. +61 8 8410 2771 Fax. +61 8 8125 6679 1133 Broadway, Suite 706 New York, NY 10010 Ph: (646) 736 7673 (New York) Ph: (415) 992 5412 (California) Fax: (877) 731 4410 (toll free) ____________________________________ ----- End forwarded message ----- ----- End forwarded message -----

If you have feedback on the results, or my idea for adding these scores as an element of the catalog search results, please chime in!
I think that a readability score on every book is a super good idea. And, the n easiest/hardest would make good lists for the site (well, formatted as an HTML table, and perhaps including author as well as title). And, there's probably no harm in including the sub-scores, though the overall is certainly most important for public consumption. Cheers, Scott S. Lawton http://blogsearch.com/ - a starting point http://ProductArchitect.com/ - consulting

On 6/25/06, Greg Newby <gbnewby@pglaf.org> wrote:
Because we don't have a lot of subject cataloging, one value of this is that it does a good job of identifing children's eBooks (they tend to be "easy").
If the problem is that we don't have a lot of subject cataloging, provide more subject cataloging. We could copy the LoC cataloging for most of the catalog without too much work. If we're going to a Wiki-type thing, lists of children's books, mysterys, sci-fi, etc. will be made, and will be superior to this.
This is also usable for people seeking to develop literacy or provide literacy instruction, by providing a way of reading something "harder" or "easier" as desired.
If the problem is literacy instruction, then we should work on a list of books for literacy, not rely on some tool that can't tell the difference between a 17th century children's book and a 20th century one, or how much dialect is used. Again, a Wiki-tool is perfect for this.
If you have feedback on the results, or my idea for adding these scores as an element of the catalog search results, please chime in!
I think that these are somewhat interesting, but they are far from the most interesting factoids. I've been drooling over Amazon's Statistically Improbable Phrases, personally. I surely wouldn't have them as promenant as on the search page; I don't think it's the most important thing that most people look at.
0.281 6 3 6 4 0 0 22 2 8 5 19 6 Mary Olivier: a Life (etext9366)
This is surely a mistake; the second sentence in the book is "When old Jenny shook it the wooden rings rattled on the pole and grey men with pointed heads and squat, bulging bodies came out of the folds on to the flat green ground. " The numbers are too hard to decipher in this form to really try and understand why. I also wonder about "profainwordsPerWords"? The profanity of words has little to do with the readability; they're just adjectives and nouns from that perspective.

If the problem is that we don't have a lot of subject cataloging, provide more subject cataloging. We could copy the LoC cataloging for most of the catalog without too much work.
If the problem is literacy instruction, then we should work on a list of books for literacy, not rely on some tool that can't tell the difference between a 17th century children's book and a 20th century one, or how much dialect is used.
While I agree that it would not be worth adding readability score if it had much impact on these and other worthy goals, I really don't see it as either/or. Granting of course that adding scores will take some time away from other projects (and, that it's not my personal time at stake here), I still see this as relatively high gain for relatively low investment. There are lots and lots of cool things that could be done with the catalog. And, any relatively "easy" (i.e. automated) method of adding readability scores will inevitably miscategorize a whole bunch of books. But, I think the 'signal' will far outweigh the 'noise'. Even in the context of the above, the scores would provide a great starting point for being improved with manual cataloging and literacy labeling. Don't let the perfect stand in the way of the good. Plus, I think the scores (and miscategorizations) are interesting in and of themselves for those of us interested in words and language. Cheers, Scott S. Lawton http://blogsearch.com/ - a starting point http://ProductArchitect.com/ - consulting
participants (3)
-
David Starner
-
Greg Newby
-
Scott Lawton