
I recently donated the etext "A Study Of The Bhagavata Purana" to PG and my next project is to reissue this book in print using CreateSpace. I'm using OpenOffice to do the work and it is up to the job. However, I see that OO has a feature that could make my CreateSpace book better than the original--I can create an index, which the original book did not have and really could use. Before I can do this I need to identify the words that I want in my index. It occurred to me that an easy way to get this list of words would be to identify all the words in the book that are NOT in an English dictionary, then sort, eliminate duplicates, and delete names that are too obscure to care about. My first thought to create such a list was to use gutspell to generate a list that I could run through a series of filters to get what I want. The problem with gutspell is it only works with plain ascii text, not UTF-8. If it could deal with UTF-8 it would be a good start. So the question for the group is, do you know of a tool that can read through a UTF-8 text file and make a list of all the words that are not found in an English dictionary? If there is extra wordage like gutspell gives I can filter that out. Using the ascii version of the file with gutspell is not an answer. I need the accents in the word list. I have Linux, Windows, and a Mac so any tool on any platform will be considered. Thanks, James Simmons