Creating a list of unknown words from a text document

24 Apr 2012

      I recently donated the etext "A Study Of The Bhagavata Purana" to PG and my
next project is to reissue this book in print using CreateSpace.  I'm using
OpenOffice to do the work and it is up to the job.  However, I see that OO
has a feature that could make my CreateSpace book better than the
original--I can create an index, which the original book did not have and
really could use.

Before I can do this I need to identify the words that I want in my index.
 It occurred to me that an easy way to get this list of words would be to
identify all the words in the book that are NOT in an English dictionary,
then sort, eliminate duplicates, and delete names that are too obscure to
care about.  My first thought to create such a list was to use gutspell to
generate a list that I could run through a series of filters to get what I
want.

The problem with gutspell is it only works with plain ascii text, not
UTF-8.  If it could deal with UTF-8 it would be a good start.

So the question for the group is, do you know of a tool that can read
through a UTF-8 text file and make a list of all the words that are not
found in an English dictionary?  If there is extra wordage like gutspell
gives I can filter that out.

Using the ascii version of the file with gutspell is not an answer.  I need
the accents in the word list.

I have Linux, Windows, and a Mac so any tool on any platform will be
considered.

Thanks,

James Simmons

James Simmons

traverso＠posso.dm.unipi.it

James Simmons

tags

participants (2)