Creating a list of unknown words from a text document

I recently donated the etext "A Study Of The Bhagavata Purana" to PG and my next project is to reissue this book in print using CreateSpace. I'm using OpenOffice to do the work and it is up to the job. However, I see that OO has a feature that could make my CreateSpace book better than the original--I can create an index, which the original book did not have and really could use. Before I can do this I need to identify the words that I want in my index. It occurred to me that an easy way to get this list of words would be to identify all the words in the book that are NOT in an English dictionary, then sort, eliminate duplicates, and delete names that are too obscure to care about. My first thought to create such a list was to use gutspell to generate a list that I could run through a series of filters to get what I want. The problem with gutspell is it only works with plain ascii text, not UTF-8. If it could deal with UTF-8 it would be a good start. So the question for the group is, do you know of a tool that can read through a UTF-8 text file and make a list of all the words that are not found in an English dictionary? If there is extra wordage like gutspell gives I can filter that out. Using the ascii version of the file with gutspell is not an answer. I need the accents in the word list. I have Linux, Windows, and a Mac so any tool on any platform will be considered. Thanks, James Simmons

"James" == James Simmons <nicestep@gmail.com> writes:
James> I recently donated the etext "A Study Of The Bhagavata James> Purana" to PG and my next project is to reissue this book James> in print using CreateSpace. I'm using OpenOffice to do the James> work and it is up to the job. However, I see that OO has a James> feature that could make my CreateSpace book better than the James> original--I can create an index, which the original book James> did not have and really could use. James> Before I can do this I need to identify the words that I James> want in my index. It occurred to me that an easy way to James> get this list of words would be to identify all the words James> in the book that are NOT in an English dictionary, then James> sort, eliminate duplicates, and delete names that are too James> obscure to care about. My first thought to create such a James> list was to use gutspell to generate a list that I could James> run through a series of filters to get what I want. ............... James> I have Linux, Windows, and a Mac so any tool on any James> platform will be considered. James> Thanks, James> James Simmons You can use a spell-checker in list mode. Since you have non iso-latin-1 characters probably aspell or ispell wouldn't work; just be sure that you have a spell-checker installed (aspell, ispell, myspell, hunspell, etc,) and enchant (that is a front-end to multiple spell-checkers). then in a xterm issue the command enchant -l -d en myfile.txt | sort | uniq -c | sort -nr > myfile-words.txt and in myfile-words.txt you'll get all the non-english words with the corresponding number of occurrences, in decreasing order. It will work in linux and mac, and maybe in windows too. If you want to list the english words too (that is a very bad idea...) you can choose any other dictionary that you have, possibly one using another alphabet. I used oriya (-d or) until I built an empty apell dictionary for language undetermined (und) just for that (the smallest the dictionary, the fastest it will be) Carlo

Carlo, This is an excellent suggestion, much better than I had hoped for because it helps me figure out which words and names are worth indexing. I tried the command and it worked the first time. UTF-8 characters came through just fine too! Thanks, James Simmons On Wed, Apr 25, 2012 at 12:35 AM, Carlo Traverso <traverso@posso.dm.unipi.it
wrote:
"James" == James Simmons <nicestep@gmail.com> writes:
James> I recently donated the etext "A Study Of The Bhagavata James> Purana" to PG and my next project is to reissue this book James> in print using CreateSpace. I'm using OpenOffice to do the James> work and it is up to the job. However, I see that OO has a James> feature that could make my CreateSpace book better than the James> original--I can create an index, which the original book James> did not have and really could use.
James> Before I can do this I need to identify the words that I James> want in my index. It occurred to me that an easy way to James> get this list of words would be to identify all the words James> in the book that are NOT in an English dictionary, then James> sort, eliminate duplicates, and delete names that are too James> obscure to care about. My first thought to create such a James> list was to use gutspell to generate a list that I could James> run through a series of filters to get what I want.
...............
James> I have Linux, Windows, and a Mac so any tool on any James> platform will be considered.
James> Thanks,
James> James Simmons
You can use a spell-checker in list mode. Since you have non iso-latin-1 characters probably aspell or ispell wouldn't work; just be sure that you have a spell-checker installed (aspell, ispell, myspell, hunspell, etc,) and enchant (that is a front-end to multiple spell-checkers).
then in a xterm issue the command
enchant -l -d en myfile.txt | sort | uniq -c | sort -nr > myfile-words.txt
and in myfile-words.txt you'll get all the non-english words with the corresponding number of occurrences, in decreasing order.
It will work in linux and mac, and maybe in windows too.
If you want to list the english words too (that is a very bad idea...) you can choose any other dictionary that you have, possibly one using another alphabet. I used oriya (-d or) until I built an empty apell dictionary for language undetermined (und) just for that (the smallest the dictionary, the fastest it will be)
Carlo _______________________________________________ gutvol-d mailing list gutvol-d@lists.pglaf.org http://lists.pglaf.org/mailman/listinfo/gutvol-d
participants (2)
-
James Simmons
-
traverso@posso.dm.unipi.it