Carlo,

This is an excellent suggestion, much better than I had hoped for because it helps me figure out which words and names are worth indexing.  I tried the command and it worked the first time.  UTF-8 characters came through just fine too!

Thanks,

James Simmons


On Wed, Apr 25, 2012 at 12:35 AM, Carlo Traverso <traverso@posso.dm.unipi.it> wrote:
>>>>> "James" == James Simmons <nicestep@gmail.com> writes:


   James> I recently donated the etext "A Study Of The Bhagavata
   James> Purana" to PG and my next project is to reissue this book
   James> in print using CreateSpace.  I'm using OpenOffice to do the
   James> work and it is up to the job.  However, I see that OO has a
   James> feature that could make my CreateSpace book better than the
   James> original--I can create an index, which the original book
   James> did not have and really could use.

   James> Before I can do this I need to identify the words that I
   James> want in my index.  It occurred to me that an easy way to
   James> get this list of words would be to identify all the words
   James> in the book that are NOT in an English dictionary, then
   James> sort, eliminate duplicates, and delete names that are too
   James> obscure to care about.  My first thought to create such a
   James> list was to use gutspell to generate a list that I could
   James> run through a series of filters to get what I want.

...............

   James> I have Linux, Windows, and a Mac so any tool on any
   James> platform will be considered.

   James> Thanks,

   James> James Simmons


You can use a spell-checker in list mode. Since you have non
iso-latin-1 characters probably aspell or ispell wouldn't work; just
be sure that you have a spell-checker installed (aspell, ispell,
myspell, hunspell, etc,) and enchant (that is a front-end to multiple
spell-checkers).

then in a xterm issue the command

enchant -l -d en myfile.txt | sort | uniq -c | sort -nr > myfile-words.txt

and in myfile-words.txt you'll get all the non-english words with the
corresponding number of occurrences, in decreasing order.

It will work in linux and mac, and maybe in windows too.

If you want to list the english words too (that is a very bad idea...)
you can choose any other dictionary that you have, possibly one using
another alphabet. I used oriya (-d or) until I built an empty apell
dictionary for language undetermined (und) just for that (the smallest
the dictionary, the fastest it will be)

Carlo
_______________________________________________
gutvol-d mailing list
gutvol-d@lists.pglaf.org
http://lists.pglaf.org/mailman/listinfo/gutvol-d