Re: [gutvol-d] Creating a list of unknown words from a text document

james said:
I need to identify the words that I want in my index.
i handled this in my "grapes" series in september and october.
http://lists.pglaf.org/mailman/private/gutvol-d/2011-September/date.html
http://lists.pglaf.org/mailman/private/gutvol-d/2011-October/date.html
you'll find working python code, and pointers to dictionaries... it might be a mistake to eliminate all dictionary-words though, because you'll probably want to have a lot of them in your index. i'd first try eliminating common words (i point to a list of them), which account for a surprising percentage of the bulk of a book, then see how useful the list is at that point in time. my guess is that eliminating words from there will be a whole lot easier than trying to summon up the remaining good index words later on...
Using the ascii version of the file with gutspell is not an answer. I need the accents in the word list.
a first pass might be to extract all the words that have accents, because those are words which you'll definitely want to index... (even if they only occur once, consider putting 'em in the index, perhaps in their own special section, because that's useful info.) after that, you'll have only low-bit words left, so no need for utf8. -bowerbird
participants (1)
-
Bowerbird@aol.com