Re: [gutvol-d] Language free version of guiguts?

23 Jan 2006

      On Sun, 22 Jan 2006 16:39:44 -0500,  Jim Tinsley <jtinsley@pobox.com>
wrote:

|On Sun, Jan 22, 2006 at 09:26:24PM +0000, Dave Fawthrop wrote:
|>
|>IMO with the advent of huge memory in even the entry level computers,   All
|>tests should be in the one program, and the different language versions
|>should be handled by simple switches/radio buttons, as with the various
|>sorts of angle brackets ATM.   OK the switches will inevitably become
|>complex and difficult. 
|
|You're right, of course.
|
|I _think_ you might even go one better. I've used the occurrence
|of 50 instances of something recognizable as the English word "the"
|as an indicator that a file is (at least partly) in English, and
|a high number of certain types of characters to suggest that the
|file is in ISO-8859 or UTF-8, and a high number of strings within
|<> to indicate some flavor of *ML.
|
|I suspect that a similar technique might be useful in multilingual
|checkers in general, and if I wrote one I would certainly consider it.

There has been a lot of academic work on detecting language by counting
frequently used short words.    All languages have a different set of
frequently used short words. IIRC it is not particularly accurate, and
naturally falls down on text in two or more languages, I have a book in
Yorkshire and English on my desk ATM.

IMO Asking the user which language he/she is using would be easisier and
more reliable.
-- 
Dave Fawthrop <dave hyphenologist co uk>
17,000 free e-books at Project Gutenberg! http://www.gutenberg.net
For Yorkshire Dialect go to www.hyphenologist.co.uk/songs/