
On Sun, 22 Jan 2006 16:39:44 -0500, Jim Tinsley <jtinsley@pobox.com> wrote: |On Sun, Jan 22, 2006 at 09:26:24PM +0000, Dave Fawthrop wrote: |> |>IMO with the advent of huge memory in even the entry level computers, All |>tests should be in the one program, and the different language versions |>should be handled by simple switches/radio buttons, as with the various |>sorts of angle brackets ATM. OK the switches will inevitably become |>complex and difficult. | |You're right, of course. | |I _think_ you might even go one better. I've used the occurrence |of 50 instances of something recognizable as the English word "the" |as an indicator that a file is (at least partly) in English, and |a high number of certain types of characters to suggest that the |file is in ISO-8859 or UTF-8, and a high number of strings within |<> to indicate some flavor of *ML. | |I suspect that a similar technique might be useful in multilingual |checkers in general, and if I wrote one I would certainly consider it. There has been a lot of academic work on detecting language by counting frequently used short words. All languages have a different set of frequently used short words. IIRC it is not particularly accurate, and naturally falls down on text in two or more languages, I have a book in Yorkshire and English on my desk ATM. IMO Asking the user which language he/she is using would be easisier and more reliable. -- Dave Fawthrop <dave hyphenologist co uk> 17,000 free e-books at Project Gutenberg! http://www.gutenberg.net For Yorkshire Dialect go to www.hyphenologist.co.uk/songs/