
google is releasing some very cool data into the wild, based on a corpus of a trillion words from web-pages. one trillion. if you know anything about previous projects in this vein, you'll know that a corpus this big is totally unprecedented. as google points out --
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.... -- this data is quite useful across a wide spectrum of arenas, including "statistical machine translation, speech recognition, spelling correction, entity detection, information extraction..." you'll note that we've discussed some of these topics before... -bowerbird p.s. this entry to google's official blog was posted by two people from the "google machine translation team", for whatever that might mean... i've suspected their research labs are _way_ out front on these things, and the near future will see a _lot_ of advances coming from them...

On Mon, 7 Aug 2006 16:50:31 EDT, Bowerbird@aol.com wrote: |google is releasing some very cool data into the wild, |based on a corpus of a trillion words from web-pages. | |one trillion. | |if you know anything about previous projects in this vein, |you'll know that a corpus this big is totally unprecedented. | |as google points out -- |> |http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.... ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ <<< All Our N-gram are Belong to You >>> ROTFLMAO If this is the quality their data, I doubt it will be much use. :-))) -- Dave Fawthrop <dave hyphenologist co uk> "Intelligent Design?" my knees say *not*. "Intelligent Design?" my back says *not*. More like "Incompetent design". Sig (C) Copyright Public Domain
participants (2)
-
Bowerbird@aol.com
-
Dave Fawthrop