
google is releasing some very cool data into the wild, based on a corpus of a trillion words from web-pages. one trillion. if you know anything about previous projects in this vein, you'll know that a corpus this big is totally unprecedented. as google points out --
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.... -- this data is quite useful across a wide spectrum of arenas, including "statistical machine translation, speech recognition, spelling correction, entity detection, information extraction..." you'll note that we've discussed some of these topics before... -bowerbird p.s. this entry to google's official blog was posted by two people from the "google machine translation team", for whatever that might mean... i've suspected their research labs are _way_ out front on these things, and the near future will see a _lot_ of advances coming from them...