google is releasing some very cool data into the wild,
based on a corpus of a trillion words from web-pages.
one trillion.
if you know anything about previous projects in this vein,
you'll know that a corpus this big is totally unprecedented.
as google points out --
> http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
-- this data is quite useful across a wide spectrum of arenas,
including "statistical machine translation, speech recognition,
spelling correction, entity detection, information extraction..."
you'll note that we've discussed some of these topics before...
-bowerbird
p.s. this entry to google's official blog was posted by two people from
the "google machine translation team", for whatever that might mean...
i've suspected their research labs are _way_ out front on these things,
and the near future will see a _lot_ of advances coming from them...