google releases some cool data

7 Aug 2006

      google is releasing some very cool data into the wild,
based on a corpus of a trillion words from web-pages.

one trillion.

if you know anything about previous projects in this vein,
you'll know that a corpus this big is totally unprecedented.

as google points out --
...
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you....
-- this data is quite useful across a wide spectrum of arenas,
including "statistical machine translation, speech recognition, 
spelling correction, entity detection, information extraction..."

you'll note that we've discussed some of these topics before...

-bowerbird

p.s.   this entry to google's official blog was posted by two people from
the "google machine translation team", for whatever that might mean...
i've suspected their research labs are _way_ out front on these things, 
and the near future will see a _lot_ of advances coming from them...

Bowerbird＠aol.com

Dave Fawthrop

tags

participants (2)