Here is a little demo video of some work I did with Sajit at MIT CSAIL this summer. The computer watches us play with a ball and produces live commentary. For now the detection of actions (like give, drop, move) are hand coded, the next step would be to learn them from examples. The step after that is "tell and show", i.e. to go from words to pictures. This would complete the imagination-perception loop which may underlie much understanding and problem solving.
I think one of the coolest things about the current implementation is how the computer starts a sentence and cuts it in half to say something more important. There is always tons of possible things to say and possible words to say them with, and a similar competition must be going on in our heads.