22721 >> Bill Dolan: So we're lucky to host Ray Mooney today. Ray is coming to us from University of Texas where he has been for 23 years now, graduating the same number of students, having an illustrious career that covered machine learning and [inaudible] language processing, data mining. He has a bunch of Best Paper awards, and he's the president of ILMS, and he comes from Illinois originally, and I know lots of anecdotes about Ray that I would not dare tell here, but feel free to ask me. I will tell six years of very much suffering at Ray's hands and he'll tell us about some exciting recent stuff he's done in learning and grounding. >> Raymond Mooney: Okay. Thanks for all showing up. I know that this is sort of an awkward time. I'm skipping out on the tutorials at ICML. I want to talk about some of my recent work on learning concepts. Maybe some of you know my student David, who was here working last summer working with Bill, and just gave II think both Bill and I agree a quite a nice talk on ACL on that work. He's actually published two papers, workshop paper at the H Comp Human Computation Workshop, at AAAI. But this is what he does this for a living and also my other Ph.D. student Joohyun Kim, which means he's Korean. That's where all my Korean data comes from that you'll see. Okay. So, again, hopefully most of this is just blah, blah, blah for people here. So in NLP things have really moved towards learning. You know, people tried to build systems manually in the '60, '70s, '80s and now almost all NLP systems are constructed by and large using machine learning methods trained on large corpora, doing syntactic parsing with things like the Penn Treebank, I get tired of teaching, telling my class about my sentence, and WordSense and every area has its own little annotated corpora whether you're doing semantic roles or co-raf or whatever. So the task I worked on a lot, for actually quite a long time, more than 15 years, certainly. Again, this term, people use it in slightly different ways. I have a particular interpretation, I think, of this term "semantic parsing," which is mapping a natural language sentence into what I call here a detailed formal semantic representation, what a linguist might most likely call a logical form. Since all of our stuff isn't strictly logical I just use this very generic term meaning representation. Sometimes this sort of not very common acronym, MR, to mean that. And for a lot of these applications what I'm interested in is mapping language directly into some computer language that's actually executable. So one of the tasks we looked at, starting back in the mid-'90s, and I actually saw papers this summer still using this data, it's sort of hopelessly outdated, in my mind. Actually I don't work with it anymore. Answering questions about U.S. geography. I can tell you the story about why we used this data. It's a bit long. So you have questions in English. How many states does the Mississippi run through? Semantic parsing, I want to map that into some very precise logical form like a prologue-y sort of thing that I can then give to a database and a command -- and the answer happens to be 10. Another task that we worked on, I have a UT colleague, Peter Stone, who was one of the founders of RoboCup. He got me sucked into the RoboCup vortex. We'll see more of that later. This is about the coaching section. RoboCup is like 25 tracks. It's like a huge tent, 20-ring circus these days. But to actually have a track that's called the coaching, where you have simulated players in this very simple 2D simulated soccer world, and you can give them instructions in a very formal way, which so we'd like to take an English sentence like this, if the ball is in our penalty area, then all our players should, except for player four, should stay in our half. In the current RoboCup Coach competition, you have to give that in this form. For those who have been around long enough it looks like lisp and it's sort of an awkward formal language. So we try to say can't we just map a natural language sentence. And then you can actually give that formal expression to the simulator and the players in the game will actually obey that instruction. So I have done a lot of work over the past 15, 17 years or something on learning semantic parsers. So you want to just give the system a bunch of pairs of natural language paired with logical form just like I showed. And then it goes into a system and automatically produces a semantic parser. And most of the work I've done, you use no prior knowledge of the language, no parsers, no lexicon, nothing. It has to learn absolutely everything from the NL-MR pairs. And then that you can hopefully give new sentences to and it will produce useful, correct, meaningful representations for. So as I said, our first paper on this which introduced this geo query data was published with my student John Zelle back in AAAI in '96 and I actively worked on these areas on many papers up through an ACL paper a couple of years ago, which this work did actually use an existing syntactic parser, but it's actually the only one of my systems that uses existing MLRP terminology. They have to learn from the NL-MR pairs. As I worked for this on a decade, no one gave a damn about it. But suddenly there are more people thanks to Luke Settlemire which I assume you know because he's across the Sound, or what do I want to say. And he's actually in the et al in these papers. He's the most active. Persy Lang, if you saw his paper at ACL, has been working in the area recently as well. Dan Roth had a paper, had a student give a talk at the symposium yesterday on it. Also who else mentioned? Dr. Steedman mentioned my work in this area. So suddenly it's actually more popular which means it's time for me to get out. And actually a lot of current work doesn't do the standard supervised NL-MR pair, Dan and Persy's work, Ming, the student who gave the talk, Ming Wae. So this is just a little site saying constructing these annotated corpora, particularly for semantic parsing, is very expensive and time-consuming. So I have this sort of catch phrase here to a large extent the MLP is to replace the burden of software engineering with the burden of corpus construction. So what I'd like to do is get a system mechanism that can learn a language more like a child learns language by experiencing it in context. And there's growing amount of work on unsupervised learning where you just try to learn from raw corpora without any annotation. That's very difficult. And I think really all the information you need is not in there. You can do a lot of interesting things with unsupervised language learning with just the text. But sort of like the U.S. government wouldn't let you run this experiment but try to raise your child and just throw New York Times and Web pages and turn the radio on and they're not going to learn language, right? And I think that's not just because of some human psychological limitation, it's because the information you really need to understand language in any real deep sense isn't in just the signal by itself. It's relationship to what's happening in the world when that sentence is being set. So I think the natural way to learn language is to perceive language in the context of its use in the physical and social world, and that requires inferring some notion of the meaning of the sentence from the perceptual context in which that sentence was uttered. So this connects to this general area that I think maybe the best name for is language grounding. That term, as far as I know, was first introduced by Harn at, published his paper in the Brain and Behavioral Sciences, in 1990 called symbol grounding where he argues the main problem with AI, you can't have these simple symbols, they have to be somehow connected to the world, to perception somehow. And a lot of obvious words are sort of grounded in the perception of the world in colors and objects like cup and ball and verbs like hit and run. But, of course, there's a lot of terms in natural language that are very abstract and don't obviously relate to actual objects and events that you perceive in the world. But even then a lot of those are metaphorical uses of language that originally was grounded, right? We use words like up and down and over and in, in all sorts of ways, but their initial meaning is a physical meaning. I think a lot of your understanding and semantics you attach to those words does come from their original grounded interpretation. Of course, there's a very famous book back from the '70s sometime, I think, right, lake [inaudible] Johnson called Metaphors we Live By, argues a lot of language is metaphorically used, and a lot of that is metaphors to things that are actually grounded. I can say things like I put my ideas into words, I'm treating ideas like objects and words like containers there, and I'm using sort of the semantics of that physically grounded language, but, of course, it's a metaphorical use. So I think most NLP work represents meaning without any connection to perception. Circularly defining teams in terms of words with no meanings, no firm grounding in actual perceptual reality. So one little thing I like to look at is look up words in the dictionary and you see how circular they are. So you look up sleep in Wordnet, it says asleep, to be asleep, a state of sleep, thank you WordNet I've learned a lot about sleep. But you and I don't have to be told what sleep is, we do it every night hopefully, unless you're a grad student. So that's sort of the high level motivation. So to try to make some concrete progress on that, I'm going to talk today about a couple, I totally admit, very simple sort of toy cases that we've been looking at to try to move in this direction. So the first one I looked at was learning to be a sportscaster. So here I want to learn from realistic data of natural language used in a representative context where I can actually infer the meanings of what, or at least have a reasonable chance of inferring the meanings of what these sentences are referring to, but I don't want to get broiled down -- messed up in all the details of computer perception such as speech and vision. So what we've worked on is learning from textually annotated traces of activity in a simulated environment. In particular, taking this RoboCup simulator game, typing in textual sportscasting sort of commentary, seeing if the system can learn to sportscast with no prior knowledge of the language whatsoever, and just listen to what -- not listen -read, process the text that's been given to it, associate with the actual events that are happening in the game. So what we do is we have four games in our dataset. We train it on three, test it on a fourth. It has to learn to sportscast the fourth game on its own after looking, just watching three games being sportscast. So the overall system here is the simulator. We have a very simple sort of rule-based system that looks at the detail of the simulator, which basically gives you the position and velocity of all the players and their direction of angle, that little mark is sort of where they look. They do have a cone of vision. It's actually not a bad simulator. It simulates friction and vision and all these sorts of things, but it's very simple 2D thing. Pulls out a set of perceived facts in a logical form about events that are happening. And that's not influenced by language at all. I assume a lot of us here went to the very nice Warfian talk at ACL which I thought was a great talk. There's no Warfian stuff going on because the facts are not perceived by language at all. They're are just assumed to come out. Then we have the sportscaster saying something that goes into a grounded language learner. And one of the semantic parsing systems I built is based on a probabilistic version of synchronous context-free grammar. One of the reasons I like that work, a system called WASPER, best paper at ACL and Prague and supports language generation and semantic parsing, language generation has to have a language model in addition. So we're interested in the language generation side of things. So it you can take, play a new game, so-called simulate perception. Pulls out a bunch of facts. And the language generator can generate that into language. Okay. So what I like to do here to give you an idea of what this looks like, do we have any Korean speakers in the audience? Oh, boy, a couple. So you are no longer valid subjects for this part of the talk. So my Korean student, Joohyun, was nice enough to comment this game. So imagine I showed you three of these games for the non-Korean speakers and then I gave you a fourth game and you have to sportscast it in Korean. Takes a little while before it starts talking. The text-to-speech is just off-the-shelf it generates the -- [Korean] Hopefully the Koreans can understand this. [Korean] Very nice script in Korean [inaudible] very Korean, he was a king, designed it in 1600ss or something. [Korean] So, think, could you do what my system does, which is learn to sportscast in Korean. After I showed you three games of this you'd be bored to tears, but I don't know if you could do it. So what -- the system really deals with -- again I don't do speech or vision or anything -- it ends up with these set of sentences, and then a string of these sort of logical terms. Now it's not using anything about the orthography of these. So really it's more like this. It's just a bunch of empty symbols as far as the program's concerned. Because it can learn both English and Korean basically equally well. >>: Is it time stamped or something? >> Raymond Mooney: Yes. So the way this works, sorry, is it takes everything that came out of the perception in the last five seconds. When it sees a sentence, it patches these dotted lines saying I don't know but this might mean that. And here I had the green ones, which are ones that we've annotated after the fact to what we think is probably the correct one. Some things don't, right? Because pink 11 looks around for teammate. Perception doesn't extract anything like that. Purple team is very sloppy today. It's actually interesting. There, I think, you need to do some sort of Warfian stuff where it has to learn the concept of sloppy. What's happened there is purple turned the ball over several times in a row, saying they're being sloppy. We can't -- percepttion can't extract. That it's a very noisy signal. Green lines we never give to the system. Those are used purely for evaluation purposes. And, again, there's really like a bunch of empty symbols over there. So if we can figure out the green lines from all the dashed lines then we can just train a supervised semantic parser on it, which is basically what we end up doing, then I can use all the history of stuff, or Luke's stuff, or Dan's stuff, whatever you want, once you have it in supervised form. Another problem here that you may not have realized it has to address is either what natural language generation people either call content selection or strategic generation, which is not only knowing how to say something, I give you the logical form, generate the language, but what to say, out of all the various facts that my perceptual system is observing which are actually worth talking about. And so we have to actually also effectively choose. So the perception pulls out all these sorts of facts. The system has to learn a model of what's worth talking about, it pulls out these. Now our current system is pretty simple, just focuses on what events seem to be highly mentioned but, of course, I have to learn that from this ambiguous supervision. If I had the green lines, it would be trivial to calculate probabilities on mentioning each of these things, but it has to learn the green lines and then estimate those probabilities. Okay. So the data that we played with, we collected four RoboCup championship games from the actual competition from 2001 to 2004. And the event extractor pulls out about 2600 events in a given game. And Dave, met him last summer, did the English. Should get him to do the Mandarin Chinese. I have all these Chinese students never gave me a Chinese corpus. I have to fix that. And Joohyun commented the games in Korean. They both pretty much use the same number of sentences. Each sentence is matched to all events in the previous five seconds, which gives you a relatively small but at least some level of ambiguity. That means on average there's two and a half events associated with each sentence. But it's a bit of a variance from one up to 12. And then we manually annotated those with the correct matching, those green lines, but that's not used by the program. It's only used in evaluation. Algorithm, I'm not going to talk ->>: The current action is just to a single event, or is it sometimes multiple events? >> Raymond Mooney: That's a good question. And we assumed here just a single match. It's one-to-one. Sometimes it's not matched at all. But if it is matched, it's matched to one thing. There are other cases where that assumption won't work. It could be many to one or one to many. We haven't specifically looked at that. It worked pretty well for this case but I do believe it's a limitation. So we use a sort of EM iterative retraining process to try to resolve those ambiguities. And then it calls a supervised semantic parser learning. You can use whatever system you want to do the supervised learning. And basically it goes in a loop like this. First it just assumes all those dotted lines are actually real annotations. So that's very noisy, right? Because it's 2.5. So most of that, three-quarters of it or something, right, is garbage. But it learns something from that. And then it goes into an iterative loop. Again it's very EM-like. You train the supervised parser on the current usually quite noisy examples to start with. You use the current trained parser to pick the bit for EML. It's what someone usually calls hard EM. We've tried something more where we've actually keep the uncertainty, where it goes through the loop, it actually did worse. I'm interested, people heard the term hard EM where other people tried it where you pick the best answer at each point, you don't put the probability distribution on it. You create new training examples based on those assignments, and you iterate. And hopefully each time through it gets better. I don't have any convergence proofs because this isn't a real generative model. But in practice three, four iterations in practice it converges. If you're interested in more details, there's a lot of details in this journal paper we wrote a little over a year ago that appeared in JR, the journal, the online journal for artificial intelligence research. Okay. So that's actually, this is in a very algorithm-heavy talk. So here's the English demo. This is we use this -- we tried different approaches. I'm not talking about the details. There's a lot of different variations of basic algorithm that we tried. You can read the paper for details. These are the ones from the best performing thing that we built. Again, the speech is just synthesized using this system from CMU for ETTS, text to speech, and it outputs the text and then the speech synthesizer, so much easier to watch these things when get the speech. Like watching a soccer game reading the subtitles is not easy or fun. So this is on the test game, right? It has never seen this game before. It's trained on three games in English. And now it knows how to sportscast. Sort of slowed down so you can follow it. [Video] purple 5 passes to purple 11. Has to learn the name of the players syntactic language and there's no built-in knowledge of language. It has to learn all of that from scratch. [Video] purple 11 passes to purple 6. Purple 6 passes to purple 11. Purple 11 passes to purple 6. Purple 6 passes to purple 8. Purple 8 passes to purple 11. Purple 11 makes a bad pass that was intercepted by pink 6. Pink 6 passes back to pink 7. Pink 7 passes to pink 9. . Pink 9 passes to pink 11. Pink 11 passes to pink 9. Team is off-side. [laughter] Okay. So the paper presents all sorts of evaluations of this. We looked at a lot of different factors about how well it's learning, how well can it figure out those green lines from the dashed lines, how well it can match sentences to the correct meanings. How well does the semantic parser work trained on the supervised data that results from that, parsing sentences into formal meanings, how well does it generate sentences from those formal meanings, and how well does it solve the content selection problem of picking which events are worth talking about. There's all sorts of numbers on that if you're interested in there, but I won't bore you with the details. The one experimental and my favorite evaluation that we did is what I've called the pseudo-Turing test. Of course, you gotta use Amazon Mechanical Turk to do all the work these days. So we recruited human judges, 36 in English. It's harder to get Korean speakers, but I think Joohyun called up his friends in Korea and said check this thing out on Amazon Turk. We had eight commented game clips, four-minute clips randomly selected from each of the four games. And each clip was commented by the human and once by the machine. Of course when that was left out as a test game. The judges were not told which ones were human or machine, but we did tell them half of them are generated by a machine, half of them by a human. Then we asked them basically four questions. We had them rate on a five-point scale the English fluency from flawless to gibberish, semantic correctness, where we tried to explain to the Turkers what we meant by this: Is it really saying something that actually happened in the game, from always to never. And then this very vague question was how good a sportscaster is this. The scores on that were all low, because it's very boring. From excellent to terrible. And we also again told them half of them are human and half are not. It says you have to guess which ones were generated by a human being. So here's a breakdown across all the games. Here's the averages for the English case at the bottom. You see, you know, the ranges of machine here are actually higher here, about the same, a little higher there, and most humans and machines are not human, only about 25 percent. But actually more the humans were, machines were human than the humans were human. But this is partly, if you look at the game-by-game results, this is partly -- you notice this huge discrepancy on the 2004 game. And that was because apparently it randomly assigned some representation to an interesting sentence in the corpus, like this is the beginning of an exciting match in 2004. Something like this. And the system just happened to spit that out for this game, and it got lucky. It didn't do it for any good reason. It sounded to them much more human and so it won decisively on that game. Now, another thing is I should have realized that David wasn't much of a soccer fan and Joohyun was. So his games are actually commented much more interestingly. So we didn't actually -- so 62 percent of the Amazon Turkers actually believe that Joohyun was human and only 30 percent believed that our learned sportscaster was human, but it's a third of the people. Again, the scores are correspondingly a little bit worse. But basically you can listen to these things, you can follow them. It works, basically. And there's more numbers in the paper if you're interested. Okay. So we stopped working on this a couple of years ago now and got and went and put the paper together and published it. So a new problem that David's been working on for the last couple of years, and we finally got some halfway decent results for, it's a paper at AAAI. I also gave a poster. Was anybody at my poster yesterday, at the symposium yesterday? And this is learning to follow directions in a virtual world. So the idea is learn to interpret navigation instructions in a virtual environment by simply observing humans giving and following instructions. Again, no prior knowledge of the language. We've only done this in English so far, but you just watch people give direction to another person in a virtual environment and that's all you get to see. And you have to now learn the language and learn to accept directions yourself. And, you know, I think even within the virtual world, I think they're potentially interesting applications for this where you can build virtual agents and video games and grow them, raise them like a baby and teach them language and then interact with them. So it's like you could have SIMS where you can actually raise your baby and teach it to talk and then have a conversation rather than just mute SIMS. And hopefully there could be some interesting applications for that. So what we did was, another student at UT who was actually interested -- he worked with Ben Kippers -- he was interested in building a wheelchair that could follow direction in natural language for severely impaired people. But to do experiments for his thesis, he built this little virtual environment where these colors here are the tile on the bottom of the floor. It's just a bunch of connected hallways, then there's different wallpaper on the walls in each of the hallways which is based on what these pictures are. I don't know why I came up with these things. And then there are objects scattered around in the environment, hat, rack, lamp, easel, sofa, bar stool and chair. So this is what it looks like to the actual people who are doing the task. They're in -- this is the bird's-eye view but this is what they see. The humans, they had to actually explore this environment from this perspective and then learn it and then they were asked, given instructions to get from this place to that place to another human and then another human followed those directions to see if they could get there. So here's an example starting here, ending there. People generate very different descriptions using different types of strategies to generate these directions. So these are all four different directions for exactly the same task. You can see there's quite a bit of variation there. Take the first left until you go all the way down, go towards a coat hanger and turn left. The graphics are pretty poor for reasons I don't want to go through. Go forward, the intersection contains a hat rack, turn left, go forward three segments to an intersection with a bare concrete wall passing a lamp. You really hardly see the lamp, I think. So this is what the system experiences, and it has to now learn to follow directions in natural language based on just that experience with no prior knowledge of the language. Okay. So, again, there are many different instructions for the same task. They describe different actions. They use different landmarks. And they talk about the landmarks in different ways. And the mini representations are completely hidden from the system. It just sees the primitive actions. It doesn't see the plan that the system would actually need to execute those actions and figure out what that's what they need to do. So unlike the sportscasting where each sentence by looking at this 5-second window contains a fixed number of options, there's actually a combinatorial number, because there's a lot of stuff in the world. And the person could be referring to basically any subset of that. So you get a combinatorial explosion, the number of possible plans for describing the same sequence of primitive actions. So we couldn't take the same sort of EM approach that we took with the sportscasting case, because there's just too much possible MRs for each NL, to use my lingo. But at the end of the day this is what it gets. It gets these triples of, it gets natural language sentence. It gets an observed sequence of actions that were executed after the person -- you said that sentence. And then, see, it knows the state of the world, all the objects and the wallpaper and the tile in each of those states. And so you get these triples of language, action, world state, and that's what it gets. And then it has to build a correct mapping from actions from the natural language into the actions given, of course, the knowledge of the world state. Okay. So the overall system we built looks a little bit like this. So it first tries to -- it looks at the world state and the trace of actions that were executed, and constructs various ways of sort of coming up with a possible plan for that. And we'll look at those in a minute. But then it goes through what we call a plan refinement step, because these usually contain too much information. It was as if everyone told you everything you saw at every point. And that's not true, of course. So we do what we call plan refinement, actually looks at the language. It learns a lexicon about what it thinks the words mean so the first one to do this sort of stuff was Jeff Siskin, if you know his work, and I had a student Cindy Thompson that did sort of things like this, too, to sort of learn pieces of representation that seems to correlate with words. Then it uses those meanings of words to sort of pull out the garbage from the plan, and just pare it down to what it thinks the language is actually talking about. Then it gets pairs of the instructions with the inferred plan for that, given lexical knowledge that it's learned and then it just trains a learned supervised semantic parser on those pairs. Now, of course in testing you get an instruction. The semantic parser maps it into a formal representation. That then goes to an execution module. Markel was this system that could execute those formal plan representations that Matt McMann built as part of his thesis. And of course it knows the world state and then it produces an action. So a little bit about those two things. So we set a bound of a representation of a plan by two things. One is you just mentioned the primitive actions. The other, after every action you put in everything in the world that you observe at that point. So you turn left and then you verify that there's a wall to the left, a lamp to the back, a hat rack to the back and a brick wall in front of you. This is actually not even all here because it doesn't have the tile, but that would be there, too. And then you take a step and you verify that you're in the wood hall. And so this is very detailed. Much more than what the language is actually talking about. So we need to figure out how to pare this down into what we think the language is actually referring to. So we need to remove those extraneous details in the landmark plans, and we do that by learning the lexicon and then remove the parts that don't seem to correspond to words that actually appear in the sentence. So let's go to the example here. So here's two cases. Again, this is a simplified form to fit on a slide. So the real landmark plans are much more compound, not much more, but somewhat more complicated than this. This is turning and walk to the sofa. And in other cases walk to the sofa and turn left. So it takes all the sequences of actions it saw in every sentence where the person mentioned sofa, say. I really should change this to like couch, because it's not using the fact that this sofa means that, because it could be in Korean or Spanish or something else. And we take an intersection of these two graphs, because they both use the word "sofa". So presumably again the word could be ambiguous, but in general it's not a bad heuristic. But probably sofa means something that's shared by these two examples. But there's sort of multiple things it could be. It could mean turn left until you see the blue hall, at the blue hallway, or it could mean travel and see if you're at the sofa. Because those are two maximal sub graphs shared by these two plans. So what it does is the lexicon learning looks a little bit like this. It collects all the landmark plans that co-occurred with a particular word and says maybe those little pieces of graph are things that word refers to, and it keep doing that, adding new entries. And then it ranks the final set of candidates by the simple score which says you want to find a piece of graph that's very likely given the word but not so likely when the word's not there. And so then it learns this sort of ranked lexicon, pieces of graph as the, quote, meaning of each of the words. Now, once it has that, it then tries to simplify the plan by just looking at the word. So here it says, well, I think turn left means this. So pull that out. I think walk to the sofa means that. So pull that out. And "and" is the only thing that's left that doesn't have any meanings in the lexicon that it learned, and then it reduces that more complicated plan to this subpiece which it thinks is composed of pieces that correspond to the meanings of the words that actually appeared in the sentence. So there's a little bit of pseudo co here in the slide about it selects the highest scoring entry at each point in the lexicon, removes that word from the sentence and keeps doing that until it's exhausted the words in this sentence. Okay. So hopefully that gives you a little bit of idea of how it Parsec down the whole big plan to what the sub plan is that it actually thinks the person is talking about. If you're interested the III A paper is up on the website if you want to read more of the details. A thing on the data again. We didn't collect this data, McMann, a previous student, which Mishi, I think you said you knew Matt, right? He collected instructions from one to 15 followers, sorry, with six instructors, one to 15 follower followers, three different maps. They all have the same type of tile and objects, but they're all arranged differently in each of those three. And we do leave one map out cross validation. Which means we train on two environments. We want to see, of course, whether it can generalize the new instructions in a new environment. The same sort of primitive objects and things in the world but they're all arranged differently. So there's a few statistics. You can look at the entire instruction for a paragraph or you can look at the individual sentences for it. Various statistics. About 700 instructions with 660 words. Single sentences are, of course, a little bit less. So the entire instruction usually takes, you need to do like ten actions to follow the entire paragraph of instructions on how to get to the goal. We did a number of different evaluations. Again, I'd point you to the paper for some of these. I'll put present results for what I call the end-to-end execution, where you train it on three maps. It gets to watch the instructions followed in those three maps. You've given it a new map. You've given it a new instruction, you say go and you see if it's followed correctly and gets to the right place. Again, we did leave one map out cross-validation, it's a pretty strict metric only if the correct final position exactly matches the goal do you get a point for that problem. But we looked at it both at the sentence level after each sentence you end up in the right place you should be, and then after the entire paragraph for that whole instruction do you end up in the right place. We built what I think is a reasonable sort of lower baseline to sort of give you an idea you how hard the problem is. We built a simple probablistic generative model of the actions in the plan, so actions that are more likely are more likely to come out of this generative model. You execute that generative model. It doesn't say look at the language at all, it says do actions in sort of very simple general model of how the plans actually occurred in practice. We thought that was a pretty good lower baseline. Upper baselines, we did what if we manually annotate sentences in this domain with their actual, correct, plan representation. Matt build the system totally by hand, no learning involved, but also he hand-engineered the system for all four of these, three or four, how many? Three, I think. So it's really like testing on the training data. So his scores will seem high, but take into account that that's basically testing on training data, development data. Matt didn't do a good job of separating development from test. And, of course, how well the humans do when they follow these directions. So a bunch of numbers here. So first is how accurately can it follow a single sentence. The simply generative model, the lower baseline is about 11 percent of the time it will actually do the right thing. If we train the system on those basic plans, just have the primitive action, at a single sentence level it doesn't actually do that bad. It gets about 56 percent. If we train it on the landmark plan, it has too much. I call these Goldilocks. This is being trained on too little of a representation for the meaning. This is being trained on too much of a representation for the meaning. And they both -- so this actually does pretty well. This does quite a bit worse. If we put in our lexicon learner and refine the plans before we produce the gold standard that goes into the training of the supervised parser, then we actually don't even do as well as the basic plans for a single sentence, which is a little bit disappointing. If you train the system on human annotated plans. So this is giving the supervised parser gold standard annotated data, it only gets about 58 percent. Marco, because it was sort of overfit to this data, is about 77 and it was never human followers were never judged on their ability to follow a single sentence. If you go to the complete plan, of course, everything gets quite a bit worse, because if you mispairse any of the sentences in the instruction you're going to screw up and you're not going to get to the right place. So this shows you that the simple generative model just gets two percent. So the chance you're going to just mimic the type of actions that were performed during training and get it right is -- you know, is very low. If you train on basic plans. So here's really the best result for our systems. Not that great. But 16 percent of the time it can actually follow the entire directions and get to the right place. And notice here it actually is doing better than the basic plans, because those verify steps because the executer can actually correct some errors, because if it says you know you should see the hat rack, did you go one step, you should see the hat rack, when you see the hat rack the next thing it will actually go to the hat rack. So it can sort of fix errors in the instructions themselves by using those verification steps. If you train on human annotated plans, it gets about 26 percent, Marco again overfit to this data 55. And then human followers -- human followers only get 70 percent of the time are they able to -- because sometimes there's errors. These are real plans that people generate. Sometimes they're just wrong. And so human followers only get about 70 percent correct destination in this environment. Here's just a sample client parse where the system learned how to parse this pretty complicated instruction, completely correct. This plan will actually get you to the right destination, place your back against the wall of the T intersection. Turn left. Go forward along the pink flowered carpet hall, two segments to the intersection with the brick wall. I wonder what the parser got for that. It's in here somewhere. This intersection contains a hat rack, turn left, go forward three segments to an intersection with a bare concrete wall, passing the lamp, that's your goal. And it actually got a plan out of that. Again, it didn't understand everything right. Notice what -- it picked up call way, what didn't it get here, pink flowered carpet, it didn't even get that. But this is sort of overly, people do this redundantly. There's more information here than you need. So the fact it doesn't pick up all the information is okay. It's still good enough to work. Okay. So that's the quick tour of these two applications we've looked at to try to judge grounded language learning. Again, the system starts with no language, the language has to learn everything about the language just by watching things, either watching a game being sportscasted, RoboCup simulator, or watching two people, one giving instructions to another other and them executing a sequence of actions. The current systems are both very passive. They just sit there and watch; they're not hopefully your child isn't like this when he's learning he or she is learning language. They're interacting and other things. So we'd like to move to a more interactive/active form of learning here, where you know either the system itself sort of tries to interactive learner acts as a follower. So it actually participates in the process and tries to follow the directions and sort of has a more interaction with the world rather than just passively observing two people doing it. Or acting as an instructor. It tries to generate things, see what people do, learns more from sort of generating the language and seeing people's response to it and, of course, it would want to generate good things, things that are uncertain about. There's sort of a lot of active learning ideas that could go into this. And this would even make it more, I think, like maybe a little bit more how people learn language rather than just passively observing the world. More generally I'm sort of interested in language grounding in general. I've done a little bit of work not in simulated. Both of these are in simulated world because I'm not a robotics or computer vision person. But I am interested in this general area of integrating language and vision. There was a workshop at NSF held about a month ago that I went to where they threw together half computational linguists and half vision people, and we yelled at each other for a day. The thing is there's actually an inter lingua -- I don't understand vision people and they don't understand language. We have an intra linguist which called machine learning. All they do is machine learning. All language people do is machine learning. So you just talk in SVMs and kernels and graphical models and EM, and it's like you can actually talk to these guys. But, of course, generally you know they all have to do supervised learning. So they have Turkers, they can get Turkers to do more fun tasks because they have them identify objects and images and outline them. It's like they have an easier job getting Turkers to do what they get what them to do than language people get them to do, what we want them to do. But that's still, Turking has made it cheaper, but it's still, you know, costly. It would be nice if we didn't have to do that. So sort of part of my vision is can't we do cross-supervision of language in vision. Use naturally occurring perceptual inputs to supervise language learning and use naturally occurring linguistic to supervise visual learning. So we've got to go back to the '60s at MIT and do the blacks world where we have this and I give it a sentence the blue cylinder is on top of a red cube. If I'm trying to learn language, I view this as my input that I want to understand and this is my supervision that I'm going to use to train that. If I'm doing vision and I treat this as my input and I treat my language as my supervision. So by each -- so they each sort of supervise each other, and I can learn both vision and language by the correspondence between vision and language. So, again, this is just a high level sort of vision about vision/vision or something. I've done a little bit of work with real video and real images, but again I'm not -- I'm learning a little bit more about vision. But I'm certainly not at all -- I'm having a hard enough time keeping up with both machine learning and natural language, much less getting into vision or robotics. My students love me, right, because I say, okay, if you want to work with me on grounded language, you have to take the computer vision class, the robotics class, the learning class and the language class. And if they survive that, then it's like, okay, you can work with me on grounded language. So we've done a little bit of work with closed captioned video where you use the closed captions as a very weak sort of supervision to do what vision people call activity recognition. So we've automatically trained activity recognizers to be used to improve the precision of video retrieval. Of course we did this in soccer. So the other nice thing, this allowed me to use my grant money to buy a Tivo machine, because that's the best way to get the closed captioned text from the video. Because YouTube they don't give you closed captioned text with them. And we tried to learn activity recognizers for four different verbs. Kick, save, touch, something else. I'm not a soccer person. I don't play video games and I hate sports. And somehow I'm doing video games and sports. I don't know how that happened. But it just uses the fact that these words -- and most of this is noise. Because most of the time when the caption says kick, no one's actually kicking in the image. But you know there's some probability. So you can actually use this as a very weak supervisory signal and learn an activity recognizer that's really bad; but if you add it on top of the text, it actually improves the precision of your image or video clip retrieval. This is a AAAI paper from last year, if people are more interested. >>: It's recognition of action of videos. >> Raymond Mooney: We used off-the-shelf stuff. It's stuff by Loptev. >>: State-of-the-art ->> Raymond Mooney: If you have -- you know, John, someone taping their graduate student say walking, waving, then it works pretty well. If you run it on real video, it's pretty crappy. It's pretty crappy. But the point is you can do the retrieval of events just by looking at the words, and then we filter that through the activity recognizers. The activity recognizer is just acting as a filter on the language retrieval. And we just show that you can, and the numbers are in the paper. You can improve the F measure and map scores and these things for your video retrieval by adding the activity recognizer that you learn from this crappy signal on top of the linguistic-based retrieval. Again, that's a AAAI paper from last year if you're interested. So one thing I learned from this whole thing is you know don't write your good masters students too good of recommendation letters because then they get into Stanford and then they leave. Some are now working with Christine's old advisor Chris Manning. So I have to write crappier letters in the future. So current language -- all right. So I'm wrapping up here. So current language learning approaches use unrealistic training data blah, blah, blah. So we've been trying to focus on this idea of learning language from systems with sentences paired with actual environments that the language refers to. And right now we've explored a couple of admittedly simple but I think interesting challenge problems. Learning to sportscast simulated RoboCup games, we're actually able to commentate games about as well as humans and learning to follow navigation instructions. Turns out that's much harder, but we're able to accurately follow individual sentences at about the 55 percent level, complete instructions as you saw. Not too great, 15, 16 percent. But I think this whole area of grounded language learning, it's starting to take off a little bit more. Regina Barlay has been doing things. Luke Settlemire has been doing things. I've contaminated them with some of these ideas. And then I hope more of you get interested in these things, because I think this is the future. Okay. That's just back-up. [applause] >>: Do you get a digital feed of the closed caption from the Tivo or you had someone ->> Raymond Mooney: Took a while to figure that out, but, yeah. Yeah, like I said, I had to write like a special letter, yes, I'm actually buying the Tivo machine to do research. >>: So is it just arbitrary amounts of that sort of data available ->> Raymond Mooney: We recorded it off cable TV or whatever. That's why we got the Tivo box. So I don't know, we didn't do a deep look at this, but I don't think you can find much closed captioned video online as far as I know easily. I mean, there's a bunch of it out there, right, but it's broadcast and it's not in, it's not on YouTube. So getting that data is a bit tricky unless you buy the Tivo machine and record it live off the TV yourself. But there's actually an idea when I first learned of close captioned in the early '90s or something, I said wow I'd like to learn language learning on that. It's just taken me 20 years to do something. >>: So now for the sportscasting application, there are video games out there like I'm kind of a sports and video games [inaudible] but I've seen them and they commentate on this -- >> Raymond Mooney: Sure it's all handwritten. So if you want to -- if you want to estimate with 99 percent probability, those are all hand-engineered systems, right? My system starts with no knowledge of the language, no knowledge of anything, has to learn it all from scratch, right? So I care more about the -- you can handle -- there's actually RoboCup sportscasters that were hand built, and they're actually pretty good. But they're hand built. You put in the patterns. It's a pretty simple text generation problem. You just do template-based text generation. But the point of this is not in some sense the sophistication of the language that it gets it, it's that it's learning it from this weak signal. >>: What about the live signal, I know there are setups where you can play multi-player gaming systems and you have mic and handset where you close talk microphone and text and comment. >> Raymond Mooney: Unique data to collect it -- I also would need a good speech recognizer person, which I don't have. But, yeah, if we could do the speech -- those mic things they're much better speech recognition if you have the voices. >>: Probably going to be cursing saying all sorts of ->> Raymond Mooney: That would be a good generator. [laughter] I have to put up a proceed fan, backslash, bang, bang. [laughter]. >>: In the contemporary not a lot of it is very structured with respect to video. >> Raymond Mooney: If you look at real -- sometimes I call this sportscasting for Sesame Street or maybe for Tele Tubbies, or what's lower than that. What we put in is very literal. You watch real sportscasts, because they're assuming you can watch the game. So what I'd like to get ahold of is what's called SAP, secondary audio program. And there I have to do the speech problem because I knocked my TV into one of these like in the mid-nineties, too. And I'm watching some PBS nature show and there's this other voice coming: And now the lion approaches the cub. And I'm like what the hell happened? And I flipped it into SAP. And that is the sort of signal I want. The other thing people have done is looked at scripts typed in by fanatic fans. There's this famous paper in vision using Buffy the Vampire Slayer where every episode of Buffy has been meticulously hand coded about every action and every dialogue piece in the entire series, and you can get this data online. People have tried to online on video, but they're detailed human annotation. So I think there are other sorts of interesting signals connecting language to video. Normal closed caption is actually pretty bad because it actually rarely is talking about what actually is happening on the screen. >>: Before you ->> Raymond Mooney: So there are these sites that post an online -- I've talked to this about my students, the online commentary on the game, on the Internet while it's being played. That's an interesting source that we've considered. Another thing we consider actually Persy's, cooking shows, but he's doing it in the virtual environment. So I just asked him about this last week in Portland. But he's never gotten anything working on this, but he told me about this last year, and I thought it was a cool thing. He built a virtual kitchen, and he just gives people recipes and they have to cook things in the virtual kitchen. And then he wants a system that can be a virtual cook, follow new recipes, but he doesn't actually get anything. Apparently he set all this up, and he got off, you know, Persy's doing ten different things at once. >>: I love the idea of language but it feels like we're suffering from a lack of data. Like sometimes was there one way to get at it, is there some, I don't know, do you have great ideas about how we can go forward into new domains and people provide ->> Raymond Mooney: I sort of like the virtual environment if we can build SIMS with language and get people to talk into it. If we built interesting languages where you could talk about things in that virtual language and get people patient enough it takes a long time to raise a baby to learn to talk. I just Misha's kid, she's almost one and a half, she doesn't say a single word. It's not an easy problem, right. >>: You should really be plotting performance on that navigation task against age. About 101-year-olds do it, two-year-olds, three-year-olds. >> Raymond Mooney: Raising a baby in the real world is tough. Raising it in the virtual word, giving it enough linguistic experience to actually learn the language, I don't know if people would be willing to do that. Because it's a lot of data. >>: Learning a different brain for RoboCup games versus the brain, following directions. Is there some way that we can leverage this across the domain somehow? Some generalization capability that might allow us to do more with less data or something? >> Raymond Mooney: I'm open to suggestions. >>: If you want sports data to -- I think in India they still have and probably other countries, actual radio broadcasts, right? So that will be much more literal. >> Raymond Mooney: I've thought about that. >>: Commentary [inaudible] exactly what's going on because it's not assuming that you have ->> Raymond Mooney: My dad used to always listen to baseball games on the radio when I was a kid. >>: Presumably you could get the video feed as well. You would have synchronization. >> Raymond Mooney: No, that's a good idea. We haven't thought about that. >>: Shortwave radio. >>: I was saying like along the same lines, like football coaches like have right after the game they will go through like the most literal analyzations and that's taking it, slowing down and really ->> Raymond Mooney: Yes, you did the alignment. But that's I think more distilled than sportscaster trying to keep an audience. >> Raymond Mooney: Yeah, so I think there are a lot of possibilities and you know we've only barely scratched the surface, and I think there are other sources of data, are any of them perfect, certainly not. >>: When you do EM training, how do you -- what features are you using in order to learn the best mean in the presentation? >> Raymond Mooney: Again at the end of the day it's just training a supervised semantic parser. So different systems use different sets of features. So I would point you, what we use with the RoboCup is a system called lambda Wasp. It was the Prague ACL Best Paper Award. It learns this synchronous context-free grammar rules to do the mapping. But that to me is a separate proper. I worked on that for 16 years. Now other people are working on it finally. So there are a number of systems out there. Our code is available online. It's not documented or supported, but you can download it. I think a few people have actually got it working. Okay. Thank you. [applause]