22721 - Msecnd.net

advertisement
22721
>> Bill Dolan: So we're lucky to host Ray Mooney today. Ray is coming to us from
University of Texas where he has been for 23 years now, graduating the same number of
students, having an illustrious career that covered machine learning and [inaudible]
language processing, data mining.
He has a bunch of Best Paper awards, and he's the president of ILMS, and he comes
from Illinois originally, and I know lots of anecdotes about Ray that I would not dare tell
here, but feel free to ask me. I will tell six years of very much suffering at Ray's hands
and he'll tell us about some exciting recent stuff he's done in learning and grounding.
>> Raymond Mooney: Okay. Thanks for all showing up. I know that this is sort of an
awkward time. I'm skipping out on the tutorials at ICML. I want to talk about some of my
recent work on learning concepts. Maybe some of you know my student David, who was
here working last summer working with Bill, and just gave II think both Bill and I agree a
quite a nice talk on ACL on that work.
He's actually published two papers, workshop paper at the H Comp Human Computation
Workshop, at AAAI. But this is what he does this for a living and also my other Ph.D.
student Joohyun Kim, which means he's Korean. That's where all my Korean data
comes from that you'll see.
Okay. So, again, hopefully most of this is just blah, blah, blah for people here. So in NLP
things have really moved towards learning. You know, people tried to build systems
manually in the '60, '70s, '80s and now almost all NLP systems are constructed by and
large using machine learning methods trained on large corpora, doing syntactic parsing
with things like the Penn Treebank, I get tired of teaching, telling my class about my
sentence, and WordSense and every area has its own little annotated corpora whether
you're doing semantic roles or co-raf or whatever.
So the task I worked on a lot, for actually quite a long time, more than 15 years, certainly.
Again, this term, people use it in slightly different ways. I have a particular interpretation,
I think, of this term "semantic parsing," which is mapping a natural language sentence
into what I call here a detailed formal semantic representation, what a linguist might most
likely call a logical form. Since all of our stuff isn't strictly logical I just use this very
generic term meaning representation. Sometimes this sort of not very common acronym,
MR, to mean that.
And for a lot of these applications what I'm interested in is mapping language directly into
some computer language that's actually executable.
So one of the tasks we looked at, starting back in the mid-'90s, and I actually saw papers
this summer still using this data, it's sort of hopelessly outdated, in my mind. Actually I
don't work with it anymore.
Answering questions about U.S. geography. I can tell you the story about why we used
this data. It's a bit long. So you have questions in English. How many states does the
Mississippi run through? Semantic parsing, I want to map that into some very precise
logical form like a prologue-y sort of thing that I can then give to a database and a
command -- and the answer happens to be 10.
Another task that we worked on, I have a UT colleague, Peter Stone, who was one of the
founders of RoboCup. He got me sucked into the RoboCup vortex. We'll see more of
that later. This is about the coaching section. RoboCup is like 25 tracks. It's like a huge
tent, 20-ring circus these days. But to actually have a track that's called the coaching,
where you have simulated players in this very simple 2D simulated soccer world, and you
can give them instructions in a very formal way, which so we'd like to take an English
sentence like this, if the ball is in our penalty area, then all our players should, except for
player four, should stay in our half.
In the current RoboCup Coach competition, you have to give that in this form. For those
who have been around long enough it looks like lisp and it's sort of an awkward formal
language. So we try to say can't we just map a natural language sentence. And then
you can actually give that formal expression to the simulator and the players in the game
will actually obey that instruction.
So I have done a lot of work over the past 15, 17 years or something on learning
semantic parsers. So you want to just give the system a bunch of pairs of natural
language paired with logical form just like I showed. And then it goes into a system and
automatically produces a semantic parser. And most of the work I've done, you use no
prior knowledge of the language, no parsers, no lexicon, nothing. It has to learn
absolutely everything from the NL-MR pairs.
And then that you can hopefully give new sentences to and it will produce useful, correct,
meaningful representations for. So as I said, our first paper on this which introduced this
geo query data was published with my student John Zelle back in AAAI in '96 and I
actively worked on these areas on many papers up through an ACL paper a couple of
years ago, which this work did actually use an existing syntactic parser, but it's actually
the only one of my systems that uses existing MLRP terminology. They have to learn
from the NL-MR pairs.
As I worked for this on a decade, no one gave a damn about it. But suddenly there are
more people thanks to Luke Settlemire which I assume you know because he's across
the Sound, or what do I want to say.
And he's actually in the et al in these papers. He's the most active. Persy Lang, if you
saw his paper at ACL, has been working in the area recently as well. Dan Roth had a
paper, had a student give a talk at the symposium yesterday on it.
Also who else mentioned? Dr. Steedman mentioned my work in this area. So suddenly
it's actually more popular which means it's time for me to get out.
And actually a lot of current work doesn't do the standard supervised NL-MR pair, Dan
and Persy's work, Ming, the student who gave the talk, Ming Wae.
So this is just a little site saying constructing these annotated corpora, particularly for
semantic parsing, is very expensive and time-consuming.
So I have this sort of catch phrase here to a large extent the MLP is to replace the burden
of software engineering with the burden of corpus construction. So what I'd like to do is
get a system mechanism that can learn a language more like a child learns language by
experiencing it in context.
And there's growing amount of work on unsupervised learning where you just try to learn
from raw corpora without any annotation. That's very difficult. And I think really all the
information you need is not in there. You can do a lot of interesting things with
unsupervised language learning with just the text. But sort of like the U.S. government
wouldn't let you run this experiment but try to raise your child and just throw New York
Times and Web pages and turn the radio on and they're not going to learn language,
right? And I think that's not just because of some human psychological limitation, it's
because the information you really need to understand language in any real deep sense
isn't in just the signal by itself. It's relationship to what's happening in the world when that
sentence is being set.
So I think the natural way to learn language is to perceive language in the context of its
use in the physical and social world, and that requires inferring some notion of the
meaning of the sentence from the perceptual context in which that sentence was uttered.
So this connects to this general area that I think maybe the best name for is language
grounding. That term, as far as I know, was first introduced by Harn at, published his
paper in the Brain and Behavioral Sciences, in 1990 called symbol grounding where he
argues the main problem with AI, you can't have these simple symbols, they have to be
somehow connected to the world, to perception somehow.
And a lot of obvious words are sort of grounded in the perception of the world in colors
and objects like cup and ball and verbs like hit and run. But, of course, there's a lot of
terms in natural language that are very abstract and don't obviously relate to actual
objects and events that you perceive in the world. But even then a lot of those are
metaphorical uses of language that originally was grounded, right? We use words like up
and down and over and in, in all sorts of ways, but their initial meaning is a physical
meaning. I think a lot of your understanding and semantics you attach to those words
does come from their original grounded interpretation. Of course, there's a very famous
book back from the '70s sometime, I think, right, lake [inaudible] Johnson called
Metaphors we Live By, argues a lot of language is metaphorically used, and a lot of that
is metaphors to things that are actually grounded. I can say things like I put my ideas into
words, I'm treating ideas like objects and words like containers there, and I'm using sort
of the semantics of that physically grounded language, but, of course, it's a metaphorical
use.
So I think most NLP work represents meaning without any connection to perception.
Circularly defining teams in terms of words with no meanings, no firm grounding in actual
perceptual reality.
So one little thing I like to look at is look up words in the dictionary and you see how
circular they are. So you look up sleep in Wordnet, it says asleep, to be asleep, a state
of sleep, thank you WordNet I've learned a lot about sleep. But you and I don't have to
be told what sleep is, we do it every night hopefully, unless you're a grad student.
So that's sort of the high level motivation. So to try to make some concrete progress on
that, I'm going to talk today about a couple, I totally admit, very simple sort of toy cases
that we've been looking at to try to move in this direction.
So the first one I looked at was learning to be a sportscaster.
So here I want to learn from realistic data of natural language used in a representative
context where I can actually infer the meanings of what, or at least have a reasonable
chance of inferring the meanings of what these sentences are referring to, but I don't
want to get broiled down -- messed up in all the details of computer perception such as
speech and vision.
So what we've worked on is learning from textually annotated traces of activity in a
simulated environment. In particular, taking this RoboCup simulator game, typing in
textual sportscasting sort of commentary, seeing if the system can learn to sportscast
with no prior knowledge of the language whatsoever, and just listen to what -- not listen -read, process the text that's been given to it, associate with the actual events that are
happening in the game. So what we do is we have four games in our dataset. We train it
on three, test it on a fourth. It has to learn to sportscast the fourth game on its own after
looking, just watching three games being sportscast.
So the overall system here is the simulator. We have a very simple sort of rule-based
system that looks at the detail of the simulator, which basically gives you the position and
velocity of all the players and their direction of angle, that little mark is sort of where they
look. They do have a cone of vision. It's actually not a bad simulator. It simulates friction
and vision and all these sorts of things, but it's very simple 2D thing.
Pulls out a set of perceived facts in a logical form about events that are happening. And
that's not influenced by language at all. I assume a lot of us here went to the very nice
Warfian talk at ACL which I thought was a great talk. There's no Warfian stuff going on
because the facts are not perceived by language at all. They're are just assumed to
come out. Then we have the sportscaster saying something that goes into a grounded
language learner. And one of the semantic parsing systems I built is based on a
probabilistic version of synchronous context-free grammar. One of the reasons I like that
work, a system called WASPER, best paper at ACL and Prague and supports language
generation and semantic parsing, language generation has to have a language model in
addition.
So we're interested in the language generation side of things. So it you can take, play a
new game, so-called simulate perception. Pulls out a bunch of facts. And the language
generator can generate that into language. Okay. So what I like to do here to give you
an idea of what this looks like, do we have any Korean speakers in the audience? Oh,
boy, a couple. So you are no longer valid subjects for this part of the talk. So my Korean
student, Joohyun, was nice enough to comment this game. So imagine I showed you
three of these games for the non-Korean speakers and then I gave you a fourth game
and you have to sportscast it in Korean.
Takes a little while before it starts talking. The text-to-speech is just off-the-shelf it
generates the --
[Korean]
Hopefully the Koreans can understand this.
[Korean]
Very nice script in Korean [inaudible] very Korean, he was a king, designed it in 1600ss
or something.
[Korean]
So, think, could you do what my system does, which is learn to sportscast in Korean.
After I showed you three games of this you'd be bored to tears, but I don't know if you
could do it. So what -- the system really deals with -- again I don't do speech or vision or
anything -- it ends up with these set of sentences, and then a string of these sort of
logical terms. Now it's not using anything about the orthography of these. So really it's
more like this. It's just a bunch of empty symbols as far as the program's concerned.
Because it can learn both English and Korean basically equally well.
>>: Is it time stamped or something?
>> Raymond Mooney: Yes. So the way this works, sorry, is it takes everything that
came out of the perception in the last five seconds. When it sees a sentence, it patches
these dotted lines saying I don't know but this might mean that.
And here I had the green ones, which are ones that we've annotated after the fact to
what we think is probably the correct one. Some things don't, right? Because pink 11
looks around for teammate. Perception doesn't extract anything like that. Purple team is
very sloppy today. It's actually interesting. There, I think, you need to do some sort of
Warfian stuff where it has to learn the concept of sloppy. What's happened there is
purple turned the ball over several times in a row, saying they're being sloppy. We can't
-- percepttion can't extract. That it's a very noisy signal. Green lines we never give to the
system. Those are used purely for evaluation purposes.
And, again, there's really like a bunch of empty symbols over there. So if we can figure
out the green lines from all the dashed lines then we can just train a supervised semantic
parser on it, which is basically what we end up doing, then I can use all the history of
stuff, or Luke's stuff, or Dan's stuff, whatever you want, once you have it in supervised
form. Another problem here that you may not have realized it has to address is either
what natural language generation people either call content selection or strategic
generation, which is not only knowing how to say something, I give you the logical form,
generate the language, but what to say, out of all the various facts that my perceptual
system is observing which are actually worth talking about. And so we have to actually
also effectively choose. So the perception pulls out all these sorts of facts. The system
has to learn a model of what's worth talking about, it pulls out these. Now our current
system is pretty simple, just focuses on what events seem to be highly mentioned but, of
course, I have to learn that from this ambiguous supervision. If I had the green lines, it
would be trivial to calculate probabilities on mentioning each of these things, but it has to
learn the green lines and then estimate those probabilities.
Okay. So the data that we played with, we collected four RoboCup championship games
from the actual competition from 2001 to 2004. And the event extractor pulls out about
2600 events in a given game. And Dave, met him last summer, did the English. Should
get him to do the Mandarin Chinese. I have all these Chinese students never gave me a
Chinese corpus. I have to fix that. And Joohyun commented the games in Korean. They
both pretty much use the same number of sentences. Each sentence is matched to all
events in the previous five seconds, which gives you a relatively small but at least some
level of ambiguity. That means on average there's two and a half events associated with
each sentence. But it's a bit of a variance from one up to 12.
And then we manually annotated those with the correct matching, those green lines, but
that's not used by the program. It's only used in evaluation.
Algorithm, I'm not going to talk ->>: The current action is just to a single event, or is it sometimes multiple events?
>> Raymond Mooney: That's a good question. And we assumed here just a single
match. It's one-to-one. Sometimes it's not matched at all.
But if it is matched, it's matched to one thing. There are other cases where that
assumption won't work. It could be many to one or one to many. We haven't specifically
looked at that. It worked pretty well for this case but I do believe it's a limitation.
So we use a sort of EM iterative retraining process to try to resolve those ambiguities.
And then it calls a supervised semantic parser learning. You can use whatever system
you want to do the supervised learning. And basically it goes in a loop like this. First it
just assumes all those dotted lines are actually real annotations. So that's very noisy,
right? Because it's 2.5. So most of that, three-quarters of it or something, right, is
garbage. But it learns something from that. And then it goes into an iterative loop. Again
it's very EM-like. You train the supervised parser on the current usually quite noisy
examples to start with. You use the current trained parser to pick the bit for EML. It's
what someone usually calls hard EM. We've tried something more where we've actually
keep the uncertainty, where it goes through the loop, it actually did worse.
I'm interested, people heard the term hard EM where other people tried it where you pick
the best answer at each point, you don't put the probability distribution on it. You create
new training examples based on those assignments, and you iterate. And hopefully each
time through it gets better. I don't have any convergence proofs because this isn't a real
generative model. But in practice three, four iterations in practice it converges.
If you're interested in more details, there's a lot of details in this journal paper we wrote a
little over a year ago that appeared in JR, the journal, the online journal for artificial
intelligence research.
Okay. So that's actually, this is in a very algorithm-heavy talk. So here's the English
demo. This is we use this -- we tried different approaches. I'm not talking about the
details. There's a lot of different variations of basic algorithm that we tried. You can read
the paper for details.
These are the ones from the best performing thing that we built. Again, the speech is just
synthesized using this system from CMU for ETTS, text to speech, and it outputs the text
and then the speech synthesizer, so much easier to watch these things when get the
speech. Like watching a soccer game reading the subtitles is not easy or fun.
So this is on the test game, right? It has never seen this game before. It's trained on
three games in English. And now it knows how to sportscast.
Sort of slowed down so you can follow it.
[Video] purple 5 passes to purple 11.
Has to learn the name of the players syntactic language and there's no built-in knowledge
of language. It has to learn all of that from scratch.
[Video] purple 11 passes to purple 6. Purple 6 passes to purple 11. Purple 11 passes to
purple 6. Purple 6 passes to purple 8. Purple 8 passes to purple 11. Purple 11 makes a
bad pass that was intercepted by pink 6. Pink 6 passes back to pink 7. Pink 7 passes to
pink 9. . Pink 9 passes to pink 11. Pink 11 passes to pink 9. Team is off-side.
[laughter]
Okay. So the paper presents all sorts of evaluations of this. We looked at a lot of
different factors about how well it's learning, how well can it figure out those green lines
from the dashed lines, how well it can match sentences to the correct meanings. How
well does the semantic parser work trained on the supervised data that results from that,
parsing sentences into formal meanings, how well does it generate sentences from those
formal meanings, and how well does it solve the content selection problem of picking
which events are worth talking about. There's all sorts of numbers on that if you're
interested in there, but I won't bore you with the details.
The one experimental and my favorite evaluation that we did is what I've called the
pseudo-Turing test. Of course, you gotta use Amazon Mechanical Turk to do all the work
these days. So we recruited human judges, 36 in English. It's harder to get Korean
speakers, but I think Joohyun called up his friends in Korea and said check this thing out
on Amazon Turk. We had eight commented game clips, four-minute clips randomly
selected from each of the four games. And each clip was commented by the human and
once by the machine. Of course when that was left out as a test game.
The judges were not told which ones were human or machine, but we did tell them half of
them are generated by a machine, half of them by a human. Then we asked them
basically four questions. We had them rate on a five-point scale the English fluency from
flawless to gibberish, semantic correctness, where we tried to explain to the Turkers what
we meant by this: Is it really saying something that actually happened in the game, from
always to never.
And then this very vague question was how good a sportscaster is this. The scores on
that were all low, because it's very boring. From excellent to terrible. And we also again
told them half of them are human and half are not. It says you have to guess which ones
were generated by a human being.
So here's a breakdown across all the games. Here's the averages for the English case at
the bottom. You see, you know, the ranges of machine here are actually higher here,
about the same, a little higher there, and most humans and machines are not human,
only about 25 percent.
But actually more the humans were, machines were human than the humans were
human. But this is partly, if you look at the game-by-game results, this is partly -- you
notice this huge discrepancy on the 2004 game. And that was because apparently it
randomly assigned some representation to an interesting sentence in the corpus, like this
is the beginning of an exciting match in 2004. Something like this. And the system just
happened to spit that out for this game, and it got lucky. It didn't do it for any good
reason. It sounded to them much more human and so it won decisively on that game.
Now, another thing is I should have realized that David wasn't much of a soccer fan and
Joohyun was. So his games are actually commented much more interestingly. So we
didn't actually -- so 62 percent of the Amazon Turkers actually believe that Joohyun was
human and only 30 percent believed that our learned sportscaster was human, but it's a
third of the people.
Again, the scores are correspondingly a little bit worse. But basically you can listen to
these things, you can follow them. It works, basically. And there's more numbers in the
paper if you're interested. Okay. So we stopped working on this a couple of years ago
now and got and went and put the paper together and published it. So a new problem
that David's been working on for the last couple of years, and we finally got some halfway
decent results for, it's a paper at AAAI. I also gave a poster. Was anybody at my poster
yesterday, at the symposium yesterday?
And this is learning to follow directions in a virtual world. So the idea is learn to interpret
navigation instructions in a virtual environment by simply observing humans giving and
following instructions. Again, no prior knowledge of the language. We've only done this
in English so far, but you just watch people give direction to another person in a virtual
environment and that's all you get to see. And you have to now learn the language and
learn to accept directions yourself. And, you know, I think even within the virtual world, I
think they're potentially interesting applications for this where you can build virtual agents
and video games and grow them, raise them like a baby and teach them language and
then interact with them. So it's like you could have SIMS where you can actually raise
your baby and teach it to talk and then have a conversation rather than just mute SIMS.
And hopefully there could be some interesting applications for that.
So what we did was, another student at UT who was actually interested -- he worked with
Ben Kippers -- he was interested in building a wheelchair that could follow direction in
natural language for severely impaired people. But to do experiments for his thesis, he
built this little virtual environment where these colors here are the tile on the bottom of the
floor. It's just a bunch of connected hallways, then there's different wallpaper on the walls
in each of the hallways which is based on what these pictures are. I don't know why I
came up with these things. And then there are objects scattered around in the
environment, hat, rack, lamp, easel, sofa, bar stool and chair. So this is what it looks like
to the actual people who are doing the task. They're in -- this is the bird's-eye view but
this is what they see. The humans, they had to actually explore this environment from
this perspective and then learn it and then they were asked, given instructions to get from
this place to that place to another human and then another human followed those
directions to see if they could get there.
So here's an example starting here, ending there. People generate very different
descriptions using different types of strategies to generate these directions. So these are
all four different directions for exactly the same task. You can see there's quite a bit of
variation there. Take the first left until you go all the way down, go towards a coat hanger
and turn left.
The graphics are pretty poor for reasons I don't want to go through. Go forward, the
intersection contains a hat rack, turn left, go forward three segments to an intersection
with a bare concrete wall passing a lamp.
You really hardly see the lamp, I think. So this is what the system experiences, and it
has to now learn to follow directions in natural language based on just that experience
with no prior knowledge of the language.
Okay. So, again, there are many different instructions for the same task. They describe
different actions. They use different landmarks. And they talk about the landmarks in
different ways. And the mini representations are completely hidden from the system. It
just sees the primitive actions. It doesn't see the plan that the system would actually
need to execute those actions and figure out what that's what they need to do.
So unlike the sportscasting where each sentence by looking at this 5-second window
contains a fixed number of options, there's actually a combinatorial number, because
there's a lot of stuff in the world. And the person could be referring to basically any
subset of that.
So you get a combinatorial explosion, the number of possible plans for describing the
same sequence of primitive actions.
So we couldn't take the same sort of EM approach that we took with the sportscasting
case, because there's just too much possible MRs for each NL, to use my lingo.
But at the end of the day this is what it gets. It gets these triples of, it gets natural
language sentence. It gets an observed sequence of actions that were executed after
the person -- you said that sentence. And then, see, it knows the state of the world, all
the objects and the wallpaper and the tile in each of those states. And so you get these
triples of language, action, world state, and that's what it gets. And then it has to build a
correct mapping from actions from the natural language into the actions given, of course,
the knowledge of the world state.
Okay. So the overall system we built looks a little bit like this. So it first tries to -- it looks
at the world state and the trace of actions that were executed, and constructs various
ways of sort of coming up with a possible plan for that. And we'll look at those in a
minute.
But then it goes through what we call a plan refinement step, because these usually
contain too much information. It was as if everyone told you everything you saw at every
point. And that's not true, of course.
So we do what we call plan refinement, actually looks at the language. It learns a lexicon
about what it thinks the words mean so the first one to do this sort of stuff was Jeff Siskin,
if you know his work, and I had a student Cindy Thompson that did sort of things like this,
too, to sort of learn pieces of representation that seems to correlate with words.
Then it uses those meanings of words to sort of pull out the garbage from the plan, and
just pare it down to what it thinks the language is actually talking about.
Then it gets pairs of the instructions with the inferred plan for that, given lexical
knowledge that it's learned and then it just trains a learned supervised semantic parser
on those pairs.
Now, of course in testing you get an instruction. The semantic parser maps it into a
formal representation. That then goes to an execution module. Markel was this system
that could execute those formal plan representations that Matt McMann built as part of
his thesis.
And of course it knows the world state and then it produces an action. So a little bit about
those two things. So we set a bound of a representation of a plan by two things. One is
you just mentioned the primitive actions.
The other, after every action you put in everything in the world that you observe at that
point. So you turn left and then you verify that there's a wall to the left, a lamp to the
back, a hat rack to the back and a brick wall in front of you.
This is actually not even all here because it doesn't have the tile, but that would be there,
too. And then you take a step and you verify that you're in the wood hall. And so this is
very detailed. Much more than what the language is actually talking about.
So we need to figure out how to pare this down into what we think the language is
actually referring to.
So we need to remove those extraneous details in the landmark plans, and we do that by
learning the lexicon and then remove the parts that don't seem to correspond to words
that actually appear in the sentence.
So let's go to the example here. So here's two cases. Again, this is a simplified form to
fit on a slide. So the real landmark plans are much more compound, not much more, but
somewhat more complicated than this.
This is turning and walk to the sofa. And in other cases walk to the sofa and turn left.
So it takes all the sequences of actions it saw in every sentence where the person
mentioned sofa, say. I really should change this to like couch, because it's not using the
fact that this sofa means that, because it could be in Korean or Spanish or something
else.
And we take an intersection of these two graphs, because they both use the word "sofa".
So presumably again the word could be ambiguous, but in general it's not a bad heuristic.
But probably sofa means something that's shared by these two examples. But there's
sort of multiple things it could be. It could mean turn left until you see the blue hall, at the
blue hallway, or it could mean travel and see if you're at the sofa.
Because those are two maximal sub graphs shared by these two plans.
So what it does is the lexicon learning looks a little bit like this. It collects all the landmark
plans that co-occurred with a particular word and says maybe those little pieces of graph
are things that word refers to, and it keep doing that, adding new entries. And then it
ranks the final set of candidates by the simple score which says you want to find a piece
of graph that's very likely given the word but not so likely when the word's not there.
And so then it learns this sort of ranked lexicon, pieces of graph as the, quote, meaning
of each of the words.
Now, once it has that, it then tries to simplify the plan by just looking at the word. So here
it says, well, I think turn left means this. So pull that out. I think walk to the sofa means
that. So pull that out. And "and" is the only thing that's left that doesn't have any
meanings in the lexicon that it learned, and then it reduces that more complicated plan to
this subpiece which it thinks is composed of pieces that correspond to the meanings of
the words that actually appeared in the sentence.
So there's a little bit of pseudo co here in the slide about it selects the highest scoring
entry at each point in the lexicon, removes that word from the sentence and keeps doing
that until it's exhausted the words in this sentence.
Okay. So hopefully that gives you a little bit of idea of how it Parsec down the whole big
plan to what the sub plan is that it actually thinks the person is talking about.
If you're interested the III A paper is up on the website if you want to read more of the
details. A thing on the data again.
We didn't collect this data, McMann, a previous student, which Mishi, I think you said you
knew Matt, right? He collected instructions from one to 15 followers, sorry, with six
instructors, one to 15 follower followers, three different maps.
They all have the same type of tile and objects, but they're all arranged differently in each
of those three. And we do leave one map out cross validation. Which means we train on
two environments. We want to see, of course, whether it can generalize the new
instructions in a new environment.
The same sort of primitive objects and things in the world but they're all arranged
differently. So there's a few statistics. You can look at the entire instruction for a
paragraph or you can look at the individual sentences for it. Various statistics. About
700 instructions with 660 words. Single sentences are, of course, a little bit less.
So the entire instruction usually takes, you need to do like ten actions to follow the entire
paragraph of instructions on how to get to the goal.
We did a number of different evaluations. Again, I'd point you to the paper for some of
these. I'll put present results for what I call the end-to-end execution, where you train it
on three maps. It gets to watch the instructions followed in those three maps. You've
given it a new map. You've given it a new instruction, you say go and you see if it's
followed correctly and gets to the right place.
Again, we did leave one map out cross-validation, it's a pretty strict metric only if the
correct final position exactly matches the goal do you get a point for that problem. But we
looked at it both at the sentence level after each sentence you end up in the right place
you should be, and then after the entire paragraph for that whole instruction do you end
up in the right place.
We built what I think is a reasonable sort of lower baseline to sort of give you an idea you
how hard the problem is. We built a simple probablistic generative model of the actions
in the plan, so actions that are more likely are more likely to come out of this generative
model. You execute that generative model. It doesn't say look at the language at all, it
says do actions in sort of very simple general model of how the plans actually occurred in
practice. We thought that was a pretty good lower baseline.
Upper baselines, we did what if we manually annotate sentences in this domain with their
actual, correct, plan representation.
Matt build the system totally by hand, no learning involved, but also he hand-engineered
the system for all four of these, three or four, how many? Three, I think. So it's really like
testing on the training data. So his scores will seem high, but take into account that that's
basically testing on training data, development data.
Matt didn't do a good job of separating development from test.
And, of course, how well the humans do when they follow these directions.
So a bunch of numbers here. So first is how accurately can it follow a single sentence.
The simply generative model, the lower baseline is about 11 percent of the time it will
actually do the right thing.
If we train the system on those basic plans, just have the primitive action, at a single
sentence level it doesn't actually do that bad. It gets about 56 percent. If we train it on
the landmark plan, it has too much. I call these Goldilocks. This is being trained on too
little of a representation for the meaning. This is being trained on too much of a
representation for the meaning.
And they both -- so this actually does pretty well. This does quite a bit worse.
If we put in our lexicon learner and refine the plans before we produce the gold standard
that goes into the training of the supervised parser, then we actually don't even do as well
as the basic plans for a single sentence, which is a little bit disappointing.
If you train the system on human annotated plans. So this is giving the supervised parser
gold standard annotated data, it only gets about 58 percent.
Marco, because it was sort of overfit to this data, is about 77 and it was never human
followers were never judged on their ability to follow a single sentence. If you go to the
complete plan, of course, everything gets quite a bit worse, because if you mispairse any
of the sentences in the instruction you're going to screw up and you're not going to get to
the right place.
So this shows you that the simple generative model just gets two percent. So the chance
you're going to just mimic the type of actions that were performed during training and get
it right is -- you know, is very low. If you train on basic plans.
So here's really the best result for our systems. Not that great. But 16 percent of the
time it can actually follow the entire directions and get to the right place.
And notice here it actually is doing better than the basic plans, because those verify steps
because the executer can actually correct some errors, because if it says you know you
should see the hat rack, did you go one step, you should see the hat rack, when you see
the hat rack the next thing it will actually go to the hat rack.
So it can sort of fix errors in the instructions themselves by using those verification steps.
If you train on human annotated plans, it gets about 26 percent, Marco again overfit to
this data 55. And then human followers -- human followers only get 70 percent of the
time are they able to -- because sometimes there's errors. These are real plans that
people generate. Sometimes they're just wrong.
And so human followers only get about 70 percent correct destination in this
environment.
Here's just a sample client parse where the system learned how to parse this pretty
complicated instruction, completely correct. This plan will actually get you to the right
destination, place your back against the wall of the T intersection. Turn left. Go forward
along the pink flowered carpet hall, two segments to the intersection with the brick wall.
I wonder what the parser got for that. It's in here somewhere. This intersection contains
a hat rack, turn left, go forward three segments to an intersection with a bare concrete
wall, passing the lamp, that's your goal.
And it actually got a plan out of that. Again, it didn't understand everything right. Notice
what -- it picked up call way, what didn't it get here, pink flowered carpet, it didn't even get
that.
But this is sort of overly, people do this redundantly. There's more information here than
you need.
So the fact it doesn't pick up all the information is okay. It's still good enough to work.
Okay. So that's the quick tour of these two applications we've looked at to try to judge
grounded language learning. Again, the system starts with no language, the language
has to learn everything about the language just by watching things, either watching a
game being sportscasted, RoboCup simulator, or watching two people, one giving
instructions to another other and them executing a sequence of actions. The current
systems are both very passive.
They just sit there and watch; they're not hopefully your child isn't like this when he's
learning he or she is learning language. They're interacting and other things. So we'd
like to move to a more interactive/active form of learning here, where you know either the
system itself sort of tries to interactive learner acts as a follower. So it actually
participates in the process and tries to follow the directions and sort of has a more
interaction with the world rather than just passively observing two people doing it.
Or acting as an instructor. It tries to generate things, see what people do, learns more
from sort of generating the language and seeing people's response to it and, of course, it
would want to generate good things, things that are uncertain about. There's sort of a lot
of active learning ideas that could go into this.
And this would even make it more, I think, like maybe a little bit more how people learn
language rather than just passively observing the world.
More generally I'm sort of interested in language grounding in general. I've done a little
bit of work not in simulated. Both of these are in simulated world because I'm not a
robotics or computer vision person.
But I am interested in this general area of integrating language and vision. There was a
workshop at NSF held about a month ago that I went to where they threw together half
computational linguists and half vision people, and we yelled at each other for a day.
The thing is there's actually an inter lingua -- I don't understand vision people and they
don't understand language. We have an intra linguist which called machine learning. All
they do is machine learning. All language people do is machine learning.
So you just talk in SVMs and kernels and graphical models and EM, and it's like you can
actually talk to these guys.
But, of course, generally you know they all have to do supervised learning. So they have
Turkers, they can get Turkers to do more fun tasks because they have them identify
objects and images and outline them.
It's like they have an easier job getting Turkers to do what they get what them to do than
language people get them to do, what we want them to do.
But that's still, Turking has made it cheaper, but it's still, you know, costly. It would be
nice if we didn't have to do that.
So sort of part of my vision is can't we do cross-supervision of language in vision. Use
naturally occurring perceptual inputs to supervise language learning and use naturally
occurring linguistic to supervise visual learning.
So we've got to go back to the '60s at MIT and do the blacks world where we have this
and I give it a sentence the blue cylinder is on top of a red cube.
If I'm trying to learn language, I view this as my input that I want to understand and this is
my supervision that I'm going to use to train that.
If I'm doing vision and I treat this as my input and I treat my language as my supervision.
So by each -- so they each sort of supervise each other, and I can learn both vision and
language by the correspondence between vision and language.
So, again, this is just a high level sort of vision about vision/vision or something. I've
done a little bit of work with real video and real images, but again I'm not -- I'm learning a
little bit more about vision. But I'm certainly not at all -- I'm having a hard enough time
keeping up with both machine learning and natural language, much less getting into
vision or robotics.
My students love me, right, because I say, okay, if you want to work with me on grounded
language, you have to take the computer vision class, the robotics class, the learning
class and the language class. And if they survive that, then it's like, okay, you can work
with me on grounded language.
So we've done a little bit of work with closed captioned video where you use the closed
captions as a very weak sort of supervision to do what vision people call activity
recognition. So we've automatically trained activity recognizers to be used to improve the
precision of video retrieval. Of course we did this in soccer.
So the other nice thing, this allowed me to use my grant money to buy a Tivo machine,
because that's the best way to get the closed captioned text from the video. Because
YouTube they don't give you closed captioned text with them.
And we tried to learn activity recognizers for four different verbs. Kick, save, touch,
something else. I'm not a soccer person. I don't play video games and I hate sports.
And somehow I'm doing video games and sports. I don't know how that happened.
But it just uses the fact that these words -- and most of this is noise. Because most of the
time when the caption says kick, no one's actually kicking in the image. But you know
there's some probability. So you can actually use this as a very weak supervisory signal
and learn an activity recognizer that's really bad; but if you add it on top of the text, it
actually improves the precision of your image or video clip retrieval. This is a AAAI paper
from last year, if people are more interested.
>>: It's recognition of action of videos.
>> Raymond Mooney: We used off-the-shelf stuff. It's stuff by Loptev.
>>: State-of-the-art ->> Raymond Mooney: If you have -- you know, John, someone taping their graduate
student say walking, waving, then it works pretty well. If you run it on real video, it's
pretty crappy. It's pretty crappy.
But the point is you can do the retrieval of events just by looking at the words, and then
we filter that through the activity recognizers.
The activity recognizer is just acting as a filter on the language retrieval. And we just
show that you can, and the numbers are in the paper. You can improve the F measure
and map scores and these things for your video retrieval by adding the activity recognizer
that you learn from this crappy signal on top of the linguistic-based retrieval.
Again, that's a AAAI paper from last year if you're interested. So one thing I learned from
this whole thing is you know don't write your good masters students too good of
recommendation letters because then they get into Stanford and then they leave.
Some are now working with Christine's old advisor Chris Manning. So I have to write
crappier letters in the future.
So current language -- all right. So I'm wrapping up here. So current language learning
approaches use unrealistic training data blah, blah, blah. So we've been trying to focus
on this idea of learning language from systems with sentences paired with actual
environments that the language refers to.
And right now we've explored a couple of admittedly simple but I think interesting
challenge problems. Learning to sportscast simulated RoboCup games, we're actually
able to commentate games about as well as humans and learning to follow navigation
instructions. Turns out that's much harder, but we're able to accurately follow individual
sentences at about the 55 percent level, complete instructions as you saw. Not too great,
15, 16 percent.
But I think this whole area of grounded language learning, it's starting to take off a little bit
more. Regina Barlay has been doing things. Luke Settlemire has been doing things.
I've contaminated them with some of these ideas.
And then I hope more of you get interested in these things, because I think this is the
future. Okay. That's just back-up. [applause]
>>: Do you get a digital feed of the closed caption from the Tivo or you had someone ->> Raymond Mooney: Took a while to figure that out, but, yeah. Yeah, like I said, I had
to write like a special letter, yes, I'm actually buying the Tivo machine to do research.
>>: So is it just arbitrary amounts of that sort of data available ->> Raymond Mooney: We recorded it off cable TV or whatever. That's why we got the
Tivo box. So I don't know, we didn't do a deep look at this, but I don't think you can find
much closed captioned video online as far as I know easily. I mean, there's a bunch of it
out there, right, but it's broadcast and it's not in, it's not on YouTube. So getting that data
is a bit tricky unless you buy the Tivo machine and record it live off the TV yourself.
But there's actually an idea when I first learned of close captioned in the early '90s or
something, I said wow I'd like to learn language learning on that. It's just taken me 20
years to do something.
>>: So now for the sportscasting application, there are video games out there like I'm
kind of a sports and video games [inaudible] but I've seen them and they commentate on
this --
>> Raymond Mooney: Sure it's all handwritten. So if you want to -- if you want to
estimate with 99 percent probability, those are all hand-engineered systems, right? My
system starts with no knowledge of the language, no knowledge of anything, has to learn
it all from scratch, right? So I care more about the -- you can handle -- there's actually
RoboCup sportscasters that were hand built, and they're actually pretty good. But they're
hand built. You put in the patterns. It's a pretty simple text generation problem. You just
do template-based text generation.
But the point of this is not in some sense the sophistication of the language that it gets it,
it's that it's learning it from this weak signal.
>>: What about the live signal, I know there are setups where you can play multi-player
gaming systems and you have mic and handset where you close talk microphone and
text and comment.
>> Raymond Mooney: Unique data to collect it -- I also would need a good speech
recognizer person, which I don't have. But, yeah, if we could do the speech -- those mic
things they're much better speech recognition if you have the voices.
>>: Probably going to be cursing saying all sorts of ->> Raymond Mooney: That would be a good generator. [laughter] I have to put up a
proceed fan, backslash, bang, bang. [laughter].
>>: In the contemporary not a lot of it is very structured with respect to video.
>> Raymond Mooney: If you look at real -- sometimes I call this sportscasting for
Sesame Street or maybe for Tele Tubbies, or what's lower than that. What we put in is
very literal. You watch real sportscasts, because they're assuming you can watch the
game. So what I'd like to get ahold of is what's called SAP, secondary audio program.
And there I have to do the speech problem because I knocked my TV into one of these
like in the mid-nineties, too. And I'm watching some PBS nature show and there's this
other voice coming: And now the lion approaches the cub. And I'm like what the hell
happened? And I flipped it into SAP. And that is the sort of signal I want. The other
thing people have done is looked at scripts typed in by fanatic fans. There's this famous
paper in vision using Buffy the Vampire Slayer where every episode of Buffy has been
meticulously hand coded about every action and every dialogue piece in the entire series,
and you can get this data online. People have tried to online on video, but they're
detailed human annotation. So I think there are other sorts of interesting signals
connecting language to video. Normal closed caption is actually pretty bad because it
actually rarely is talking about what actually is happening on the screen.
>>: Before you ->> Raymond Mooney: So there are these sites that post an online -- I've talked to this
about my students, the online commentary on the game, on the Internet while it's being
played. That's an interesting source that we've considered. Another thing we consider
actually Persy's, cooking shows, but he's doing it in the virtual environment. So I just
asked him about this last week in Portland.
But he's never gotten anything working on this, but he told me about this last year, and I
thought it was a cool thing. He built a virtual kitchen, and he just gives people recipes
and they have to cook things in the virtual kitchen. And then he wants a system that can
be a virtual cook, follow new recipes, but he doesn't actually get anything. Apparently he
set all this up, and he got off, you know, Persy's doing ten different things at once.
>>: I love the idea of language but it feels like we're suffering from a lack of data. Like
sometimes was there one way to get at it, is there some, I don't know, do you have great
ideas about how we can go forward into new domains and people provide ->> Raymond Mooney: I sort of like the virtual environment if we can build SIMS with
language and get people to talk into it. If we built interesting languages where you could
talk about things in that virtual language and get people patient enough it takes a long
time to raise a baby to learn to talk. I just Misha's kid, she's almost one and a half, she
doesn't say a single word. It's not an easy problem, right.
>>: You should really be plotting performance on that navigation task against age. About
101-year-olds do it, two-year-olds, three-year-olds.
>> Raymond Mooney: Raising a baby in the real world is tough. Raising it in the virtual
word, giving it enough linguistic experience to actually learn the language, I don't know if
people would be willing to do that. Because it's a lot of data.
>>: Learning a different brain for RoboCup games versus the brain, following directions.
Is there some way that we can leverage this across the domain somehow? Some
generalization capability that might allow us to do more with less data or something?
>> Raymond Mooney: I'm open to suggestions.
>>: If you want sports data to -- I think in India they still have and probably other
countries, actual radio broadcasts, right? So that will be much more literal.
>> Raymond Mooney: I've thought about that.
>>: Commentary [inaudible] exactly what's going on because it's not assuming that you
have ->> Raymond Mooney: My dad used to always listen to baseball games on the radio
when I was a kid.
>>: Presumably you could get the video feed as well. You would have synchronization.
>> Raymond Mooney: No, that's a good idea. We haven't thought about that.
>>: Shortwave radio.
>>: I was saying like along the same lines, like football coaches like have right after the
game they will go through like the most literal analyzations and that's taking it, slowing
down and really ->> Raymond Mooney: Yes, you did the alignment. But that's I think more distilled than
sportscaster trying to keep an audience.
>> Raymond Mooney: Yeah, so I think there are a lot of possibilities and you know we've
only barely scratched the surface, and I think there are other sources of data, are any of
them perfect, certainly not.
>>: When you do EM training, how do you -- what features are you using in order to learn
the best mean in the presentation?
>> Raymond Mooney: Again at the end of the day it's just training a supervised semantic
parser. So different systems use different sets of features. So I would point you, what
we use with the RoboCup is a system called lambda Wasp. It was the Prague ACL Best
Paper Award. It learns this synchronous context-free grammar rules to do the mapping.
But that to me is a separate proper. I worked on that for 16 years. Now other people are
working on it finally. So there are a number of systems out there. Our code is available
online. It's not documented or supported, but you can download it. I think a few people
have actually got it working.
Okay. Thank you.
[applause]
Download