>> Hoifung Poon: Hi, everyone. It's my great... Jayant is a student of Tom Mitchell, and he's worked...

advertisement
>> Hoifung Poon: Hi, everyone. It's my great pleasure to introduce Jayant from CMU. So
Jayant is a student of Tom Mitchell, and he's worked on a lot of interesting stuff, most recently
on grounded language learning and also semantic parsing, so he will be talking about -- probably
first on grounded language learning, but maybe, with time permitted, maybe a little bit on
semantic parsing, as well. So without further ado, here's Jayant.
>> Jayant Krishnamurthy: Okay, hi. I'm Jayant, and as Hoifung says, today, I'm going to talk
about learning to understand natural language and with a focus on physically grounded
environments. But I'm also going to talk a little bit about some of the other language
understanding stuff we've been working on, because I'm excited about it and I wanted to.
This is joint work also with my adviser, Tom Mitchell, and then Tom Kollar and Grant, who
were at CMU while this was being done. So I just want to motivate this by saying, why do we
need language understanding systems? So here's an example of a language-understanding
problem, I would say. It's a question-answering problem, where maybe you want to be able to
ask some sort of question, and you have some big database of facts, and based on this database of
facts, you want to produce some sort of answer to that question. So today, we have databases
like Freebase, we have things like NELL, which have millions and millions of these facts, so you
can imagine this would be pretty useful for something you could do in a search engine or
something like that. If a user types this in as the query, you can just give them the answer,
instead of going and retrieving a bunch of search pages.
The flipside of this is information extraction, and so, if you know me, I work on NELL, which is
this sort of ongoing information extraction project, where every day NELL goes out and looks at
webpages and finds sentences like this, and then it has this big database of facts, and I'm sort of
trying to grow that database of facts iteratively. So we're kind of interested in this information
extraction problem, and what we might want to do is, say, find a sentence like this, and find that
it expresses these sort of relation instances and add that to some sort of growing knowledge base.
Then you can imagine we can use this knowledge base for question answering sort of
downstream.
So I'm going to briefly talk about some of this stuff at the end of the talk, for like 10 minutes.
But as I said, the focus of this talk is on understanding natural language in grounded
environments. So I think the canonical example of such a problem would be robot command
understanding. So maybe you want to be able to give a robot a command like, go get a pen from
my office, and you have this robot, and the robot knows how to navigate its environment and
knows how to manipulate objects. And you want this robot to do the right thing, given that
you've given it this command. And the reason this is a grounded language understanding
problem is because the meaning of this command is sort of intimately tied to objects and
locations in the real world, right? The robot needs to be able to map my pen onto some physical
pen which is in your office. It needs to be able to map office onto some location in the real
world.
And I'm really excited about these sort of grounded language understanding problems, because I
think they sort of provide us a way to avoid some of the ambiguities of natural language
semantics that we've had for a number of years, which is that we have a number of different
ways of doing semantic representations of language, but there are differences between them and
their expressivity is different, and it's not clear what's the best way to go. Well, these grounded
language understanding tasks provide us with a test bed for all these different semantic theories,
right? We can plug in frame semantics, we can plug in semantic parsing, and we'd say, how well
can I actually answer these commands?
And what's great is, if I created a corpus of 1,000 commands and the robot gets 950 right, who's
to say that that robot doesn't really understand language? Certainly, it understands some subset
of language that's contained within those commands. So that's why I'm really excited about these
tasks. Now, robot command understanding is actually going to be a challenging task, and so in
this talk I'm going to talk about a slightly simpler task, which I'm going to call scene
understanding, which I hope you agree has many of the same sub-problems as robot command
understanding. And so the way that this task is going to work is we're going to be given some
sort of a physical environment -- in this case it's an image -- with segmented objects.
So I'm going to assume that the objects in the environment are known a priori. That's a
simplification. And then we're also going to get some natural language description of some of
the objects, and our goal would be to identify the objects which are described by the natural
language statement. So if you say the mugs, I need to give you the set of both mugs.
Similarly, you can have more complex language, like the mug to the left of the monitor, if you
want to be able to return that just that mug which is to the left of the monitor. And to give you a
preview of where this talk is going, I'm going to introduce a sort of model for solving this
problem, which kind of jointly learns to perceive the environment, as well as parse this natural
language text, and using only training data of the form that you see here --you can train it using
only this kind of training data, and it can get about 70% accuracy on this task. That's just a
preview of what's going to happen.
Okay, now, before I talk about that, I want to briefly talk -- so this is the outline of my talk -- and
before I start talking about that sort of physically grounded stuff, I want to briefly talk about
semantic parsing, which is kind of the fundamental technology which underlies, I think, a lot of
this language understanding work. And then I'm going to talk about the scene understanding
application and the model we've developed, and then finally I'm going to spend a brief amount of
time talking about the information extraction work that we're working on, which I'm going to call
broad coverage semantic parsing.
And feel free to stop me if you have questions. I think we can be pretty free form. Okay, great.
So let's talk about semantic parsing. So semantic parsing was initially intended to solve these
kinds of question answering applications, so I'm going to kind of focus on that for the moment.
So let's say we have some question, like what cities are in Texas, and we want to be able to
answer this question, well, how should we do that? The semantic parsing view of the world is,
first, we should take that natural language statement and we should translate into a sort of logical
representation of its meaning. And the point of this logical representation is to abstract away
from syntactic differences in the natural language, different words meaning the same thing,
things like that. And then, once we have the semantic parse, we're going to assume that we have
some big database of facts over here, and we'll be able to execute the semantic parse on this
knowledge base, this big database of facts, to produce the answer to the question.
Now, if you're not familiar with formal semantics -- I imagine. Well, I'll just explain it anyway.
But the way to read these semantic parses is they're going to be functions. This is a function
from a variable X, which you should think about as ranging over the entities and the knowledge
base to a sort of logical statement, which is going to return true or false for each entity. So the
way to interpret this is actually as a set. It's a set of objects for which the function returns true.
So this is really the set of things X, where X is a city and X is located in Texas. So I hope you
agree that that's a pretty reasonable semantic parse to get the answer to that question.
And if you use that interpretation, it's straightforward to see how you can execute that on this
knowledge base. This is just like performing a database query. So this is the semantic parsing
view of how we should do these kinds of question answering applications. And I just want to
point out one thing, which is that for this to actually be meaningful, we really need to make sure
that this knowledge base is derived from the real world. So we typically assume we're just given
this big knowledge base of facts and that it's true, but unless these facts are actually true in the
real world, we're not going to get the right answer to our question. So this sort of
correspondence is critical to making this whole pipeline work. Of course, typically, when we're
doing semantic parsing, we're going to assume this knowledge base is known, so really the only
interesting problem is this mapping right here, going from the natural language onto the logical
representation, which is called the logical form. So how are we going to do this mapping?
There's a number of different ways that we can do these mappings from natural language onto
logical form or semantic parses.
Today, I'm going to talk about my favorite way, which is combinatory categorial grammar. So I
really like combinatory combinatory category grammar, or CCG, for semantic parsing, because it
has this tight coupling between syntax and semantics. So what you can do in CCG is you specify
for every word what's its syntactic function and what's its semantic function. And then if you
can syntactically parse some statement, you can automatically drive what the semantics of that
statement should be. So it's sort of just transparent from the syntactic parse.
So I think that's really nice, and another thing that's nice is that it's a formalism that a lot of
linguists have studied in the past, so if you have some sort of weird linguistic dilemma, you can
kind of go look it up in the literature and say, how should I solve this? How do I deal with
coordination? So that I think is a really nice property. You don't have to invent everything
yourself.
It obviously has its quirks, as well. So let me explain a little bit about how CCG works. So in
CCG, there's basically -- so CCG, it's a lexicalized grammar formalism, which means that it has a
huge number of different syntactic categories. So if you're familiar with a context-free grammar,
it'll have 40 different parts of speech. In CCG, there's going to be 500, but these parts of speech
have internal structure which tell you how you can combine them with other parts of speech. So
these parts of speech are called syntactic categories. And the simplest syntactic categories are
called atomic categories. They're things like nouns, and there's a handful of these. There's
nouns, sentences, prepositional phrases, noun phrases. That's basically it. There's only four of
those.
The more complicated categories, things for like adjectives, you create by basically building
functions out of these atomic categories. So these are examples of functional categories. And
the way to read this is, the intuition here, is that words in CCG behave like functions, which you
can apply to other words to produce parses for the results. And the way we're going to denote
that a word is a function is we're going to use this kind of funny slash notation, where this side -I guess that's your right side -- is the argument to the function and the left side is what it returns.
So big here is an adjective, because what it does is it takes the noun and it returns a noun.
There's two different kinds of slashes, actually, which denote which side of the current word that
you expect the argument to occur on. So the way I like to remember this is if you look at the top
of the slash, that's the direction that it's pointing. So this means that it's looking for a noun on its
right. Now, we can take the sort of intuition of these functional categories and kind of nest this
sort of function process to produce more complicated things like prepositions. So in here is a
preposition which expects a noun on its right and then returns something which expects a noun
on its left and the returns a noun.
So if we wanted to parse town in Texas, I could use something like this. Now, the corresponding
semantics of these categories is specified in this lambda calculus notation, and what's important
to note is just that for these atomic categories, right, we're just going to talk about as referring to
sets of entities. So these are all sets of entities, right? These are functions from some entity to
something which returns true or false. And these other categories here, this is a function which
accepts something as its argument, and so there's actually a slot here, this lambda-F, is the
argument to big. So there's this sort of interface between the syntax and semantics, and that's
how parsing is going to derive the semantic parsing for the whole phrase.
So let me show you how parsing works. I think that will make it a little bit clearer. So how do
we parse a sentence in CCG? Well, first, let's say this is our sentence. The first thing we're
going to do is we're going to look up in this sort of table I showed you before what the syntactic
and semantic representations of each of these words are. Now, there might be ambiguity about
this. We'll talk about ambiguity in a second. We're going to look at these syntactic and semantic
representations. And then what we're going to do is we're going to combine adjacent words
using a small number of different operations.
Now, intuitively, all these operations are things that you can do on functions. So you can do
function application, you can do composition. By far, the most common operation is function
application, and that' pretty much the only one that you really need to do all but the kind of
complicated stuff. So here's an example of function application. What I've done here is I've
applied into the argument Texas, and I've produced here a syntactic category for this whole
phrase, along with a sort of corresponding semantic representation. And if you look at this,
you'll see what's happened is I've plugged in this function for F, and that sort of produced this
resulting semantic representation. So I did function application syntactically and I do function
application semantically, which I think is a nice sort of correspondence there.
>>: Texas here is being parsed as a noun, not a prepositional phrase?
>> Jayant Krishnamurthy: So this is a function which expects a noun on the left and then returns
a noun, so that's kind of like a modifier to the right of a noun. I think the syntactic categories in
CCG are the biggest obstacle to just understanding how it works, because they get -- I showed
you a handful, but they get really, really messy pretty fast. So it's kind of one of those things
where if you look at it long enough, you kind of start to cache out the important subcomponents
of these different things, and you can kind of quickly look at it and say, yes, that makes sense.
But yes, this is basically like a preposition which has already taken one of the two arguments.
>>: For the interpretation, you could also easily say that you can reverse the order of where do
you expect right? You basically take one something from the left and one something from the
left and one from the right. So in this particular form, you expect something maybe from the left
first and then from the right? But you can also equally say it the other way around and you've
got the different parts. So would that create a lot of ambiguity?
>> Jayant Krishnamurthy: So that's a good question. The question here is basically CCG has in
this category put a certain ordering. It's said, I need to take the left argument before I can take
the left argument. But you could easily imagine that you want to apply the left argument first
and then apply the right argument after. And so you can do that. That's where these other funny
operations come in, to let you apply the arguments out of order, essentially. But in a sentence
like this, you'll get the same parse either way, right? So I guess maybe the high-level thing to
say is, you can apply the sentences out of order. I don't want to talk about the operations
required to do that. But if you do produce the same sort of parse, it'll give you the equivalent
semantics, so it's only a problem in terms of efficiency of your parser. It's not a problem in terms
of getting the right answer.
>>: My question is really why, when you're inducing the lexicon, because that's sort of the core
of all CCG, do you need some sort of -- because those ambiguities are extraneous. It doesn't
matter why you still [indiscernible]. So do you normally do something?
>> Jayant Krishnamurthy: So you don't need to do that in the lexicon, and there's a couple of
different ways to handle this ambiguity, and maybe it's better to talk about this after. But one
thing you can do is say, I'm only going to use the ambiguity when it's necessary to parse
something, so I'm only going to apply these things out of order if that's necessary to produce a
parse for this sentence. But also, I guess you were asking about the lexicon. You don't need to
encode in the lexicon at all. You only need this one category for in, and then the parsing
operations take care of the out-of-order application.
>>: So the parentheses indicate ordering, because you're going to take that, for in, you're looking
for the noun to the right first. So I'm most interested in finding out -- so I understand that in the
lexicon we can say in, but that there is a sense of in that's located in, but there are other senses of
the word in, so you haven't talked about those, and we've picked located in, magically. And then
the other one is ->> Jayant Krishnamurthy: Okay, I'll answer that question in like two slides.
>>: And then why do we get from town to city?
>> Jayant Krishnamurthy: So I've just sort of assumed that this is what's encoded in our lexicon,
so I've assumed that someone's given me a table of the possible mappings here. Now, there's
different ways you can do this, so people have actually developed semantic parsing algorithms,
where if you have the labeled logical forms, it will sort of automatically induce what should go
in the lexicon. That assumes you have the labeled logical forms. The other thing you can do is
you can use some sort of heuristics based on alignments of -- like I see Texas, I know Texas is a
city, maybe I see a city like Texas, town -- not Texas. If I have an appositive with something
that I know is a city, I can guess that. I agree this mapping is nontrivial to produce. I'm
currently assuming we're going to do that heuristically.
There's other things you can do, actually, so in the grounded work, what we're going to do is
we're just going to say I'm just going to use the word lemma essentially as this predicate. So that
lets you kind of abstract away from plurals but not worry about this mapping onto these discrete
predicates, which is a hard problem. And ambiguity is going to show up in a second. Okay,
great. Good. So, okay, I can repeat the function application procedure to produce the parse for
the whole sentence here. Now, there's ambiguity, right? So I could have multiple different
mappings for the words onto these sort of lexicon entries. I could have multiple different ways
of applying these rules, which will lead to things like prepositional phrase attachment
ambiguities, things like that. So we need some way to decide which parse we're going to get,
given some input sentence. And like good machine learning people, what we're going to do is
we're going to parameterize these parses, we're going to have some features, and we're going to
learn some weights which produce the right parse. So here's sort of my schematic view of how
we're going to do that.
So, first of all, let S be the sentence we're trying to parse, let L be this whole parse-tree structure.
We're going to define some features of the parse and the sentence, so these could be things like,
count how many times each individual ones of these lexicon entries shows up. It could be things
like count how many times I apply a preposition to a noun that's one word away on the right. We
can kind of come up with anything that's local in this parse-tree structure for these features. And
so we'll have these features, and what we're going to do is we're just going to learn some weights
which say, okay, which features should I prefer in good parses? And so in this kind of view, we
can parse a sentence by solving this optimization, and you can do this sort of using a dynamic
programming, CKY kind of thing. So I have some sentence S here, and I'm trying to find the
best parse for this sentence. That's what the argmax is doing. So this is kind of like a
parameterized combinatory categorial grammar. Is that good? Okay.
So how do we learn these things? I'll talk briefly about supervised learning for these things. I'll
say there's other ways to learn semantic parsers. I know Hoifung is working on some totally
unsupervised ways to learn semantic parsers, not using CCG. But let's imagine for a second that
we have training data of the form, here are some questions and here's their logical forms. But
what we can do is we can just treat this as a supervised learning problem. And so we can solve
this learning problem in a number of different ways, but for example, we could do the structured
perceptron algorithm to train the parameters. So here, what we'll do is we'll basically just -- we'll
take each sentence. We'll parse it. We'll predict some parse, and then we'll say -- we'll try and
move the parameters towards the features of the correct parse but way from the features of the
predicted parse. So essentially, every time you get something wrong, you're going to try and say,
instead of predicting the thing you predicted, predict the right thing. So this is one way you
could train the parameters. You can also do things like maximum likelihood. Your favorite
objective function can be plugged in here. So that's supervised learning of these CCGs. Yes.
Okay, so that's the end of my semantic parsing intro chunk. Now I'm going to talk about how we
can use semantic parsing to do these sort of physically grounded problems. Now, remember that
the problem I was talking about was this one here, where we're going to get some image with
some segmented objects. We're going to get some descriptions, and our goal is to predict the set
of objects described by that description. There's a couple of things I want to point out about this
problem, which distinguishes it from some things people tried in prior work. So, first of all, I
want to point out that the output is a set. If you say the mugs, I want you to get two mugs. So
this is kind of different from some work in robotics, which just kind of assumed that every word
refers to exactly one thing.
Another thing I want to point out is that there's no knowledge base. If I were training a semantic
parser, I'd assume that I had some knowledge base which encoded all of the information about
this image, and I'd just parse into some formalism that could use that knowledge base. But here's
there's no knowledge base. There's just the image. And, finally, we're going to consider both
one-argument predicates, things like mugs, and also relations, or two-argument predicates, like
left. So here, left is encoding some relationship between the mug and the monitor, so we need to
look at two things to kind of get that right. Okay, so how are we going to solve this problem?
Well, we propose this model, which we call logical semantics with perception, which kind of just
takes the semantic parsing view of the world and says, okay, I don't know what the knowledge
base is. Why don't I try and figure that out automatically?
So here's the way that our model's going to work. We have as input -- we have the language.
We have this environment, which is this image, and we have this output, which is the set. And
what we're going to do is we're simply going to semantically parse the input. Now, here, you'll
notice that I've -- there is some set of predicates now. Just assume we've made these up. What
we actually do in practice is we just look at the words in the language and say each words maps
onto a predicate. That's its word lemma. So mugs will map onto mug, we get some sort of
sharing here. But it's not there's some predicate vocabulary. There's no discrete mapping onto
those predicates anymore. It's very easy to do this.
>>: Can you give the number of arguments for each predicate, as well?
>> Jayant Krishnamurthy: What we do is we part-of-speech tag and then we kind of just guess.
So actually what ends up happening is with words like left, you'll end up with two different ones,
where one's a one-argument predicate and one's a two-argument predicate. And then you have to
disambiguate. Okay, so next, what we're going to do is we're going to say, okay, so I have this
environment. I'm going to try to produce a knowledge base from that environment. So here, I'm
going to take for every category predicate that we've invented, I'm going to say here's the set of
objects which are elements of that category predicate. For every relation predicate that we've
invented, I'm going to say here's the set of object tuples that are elements of that predicate. So
this is like the set of things where the first element is to the left of the second element. So once I
have this knowledge base, it's straightforward to produce the output, right? Because I have the
semantic parse, I have the knowledge base. I do my database query and I get the output set.
>>: The one that you need is not part of your perception of left.
>> Jayant Krishnamurthy: Yes, that's me just not putting in slides correctly. Let's assume that it
were.
>>: Do you mean to have that be an exhaustive list of the things that are perceived?
>> Jayant Krishnamurthy: Yes. It's hard to do for left, because there's a lot of them.
>>: So that's why I was curious whether you needed it to be exhaustive or not.
>> Jayant Krishnamurthy: I do. I do. I mean this to be the complete set of all the categories.
Okay, so this our proposed model, and you'll notice this looks a lot like the semantic parsing
view of the world. The big difference is just that instead of having one learning problem, we
now have really two learning problems. We need to learn the semantic parser. We also need to
learn to do this perception thing to produce these knowledge bases. And remember that this is
deterministic, so we don't need to worry about learning this. This is just I run that query on this
database and I get some output.
Okay, so you can actually just take this picture and kind of think about this as a graphical model,
where these things are the nodes and these edges are hedges. If you do that, you'll end up with a
sort of -- you can kind of render that into math, and it looks something like this. So we have a
model, the model has three components. There's the semantic parser, which gets as input sort of
this language, and our job is to kind of score these different semantic parses for that language.
We have this perception function, which is going to take the environment here, and our job is to
produce this sort of knowledge base. And then we have this sort of deterministic evaluation
thing, which just takes that knowledge base, and it takes that semantic parse, and it says here's
what the output is, given that you've produced those two things already.
So both of these two components need to be learned, right? We have some parameters, data
parse for the semantic parser, we have some parameters, data per, perception, for the perception
function. I'm going to talk briefly about how these things are parameterized before I talk about
the learning. And the parameterization here, I've sort of already told you what we're going to do.
We're going to have some features of these semantic parsers, and we're just going to train the
linear model like that up there. For the perception function, this looks like it could be potentially
tricky, right, because there's a complicated structured knowledge base thing. So let me talk
about how we're going to parameterize that.
If you look at this knowledge base I think the right way, it turns out there's a very easy way to
think about parameterizing it. And the way to look at it is this. The knowledge base is really just
the collection of binary classification decisions. So I have a predicate like mug. I want to know
for each object in the environment. I've omitted the table, but for each object in the environment,
is it a mug or is it not a mug? And my job is just to produce this set of binary classifications,
similarly for a predicate like blue.
So if you think about this way, what we can say is, okay, I'm just going to train a single binary
classifier per predicate to predict the right set of objects. I'm going to classify every object with
that classifier, and that's how I'm going to produce this set. So basically what we're going to
assume is that we have some features that we can compute of these image segments. That's what
phi-cat is here. And I'm going to train some classifier parameters for, say, mug, beta-mug, and
this is going to be like my classification decision equation for that classifier. And, similarly, I'm
going to assume that I have some features of pairs of bounding boxes for the relations, or pairs of
image segments, and I'll train one classifier per relation predicate to say, okay, is this thing left of
that thing?
>>: So then you did the same thing as you did for the text, where any noun you considered is
going to be the predicate, and any something else is a two-position predicate?
>> Jayant Krishnamurthy: Right, so the pipeline is we look at the text, we invent the set of
predicates, and then there's a parameter for each of those predicates that we invented here, and
there's also the mapping and the lexicon there. Yes, we invent the predicates first. Okay.
>>: So you put any structure among all those different classifiers? So, for example, when you
have two predicates, maybe they meet in the same -- I mean, post hoc, obviously. So do you try
to impose these?
>> Jayant Krishnamurthy: There's no structure. It's all independent classifications. I can talk
about that later, but I think there's actually an interesting point there, which is it's hard to impose
the structure, because you don't know what it is a priori, and then it turns out if you have enough
data, this will work well either way, so do you really want the structure? It's not obvious to do
that in a way that kind of makes sense, I think. I don't know. Okay, so ->>: I'm sorry. I want to drag you back to the last slide.
>> Jayant Krishnamurthy: Sure.
>>: I feel like I missed something -- or maybe one more. I'm sorry. I can see how you get
predicates for each of the nouns. However you have this lambda-X which seems to correspond
to the indefinite article. There's a mug and exists Y that corresponds to the definite article the. It
feels like there's a lot more complication in the instantiation of that first-order logic statement
than just inventing predicates for each word. I mean, there's a number of words that are omitted.
There's a bunch of structure that's induced, the use of [indiscernible] instead of disjunctions.
There's scoping. Yes.
>> Jayant Krishnamurthy: So that's a good point. There's definitely -- so all of this is done by
part-of-speech heuristics, and there's multiple possibilities for these things. So what we're really
doing is we're looking at each word, and we're saying this is a noun, so there's possibly the thing
lambda-X, mug-X, and this is a preposition, so maybe I have the thing which takes an argument
on the right and quantifies that out and gives me the thing on the left, right? And you can do that
by kind of just looking at the part-of-speech tag, saying, okay, it's tagged as IN. I'm going to
give it the preposition lexicon entry. Honestly, that part is not super-interesting. It's like a thing
of rules I wrote down, and it kind of just invents some stuff.
>>: I totally agree it's not interesting, especially when you have one, two or three predicates in a
sentence, but I think when we start scaling up to more complicated situations -- I don't know. I
would love it if this stuff is -- you can basically really infer it from syntax and it's no problem at
all, but do you believe that ->> Jayant Krishnamurthy: Remember that this is learned, right. So what we're trying to do is
invent the space of possible semantic parses, and then we're trying to learn to pick the good
semantic parses.
>>: Yes.
>> Jayant Krishnamurthy: So there's some room for kind of over-generating these things, saying
I'm going to make a bunch of different things for these nouns. I'm going to try a bunch of
different things for these relations, and saying, which parses actually do a good job? That overgeneration is kind of what we're counting on to get the right answer here. We're not trying to
pick exactly this parse automatically, heuristically, from that text. Okay. I'm going to keep
going. Good.
Let me just say a word about how these things are parameterized. So if you look at this you
realize, actually, we can apply this model to a bunch of different other things. It doesn't
necessarily have to be this image understanding problem. Really, the only thing that we count on
for the environment is having these feature functions. So if you have any sort of application
domain, where you can kind of say, here are the objects that I think you need to reason about and
here are some good ways to calculate features of them, you can actually apply this model. But
I'm sticking with the images, because I think it's more concrete. And for the image domain, what
we're using here is we're using a histogram of oriented gradients, which kind of captures the edge
information, and we're also using a color histogram. And then here we're using spatial relation
features that capture the orientation of the two objects relative to each other.
>>: Do you also have just an indicator feature for this bounding box?
>> Jayant Krishnamurthy: No, because if I give you a different environment, it's going to have
completely different objects. So the objects aren't preserved across environment.
>>: Do you think it might learn to answer future questions?
>> Jayant Krishnamurthy: So in our evaluation, we actually only test on unseen environments,
so those features would be very useful. But, yes, you could get object identity as a feature
somehow. You could definitely plug that in. It just I don't think makes sense for our particular
application. Okay, so let's talk about training this thing. Here is one way I could train the
system. I could sit down, I could annotate the right semantic parse for every piece of input
language. I could also annotate the right knowledge base for each environment. And then, once
I've annotated these things, I just have two independent learning problems. I can train the
semantic parser, like we talked about before, and then this is just training a bunch of SVMs,
essentially, to distinguish which things are elements of which predicate.
So this is potentially one way we could train the system, but I'm going to argue this isn't a very
good way to train the system, because you're going to have to sit down and annotate all these
parses, which is pretty painful. You're going to have to do that for all these knowledge bases,
also pretty painful. I don't think we really want to do this. I think what we really want is a way
that we can train the system using naturally occurring kinds of supervision. So what we're going
to do is we're going to propose a weakly supervised algorithm, which is going to be trained using
just the natural language, the environment, and the right answer, the set of objects that that
language actually refers to. So I'm going to argue this is pretty natural, because if you can
recognize a pointing gesture, if someone points at an object and they describe it, there you go,
now you have a training example.
>>: Does the environment come already with the bounding boxes and everything?
>> Jayant Krishnamurthy: Yes. We've assumed that the bounding boxes come in the
environment. That's an oversimplification of the truth, but we need to start somewhere. Okay,
so how are we going to train this? Well, we're just going to treat these variables as unobserved,
and we're just going to do stochastic gradient descent. So, basically, what we do is we set this up
as a maximum-margin Markov network, which is kind of a generalization of an SVM, and I'll
just show you how the training works by looking at one gradient update. So let's pretend this is
our training example, so this is an iterative procedure. So we're going to look at each example
independently, and we're going to do a gradient update based on that, and then we're going to
kind of iterate over the whole data set. So let's just look at one training example over here. The
first thing we need to do is we need to solve these two maximization problems, which is
basically here we need to predict, given these two inputs, what's the best knowledge-based
semantic parse and output under the current model parameters. That's what this part is. But then
there's also this funny cost term which we have to add in, and that's what makes it a maximummargin Markov network. But this is basically just what's the overlap between the correct thing
and the predicted thing. And then the next thing we need to do is we need to figure out what the
best knowledge base and semantic parse are which actually explain why you got this answer. So
we're conditioning on observing this. We want to find these two things. Now, unfortunately,
these maximizations are hard to solve. Typically, in this model, inference is actually easy,
because we have this deterministic evaluation component. If you just do inference in the
knowledge base and you do inference in the semantic parser, you can just combine those results
to produce what the right output should be. But the second you start adding some constraints on
what the output should be, inference can become hard, because now the knowledge base that you
predict is coupled to the semantic parse that you produced. So we have an approximation for
solving these problems, and what we do is we basically do a beam search over the different
semantic parses, and then give the semantic parse, you can solve a sort of little ILP to solve for
what the best knowledge base is.
So that in practice we can train on our domains, we can do cross-validation in an hour, so it's not
too bad. But the inference problem will become problematic if we try and study domains with a
large number of entities.
>>: So it's a cost function?
>> Jayant Krishnamurthy: No, so it's a hamming cost, so it's basically for every object that you
predict which is in the true set, you get a cost of one. Here, you're trying to encourage the model
to predict something wrong. Sorry. I guess for everything that's not in the true set, you add a
cost. But if you ignore that, it's just like a structured perceptron kind of update. It's just this that
makes it the max-margin Markov network.
Okay, so once we have these values, how do we do our gradient update? Well, what we're going
to do is we're just going to update -- this is the same structured perceptron update, essentially,
that I showed you before. We're going to update our parameters towards the value of the
semantic parse, which actually got the right answer, and we're going to do it away from whatever
parse we predicted here with the cost. Similarly, for the knowledge base, for the perception
function, we're going to do a similar thing, and here it's just going to factorize into sort of an
update per predicate classifier. So let me just show you the example for one predicate. Let's
pretend this is what we predicted. This is a value from this optimization, and this is the correct
value, correct value from this optimization. Essentially, what we're going to do is we're going to
update towards entries which are in here and not in there and away from the entries which are in
here or in there but not in here. Intuitively, we want to produce this thing, so if we predicted
something in here, we should be trying to move away from that, and if we didn't produce
something in here, we should be moving towards that. This is also just the sub-gradient update
for an SVM, so it's like a -- it's actually like three SVM updates at the same time. We're just
adding them all together at the same time. So this is sort of I think pretty standard. Questions
about parameter estimation? Good?
>>: Parameters for the other [indiscernible] for the other connective statements in the first-order
logic?
>> Jayant Krishnamurthy: For the other connective statements?
>>: Yes. Should they do a parameterization over whether it should be an existential quantifier
or a lambda expression or the area of ->> Jayant Krishnamurthy: So that comes out of the semantic parse parameters, which that was
the previous update. The semantic parses, I take the input language and I produce a logic with
them, so all that stuff goes in that previous update.
>>: I guess I should read your paper to see specifics.
>> Jayant Krishnamurthy: Yes, I guess you might have missed the part where I talked about
how the semantic parsing worked, but we can talk about that later. Okay.
>>: So the picture you use -- I mean, for the object, you actually can use not just the [HOT],
right? Because whatever the evaluation system gives you the bounding bars, it will also give you
an object -- so right now, here, you just use the raw vision feature?
>> Jayant Krishnamurthy: Yes. Here, we're just using raw vision features. I mean, if you have
a better object recognition system -- we actually aren't using an object recognizer. Tom Kollar
went and just annotated the bounding boxes for this data set. So there's no automatic system
that's providing additional labels here, but the feature -- this parameterization is generic, right? If
you have better features, you can use them.
>>: What about the features on the left? So for those kind of relational features, what exactly
are the features that you use?
>> Jayant Krishnamurthy: So we're basically using a spatial orientation. So there's these AVS
directional features, which I don't really understand, but Tom Kollar does. And there's a bunch
of things like do the bounding boxes overlap, is the centroid above that centroid? There's just
kinds of heuristic things like that which try and capture the orientation of the objects.
>>: [Indiscernible].
>> Jayant Krishnamurthy: Well, you can. It's going to be hard to distinguish visually from on,
but if you say it's in front, you'll get like, does it overlap? Yes. Things like that in the feature
vector. I guess if you totally occluded the object, it would be a problem. Okay, so this is our
weekly supervised training procedure, and now I'm going to talk about some of the experiments
we've done with this model. So what we've done is we actually created two data sets. So
remember, I was saying this model is generic, but we created two data sets to try it out. The first
data set is the sort of data set we've been using examples from all the time, where basically we
took a bunch of images of the same -- it's the same set of objects. It's the same set of objects, the
same mug, the monitor, the table, but they're rearranged in different configurations. Obviously,
if I move the object around, it does look different. So it's not like there's identity of objects
preserved across the domains. And what we did is we collected some language from our
research group and also from Mechanical Turk, and then we annotated what the right answer is
for each statement, and this contains about 15 images and about 300 descriptions, so there's
multiple descriptions per image.
And then we also created this geography data set, so the traditional semantic parsing data set is
this Geoquery data set, so this is kind of our mock version of Geoquery, but it's a grounded
Geoquery. And the way it works is, as input to the system, it actually gets these states and cities
and national parks and whatever, but they're all represented as these polygons of latitudelongitude coordinates. This image that you see here is pretty close to what the system is actually
observing. It's actually observing this outline of this thing, and this data set has cities. This is a
national park. It's some forest. We basically collected questions about these maps and the we
annotated the right answer, and again, there's like 10 environments and 250 questions. And I'm
going to kind of focus on the scene data set for most of the experiments, because I think we're
most familiar with it, but I'll also present the results for this sort of just briefly at the end.
Now, let me talk about how we're going to evaluate the system. What we're going to do is we're
going to do leave one environment out cross-validation. So we have 10 environments. We're
going to hold out one of them. We're going to train the system on nine of the environments and
all the descriptions, and then we're going to take that held-out environment and we're going to
have some test examples like this. And the way we're going to evaluate, we're basically going to
have the system predict some set of objects, and we're going to annotate that set of objects as
correct if it exactly matches the annotated set. So in this case, these are the same, so this would
be correct. This would obviously be wrong, but this is also wrong. You don't get any partial
credit for improving the right object in the set.
So if you think about this, you realize this is actually going to make it a kind of challenging task,
because the number of sets of these objects is going to grow exponentially in the number of
objects. So here we have four objects. That means there's 16 outputs, but if I had five, that
would be 32, etc.
>>: Five objects?
>> Jayant Krishnamurthy: So in this data set, I think there's four to five objects per image. In
the other data set, it's like somewhere -- there are sometimes like seven. There are -- the paper
actually has the statistics on that, and also I'm going to show you results from a random baseline,
so you'll have some idea of what the average is across the different domains.
>>: You said there were 10 environments and then a bunch of images per environment.
>> Jayant Krishnamurthy: Sorry, no, no, no. Each image is an environment.
>>: Oh, and then so you have one image and then a bunch of statements.
>> Jayant Krishnamurthy: And there will be like 10 to 15 different statements for that image.
>>: Okay, and all the environments are the computer mug, whatever, setting.
>> Jayant Krishnamurthy: Yes, in this data set, they're all the computer, the mug, they're
rearranged, those kinds of things.
>>: Why don't you give partial credit? Is this just for simplification?
>> Jayant Krishnamurthy: Yes, so that's a good question. Initially, when we submitted the
paper, we had two different metrics, where one wasn't giving partial credit, one was giving
partial credit, and then the reviewers got really confused, and just to simplify it, we kind of got
rid of one. And that's basically it. And we also wanted to put in some error analysis and stuff
which didn't fit, but you can imagine doing one where you say, I'll measure the overlap or
something.
>>: In your training, you have this hamming [loss].
>> Jayant Krishnamurthy: Yes, there is.
>>: Do you allow any numbers? Is it the two mugs or the one mug, numbers that would affect
the final set?
>> Jayant Krishnamurthy: So you can say it, but our parser lexicon ignores it. So we'll invent I
think a predicate for two, but the way our semantic representation is put together, we can't
actually detect sets in two objects. That's not a thing we can parameterize. So, actually, let me
talk about that real quick. Let me show you the results here. What I'm going to do for the -- the
experiments are designed to do basically two things, okay? First of all, we want to see if we're
really learning both the categories and the relations here. So previous work has looked at similar
models without considering the relations, and so including the relations is one of the things we're
doing, so we were interested in seeing if that works. And the other thing we want to do is we
want to see if weakly supervised training is competitive compared to, say, like fully supervised
training. We want to know what the performance loss is there. And what I'm going to do is I'm
just going to show you these accuracy numbers, like based on that previous metric that we saw,
and I'm going to break it up into three different categories of natural language, which is getting at
your point here. Which is I'm going to say, do you need to understand any relations to answer
the question? No. It falls in the zero-relation category. Is there one relation? It falls into that
category.
Then there's this other category, which is things with two relations, but those are pretty rare. It's
mostly things like superlatives or numbers, where the number is actually important, which
actually can't be represented by our model as we currently put it together. This requires some
universal quantifier over the distance between two things, or two would require something that
could look at all sets of size two, which we can't represent. So that's what's going to go in this
other category. And so let's ask how does the algorithm that I've been talking about the whole
time, with the categories and the relations and the weak supervision, who does that actually do?
Well, overall, we get about 67% of these queries correct, and if you look at the results, you'll see
that for the categories and the relations, we're actually doing better than that. But the fact that we
have this whole other category is bringing us down. So conceptually, we're not going to be able
to do well in this category, just because we don't have the representational power required to
capture that.
The next baseline that we're going to introduce is this category-only model, which is roughly
based on the work of Cynthia Matuszek et al., and we're going to train that, again, using this
weakly supervised algorithm. And what you can see here is that overall it does worse, and the
reason it does worse is because you can't model these relational things. Now, that's not
surprising, but there's a possibility that if someone uses the right noun at the first word, you can
just guess what the right answer is. So there's a possibility that you didn't even need to
understand relations to solve these questions. So this actually also demonstrates that relational
knowledge is important for our data set. This one doesn't have relations like left of. It doesn't
have relations like left.
>>: So your lexicon only has single ->> Jayant Krishnamurthy: Yes. It only has one-argument predicates, which I agree it seems
unfair in some sense. It's just we're trying to replicate this previous thing. Now, the final thing is
we're going to do the whole model trained with full supervision. So here, we actually annotated
all the semantic parses and all the knowledge bases for every environment, for every natural
language standard, and we trained the semantic parser and the classifiers independently, like we
talked about, and the results are about the same as the weakly supervised algorithm. So we're
getting about 70% of the queries right. We're doing a little bit better on these relational queries,
but aside from that, the results are pretty comparable. So this is promising. This suggests that
this weakly supervised algorithm performs competitively to the fully supervised case.
Now, I'll just point out that the similar results kind of hold true for the geography domain, so
again, these numbers are pretty comparable, and this category-only model does a lot worse.
>>: In the previous one, with the relation, or that relation, the blue mug example, in the first
column.
>> Jayant Krishnamurthy: Yes.
>>: The reason you're getting them wrong there, is it because of parser? Is it because you didn't
detect the blue correctly? Is it something you distinguish as ->> Jayant Krishnamurthy: It's the detectors are mostly the problem. So we do -- actually, our
paper has an error analysis, but a big part of the problem is you only have a small number of
examples for each of these different words, so it's kind of hard to train an object detector from
200 examples or whatever, and a lot of the words occur much less frequently than that. The
actual vision side of the system is definitely the weak link, I would say.
>>: Let's talk about how performance varies between set size, how many descriptions per scene
you need.
>> Jayant Krishnamurthy: That's a good question. I do not have an answer for that. We could
probably run an experiment.
>>: Do you have any information about how many [entities] you're considering, how many
colors, how many relations, those kinds of things?
>> Jayant Krishnamurthy: I do, but not here. Again, in the paper somewhere.
>>: [Indiscernible].
>> Jayant Krishnamurthy: So we did a cross-validation, so it's about 250 to 300. This is I guess
270, something like that. I forget the exact number. These aren't just variances, right? You
have 300 examples.
Similar results for the geography domain. Let me show you some examples. So these are some
examples of what the model predicts. This is our input. We have some description, like monitor
to the left of the mugs. We are not getting this one right. You'll notice we're not doing the two
particularly well. There's also some question about whether this mug is actually blue. I don't
know. So we're not getting that one right. We have some -- sure.
>>: So here, the full supervision did worse than the weak supervision?
>> Jayant Krishnamurthy: Yes.
>>: Is that just not significant?
>> Jayant Krishnamurthy: I think it's significant, but barely. I'm not reading too much into this
3% performance difference. Yes. It's a good question. I'm not sure why they're different. You
can imagine that we just happened to find slightly worse parameters for guessing on the test side
or something, right?
>>: Can you [indiscernible] and test side? Do you have full statements in there?
>> Jayant Krishnamurthy: Full statements?
>>: In either, when you have the description for training, or do you have [indiscernible]?
>> Jayant Krishnamurthy: No. The data in these domains is actually very, very clean. They're
manually made data sets, but there are these examples which you just can't get right, so those
kind of do act as noise in a way. Like, if you say closest, the model is going to try and learn to
predict this correctly, but it just can't. These data sets, they're clean.
>>: What would happen if I were to inject noise in the training set?
>> Jayant Krishnamurthy: I don't have a great idea. I mean, it's a little -- yes, there's potential
for it to be bad.
>>: So I'm a little bit curious about the [cat-only] weak one, because in the other same domain,
it actually performed for the unit predicate one, it actually performed better, but why here it's so
much worse?
>> Jayant Krishnamurthy: So that's a good observation. This data set has more relations, like
more queries with relations in it than the previous one. And so when you're training this model,
it's actually trying to get those relation queries right, using only its categories. So it actually
produces really bad category classifiers, because it can't actually learn the right thing. It's outside
of its representational power. So that's what's happening here. In the previous domain, there's a
lot more things where someone says, like, the blue mugs, so you can learn a good classifier.
Okay. So I'll show you some examples from the geography domain. So here are some things.
We have capitals. I guess I didn't talk about this, but our feature representation has some
linguistic context features, which allows you to do things like detect capitals. You can do weird
question phrasings, and then here, we're not actually getting this one right, because this is a
prepositional phrase attachment error, so this question is saying, what cities are both east of
Greensboro and in North Carolina, so the correct answer to that is Raleigh, but what the parser
says is, what city is east of Greensboro, where Greensboro is in North Carolina, and that actually
gives you both of those two things as the answer. So that's kind of -- I don't know. That's kind
of an error case you'd expect, right? You're right. I might have copied the question wrong. Yes,
yes. Good point.
>>: Just in case you use the slides.
>> Jayant Krishnamurthy: Yes, I'll remember that. And I just want to say one more thing, which
is we've actually taken this model, and we've hooked it up into an interactive system, so we used
a Microsoft Kinect and we did some automatic bounding box detection on some objects, and this
is Tom Kollar, and what you can do is, it has gesture recognition, so you can point at an object,
like this object here, and you can say something, and that will get fed through an automatic
speech-recognition system, and we'll get the text input. And then we can generate these training
examples automatically.
This paper, there's actually a paper about this in RSS, but we considered some extensions to the
model, and I think one of the coolest extensions we did was language generation, so here are
some examples of that. This is a training example that someone provided, and then we asked the
system to generate language to describe that object that's being indicated. This language isn't
perfect. It thinks the table's behind the book, but those look pretty similar. Here's another
example of that. You can say the toilet paper to the right of the mug. The relations here are
from the person's point of view, so this is correct.
It was too hard to get people to point at objects and then think about the opposite thing. So that's
something we've done too, and I think I'm actually basically out of time, so that's the end of my
physically grounded component. I did want to talk about this. All right, give me five minutes?
I'll go fast. Okay, so I want to change gears a little bit and talk about some of the other work
we're doing on semantic parsing, and this work is kind of in a different vein, because we're kind
of looking at these web-scale knowledge bases, things like now these information-extraction
applications, and again, the motivation here is we have these databases. We have things like
Freebase. They have millions of facts, they have thousands of predicates, and we want to be able
to do question answering against these databases. We want to be able to do information
extraction to fill in the stuff that's missing in these databases. But when we're talking about these
databases, they're -- really annotating these semantic parses is kind of infeasible. The training
burden sort of grows as the number of predicates in your vocabulary, so we're going to pay a
price to annotate for these thousands of predicates. And we kind of expect the number of
predicates to actually grow in the future, because even though they have thousands of predicates,
that's only a small fraction of language or the world that we're really capturing. So we want
algorithms that scale to millions of predicates, I think, and so the goal here is we're going to try
and train semantic parsers for these web-scale knowledge bases without using any labeled
training sentences. And so the way we're going to do that is -- this is kind of like our inputoutput. I want to be able to take any sentence that I find on the web, I want to be able to produce
the semantic parse for it, using predicates from, say, NELL's ontology. That's what those
predicates are. Don't read too much into the representation. So how are we going to do that?
Well, we've been working on training these semantic parsers using distant supervision. So the
idea here is, we have a bunch of language. We can go out and get all of Wikipedia. And we
have all these sentences. We don't know what the right semantic parse for those sentences are,
but we do know some knowledge that kind of relates to the things in those sentences. So we
kind of look at these relation instances in our knowledge base, we can kind of use that to
constrain what the semantic parse should be for each sentence. And the way we're going to
formalize that is we're going to use this sort of distant supervision constraint, which is called the
extracts-at-least-once constraint, so the idea is, we're going to take every relation instance in our
knowledge base, we're going to find all the sentences which could express that relation, so which
mention that pair of entities. And then we're going to force at least one of these sentences to
semantically parse to something which expresses that relation.
So let's pretend this is the space of possible semantic parses for each sentence. The constraint
will allow us to choose, for example, these two sentences as the correct semantic parses, because
this expresses the right fact, but we won't be able to choose, for example, these two semantic
parses, because none of these sentences is expressing that fact. Okay, so this is kind of our
distant supervision constraint, and the idea here is, the hope here is, we're going to have so many
of these relation instances and so many of these sentences, that the corpus statistics will help us
decide which of these sentences should actually parse to that city located in country thing.
>>: So the pair you take is always the two arguments, or you take also the relation as the one
you're -- basically, you have three. This is a three [flats], and you constrain to find text which
matches two of them.
>> Jayant Krishnamurthy: Right. So part of it is that we don't know what the text expressing
this relation is, so we kind of need to figure that out during training, but we do know what the
text for the entities is, so that's why we do it that way.
>>: And the knowledge comes from NELL?
>> Jayant Krishnamurthy: So in the experiments I'm going to show you, we have a sort of
confused, mishmash knowledge base of Freebase instances in the NELL ontology which is there
for historical reasons. It's not any particular reason
>>: If the knowledge came from NELL, it seems kind of funnily, because the NELL would have
gotten the knowledge from these sentences also, so it's -- but if it comes in Freebase, that makes
sense.
>> Jayant Krishnamurthy: That's true. That's a good point. I haven't thought about that too
much. Yes, it comes from a different kind of way of expressing it, though, so NELL has these
corpus statistics kinds of things. I'm not sure that would really matter. It seems like it might.
Yes, good point. Good point.
>>: Does NELL throw away the source of its knowledge?
>> Jayant Krishnamurthy: Well, so the way NELL reads is it doesn't actually look at individual
sentences or documents. It actually does a bunch of counting on all these documents and then
uses those counts in some big matrix of, okay, this word has occurred with those other words this
many times, to decide what things belong to what categories. So in some sense, there isn't one
sentence source for its knowledge. It's from the corpus statistics as a whole. Well, there's an
asterisk on that.
>>: Yes, it has some other extractors which do look at individual documents, but those aren't the
main source of knowledge, I would say.
>> Jayant Krishnamurthy: Okay. This is sort of our distant supervision constraint, and just in
the interest of time, you might ask, does this work? So we conducted a sort of relation extraction
experiment, where we trained a semantic parser using this constraint, and we evaluated it on how
well it extracted relations. So one thing you can do with the semantic parser is extract relations.
It's more powerful than that, but you can extract relations. So these colored lines are different
variants of our parser, and this dashed line is a sort of state-of-the-art relation extractor, and this
is basically extractions as a function of sentences, so we basically extracted facts from a big
corpus of sentences, and then we went through manually and said, okay, does this sentence
express this fact? And that's what this precision axis is. And then this is the number of things
extracted. And you can see here we're doing pretty well.
We also did a sort of secondary evaluation, where we tried to look at does the semantic parse
actually do a good job of semantic parsing, as opposed to this simplified relation extraction task.
So here we kind of annotated some noun phrases with what their correct lambda calculus
representation should be, and here you can see that of precision here is the fraction of sentences - not sentences, noun phrases which could be parsed, which we got correct, 80%. This is the
overall fraction that we got correct, so 56%. So you can see here there's some discrepancy
between -- there's a number of things that can't be parsed. That's actually often due to sort of
weird vocabularies. So one of our queries is exsinger of something, where exsinger is one word,
so we're not doing a good job on those kinds of things. This is actually work that we did last
year, and so this year we've been working on extending this in I think a really interesting
direction, where our goal is to produce a kind of NLP tool that other people can plug into any
application and get the semantics where it's available.
And so our goal here is really -- I'm going to call this a broad-coverage semantic parser, and our
goal here is we want to produce a full syntactic parse of every sentence, and then we want the
parser to be able to say, okay, for this subset, this sub-span of the sentence, here's what the
semantic representation in terms of NELL predicates is, similarly for this other sub-span. So the
hope is that by having a full syntactic parser, you can easily plug this into whatever your favorite
application is, but then you can also get these sort of semantic features where they're available.
And so our procedure for that is basically we're going to use this distant supervision scheme, but
we're also going to use a CCG bank, which is a big annotated corpus of CCG parses, and we're
going to build one joint objective to do the right thing, parse the sentences correctly in CCG bank
while also extracting these relation instances from Wikipedia. And by doing this optimization,
we're going to produce this broad-coverage parser.
So we have some preliminary results on this, and actually, what I'm going to do is I'll just show
you a demo. Okay, so we have this demo website here, where you can kind of type in -- at the
top, you can type in some sentence, and it will try and parse it. These are some examples. Here
is sort of the space of the parser's representation, so these are all the types that it understands, and
these are all the relations that it understands. These are all from NELL's knowledge base. And if
you click on one of these things, it'll parse it. It's a little slow, so I've opened one already. So
here's the parse of the sentence. Madonna, who was born in Bay City, Michigan, on August
19th. What it does first is it looks up in Nell all of the possible noun phrases to kind of get the
entity types for all these things. So there's a whole number of possible entity types for all of
these noun phrases.
And then, during parsing, it kind of tries to disambiguate which entity do these things actually
refer to? And you can see here we get this logical form right here which says, okay, this whole
noun phrase refers to something C, which C can be called Madonna and is a musician, and also C
-- so C is also equal to this thing A, so A and C are the same variable. And A, Madonna, was
born in location D, which was Bay City, and Bay City is located in the state Michigan. So we
produce this whole parse. Now, you'll notice we've omitted this last clause, this on August 19th,
because we don't have that sort of date relation. But we actually also produced a full syntactic
parse, which kind of shows us where that modifier gets attached to this whole phrase. So you
can see we have on Bay City kind of modifies this whole was born in phrase here. This is sort of
the syntactic tree, and yes. That's kind of what our parser does. And we've done some sort of
preliminary evaluation on this, kind of working on the syntactic parsing accuracy and the
semantic parsing accuracy, but we're not quite done yet.
Okay, that's my demo, and yes. I just want to briefly conclude by talking about some of the
directions that we're exploring. So the first thing we're trying to do is we're actually trying to
take this LSP kind of model and extend that to do image understanding on these much harder
kind of vision corpora. So I'm working with some people up in the students' inlay at CMU who
are vision people, and they know how to solve these object-detection kind of problems. And
we're working with them to produce a version of LSP which has a much larger set of kinds of
objects it can recognize, also relationships it can recognize. Also, we're working on this broadcoverage semantic parsing thing, and one of the things that we're introduced in doing is coming
up with this sort of broader-coverage semantic representation, maybe something that can assign
model theoretic semantics as a larger subset of English than just the predicates and the
knowledge base. So we have some ideas on how to do this using ideas from vector space
semantics. So that's a direction we're kind of looking at moving towards, as well. Yes, so just to
conclude, I talked about logical semantics with perception, which is a model for grounded
language understanding where you have these physical environments and you have some
language you want to understand in that context. It's a generic model. I showed you an
application on image understanding, but you can imagine it can be applied to other things, as
well, and then I also briefly talked about this training semantic parsers with distant supervision,
which is something we're working on for kind of question answering or information extraction
applications. And then, if you're interested in the papers or the data for the LSP stuff, it's all on
my website. Okay, so thank you.
>>: The last one, where you say join optimization, do you have any sort of -- so, basically,
you're get some store for different [parse streams], and then you've got an objective from the
semantic, and then you just add them?
>> Jayant Krishnamurthy: Yes, we just add them together.
>>: So there is no join limit.
>> Jayant Krishnamurthy: Well, there's one set of parameters which do -- there's one semantic
parser, right? The semantic parser has to do both tasks, so in some sense it's sort of being trained
to do both things simultaneously.
>>: So does this implement feedback from [indiscernible].
>> Jayant Krishnamurthy: It does, actually. So, for example, here you'll notice the word who
has this sort of head passing category, where it takes the was born in phrase and makes that
subject Madonna, but then makes the whole thing refer to Madonna. And that information is
actual completely derived from the syntactic category of who from CCGbank, so it's something
you can sort of automatically annotate. Things like conjunctions also work because of their
annotation in CCGbank, so there's actually synergy between the semantics and the syntax. In
fact, I have a slide. So, actually, one of the evaluations we've done is basically we just took a
corpus of sentences, we semantically parsed it using this parser, and then we said -- we looked at
each semantic parse and said, is it correct or not? So, again, that's precision, and I have two
models here, so one is trained without CCGbank and one is trained with CCGbank, and you can
see we actually do better with CCGbank. These are very preliminary results.
>>: So the CCG parser does give you that prepositional phrase that's time, so how is it that the
logical form simply drops that?
>> Jayant Krishnamurthy: So, what we've done is we've sort of taken the CCGbank syntactic
lexicon, and we've kind of annotated these syntactic categories with what their semantics are, and
we've annotated it in such a way where it says if we don't know what the semantics of this word
is and it's a modifier, just ignore it in the semantic parse. But you can get that information out by
looking at the syntactic tree, by -- there's these predicate argument dependencies that get
generated, so you can look at those, as well, to get that information back.
>>: When you reach the understanding part, if you generate the computer images, synthesize
them with a graphics card, would that be helpful?
>> Jayant Krishnamurthy: I'm not sure. We can talk about that. I'm not a vision expert. I'm
really a language guy. But I guess you'd have the -- wouldn't you then kind of already know
what each object looked like?
>>: Is the training data of 100 -- it would be helpful if you had more training data.
>> Jayant Krishnamurthy: Well, I mean, more training data would be good. What I'm
wondering is if you could synthesize these images in such a way where the system wouldn’t have
to already know what the right answer is when it's synthesizing the image and the annotation.
Because if you synthesize an image and then you synthesize the caption the mug to the left of the
monitor and that caption already works, that means you sort of already knew how to do that. So
you need to make sure that the synthesized data isn't sort of trivial, and I'm not sure exactly how
that would work.
>>: It seems like now that you have the joint thing, you could almost reverse one of your arrows
and given a sentence, find the images that would parse to satisfy that sentence. Have you looked
at that, as like an image-search kind of task?
>> Jayant Krishnamurthy: Yes, that's a great application. We've thought about it, but we haven't
actually tried it. I think we're going to try for this new data set -- it's got a lot more images. I
think we're going to try an evaluation like that, probably, but we haven't looked at it yet.
>>: I would have thought, just as you turned around and did natural language generation, you
could do scene generation.
>> Jayant Krishnamurthy: Oh, actually, yes, if we had something that synthesized images, that
would actually help, right? Because then we could synthesize the image and then detect, does
this image satisfy? Yes, that would be good.
>>: Can you make sure that the language needs to be fairly compact, right, so that you get
enough data to [learn] the outcomes of each word and also the spatial relations. Did you
constrain the annotator in any way?
>> Jayant Krishnamurthy: So what we did -- I think the domains that we have lend themselves
to relatively constrained language, and what we did was, basically, first we asked some members
of our research group to just write down some descriptions and questions, and then that wasn't
scaling great, so we asked some people on Mechanical Turk. And so the things that the research
group contributed tend to be a little bit nicer. There are some Mechanical Turk things in there,
like I showed you one, which is just kind of a mess, where they're actually very complicated.
But I think as long as you have some examples, which are kind of easy -- if you have some
examples which really help you constrain the concepts, like this one -- this is kind of a messy
piece of language that someone on Mechanical Turk contributed. But as long as you have some
examples which are sort of simple enough to let you learn the object classifiers and relation
classifiers, I think the more complex language works from there. It's more of a curriculum
strategy, in a way.
>>: But here you have cups, mugs. You mentioned there's like 20 different ways to refer to
these objects. Then, I mean ->> Jayant Krishnamurthy: Yes, you need training data for each new word. Absolutely,
absolutely.
>>: So when you presented the data to the Turkers, did you present the boxes?
>> Jayant Krishnamurthy: Yes.
>>: So these were identified.
>> Jayant Krishnamurthy: Yes.
>>: So they had to -- they didn't have to worry about the shelves, for example.
>> Jayant Krishnamurthy: I believe what happened is -- yes, I believe they were presented the
object bounding boxes and then asked to describe the scene, and there was an example that was
something like -- it wasn't for this image, but it was something like, there's a knife on the
counter, or something. Something like that.
>>: Have you considered different types of objectives, like big might be problematic?
>> Jayant Krishnamurthy: Because it's something that -- it's relative.
>>: It's relative. It's not just something that you can easily apply in a Boolean --
>> Jayant Krishnamurthy: Yes, that's a good question. I think, conceptually, you can represent
those kinds of things in the semantic parse, but it's a little bit less clear how to detect them, I
think. I haven't really figured out that second piece of the puzzle. It's kind of tricky. It's hard.
I'm not sure what the right way to do that is. Even the colors here are sort of -- you could
imagine it being it being the same way, like a red car and a red something else -- wood.
Redhead. Definitely different.
>> Hoifung Poon: Okay, so Jayant will be here today and tomorrow, and there are still
[indiscernible]. So let's thank Jayant.
Download