>> Hoifung Poon: Hi, everyone. It's my great pleasure to introduce Jayant from CMU. So Jayant is a student of Tom Mitchell, and he's worked on a lot of interesting stuff, most recently on grounded language learning and also semantic parsing, so he will be talking about -- probably first on grounded language learning, but maybe, with time permitted, maybe a little bit on semantic parsing, as well. So without further ado, here's Jayant. >> Jayant Krishnamurthy: Okay, hi. I'm Jayant, and as Hoifung says, today, I'm going to talk about learning to understand natural language and with a focus on physically grounded environments. But I'm also going to talk a little bit about some of the other language understanding stuff we've been working on, because I'm excited about it and I wanted to. This is joint work also with my adviser, Tom Mitchell, and then Tom Kollar and Grant, who were at CMU while this was being done. So I just want to motivate this by saying, why do we need language understanding systems? So here's an example of a language-understanding problem, I would say. It's a question-answering problem, where maybe you want to be able to ask some sort of question, and you have some big database of facts, and based on this database of facts, you want to produce some sort of answer to that question. So today, we have databases like Freebase, we have things like NELL, which have millions and millions of these facts, so you can imagine this would be pretty useful for something you could do in a search engine or something like that. If a user types this in as the query, you can just give them the answer, instead of going and retrieving a bunch of search pages. The flipside of this is information extraction, and so, if you know me, I work on NELL, which is this sort of ongoing information extraction project, where every day NELL goes out and looks at webpages and finds sentences like this, and then it has this big database of facts, and I'm sort of trying to grow that database of facts iteratively. So we're kind of interested in this information extraction problem, and what we might want to do is, say, find a sentence like this, and find that it expresses these sort of relation instances and add that to some sort of growing knowledge base. Then you can imagine we can use this knowledge base for question answering sort of downstream. So I'm going to briefly talk about some of this stuff at the end of the talk, for like 10 minutes. But as I said, the focus of this talk is on understanding natural language in grounded environments. So I think the canonical example of such a problem would be robot command understanding. So maybe you want to be able to give a robot a command like, go get a pen from my office, and you have this robot, and the robot knows how to navigate its environment and knows how to manipulate objects. And you want this robot to do the right thing, given that you've given it this command. And the reason this is a grounded language understanding problem is because the meaning of this command is sort of intimately tied to objects and locations in the real world, right? The robot needs to be able to map my pen onto some physical pen which is in your office. It needs to be able to map office onto some location in the real world. And I'm really excited about these sort of grounded language understanding problems, because I think they sort of provide us a way to avoid some of the ambiguities of natural language semantics that we've had for a number of years, which is that we have a number of different ways of doing semantic representations of language, but there are differences between them and their expressivity is different, and it's not clear what's the best way to go. Well, these grounded language understanding tasks provide us with a test bed for all these different semantic theories, right? We can plug in frame semantics, we can plug in semantic parsing, and we'd say, how well can I actually answer these commands? And what's great is, if I created a corpus of 1,000 commands and the robot gets 950 right, who's to say that that robot doesn't really understand language? Certainly, it understands some subset of language that's contained within those commands. So that's why I'm really excited about these tasks. Now, robot command understanding is actually going to be a challenging task, and so in this talk I'm going to talk about a slightly simpler task, which I'm going to call scene understanding, which I hope you agree has many of the same sub-problems as robot command understanding. And so the way that this task is going to work is we're going to be given some sort of a physical environment -- in this case it's an image -- with segmented objects. So I'm going to assume that the objects in the environment are known a priori. That's a simplification. And then we're also going to get some natural language description of some of the objects, and our goal would be to identify the objects which are described by the natural language statement. So if you say the mugs, I need to give you the set of both mugs. Similarly, you can have more complex language, like the mug to the left of the monitor, if you want to be able to return that just that mug which is to the left of the monitor. And to give you a preview of where this talk is going, I'm going to introduce a sort of model for solving this problem, which kind of jointly learns to perceive the environment, as well as parse this natural language text, and using only training data of the form that you see here --you can train it using only this kind of training data, and it can get about 70% accuracy on this task. That's just a preview of what's going to happen. Okay, now, before I talk about that, I want to briefly talk -- so this is the outline of my talk -- and before I start talking about that sort of physically grounded stuff, I want to briefly talk about semantic parsing, which is kind of the fundamental technology which underlies, I think, a lot of this language understanding work. And then I'm going to talk about the scene understanding application and the model we've developed, and then finally I'm going to spend a brief amount of time talking about the information extraction work that we're working on, which I'm going to call broad coverage semantic parsing. And feel free to stop me if you have questions. I think we can be pretty free form. Okay, great. So let's talk about semantic parsing. So semantic parsing was initially intended to solve these kinds of question answering applications, so I'm going to kind of focus on that for the moment. So let's say we have some question, like what cities are in Texas, and we want to be able to answer this question, well, how should we do that? The semantic parsing view of the world is, first, we should take that natural language statement and we should translate into a sort of logical representation of its meaning. And the point of this logical representation is to abstract away from syntactic differences in the natural language, different words meaning the same thing, things like that. And then, once we have the semantic parse, we're going to assume that we have some big database of facts over here, and we'll be able to execute the semantic parse on this knowledge base, this big database of facts, to produce the answer to the question. Now, if you're not familiar with formal semantics -- I imagine. Well, I'll just explain it anyway. But the way to read these semantic parses is they're going to be functions. This is a function from a variable X, which you should think about as ranging over the entities and the knowledge base to a sort of logical statement, which is going to return true or false for each entity. So the way to interpret this is actually as a set. It's a set of objects for which the function returns true. So this is really the set of things X, where X is a city and X is located in Texas. So I hope you agree that that's a pretty reasonable semantic parse to get the answer to that question. And if you use that interpretation, it's straightforward to see how you can execute that on this knowledge base. This is just like performing a database query. So this is the semantic parsing view of how we should do these kinds of question answering applications. And I just want to point out one thing, which is that for this to actually be meaningful, we really need to make sure that this knowledge base is derived from the real world. So we typically assume we're just given this big knowledge base of facts and that it's true, but unless these facts are actually true in the real world, we're not going to get the right answer to our question. So this sort of correspondence is critical to making this whole pipeline work. Of course, typically, when we're doing semantic parsing, we're going to assume this knowledge base is known, so really the only interesting problem is this mapping right here, going from the natural language onto the logical representation, which is called the logical form. So how are we going to do this mapping? There's a number of different ways that we can do these mappings from natural language onto logical form or semantic parses. Today, I'm going to talk about my favorite way, which is combinatory categorial grammar. So I really like combinatory combinatory category grammar, or CCG, for semantic parsing, because it has this tight coupling between syntax and semantics. So what you can do in CCG is you specify for every word what's its syntactic function and what's its semantic function. And then if you can syntactically parse some statement, you can automatically drive what the semantics of that statement should be. So it's sort of just transparent from the syntactic parse. So I think that's really nice, and another thing that's nice is that it's a formalism that a lot of linguists have studied in the past, so if you have some sort of weird linguistic dilemma, you can kind of go look it up in the literature and say, how should I solve this? How do I deal with coordination? So that I think is a really nice property. You don't have to invent everything yourself. It obviously has its quirks, as well. So let me explain a little bit about how CCG works. So in CCG, there's basically -- so CCG, it's a lexicalized grammar formalism, which means that it has a huge number of different syntactic categories. So if you're familiar with a context-free grammar, it'll have 40 different parts of speech. In CCG, there's going to be 500, but these parts of speech have internal structure which tell you how you can combine them with other parts of speech. So these parts of speech are called syntactic categories. And the simplest syntactic categories are called atomic categories. They're things like nouns, and there's a handful of these. There's nouns, sentences, prepositional phrases, noun phrases. That's basically it. There's only four of those. The more complicated categories, things for like adjectives, you create by basically building functions out of these atomic categories. So these are examples of functional categories. And the way to read this is, the intuition here, is that words in CCG behave like functions, which you can apply to other words to produce parses for the results. And the way we're going to denote that a word is a function is we're going to use this kind of funny slash notation, where this side -I guess that's your right side -- is the argument to the function and the left side is what it returns. So big here is an adjective, because what it does is it takes the noun and it returns a noun. There's two different kinds of slashes, actually, which denote which side of the current word that you expect the argument to occur on. So the way I like to remember this is if you look at the top of the slash, that's the direction that it's pointing. So this means that it's looking for a noun on its right. Now, we can take the sort of intuition of these functional categories and kind of nest this sort of function process to produce more complicated things like prepositions. So in here is a preposition which expects a noun on its right and then returns something which expects a noun on its left and the returns a noun. So if we wanted to parse town in Texas, I could use something like this. Now, the corresponding semantics of these categories is specified in this lambda calculus notation, and what's important to note is just that for these atomic categories, right, we're just going to talk about as referring to sets of entities. So these are all sets of entities, right? These are functions from some entity to something which returns true or false. And these other categories here, this is a function which accepts something as its argument, and so there's actually a slot here, this lambda-F, is the argument to big. So there's this sort of interface between the syntax and semantics, and that's how parsing is going to derive the semantic parsing for the whole phrase. So let me show you how parsing works. I think that will make it a little bit clearer. So how do we parse a sentence in CCG? Well, first, let's say this is our sentence. The first thing we're going to do is we're going to look up in this sort of table I showed you before what the syntactic and semantic representations of each of these words are. Now, there might be ambiguity about this. We'll talk about ambiguity in a second. We're going to look at these syntactic and semantic representations. And then what we're going to do is we're going to combine adjacent words using a small number of different operations. Now, intuitively, all these operations are things that you can do on functions. So you can do function application, you can do composition. By far, the most common operation is function application, and that' pretty much the only one that you really need to do all but the kind of complicated stuff. So here's an example of function application. What I've done here is I've applied into the argument Texas, and I've produced here a syntactic category for this whole phrase, along with a sort of corresponding semantic representation. And if you look at this, you'll see what's happened is I've plugged in this function for F, and that sort of produced this resulting semantic representation. So I did function application syntactically and I do function application semantically, which I think is a nice sort of correspondence there. >>: Texas here is being parsed as a noun, not a prepositional phrase? >> Jayant Krishnamurthy: So this is a function which expects a noun on the left and then returns a noun, so that's kind of like a modifier to the right of a noun. I think the syntactic categories in CCG are the biggest obstacle to just understanding how it works, because they get -- I showed you a handful, but they get really, really messy pretty fast. So it's kind of one of those things where if you look at it long enough, you kind of start to cache out the important subcomponents of these different things, and you can kind of quickly look at it and say, yes, that makes sense. But yes, this is basically like a preposition which has already taken one of the two arguments. >>: For the interpretation, you could also easily say that you can reverse the order of where do you expect right? You basically take one something from the left and one something from the left and one from the right. So in this particular form, you expect something maybe from the left first and then from the right? But you can also equally say it the other way around and you've got the different parts. So would that create a lot of ambiguity? >> Jayant Krishnamurthy: So that's a good question. The question here is basically CCG has in this category put a certain ordering. It's said, I need to take the left argument before I can take the left argument. But you could easily imagine that you want to apply the left argument first and then apply the right argument after. And so you can do that. That's where these other funny operations come in, to let you apply the arguments out of order, essentially. But in a sentence like this, you'll get the same parse either way, right? So I guess maybe the high-level thing to say is, you can apply the sentences out of order. I don't want to talk about the operations required to do that. But if you do produce the same sort of parse, it'll give you the equivalent semantics, so it's only a problem in terms of efficiency of your parser. It's not a problem in terms of getting the right answer. >>: My question is really why, when you're inducing the lexicon, because that's sort of the core of all CCG, do you need some sort of -- because those ambiguities are extraneous. It doesn't matter why you still [indiscernible]. So do you normally do something? >> Jayant Krishnamurthy: So you don't need to do that in the lexicon, and there's a couple of different ways to handle this ambiguity, and maybe it's better to talk about this after. But one thing you can do is say, I'm only going to use the ambiguity when it's necessary to parse something, so I'm only going to apply these things out of order if that's necessary to produce a parse for this sentence. But also, I guess you were asking about the lexicon. You don't need to encode in the lexicon at all. You only need this one category for in, and then the parsing operations take care of the out-of-order application. >>: So the parentheses indicate ordering, because you're going to take that, for in, you're looking for the noun to the right first. So I'm most interested in finding out -- so I understand that in the lexicon we can say in, but that there is a sense of in that's located in, but there are other senses of the word in, so you haven't talked about those, and we've picked located in, magically. And then the other one is ->> Jayant Krishnamurthy: Okay, I'll answer that question in like two slides. >>: And then why do we get from town to city? >> Jayant Krishnamurthy: So I've just sort of assumed that this is what's encoded in our lexicon, so I've assumed that someone's given me a table of the possible mappings here. Now, there's different ways you can do this, so people have actually developed semantic parsing algorithms, where if you have the labeled logical forms, it will sort of automatically induce what should go in the lexicon. That assumes you have the labeled logical forms. The other thing you can do is you can use some sort of heuristics based on alignments of -- like I see Texas, I know Texas is a city, maybe I see a city like Texas, town -- not Texas. If I have an appositive with something that I know is a city, I can guess that. I agree this mapping is nontrivial to produce. I'm currently assuming we're going to do that heuristically. There's other things you can do, actually, so in the grounded work, what we're going to do is we're just going to say I'm just going to use the word lemma essentially as this predicate. So that lets you kind of abstract away from plurals but not worry about this mapping onto these discrete predicates, which is a hard problem. And ambiguity is going to show up in a second. Okay, great. Good. So, okay, I can repeat the function application procedure to produce the parse for the whole sentence here. Now, there's ambiguity, right? So I could have multiple different mappings for the words onto these sort of lexicon entries. I could have multiple different ways of applying these rules, which will lead to things like prepositional phrase attachment ambiguities, things like that. So we need some way to decide which parse we're going to get, given some input sentence. And like good machine learning people, what we're going to do is we're going to parameterize these parses, we're going to have some features, and we're going to learn some weights which produce the right parse. So here's sort of my schematic view of how we're going to do that. So, first of all, let S be the sentence we're trying to parse, let L be this whole parse-tree structure. We're going to define some features of the parse and the sentence, so these could be things like, count how many times each individual ones of these lexicon entries shows up. It could be things like count how many times I apply a preposition to a noun that's one word away on the right. We can kind of come up with anything that's local in this parse-tree structure for these features. And so we'll have these features, and what we're going to do is we're just going to learn some weights which say, okay, which features should I prefer in good parses? And so in this kind of view, we can parse a sentence by solving this optimization, and you can do this sort of using a dynamic programming, CKY kind of thing. So I have some sentence S here, and I'm trying to find the best parse for this sentence. That's what the argmax is doing. So this is kind of like a parameterized combinatory categorial grammar. Is that good? Okay. So how do we learn these things? I'll talk briefly about supervised learning for these things. I'll say there's other ways to learn semantic parsers. I know Hoifung is working on some totally unsupervised ways to learn semantic parsers, not using CCG. But let's imagine for a second that we have training data of the form, here are some questions and here's their logical forms. But what we can do is we can just treat this as a supervised learning problem. And so we can solve this learning problem in a number of different ways, but for example, we could do the structured perceptron algorithm to train the parameters. So here, what we'll do is we'll basically just -- we'll take each sentence. We'll parse it. We'll predict some parse, and then we'll say -- we'll try and move the parameters towards the features of the correct parse but way from the features of the predicted parse. So essentially, every time you get something wrong, you're going to try and say, instead of predicting the thing you predicted, predict the right thing. So this is one way you could train the parameters. You can also do things like maximum likelihood. Your favorite objective function can be plugged in here. So that's supervised learning of these CCGs. Yes. Okay, so that's the end of my semantic parsing intro chunk. Now I'm going to talk about how we can use semantic parsing to do these sort of physically grounded problems. Now, remember that the problem I was talking about was this one here, where we're going to get some image with some segmented objects. We're going to get some descriptions, and our goal is to predict the set of objects described by that description. There's a couple of things I want to point out about this problem, which distinguishes it from some things people tried in prior work. So, first of all, I want to point out that the output is a set. If you say the mugs, I want you to get two mugs. So this is kind of different from some work in robotics, which just kind of assumed that every word refers to exactly one thing. Another thing I want to point out is that there's no knowledge base. If I were training a semantic parser, I'd assume that I had some knowledge base which encoded all of the information about this image, and I'd just parse into some formalism that could use that knowledge base. But here's there's no knowledge base. There's just the image. And, finally, we're going to consider both one-argument predicates, things like mugs, and also relations, or two-argument predicates, like left. So here, left is encoding some relationship between the mug and the monitor, so we need to look at two things to kind of get that right. Okay, so how are we going to solve this problem? Well, we propose this model, which we call logical semantics with perception, which kind of just takes the semantic parsing view of the world and says, okay, I don't know what the knowledge base is. Why don't I try and figure that out automatically? So here's the way that our model's going to work. We have as input -- we have the language. We have this environment, which is this image, and we have this output, which is the set. And what we're going to do is we're simply going to semantically parse the input. Now, here, you'll notice that I've -- there is some set of predicates now. Just assume we've made these up. What we actually do in practice is we just look at the words in the language and say each words maps onto a predicate. That's its word lemma. So mugs will map onto mug, we get some sort of sharing here. But it's not there's some predicate vocabulary. There's no discrete mapping onto those predicates anymore. It's very easy to do this. >>: Can you give the number of arguments for each predicate, as well? >> Jayant Krishnamurthy: What we do is we part-of-speech tag and then we kind of just guess. So actually what ends up happening is with words like left, you'll end up with two different ones, where one's a one-argument predicate and one's a two-argument predicate. And then you have to disambiguate. Okay, so next, what we're going to do is we're going to say, okay, so I have this environment. I'm going to try to produce a knowledge base from that environment. So here, I'm going to take for every category predicate that we've invented, I'm going to say here's the set of objects which are elements of that category predicate. For every relation predicate that we've invented, I'm going to say here's the set of object tuples that are elements of that predicate. So this is like the set of things where the first element is to the left of the second element. So once I have this knowledge base, it's straightforward to produce the output, right? Because I have the semantic parse, I have the knowledge base. I do my database query and I get the output set. >>: The one that you need is not part of your perception of left. >> Jayant Krishnamurthy: Yes, that's me just not putting in slides correctly. Let's assume that it were. >>: Do you mean to have that be an exhaustive list of the things that are perceived? >> Jayant Krishnamurthy: Yes. It's hard to do for left, because there's a lot of them. >>: So that's why I was curious whether you needed it to be exhaustive or not. >> Jayant Krishnamurthy: I do. I do. I mean this to be the complete set of all the categories. Okay, so this our proposed model, and you'll notice this looks a lot like the semantic parsing view of the world. The big difference is just that instead of having one learning problem, we now have really two learning problems. We need to learn the semantic parser. We also need to learn to do this perception thing to produce these knowledge bases. And remember that this is deterministic, so we don't need to worry about learning this. This is just I run that query on this database and I get some output. Okay, so you can actually just take this picture and kind of think about this as a graphical model, where these things are the nodes and these edges are hedges. If you do that, you'll end up with a sort of -- you can kind of render that into math, and it looks something like this. So we have a model, the model has three components. There's the semantic parser, which gets as input sort of this language, and our job is to kind of score these different semantic parses for that language. We have this perception function, which is going to take the environment here, and our job is to produce this sort of knowledge base. And then we have this sort of deterministic evaluation thing, which just takes that knowledge base, and it takes that semantic parse, and it says here's what the output is, given that you've produced those two things already. So both of these two components need to be learned, right? We have some parameters, data parse for the semantic parser, we have some parameters, data per, perception, for the perception function. I'm going to talk briefly about how these things are parameterized before I talk about the learning. And the parameterization here, I've sort of already told you what we're going to do. We're going to have some features of these semantic parsers, and we're just going to train the linear model like that up there. For the perception function, this looks like it could be potentially tricky, right, because there's a complicated structured knowledge base thing. So let me talk about how we're going to parameterize that. If you look at this knowledge base I think the right way, it turns out there's a very easy way to think about parameterizing it. And the way to look at it is this. The knowledge base is really just the collection of binary classification decisions. So I have a predicate like mug. I want to know for each object in the environment. I've omitted the table, but for each object in the environment, is it a mug or is it not a mug? And my job is just to produce this set of binary classifications, similarly for a predicate like blue. So if you think about this way, what we can say is, okay, I'm just going to train a single binary classifier per predicate to predict the right set of objects. I'm going to classify every object with that classifier, and that's how I'm going to produce this set. So basically what we're going to assume is that we have some features that we can compute of these image segments. That's what phi-cat is here. And I'm going to train some classifier parameters for, say, mug, beta-mug, and this is going to be like my classification decision equation for that classifier. And, similarly, I'm going to assume that I have some features of pairs of bounding boxes for the relations, or pairs of image segments, and I'll train one classifier per relation predicate to say, okay, is this thing left of that thing? >>: So then you did the same thing as you did for the text, where any noun you considered is going to be the predicate, and any something else is a two-position predicate? >> Jayant Krishnamurthy: Right, so the pipeline is we look at the text, we invent the set of predicates, and then there's a parameter for each of those predicates that we invented here, and there's also the mapping and the lexicon there. Yes, we invent the predicates first. Okay. >>: So you put any structure among all those different classifiers? So, for example, when you have two predicates, maybe they meet in the same -- I mean, post hoc, obviously. So do you try to impose these? >> Jayant Krishnamurthy: There's no structure. It's all independent classifications. I can talk about that later, but I think there's actually an interesting point there, which is it's hard to impose the structure, because you don't know what it is a priori, and then it turns out if you have enough data, this will work well either way, so do you really want the structure? It's not obvious to do that in a way that kind of makes sense, I think. I don't know. Okay, so ->>: I'm sorry. I want to drag you back to the last slide. >> Jayant Krishnamurthy: Sure. >>: I feel like I missed something -- or maybe one more. I'm sorry. I can see how you get predicates for each of the nouns. However you have this lambda-X which seems to correspond to the indefinite article. There's a mug and exists Y that corresponds to the definite article the. It feels like there's a lot more complication in the instantiation of that first-order logic statement than just inventing predicates for each word. I mean, there's a number of words that are omitted. There's a bunch of structure that's induced, the use of [indiscernible] instead of disjunctions. There's scoping. Yes. >> Jayant Krishnamurthy: So that's a good point. There's definitely -- so all of this is done by part-of-speech heuristics, and there's multiple possibilities for these things. So what we're really doing is we're looking at each word, and we're saying this is a noun, so there's possibly the thing lambda-X, mug-X, and this is a preposition, so maybe I have the thing which takes an argument on the right and quantifies that out and gives me the thing on the left, right? And you can do that by kind of just looking at the part-of-speech tag, saying, okay, it's tagged as IN. I'm going to give it the preposition lexicon entry. Honestly, that part is not super-interesting. It's like a thing of rules I wrote down, and it kind of just invents some stuff. >>: I totally agree it's not interesting, especially when you have one, two or three predicates in a sentence, but I think when we start scaling up to more complicated situations -- I don't know. I would love it if this stuff is -- you can basically really infer it from syntax and it's no problem at all, but do you believe that ->> Jayant Krishnamurthy: Remember that this is learned, right. So what we're trying to do is invent the space of possible semantic parses, and then we're trying to learn to pick the good semantic parses. >>: Yes. >> Jayant Krishnamurthy: So there's some room for kind of over-generating these things, saying I'm going to make a bunch of different things for these nouns. I'm going to try a bunch of different things for these relations, and saying, which parses actually do a good job? That overgeneration is kind of what we're counting on to get the right answer here. We're not trying to pick exactly this parse automatically, heuristically, from that text. Okay. I'm going to keep going. Good. Let me just say a word about how these things are parameterized. So if you look at this you realize, actually, we can apply this model to a bunch of different other things. It doesn't necessarily have to be this image understanding problem. Really, the only thing that we count on for the environment is having these feature functions. So if you have any sort of application domain, where you can kind of say, here are the objects that I think you need to reason about and here are some good ways to calculate features of them, you can actually apply this model. But I'm sticking with the images, because I think it's more concrete. And for the image domain, what we're using here is we're using a histogram of oriented gradients, which kind of captures the edge information, and we're also using a color histogram. And then here we're using spatial relation features that capture the orientation of the two objects relative to each other. >>: Do you also have just an indicator feature for this bounding box? >> Jayant Krishnamurthy: No, because if I give you a different environment, it's going to have completely different objects. So the objects aren't preserved across environment. >>: Do you think it might learn to answer future questions? >> Jayant Krishnamurthy: So in our evaluation, we actually only test on unseen environments, so those features would be very useful. But, yes, you could get object identity as a feature somehow. You could definitely plug that in. It just I don't think makes sense for our particular application. Okay, so let's talk about training this thing. Here is one way I could train the system. I could sit down, I could annotate the right semantic parse for every piece of input language. I could also annotate the right knowledge base for each environment. And then, once I've annotated these things, I just have two independent learning problems. I can train the semantic parser, like we talked about before, and then this is just training a bunch of SVMs, essentially, to distinguish which things are elements of which predicate. So this is potentially one way we could train the system, but I'm going to argue this isn't a very good way to train the system, because you're going to have to sit down and annotate all these parses, which is pretty painful. You're going to have to do that for all these knowledge bases, also pretty painful. I don't think we really want to do this. I think what we really want is a way that we can train the system using naturally occurring kinds of supervision. So what we're going to do is we're going to propose a weakly supervised algorithm, which is going to be trained using just the natural language, the environment, and the right answer, the set of objects that that language actually refers to. So I'm going to argue this is pretty natural, because if you can recognize a pointing gesture, if someone points at an object and they describe it, there you go, now you have a training example. >>: Does the environment come already with the bounding boxes and everything? >> Jayant Krishnamurthy: Yes. We've assumed that the bounding boxes come in the environment. That's an oversimplification of the truth, but we need to start somewhere. Okay, so how are we going to train this? Well, we're just going to treat these variables as unobserved, and we're just going to do stochastic gradient descent. So, basically, what we do is we set this up as a maximum-margin Markov network, which is kind of a generalization of an SVM, and I'll just show you how the training works by looking at one gradient update. So let's pretend this is our training example, so this is an iterative procedure. So we're going to look at each example independently, and we're going to do a gradient update based on that, and then we're going to kind of iterate over the whole data set. So let's just look at one training example over here. The first thing we need to do is we need to solve these two maximization problems, which is basically here we need to predict, given these two inputs, what's the best knowledge-based semantic parse and output under the current model parameters. That's what this part is. But then there's also this funny cost term which we have to add in, and that's what makes it a maximummargin Markov network. But this is basically just what's the overlap between the correct thing and the predicted thing. And then the next thing we need to do is we need to figure out what the best knowledge base and semantic parse are which actually explain why you got this answer. So we're conditioning on observing this. We want to find these two things. Now, unfortunately, these maximizations are hard to solve. Typically, in this model, inference is actually easy, because we have this deterministic evaluation component. If you just do inference in the knowledge base and you do inference in the semantic parser, you can just combine those results to produce what the right output should be. But the second you start adding some constraints on what the output should be, inference can become hard, because now the knowledge base that you predict is coupled to the semantic parse that you produced. So we have an approximation for solving these problems, and what we do is we basically do a beam search over the different semantic parses, and then give the semantic parse, you can solve a sort of little ILP to solve for what the best knowledge base is. So that in practice we can train on our domains, we can do cross-validation in an hour, so it's not too bad. But the inference problem will become problematic if we try and study domains with a large number of entities. >>: So it's a cost function? >> Jayant Krishnamurthy: No, so it's a hamming cost, so it's basically for every object that you predict which is in the true set, you get a cost of one. Here, you're trying to encourage the model to predict something wrong. Sorry. I guess for everything that's not in the true set, you add a cost. But if you ignore that, it's just like a structured perceptron kind of update. It's just this that makes it the max-margin Markov network. Okay, so once we have these values, how do we do our gradient update? Well, what we're going to do is we're just going to update -- this is the same structured perceptron update, essentially, that I showed you before. We're going to update our parameters towards the value of the semantic parse, which actually got the right answer, and we're going to do it away from whatever parse we predicted here with the cost. Similarly, for the knowledge base, for the perception function, we're going to do a similar thing, and here it's just going to factorize into sort of an update per predicate classifier. So let me just show you the example for one predicate. Let's pretend this is what we predicted. This is a value from this optimization, and this is the correct value, correct value from this optimization. Essentially, what we're going to do is we're going to update towards entries which are in here and not in there and away from the entries which are in here or in there but not in here. Intuitively, we want to produce this thing, so if we predicted something in here, we should be trying to move away from that, and if we didn't produce something in here, we should be moving towards that. This is also just the sub-gradient update for an SVM, so it's like a -- it's actually like three SVM updates at the same time. We're just adding them all together at the same time. So this is sort of I think pretty standard. Questions about parameter estimation? Good? >>: Parameters for the other [indiscernible] for the other connective statements in the first-order logic? >> Jayant Krishnamurthy: For the other connective statements? >>: Yes. Should they do a parameterization over whether it should be an existential quantifier or a lambda expression or the area of ->> Jayant Krishnamurthy: So that comes out of the semantic parse parameters, which that was the previous update. The semantic parses, I take the input language and I produce a logic with them, so all that stuff goes in that previous update. >>: I guess I should read your paper to see specifics. >> Jayant Krishnamurthy: Yes, I guess you might have missed the part where I talked about how the semantic parsing worked, but we can talk about that later. Okay. >>: So the picture you use -- I mean, for the object, you actually can use not just the [HOT], right? Because whatever the evaluation system gives you the bounding bars, it will also give you an object -- so right now, here, you just use the raw vision feature? >> Jayant Krishnamurthy: Yes. Here, we're just using raw vision features. I mean, if you have a better object recognition system -- we actually aren't using an object recognizer. Tom Kollar went and just annotated the bounding boxes for this data set. So there's no automatic system that's providing additional labels here, but the feature -- this parameterization is generic, right? If you have better features, you can use them. >>: What about the features on the left? So for those kind of relational features, what exactly are the features that you use? >> Jayant Krishnamurthy: So we're basically using a spatial orientation. So there's these AVS directional features, which I don't really understand, but Tom Kollar does. And there's a bunch of things like do the bounding boxes overlap, is the centroid above that centroid? There's just kinds of heuristic things like that which try and capture the orientation of the objects. >>: [Indiscernible]. >> Jayant Krishnamurthy: Well, you can. It's going to be hard to distinguish visually from on, but if you say it's in front, you'll get like, does it overlap? Yes. Things like that in the feature vector. I guess if you totally occluded the object, it would be a problem. Okay, so this is our weekly supervised training procedure, and now I'm going to talk about some of the experiments we've done with this model. So what we've done is we actually created two data sets. So remember, I was saying this model is generic, but we created two data sets to try it out. The first data set is the sort of data set we've been using examples from all the time, where basically we took a bunch of images of the same -- it's the same set of objects. It's the same set of objects, the same mug, the monitor, the table, but they're rearranged in different configurations. Obviously, if I move the object around, it does look different. So it's not like there's identity of objects preserved across the domains. And what we did is we collected some language from our research group and also from Mechanical Turk, and then we annotated what the right answer is for each statement, and this contains about 15 images and about 300 descriptions, so there's multiple descriptions per image. And then we also created this geography data set, so the traditional semantic parsing data set is this Geoquery data set, so this is kind of our mock version of Geoquery, but it's a grounded Geoquery. And the way it works is, as input to the system, it actually gets these states and cities and national parks and whatever, but they're all represented as these polygons of latitudelongitude coordinates. This image that you see here is pretty close to what the system is actually observing. It's actually observing this outline of this thing, and this data set has cities. This is a national park. It's some forest. We basically collected questions about these maps and the we annotated the right answer, and again, there's like 10 environments and 250 questions. And I'm going to kind of focus on the scene data set for most of the experiments, because I think we're most familiar with it, but I'll also present the results for this sort of just briefly at the end. Now, let me talk about how we're going to evaluate the system. What we're going to do is we're going to do leave one environment out cross-validation. So we have 10 environments. We're going to hold out one of them. We're going to train the system on nine of the environments and all the descriptions, and then we're going to take that held-out environment and we're going to have some test examples like this. And the way we're going to evaluate, we're basically going to have the system predict some set of objects, and we're going to annotate that set of objects as correct if it exactly matches the annotated set. So in this case, these are the same, so this would be correct. This would obviously be wrong, but this is also wrong. You don't get any partial credit for improving the right object in the set. So if you think about this, you realize this is actually going to make it a kind of challenging task, because the number of sets of these objects is going to grow exponentially in the number of objects. So here we have four objects. That means there's 16 outputs, but if I had five, that would be 32, etc. >>: Five objects? >> Jayant Krishnamurthy: So in this data set, I think there's four to five objects per image. In the other data set, it's like somewhere -- there are sometimes like seven. There are -- the paper actually has the statistics on that, and also I'm going to show you results from a random baseline, so you'll have some idea of what the average is across the different domains. >>: You said there were 10 environments and then a bunch of images per environment. >> Jayant Krishnamurthy: Sorry, no, no, no. Each image is an environment. >>: Oh, and then so you have one image and then a bunch of statements. >> Jayant Krishnamurthy: And there will be like 10 to 15 different statements for that image. >>: Okay, and all the environments are the computer mug, whatever, setting. >> Jayant Krishnamurthy: Yes, in this data set, they're all the computer, the mug, they're rearranged, those kinds of things. >>: Why don't you give partial credit? Is this just for simplification? >> Jayant Krishnamurthy: Yes, so that's a good question. Initially, when we submitted the paper, we had two different metrics, where one wasn't giving partial credit, one was giving partial credit, and then the reviewers got really confused, and just to simplify it, we kind of got rid of one. And that's basically it. And we also wanted to put in some error analysis and stuff which didn't fit, but you can imagine doing one where you say, I'll measure the overlap or something. >>: In your training, you have this hamming [loss]. >> Jayant Krishnamurthy: Yes, there is. >>: Do you allow any numbers? Is it the two mugs or the one mug, numbers that would affect the final set? >> Jayant Krishnamurthy: So you can say it, but our parser lexicon ignores it. So we'll invent I think a predicate for two, but the way our semantic representation is put together, we can't actually detect sets in two objects. That's not a thing we can parameterize. So, actually, let me talk about that real quick. Let me show you the results here. What I'm going to do for the -- the experiments are designed to do basically two things, okay? First of all, we want to see if we're really learning both the categories and the relations here. So previous work has looked at similar models without considering the relations, and so including the relations is one of the things we're doing, so we were interested in seeing if that works. And the other thing we want to do is we want to see if weakly supervised training is competitive compared to, say, like fully supervised training. We want to know what the performance loss is there. And what I'm going to do is I'm just going to show you these accuracy numbers, like based on that previous metric that we saw, and I'm going to break it up into three different categories of natural language, which is getting at your point here. Which is I'm going to say, do you need to understand any relations to answer the question? No. It falls in the zero-relation category. Is there one relation? It falls into that category. Then there's this other category, which is things with two relations, but those are pretty rare. It's mostly things like superlatives or numbers, where the number is actually important, which actually can't be represented by our model as we currently put it together. This requires some universal quantifier over the distance between two things, or two would require something that could look at all sets of size two, which we can't represent. So that's what's going to go in this other category. And so let's ask how does the algorithm that I've been talking about the whole time, with the categories and the relations and the weak supervision, who does that actually do? Well, overall, we get about 67% of these queries correct, and if you look at the results, you'll see that for the categories and the relations, we're actually doing better than that. But the fact that we have this whole other category is bringing us down. So conceptually, we're not going to be able to do well in this category, just because we don't have the representational power required to capture that. The next baseline that we're going to introduce is this category-only model, which is roughly based on the work of Cynthia Matuszek et al., and we're going to train that, again, using this weakly supervised algorithm. And what you can see here is that overall it does worse, and the reason it does worse is because you can't model these relational things. Now, that's not surprising, but there's a possibility that if someone uses the right noun at the first word, you can just guess what the right answer is. So there's a possibility that you didn't even need to understand relations to solve these questions. So this actually also demonstrates that relational knowledge is important for our data set. This one doesn't have relations like left of. It doesn't have relations like left. >>: So your lexicon only has single ->> Jayant Krishnamurthy: Yes. It only has one-argument predicates, which I agree it seems unfair in some sense. It's just we're trying to replicate this previous thing. Now, the final thing is we're going to do the whole model trained with full supervision. So here, we actually annotated all the semantic parses and all the knowledge bases for every environment, for every natural language standard, and we trained the semantic parser and the classifiers independently, like we talked about, and the results are about the same as the weakly supervised algorithm. So we're getting about 70% of the queries right. We're doing a little bit better on these relational queries, but aside from that, the results are pretty comparable. So this is promising. This suggests that this weakly supervised algorithm performs competitively to the fully supervised case. Now, I'll just point out that the similar results kind of hold true for the geography domain, so again, these numbers are pretty comparable, and this category-only model does a lot worse. >>: In the previous one, with the relation, or that relation, the blue mug example, in the first column. >> Jayant Krishnamurthy: Yes. >>: The reason you're getting them wrong there, is it because of parser? Is it because you didn't detect the blue correctly? Is it something you distinguish as ->> Jayant Krishnamurthy: It's the detectors are mostly the problem. So we do -- actually, our paper has an error analysis, but a big part of the problem is you only have a small number of examples for each of these different words, so it's kind of hard to train an object detector from 200 examples or whatever, and a lot of the words occur much less frequently than that. The actual vision side of the system is definitely the weak link, I would say. >>: Let's talk about how performance varies between set size, how many descriptions per scene you need. >> Jayant Krishnamurthy: That's a good question. I do not have an answer for that. We could probably run an experiment. >>: Do you have any information about how many [entities] you're considering, how many colors, how many relations, those kinds of things? >> Jayant Krishnamurthy: I do, but not here. Again, in the paper somewhere. >>: [Indiscernible]. >> Jayant Krishnamurthy: So we did a cross-validation, so it's about 250 to 300. This is I guess 270, something like that. I forget the exact number. These aren't just variances, right? You have 300 examples. Similar results for the geography domain. Let me show you some examples. So these are some examples of what the model predicts. This is our input. We have some description, like monitor to the left of the mugs. We are not getting this one right. You'll notice we're not doing the two particularly well. There's also some question about whether this mug is actually blue. I don't know. So we're not getting that one right. We have some -- sure. >>: So here, the full supervision did worse than the weak supervision? >> Jayant Krishnamurthy: Yes. >>: Is that just not significant? >> Jayant Krishnamurthy: I think it's significant, but barely. I'm not reading too much into this 3% performance difference. Yes. It's a good question. I'm not sure why they're different. You can imagine that we just happened to find slightly worse parameters for guessing on the test side or something, right? >>: Can you [indiscernible] and test side? Do you have full statements in there? >> Jayant Krishnamurthy: Full statements? >>: In either, when you have the description for training, or do you have [indiscernible]? >> Jayant Krishnamurthy: No. The data in these domains is actually very, very clean. They're manually made data sets, but there are these examples which you just can't get right, so those kind of do act as noise in a way. Like, if you say closest, the model is going to try and learn to predict this correctly, but it just can't. These data sets, they're clean. >>: What would happen if I were to inject noise in the training set? >> Jayant Krishnamurthy: I don't have a great idea. I mean, it's a little -- yes, there's potential for it to be bad. >>: So I'm a little bit curious about the [cat-only] weak one, because in the other same domain, it actually performed for the unit predicate one, it actually performed better, but why here it's so much worse? >> Jayant Krishnamurthy: So that's a good observation. This data set has more relations, like more queries with relations in it than the previous one. And so when you're training this model, it's actually trying to get those relation queries right, using only its categories. So it actually produces really bad category classifiers, because it can't actually learn the right thing. It's outside of its representational power. So that's what's happening here. In the previous domain, there's a lot more things where someone says, like, the blue mugs, so you can learn a good classifier. Okay. So I'll show you some examples from the geography domain. So here are some things. We have capitals. I guess I didn't talk about this, but our feature representation has some linguistic context features, which allows you to do things like detect capitals. You can do weird question phrasings, and then here, we're not actually getting this one right, because this is a prepositional phrase attachment error, so this question is saying, what cities are both east of Greensboro and in North Carolina, so the correct answer to that is Raleigh, but what the parser says is, what city is east of Greensboro, where Greensboro is in North Carolina, and that actually gives you both of those two things as the answer. So that's kind of -- I don't know. That's kind of an error case you'd expect, right? You're right. I might have copied the question wrong. Yes, yes. Good point. >>: Just in case you use the slides. >> Jayant Krishnamurthy: Yes, I'll remember that. And I just want to say one more thing, which is we've actually taken this model, and we've hooked it up into an interactive system, so we used a Microsoft Kinect and we did some automatic bounding box detection on some objects, and this is Tom Kollar, and what you can do is, it has gesture recognition, so you can point at an object, like this object here, and you can say something, and that will get fed through an automatic speech-recognition system, and we'll get the text input. And then we can generate these training examples automatically. This paper, there's actually a paper about this in RSS, but we considered some extensions to the model, and I think one of the coolest extensions we did was language generation, so here are some examples of that. This is a training example that someone provided, and then we asked the system to generate language to describe that object that's being indicated. This language isn't perfect. It thinks the table's behind the book, but those look pretty similar. Here's another example of that. You can say the toilet paper to the right of the mug. The relations here are from the person's point of view, so this is correct. It was too hard to get people to point at objects and then think about the opposite thing. So that's something we've done too, and I think I'm actually basically out of time, so that's the end of my physically grounded component. I did want to talk about this. All right, give me five minutes? I'll go fast. Okay, so I want to change gears a little bit and talk about some of the other work we're doing on semantic parsing, and this work is kind of in a different vein, because we're kind of looking at these web-scale knowledge bases, things like now these information-extraction applications, and again, the motivation here is we have these databases. We have things like Freebase. They have millions of facts, they have thousands of predicates, and we want to be able to do question answering against these databases. We want to be able to do information extraction to fill in the stuff that's missing in these databases. But when we're talking about these databases, they're -- really annotating these semantic parses is kind of infeasible. The training burden sort of grows as the number of predicates in your vocabulary, so we're going to pay a price to annotate for these thousands of predicates. And we kind of expect the number of predicates to actually grow in the future, because even though they have thousands of predicates, that's only a small fraction of language or the world that we're really capturing. So we want algorithms that scale to millions of predicates, I think, and so the goal here is we're going to try and train semantic parsers for these web-scale knowledge bases without using any labeled training sentences. And so the way we're going to do that is -- this is kind of like our inputoutput. I want to be able to take any sentence that I find on the web, I want to be able to produce the semantic parse for it, using predicates from, say, NELL's ontology. That's what those predicates are. Don't read too much into the representation. So how are we going to do that? Well, we've been working on training these semantic parsers using distant supervision. So the idea here is, we have a bunch of language. We can go out and get all of Wikipedia. And we have all these sentences. We don't know what the right semantic parse for those sentences are, but we do know some knowledge that kind of relates to the things in those sentences. So we kind of look at these relation instances in our knowledge base, we can kind of use that to constrain what the semantic parse should be for each sentence. And the way we're going to formalize that is we're going to use this sort of distant supervision constraint, which is called the extracts-at-least-once constraint, so the idea is, we're going to take every relation instance in our knowledge base, we're going to find all the sentences which could express that relation, so which mention that pair of entities. And then we're going to force at least one of these sentences to semantically parse to something which expresses that relation. So let's pretend this is the space of possible semantic parses for each sentence. The constraint will allow us to choose, for example, these two sentences as the correct semantic parses, because this expresses the right fact, but we won't be able to choose, for example, these two semantic parses, because none of these sentences is expressing that fact. Okay, so this is kind of our distant supervision constraint, and the idea here is, the hope here is, we're going to have so many of these relation instances and so many of these sentences, that the corpus statistics will help us decide which of these sentences should actually parse to that city located in country thing. >>: So the pair you take is always the two arguments, or you take also the relation as the one you're -- basically, you have three. This is a three [flats], and you constrain to find text which matches two of them. >> Jayant Krishnamurthy: Right. So part of it is that we don't know what the text expressing this relation is, so we kind of need to figure that out during training, but we do know what the text for the entities is, so that's why we do it that way. >>: And the knowledge comes from NELL? >> Jayant Krishnamurthy: So in the experiments I'm going to show you, we have a sort of confused, mishmash knowledge base of Freebase instances in the NELL ontology which is there for historical reasons. It's not any particular reason >>: If the knowledge came from NELL, it seems kind of funnily, because the NELL would have gotten the knowledge from these sentences also, so it's -- but if it comes in Freebase, that makes sense. >> Jayant Krishnamurthy: That's true. That's a good point. I haven't thought about that too much. Yes, it comes from a different kind of way of expressing it, though, so NELL has these corpus statistics kinds of things. I'm not sure that would really matter. It seems like it might. Yes, good point. Good point. >>: Does NELL throw away the source of its knowledge? >> Jayant Krishnamurthy: Well, so the way NELL reads is it doesn't actually look at individual sentences or documents. It actually does a bunch of counting on all these documents and then uses those counts in some big matrix of, okay, this word has occurred with those other words this many times, to decide what things belong to what categories. So in some sense, there isn't one sentence source for its knowledge. It's from the corpus statistics as a whole. Well, there's an asterisk on that. >>: Yes, it has some other extractors which do look at individual documents, but those aren't the main source of knowledge, I would say. >> Jayant Krishnamurthy: Okay. This is sort of our distant supervision constraint, and just in the interest of time, you might ask, does this work? So we conducted a sort of relation extraction experiment, where we trained a semantic parser using this constraint, and we evaluated it on how well it extracted relations. So one thing you can do with the semantic parser is extract relations. It's more powerful than that, but you can extract relations. So these colored lines are different variants of our parser, and this dashed line is a sort of state-of-the-art relation extractor, and this is basically extractions as a function of sentences, so we basically extracted facts from a big corpus of sentences, and then we went through manually and said, okay, does this sentence express this fact? And that's what this precision axis is. And then this is the number of things extracted. And you can see here we're doing pretty well. We also did a sort of secondary evaluation, where we tried to look at does the semantic parse actually do a good job of semantic parsing, as opposed to this simplified relation extraction task. So here we kind of annotated some noun phrases with what their correct lambda calculus representation should be, and here you can see that of precision here is the fraction of sentences - not sentences, noun phrases which could be parsed, which we got correct, 80%. This is the overall fraction that we got correct, so 56%. So you can see here there's some discrepancy between -- there's a number of things that can't be parsed. That's actually often due to sort of weird vocabularies. So one of our queries is exsinger of something, where exsinger is one word, so we're not doing a good job on those kinds of things. This is actually work that we did last year, and so this year we've been working on extending this in I think a really interesting direction, where our goal is to produce a kind of NLP tool that other people can plug into any application and get the semantics where it's available. And so our goal here is really -- I'm going to call this a broad-coverage semantic parser, and our goal here is we want to produce a full syntactic parse of every sentence, and then we want the parser to be able to say, okay, for this subset, this sub-span of the sentence, here's what the semantic representation in terms of NELL predicates is, similarly for this other sub-span. So the hope is that by having a full syntactic parser, you can easily plug this into whatever your favorite application is, but then you can also get these sort of semantic features where they're available. And so our procedure for that is basically we're going to use this distant supervision scheme, but we're also going to use a CCG bank, which is a big annotated corpus of CCG parses, and we're going to build one joint objective to do the right thing, parse the sentences correctly in CCG bank while also extracting these relation instances from Wikipedia. And by doing this optimization, we're going to produce this broad-coverage parser. So we have some preliminary results on this, and actually, what I'm going to do is I'll just show you a demo. Okay, so we have this demo website here, where you can kind of type in -- at the top, you can type in some sentence, and it will try and parse it. These are some examples. Here is sort of the space of the parser's representation, so these are all the types that it understands, and these are all the relations that it understands. These are all from NELL's knowledge base. And if you click on one of these things, it'll parse it. It's a little slow, so I've opened one already. So here's the parse of the sentence. Madonna, who was born in Bay City, Michigan, on August 19th. What it does first is it looks up in Nell all of the possible noun phrases to kind of get the entity types for all these things. So there's a whole number of possible entity types for all of these noun phrases. And then, during parsing, it kind of tries to disambiguate which entity do these things actually refer to? And you can see here we get this logical form right here which says, okay, this whole noun phrase refers to something C, which C can be called Madonna and is a musician, and also C -- so C is also equal to this thing A, so A and C are the same variable. And A, Madonna, was born in location D, which was Bay City, and Bay City is located in the state Michigan. So we produce this whole parse. Now, you'll notice we've omitted this last clause, this on August 19th, because we don't have that sort of date relation. But we actually also produced a full syntactic parse, which kind of shows us where that modifier gets attached to this whole phrase. So you can see we have on Bay City kind of modifies this whole was born in phrase here. This is sort of the syntactic tree, and yes. That's kind of what our parser does. And we've done some sort of preliminary evaluation on this, kind of working on the syntactic parsing accuracy and the semantic parsing accuracy, but we're not quite done yet. Okay, that's my demo, and yes. I just want to briefly conclude by talking about some of the directions that we're exploring. So the first thing we're trying to do is we're actually trying to take this LSP kind of model and extend that to do image understanding on these much harder kind of vision corpora. So I'm working with some people up in the students' inlay at CMU who are vision people, and they know how to solve these object-detection kind of problems. And we're working with them to produce a version of LSP which has a much larger set of kinds of objects it can recognize, also relationships it can recognize. Also, we're working on this broadcoverage semantic parsing thing, and one of the things that we're introduced in doing is coming up with this sort of broader-coverage semantic representation, maybe something that can assign model theoretic semantics as a larger subset of English than just the predicates and the knowledge base. So we have some ideas on how to do this using ideas from vector space semantics. So that's a direction we're kind of looking at moving towards, as well. Yes, so just to conclude, I talked about logical semantics with perception, which is a model for grounded language understanding where you have these physical environments and you have some language you want to understand in that context. It's a generic model. I showed you an application on image understanding, but you can imagine it can be applied to other things, as well, and then I also briefly talked about this training semantic parsers with distant supervision, which is something we're working on for kind of question answering or information extraction applications. And then, if you're interested in the papers or the data for the LSP stuff, it's all on my website. Okay, so thank you. >>: The last one, where you say join optimization, do you have any sort of -- so, basically, you're get some store for different [parse streams], and then you've got an objective from the semantic, and then you just add them? >> Jayant Krishnamurthy: Yes, we just add them together. >>: So there is no join limit. >> Jayant Krishnamurthy: Well, there's one set of parameters which do -- there's one semantic parser, right? The semantic parser has to do both tasks, so in some sense it's sort of being trained to do both things simultaneously. >>: So does this implement feedback from [indiscernible]. >> Jayant Krishnamurthy: It does, actually. So, for example, here you'll notice the word who has this sort of head passing category, where it takes the was born in phrase and makes that subject Madonna, but then makes the whole thing refer to Madonna. And that information is actual completely derived from the syntactic category of who from CCGbank, so it's something you can sort of automatically annotate. Things like conjunctions also work because of their annotation in CCGbank, so there's actually synergy between the semantics and the syntax. In fact, I have a slide. So, actually, one of the evaluations we've done is basically we just took a corpus of sentences, we semantically parsed it using this parser, and then we said -- we looked at each semantic parse and said, is it correct or not? So, again, that's precision, and I have two models here, so one is trained without CCGbank and one is trained with CCGbank, and you can see we actually do better with CCGbank. These are very preliminary results. >>: So the CCG parser does give you that prepositional phrase that's time, so how is it that the logical form simply drops that? >> Jayant Krishnamurthy: So, what we've done is we've sort of taken the CCGbank syntactic lexicon, and we've kind of annotated these syntactic categories with what their semantics are, and we've annotated it in such a way where it says if we don't know what the semantics of this word is and it's a modifier, just ignore it in the semantic parse. But you can get that information out by looking at the syntactic tree, by -- there's these predicate argument dependencies that get generated, so you can look at those, as well, to get that information back. >>: When you reach the understanding part, if you generate the computer images, synthesize them with a graphics card, would that be helpful? >> Jayant Krishnamurthy: I'm not sure. We can talk about that. I'm not a vision expert. I'm really a language guy. But I guess you'd have the -- wouldn't you then kind of already know what each object looked like? >>: Is the training data of 100 -- it would be helpful if you had more training data. >> Jayant Krishnamurthy: Well, I mean, more training data would be good. What I'm wondering is if you could synthesize these images in such a way where the system wouldn’t have to already know what the right answer is when it's synthesizing the image and the annotation. Because if you synthesize an image and then you synthesize the caption the mug to the left of the monitor and that caption already works, that means you sort of already knew how to do that. So you need to make sure that the synthesized data isn't sort of trivial, and I'm not sure exactly how that would work. >>: It seems like now that you have the joint thing, you could almost reverse one of your arrows and given a sentence, find the images that would parse to satisfy that sentence. Have you looked at that, as like an image-search kind of task? >> Jayant Krishnamurthy: Yes, that's a great application. We've thought about it, but we haven't actually tried it. I think we're going to try for this new data set -- it's got a lot more images. I think we're going to try an evaluation like that, probably, but we haven't looked at it yet. >>: I would have thought, just as you turned around and did natural language generation, you could do scene generation. >> Jayant Krishnamurthy: Oh, actually, yes, if we had something that synthesized images, that would actually help, right? Because then we could synthesize the image and then detect, does this image satisfy? Yes, that would be good. >>: Can you make sure that the language needs to be fairly compact, right, so that you get enough data to [learn] the outcomes of each word and also the spatial relations. Did you constrain the annotator in any way? >> Jayant Krishnamurthy: So what we did -- I think the domains that we have lend themselves to relatively constrained language, and what we did was, basically, first we asked some members of our research group to just write down some descriptions and questions, and then that wasn't scaling great, so we asked some people on Mechanical Turk. And so the things that the research group contributed tend to be a little bit nicer. There are some Mechanical Turk things in there, like I showed you one, which is just kind of a mess, where they're actually very complicated. But I think as long as you have some examples, which are kind of easy -- if you have some examples which really help you constrain the concepts, like this one -- this is kind of a messy piece of language that someone on Mechanical Turk contributed. But as long as you have some examples which are sort of simple enough to let you learn the object classifiers and relation classifiers, I think the more complex language works from there. It's more of a curriculum strategy, in a way. >>: But here you have cups, mugs. You mentioned there's like 20 different ways to refer to these objects. Then, I mean ->> Jayant Krishnamurthy: Yes, you need training data for each new word. Absolutely, absolutely. >>: So when you presented the data to the Turkers, did you present the boxes? >> Jayant Krishnamurthy: Yes. >>: So these were identified. >> Jayant Krishnamurthy: Yes. >>: So they had to -- they didn't have to worry about the shelves, for example. >> Jayant Krishnamurthy: I believe what happened is -- yes, I believe they were presented the object bounding boxes and then asked to describe the scene, and there was an example that was something like -- it wasn't for this image, but it was something like, there's a knife on the counter, or something. Something like that. >>: Have you considered different types of objectives, like big might be problematic? >> Jayant Krishnamurthy: Because it's something that -- it's relative. >>: It's relative. It's not just something that you can easily apply in a Boolean -- >> Jayant Krishnamurthy: Yes, that's a good question. I think, conceptually, you can represent those kinds of things in the semantic parse, but it's a little bit less clear how to detect them, I think. I haven't really figured out that second piece of the puzzle. It's kind of tricky. It's hard. I'm not sure what the right way to do that is. Even the colors here are sort of -- you could imagine it being it being the same way, like a red car and a red something else -- wood. Redhead. Definitely different. >> Hoifung Poon: Okay, so Jayant will be here today and tomorrow, and there are still [indiscernible]. So let's thank Jayant.