>> Larry Zitnick: It is my pleasure to introduce Ali Farhadi to Microsoft this morning. He is from UIUC under the advisor of David Forsyth and his early work was mostly centered on activity recognition and then more recently he's done a lot of really great work on attributes and high level semantic scene understanding. So I think he's going to talk about that mainly today so I will hand it over. >> Ali Farhadi: Thank you. First of all thanks a lot for having me here. It's a great pleasure. I'm going to talk about better recognition with richer representation. As it sounds I'm basically going to be focusing on a representational approach as to recognition. And what can we do with these representations? First I will talk about one; I am interested in recognition, so obviously it's intellectually a very, very challenging problem. It's a core fundamental computer vision problem, of course. And it provides a lot of deep insight into other fundamental problems as well. And I would say even human vision, for example. And if you approach computer vision actually correctly, it provides a lot of applications. I don't need to actually provide, explain the applications. Surveillance, robotics, image search, all of those things actually come as applications of object recognition, assuming that you have approached this problem correctly. The way that we currently do recognition in the computer vehicular community is that we come up with a list of objects that we want to recognize. So we sit down and write a bunch of names, car, bicycle, I don't know motorbikes, people and then we put what we actually gather, positive, negative examples for those. And then we build models to learn those category models. And then when we want to actually use those models, when an image comes in, we give it to all of our detectors, run it over to the image and we’re just crossing our fingers that one of those detectors is going to be excited about this picture. And say hey, there's a car in here. I have seen cars before and it's basically right in there. Most of or almost all of the focus in computer vision or in computer recognition or object recognition community is to improve these numbers in these tasks. How can I build a better car detector? How can I build a robust person detector, and so? And so we are pushing the bar each year; each year you're going to see new benchmarks, new numbers in those benchmarks and we are basically making great progress as the recognition community towards this task. But we basically never step back and think about what are we going to do with this recognition system that we are building? What happens if I show you a picture like this? And actually I run all of my detectors over pictures like this. Do you actually get anything out of there? So the best outcome that I can get this would be, gee, I don't know what this is because, I haven't seen it before. But on the other hand, if I ask you as a human, so what is this picture? You can provide me tons of useful information for this picture. You can say, I don't know what this is, but I am sure it's a vehicle. It has wheels; I can see the wheels. I've seen wheels on other vehicles, so I know this is a wheel. And since it has wheels, it has a wheel, it's a round thing, it probably moves on the road. And looking at the size of the car, it probably works with [inaudible] power. So you can infer tons of useful information. You can say it's probably a new and modern vehicle. If I have to guess, it's probably pricey. So, one of my goals actually is to provide such a capability for recognition systems. Without knowing what this is, humans can recognize it and localize it, localize the wheel, localize the windshield and provide tons of useful information for this. So part of his talk will focused on how we can achieve such capability with our recognition systems. The other actually part is, basically, this topic is going to talk about what are the interesting things in the image? The other half would be interesting, we would be interested in, is talking about what is it I'm going to report as the output of my description for an image? Do I want to list everything together and say this is this, this is this, and this is this? Or do I want to be smarter than? So the way that we do recognition, or the way that we describe images in the current image paradigm is, I'm going to get all of my detectors running all over the image and that will be the description of the image, list of words. Do we really actually want to do this? Or do we, people, describe this image like this? Or if I show you this picture and ask you to describe this picture, what would you say? You would probably select some of those objects. You probably don't talk about the flower back there, the pen down here, the car back there, so you select some of the objects. And then you put them sort of in relationships, in different forms, maybe in the form of a phrase, or maybe in the form of a sentence. So the other part of this talk will be focused on how can I provide such a description for an image instead of listing a bunch of words? How can I select some of those things and put them into relationships in terms of events, actions, functions, scenes and stuff like that? So basically my talk today would be concerned with three big topics, attributes as a way of generalizing across categories, being able to talk about unfamiliar objects. Describing and localizing unfamiliar objects. And then I'm going to go to richer descriptions of images. How can I actually predict sentences for pictures, instead of a bunch of words? And then at the end I'm going to talk about visual phrases, which I am really excited about, which is a way of getting actually a more sophisticated way of predicting sentences. On the attribute side, I am going to talk about basically describing and localizing objects. First I'm going to start with describing objects, and then I am going to switch to the localizing part. So the way, my goal of this attribute-based representation is mainly focused on shifting the goal of the recognition community from focusing on predicting a single name for an image, to learning to describe objects. So basically my goal is to shift this point of view from pure naming to description. I want to be able to describe things instead of name things or assign a singular word to this. And the procedure is a very simple procedure. So an image comes in. There is a little bit of technical detail that I'm going to skip for the sake of time, which is basically try to de-correlate attribute predictors and stuff but at the end of the day I am going to have attribute classifiers. I am going to write over an image and then that provides me a description of the object. And if I am interested in naming the object, like I'm basically filling a category with this description, and then find the category. And this description can be semantic attributes or this community attributes, so both of them. And I'm going to show, if you adopt this approach you're going to do amazing new stuff that you could never have done before in recognition. For example, not only can you name things like before, this is an airplane, I know, you can describe new objects, novel objects things that you never observed before. We've never observed carriages before. Despite that fact you can say this thing has wheels, it is made of wood, and a lot of different descriptions. Examples would be we never observed buildings in our training. But despite that we can say that these things are 3-D boxy things which are vertical, and they have rows of windows in it. We have never observed centaurs. Despite that we can say it has head; it has leg; it has, we think it has saddle in it for some reason. But we can do more than that. We can also report what is typical, for known categories. Assume we’ve known birds. We know the properties of the birds. We know birds should have head and beak. If you show a picture like this for the system, it says I know it's a bird, but the head and beaks are missing. And that is something that is not usual, and is worth reporting. I know motorbike. But I don't expect to see cloth on a motorbike. If I see that, I can actually report it. And here are some examples. Basically we expect to see, for example, jet engines on an airplane. If you don't see a jet engine, you can report it. If you don't see a sail on a boat, that sort of, the system thinks it's an unusual boat. And it can report it’s a boat with no sail, or the other way around. We don't expect to see faces on buses. And once we see it, we can actually report it. And sometimes we think there is a horn on a bike because the handlebars look like a horn. And we think it's actually something suspicious. And there's even more than that. The attribute framework provides you with such new functional capabilities that you can deal with all these things. And more than that, you can also learn new categories from much fewer training examples, or in the extreme case, with zero visual training examples. Like if you, assume you have not ever observed goats before. And I want to explain goat to you. I want to say goats are something that have four legs. It's an animal that has four legs. It is made of wool. It has horns. And basically I will list everything for you. So you have some sort of idea about the goat. You are not the best goat expert in the world, but you have some sort of idea about what a goat is, with this pure textual description of the goat category. So our system provides such a capability for you, to be able to basically learn new categories from fewer or no visual examples. So this basically, the dotted black line is the attribute framework. The blue line is basically standard recognition. This is the accuracy that we get. And this is the number of training examples. So the point that I am trying to make is one of them is that you can actually can get to sort of the same accuracy with way fewer training examples. Meaning that you can learn new categories much faster with way fewer training examples adopting this attribute representation. And the interesting part is that what happens if I have zero training examples with few textual descriptions of the category? And, here is the chance, and here's where we’re standing with the attribute representation. Still we are not here because we need to basically forty examples to give there. But with zero visual examples we can actually go up to here. We have some idea of what the goat is without knowing, without being a perfect goat expert or a good goat detector. With that, I'm going to close the description part. So right now we will just talk about how to describe objects in terms of the attributes. And one other thing that we can do as, people can do, is being able to localize unfamiliar objects. Being able to say this is, I know there is an object here; it's probably a vehicle. Here is its wheel here; here is its windshield. So we people can do that. Can we ask computer, can we build a recognition system that can actually do the same? And the answer is, yes. So, again, we are going to adopt the attribute driven recognition system. We are going to build detectors, not only for basic level categories as we usually do, but also for super ordinary categories, for parts and attributes for everything. So we are going to have a pool of detectors and then when a new image comes in, I am going to run all of my detectors over the image. And each of them is going to have some opinion about where the objects are. And then I'm going to give all of those detections and I'm going to use all of those detections to reason about the location, the property of the objects. So the procedure looks like, an image comes in. I get all of my detectors that I learned on-- actually we build a data center called [inaudible]. It goes through examples with detailed annotations of all of the parts for vehicles and animals. So we use that data set to build the detectors. We get all of our detectors. We run them over an image. Then we have machinery that actually looks at all of those predictions and decides on where the object is. And then we're going to use this localization information together with all of the other detection results to describe the objects in terms of the attributes, special behavior of functions, like this is carnivorous. This creature can jump, so, all that information. First, how do we localize? So basically it's a very simple thing and very fast. Each of the detections has an opinion about where the object is. So the ear detector has an opinion about where the dog is, the nose detector, the same. The dog detector itself has an opinion about where the dog is. The mammal detector has an opinion about where the dog is, the animal detector as well. So we gather all of those votes for the location of the object and then we cluster data space and we pick the base of the most populated cluster as the response for where the object is. It's very simple and it's very fast. And once we localize that we basically can describe the object. The way we describe the object is basically we have two different types of attributes. There are attributes for which we have direct visual evidence, for example, if something has like if something is a dog, if something is a mammal. So basically for parts, for basic level categories and for super categories, we can actually build a detector. So for those things we have a detector. We have other types of attributes for which we either don't know how to build a detector or it is very hard to build a detector. For example, if some object has the potential of having a leg, no matter if it is visible or not. This is talking about it this thing has a leg or not, if it's visible or not. This talks about a potential of the object having a leg. Talking about the functions, this creature can jump, can run fast, is carnivorous and also the aspectual information, is lying down and facing toward the camera. So the way they're going to approach this is we are going to basically learn all of those things by looking at the simple [inaudible]. And what it does is basically it tries to look at the correlation of the things that we have detectors for and infers the others. And we solve this with simple EM. We inference is exact and very fast. So the gist of the idea is that if I show you a box and tell you there is a head in this side, and a tail in the side and ask you which way the animal is facing, you can say it's facing that way. So basically we're going to look at all of the predictions of all the things that we have, and the correlation between them to infer things for which we don't have any direct visual evidence. So to see how good this is working… >>: The things that we don't have visual evidence for, can they in any way help the detection of the things that we do have evidence for? Can they help weed out false… >> Ali Farhadi: No. Actually, we are going to use those as walls to marginalize the work. >>: So they all help each other? >> Ali Farhadi: Yes. They all have attributes. So to test this, what we do is we actually give the coordinated set. We divided into two sets, familiar categories and unfamiliar categories. So unfamiliar categories are the things that we are not going to observe at all during training. And we're going to see actually is can we generalize to those unfamiliar categories or not. The first question is can I basically learn a wheel model, let's say, on the trucks and pickups and cars and expect this to actually work on motorbikes? And the answer is actually yes. The detectors that we have are actually really powerful with some loss of accuracy; you can actually learn those detectors. Here's an example. This is learning a leg on the same category, tested on the same category, not the same instance but instances of the same category, and this is the leg detector the RC for the leg detector. If you train it on some categories it teaches some other categories, and these are examples for leg, horn, wing, head, eye and so on. So the gist of the idea is yes they generalize. They are not as good, but actually they are reasonably good that we can work with them. So here is the sort of things that we can do. So elephant is an example of familiar categories. So we can localize them as before, as a standard recognition system but you can do much more than that. Because we can say hey, this is the animal. It is a fourlegged mammal. I know it is an elephant because I have seen it before. Here is the leg here is the foot, here is the trunk. All of those involve localization information. And things actually get more interesting when we are shown unfamiliar things. We have never observed cats before. Despite that fact, you can actually localize the cat an animal. Green means animal; red means vehicle in these pictures. And we can say here is the head, here is the leg. We think it has a hump. And it's a mammal. We have never observed jet skis before. Despite that we can actually localize them as a watercraft. We have never observed buses before. No single example of a bus is in our training set. Despite that we can actually localize the bus. I can say, I don't know what this is but whatever this name is it is a wheeled vehicle. Here is the wheel; here is the license plate. And if you are interested in numbers basically these are the quantitative results on how good we are on the coordinated sets comparing to the dotted lines are basically traditional recognition. So what happens if you just focus on basic localization versus the attribute center recognition? So the red curve is for familiar objects. So this is what you get if you adopt the attribute-based recognition. Even for familiar cases, there is this much gain, considering the attribute-based representation, comparing to only doing the standard thing, which is naming. Even for localization of familiar things. And also there's a gain for unfamiliar and there is a gain for both of the objects. >>: The main reason for this boost, is it because in the standard method are you using less training data because you're actually generalizing across categories so they essentially have more training data, or is it a natural representation that the attributes can be more… >> Ali Farhadi: It's actually both of them. Because you may miss a car, but you may not miss a wheel. And then the wheel can actually help boost your car detection in the root ball, so basically in the voting system. So basically both of the hypotheses are sort of correct, that we get gains both because we have more models and because of actually those models talk to each other. So far what I've talked about, these are actual results of our system. So we never observed horses or carriages here in our training at all. No single instances of them. Despite that fact, we can say I know there is a vehicle in here. Here is a wheel and this vehicle, whatever it is, moves on the road and is facing to the right. I know there is an animal here which is probably is a four-legged mammal. Here is the head, here is a leg. This creature can run, can jump, is herbivorous and is facing right. So this type of information you can actually do this type of inferences for an image by adopting this attribute-based representation. But at this I am going to actually shift gears and talk about, after talking about what to predict over an image, talk about what to say for an image. Do I really want to list everything? Or do I want to provide a concise description of an image? >>: Ali, can you back up a bit? It's going by very, very fast for me, so. Typically we have time in these talks. So I can see that you are detecting the leg, but it's a total mystery how you, and what part of the system is saying is herbivorous or can jump or something like that. >> Ali Farhadi: For that I have to get back to the root ball. So basically this comes from this node that predicts the function of the objects by looking-- for example if I want to predict if something can jump or not, I have supervision for all of those things. And what I can do, I can actually learn that if something has a long leg, has four of them and probably has a little belly, it can jump. So we are going to learn those things through the correlation that we are going to learn all through the things for which I have a detector. So I am going to have a detector for leg, for all of the parts of the attributes of the objects. And what I can do is I can basically infer things by looking at the patterns of the responses of those detectors. >>: So the notion of a long leg is different than the notion of a short leg? >> Ali Farhadi: No, no. There is no notion of actually long versus short leg in this framework. And if you actually look at the results for things that can jump, for example, we really think that elephants can jump, which they probably can’t, because we don't have the notion of weight with respect to the leg length, with respect to the muscle, power and this stuff. So it does make mistakes because this model is very basic in the way of modeling these correlations, or the attributes are not detailed enough to infer all of those things. The reason that we actually can predict some of those things that some of them, for example, something being carnivorous or herbivorous, is probably just some coincidence of the correlations between the things that we have. And so for some of them basically the correlation is the only thing that helps us to infer that. >>: So for example the cart and horse, were there other predictions that it determined that were not correct? >> Ali Farhadi: Sure this is actually, I have examples, for example, I showed you a wrong prediction, for example. There is a hump over here. There are two trunks over here. So there are many wrong predictions over the image. This is a specific one that we use as the icon of the paper. This is actually the perfect example. No mistakes, but it happens. >>: So facing right for the carriage, if you've never seen it carriage before how would you infer that it is facing right? >> Ali Farhadi: So basically you can infer it from facing right cars, let's say. >>: Yes, so has no [inaudible] all wheels in front… You see the person on top, you see his face… >> Ali Farhadi: We don't do that actually. We only look at information that relates to this vehicle. We don't infer from the person, right? But we can actually-- this is to show that if you know how cars are going to be or how buses are going to be facing, right? You can actually transfer that knowledge to carriages. >>: So does that come from the spatial relationship between the wheels and the body? >> Ali Farhadi: Yes. >>: I was surprised that none of the attributes where a child knows, parent-child relationships with the functions. I kind of expected to see, you know if it was there before then that would affect conditions and possibilities of, it could compare things that happened. So they are only communicating to the parent node [inaudible]. >> Ali Farhadi: Actually, I agree that you can actually do this child, makes the model more sophisticated and probably works better. The reason that we didn't do that is we wanted to have a very fast inference model rather than exact things. And we wanted to be fast or we didn't want to get into details of how we can boost those things. But I completely agree with you if we wanted to add layers to these root models, you could do better than just a single layer of the root. >>: With only the parent node for communication I mean that greatly limits how much those notes can actually coordinate back and forth. >> Ali Farhadi: Of course. I agree. >>: So what is the probability that this [inaudible]? >> Ali Farhadi: So distributions for the detector nodes we basically produced a multinomial. And we are going to consider the other nodes as multinomial and then they will be marginalized over everything. So it's basically very simple EM style life and root model. So I am going to go to the second part which is basically how can I provide the different types of descriptions for images? We think actually sentences would be the right description for images. If I ask people, people tend to select some of the objects and put them in some sort of relationships. We believe that sentences are the wide representation for the images. Why? Because they provide us with the capability to talk about relationships, events, functions and also they implicitly are talking about what is worth mentioning. If I know how to predict a sentence for an image, basically, implicitly selecting some of the detectors and then put them in a sentence. So it's an extremely challenging task for people familiar with recognition. It is very, very hard to predict a sentence for an image. If you look at the literature there are some approaches who are actually trying to do this by being explicit about the relationships. This is on top of this, this is below this, this is beside this, and then they try to basically to infer those properties. But there is a limit to what you can do with those types of explicit approaches. Why? The domain of the problem is very hard. You are going to be lost in the inference of very sophisticated models. One thing we should do is we are sort of adopting the non-parametric approach to this problem. By non-parametric, I mean if I have data sets of images and sentences in correspondence and I have a nice representation can I actually learn to measure the similarity of a sentence with an image? And if I can do that than what I can do is assuming that I have a big enough data sets of images and sentences in correspondence, a new image comes in; I find the closest sentence and report that as a description. Or a new sentence comes in; I measure the similarity and find the best image. So how can I build such a similarity score? So what we are thinking is we believe that there is a space of meeting in the middle of the space of sentences and images. So each point in the space has a projection into the space of the meeting and has a projection into the space of sentences, so each point here in the projection of space of images and the space of sentences. So if I can learn these two projections than I am home. Basically then what I can do to score a similarity of a picture, of an image on a sentence, I predict both of them that to the space of the meeting and then look at the distance of the projections in the data space of the meeting, and that provides me a similarity measure. How to I do that? So the way that we are going to do that is basically since this is an extremely hard problem, we are going to have a simplistic approach to this problem. I am going to assume the space of the meeting can be represented by an object, and action, and a scene. I am ignoring the subject, object, the properties, the [inaudible], everything. So I am just concerned with the objects seen in action. And what I'm going to do is I'm basically going to learn this link and this rejection discriminately and together, how basically I'm going to set up the structure and learning problem which I am not going to talk about. If you're interested we can actually talk off-line. The job of this structured learning is to basically rank the correct sentence for this image higher than the rest of the sentences. And I am going to learn the parameters on this space over these three nodes, over these three elements over the space of the meetings. To be able to do that we need a data set that puts sentences and images in correspondence. So we actually built such a [inaudible], is called UAOC sentence data set. We get examples of Pascal shouldn't be that dark. Sorry about the picture. And ask people in the mechanical to write sentences over the images. And these are, this is an example of the data set. They're interesting properties with the sentences. So people are interestingly very, very, are in great agreement about what to say about an image. All the people who are talking about the two men here, all of them are talking about the talking; all of them are talking about the relationship to the plane and all of them are talking about the plane. So there is this great agreement between people about what to say about an image. They don't talk about the grass back there. They don't talk about the trees cloud the sky, the jacket that this guy is wearing. So people are in great agreement about what to say about an image. And we basically tried to implicitly approach this problem because we really don't know how to actually [inaudible] approach this. It is an extremely challenging problem. It is actually one of my future directions I'm going to talk about. But if you approach it implicitly in terms of how can I predict a sentence for an image? Through implicitly if I predict a sentence I am implicitly selecting some of the things. So basically the way that we are going to do it is an image comes in. I am going to get all of my detectors, all of my scene detectors, all of my object detectors and all of my action detectors. I basically I'm going to have all of them running over an image. I have a simple CRF that connects these together, and then I am going to do an inference. And then that gets me to--that feeds information to the structure learner. Here is an example of what happens at the end. So of course this is a random example. This is a good example of the method. For this picture we were predicting a man stands next to a train on a cloudy day. A backpacker stands beside a green train, and different sentences. And remember we we’re not generating the sentences. We didn't want to actually get into the details of language generation. We just are getting it out of the data set by scoring the similar things. And surprisingly this simple nonparametric approach, it worked quite well for images and sentences. We have evaluation metrics in the paper. If you are interested we can talk about that. I am not going to bore you with evaluations at this point. >>: Just to be absolutely clear. The sentences that you are generating from this are whole sentences that people were referring down for other images. So you were not mixing and matching any parts. You are just taking the entire sentences? >> Ali Farhadi: No, no, no. Yes. And since our system is symmetric what we can do is we can actually go the other way around. If you give me a sentence like this, and remember this is a sentence written by a [inaudible]. So given that sentence, we can find images like this. A horse being ridden within a fenced area. They are very interesting properties so, for example we don't have detectors for everything. We don't have detectors for fence. We have a limited number of things where [inaudible] realty detector. The way that we are going to recognize those is basically we have a way to incorporate the distributional semantics to our detectors. For example, I haven't seen let's say beetle; there is a sentence that comes with this, there is a yellow beautiful beetle on the street. We have never observed beetle or we don't have a beetle detector. But true distributional semantics I know beetle is very similar to the car. Beetle is very similar to the bus. It is a little similar to the car and it is not similar to dinosaurs. So basically I am going to re-weight my detectors to have a rough and ready beetle detector that I can actually use here. And incorporating those distributional semantics actually works really good. If you are interested about how to incorporate that we can actually talk about that off-line as well. But there is one big problem that actually Larry mentioned with this approach. Which is what I am doing at this point is I am scoring the whole sentence with the whole image. And if I want to do that I am basically implicitly assuming that I have huge data sets for which I have sentences that correspond to every possible image, which is a little bit unreasonable assumption so I cannot build that. And looking back at the history of machine translation, if you look at the history of machine translation, the boost in the machine translation happens when people actually talk about phrases. This chunk of a sentence, corresponds to this chunk of a-- this chunk of a sentence in English corresponds to this chunk of a sentence in French, let's say. So people you do translation by phrase by phrase. And we have a translation problem. I am going to translate from the space of the images to the English text. And instead of actually matching the whole sentence to the whole image, at this point I want to actually match the chunk of an image to a chunk of a sentence and then establish a phrasal recognition basic system. That gets me to the third part of my talk which is how can I learn a complex composites like a person riding a horse or a rider and a horse jumping over a fence? How can I do that? Because if I can learn those, basically I have sort of learned the phrase detector, which basically has something which is bigger than an object but smaller than a scene. So there is, we do all this talk about objects and scenes. And we never thought about there might be something in the middle. And those phrases are something that is in the middle. And so they are bigger than an object that corresponds to a chunk of an image, bigger than an object, but was smaller than the whole image. Similar to the phrasal translation, there are phrases consisting of a couple of words or more words, but are smaller than the whole sentence. How can I do that? So conventional wisdom in vision says if you want to detect a person riding a horse, detect the person, detect the horse and then sort of put them in correspondence somehow. An example of that would be the [inaudible] paper on how to basically learn, this is on the top of that, and this is beside this and therefore infer that this is a person riding a horse. When we do that we implicitly are making a huge assumption which is wrong. That assumption is that the appearance of the objects are going to be the same when they interact. But as a matter of fact the appearance of the objects are going to change a lot when they participate in relationships. A person riding a horse takes a very few characteristic postures comparing to a typical person. A horse being ridden by a person actually is quite different from a typical horse. A person drinking from a bottle has actually such a rigid characteristic behavior compared to a typical person that we actually are suggesting that if you want to detect these things, why don't you detect it as a whole? If you want to detect a person lying on a sofa, you're going to have a miserable time to detect the person, detect the sofa. But these things together provide such a reduced visual complexity that you can actually learn them as one entity. So we are proposing instead of learning horse, dog, besides learning horse and dog, what about learning a person riding a horse, a dog lying on a sofa? What's wrong with this? The first thing that comes into mind is that well I have combinatorial number of phrases I can make out of words. How are you going to do a deal with that? Do you want to learn a combinatorial of phrases? And if you want to do that I'm sure you are not going to have enough training data to do that. And the answer is actually it's very similar to what happens in language. So there is a saying that you can generate with the amount of words that you know, you can talk your whole life without repeating a phrase twice. But we people don't do that. We actually have very few numbers of frequent phrases that we use a lot. And what we are proposing is that there are few very characteristic phrases that we can highly reliably detect. And there is no excuse in not using them and not getting them into our recognition systems. Our recognition systems consist of only objects before. Some people talk about context and scenes and stuff like that. And you are actually adding these phrases to it. Why? Because they are very characteristic. We can actually learn them very, very reliably and they are much better than learning the individuals apart. How am I going to do that? Basically we build a phrase into the data set called phrasal recognition. It consists, we give basic objects from Pascal, objects that really know how to deal with them. They are famous objects from Pascal. And then we basically make all possible phrases that we can think of out of those objects and we add up-- basically there are eight objects, 17 visual phrases, almost 3000 images and we draw bounding boxes for all of the phrases and all of the objects in them. Because of this great reduction in the--look at the people drinking from bottles, for example. Look at the people riding horses. So there is this great reduction in the individual complexity of the objects. That what I can do is basically I can learn to detect directly for those. And surprisingly I don't actually need so many examples to do that because this is such an easy detection example. It's very characteristic. For example, if I show you this picture and ask you what does this correspond to? You all can say it is a person riding a horse. What is the model for this? >>: A person riding a bicycle. >> Ali Farhadi: Exactly. So it is so simple it's, the behaviors are so, have such a great characteristic that we can easily, easily learn detectors for all of those things. >>: I have a question about the 17 phrases. So when you say 17 phrases, you mean 17 pieces that can recombine into other phrases or do you mean 17 phrases like a person drinking from a bottle? >> Ali Farhadi: The second one. So a person riding a horse, a person jumping. >>: But even if you argue that all possible phrases coming in are huge and there are a lot of repeats, but 17 is smaller than expected. >> Ali Farhadi: Actually learning these, scaling these to hundreds or thousands is easy, because actually I only learned 50 examples of those. These are so easy to train and so fast to train that you can actually learn them very easily. Later on actually I am going to show you some experiments that will totally convince you that it's not a scale issue in this subject. Let's see them all in action, so here. These are the baseline comparisons of person riding a bicycle, person on bicycle detector. Bicycle detector has absolutely no clue of other bicycle. There is only one person detection over here, whereas the person riding the bicycle detector actually fires in five correct places. The dog detector has no idea about what's going on over here. The sofa detector has no idea. But the dog lying on the sofa actually finds it. The bottle detector has no idea over here, but the person drinking from bottle actually fires over there. So the next question is actually how can I compare this approach, which is basically train the phrases directly, versus find the individual components and then put them in correspondence. For that we actually built a baseline. So what this baseline is trying to do is that it tries to basically find an upper bound and optimistic upper bound on how well one can predict a person riding a horse from this prediction of a person and this prediction of a horse. And we are actually building a very, very generous baseline. Well, basically I am going to expand the bounding boxes to form another bounding box and I'm going to use them in the max and min of those confidences to basically put on the final bounding box. But I am not going to stop here. I'm also going to regress the position of these two bounding boxes; I'm going to regress the position of these two bounding boxes against these two bounding boxes on the test set. And I am going to regress the position and the confidence over here and I am going to look at all of those things and I am going to pick the best one on the test set. So this is an extremely generous baseline. I am just saying what is the upper bound and how well one can predict this phrase from those predictions. And I am going to basically train my regresser on the test set. And I'm going to compare this generous baseline to detection the phrase as a whole. So these C curves are our C curves. The blue lines are detections for the phrases. And the red one is the baseline that I just talked about. Look at the huge gaps. These are not actually easy to get gaps in recognition. If you look at the recognition results, they are just tiny little gaps on all of our C curves. >>: So initially how did you train the baseline here? Did you train it on the same way you train yours on, or did you train it on generic people? >> Ali Farhadi: For those we actually had two approaches. I used the state-of-the-art best person detector that Pedro gets out of the Pascal and I have a person to take it to my own data set. I am going to run both of them and I pick the best one. So basically this one is the best possible detection plus the best possible relationships. The problem is that a typical person detector won't get a person riding a bicycle. Because it's bent, and if you want to build a root ball that actually handles all the variations of the people the root ball is going to be lost. >>: It might be a function of the training data, because you're training the people on these very diverse people images which have very few people riding on bike images. In a lot of ways this looks similar to the component models Adris, where basically you take people and you break that data set into let's say six or ten possible threads, people laying down, people standing up people bending at the waist, that sort of thing. And you are basically saying what let's do another cut which is people on bikes. So… >> Ali Farhadi: So you are actually asking two questions. One is do you train this on the status that people are actually on the bikes only? The answer is no. I train the two detectors per person. One is on the Pascal, the state-of-the-art Pascal detector. The other is trained on my own data set. So basically I am trying to capture both of them, and as a response to the second one, look at this curve. If the root model can handle that this curve shouldn't be here. It should be somewhere close to that, or there should be a smaller gap here. The reason actually that the Pedro's model cannot do that is because if you're issue is too big you cannot have a light and root model with 20 different light and roots and you can that actually to do these things. And those gaps are actually amazing gaps. Look at a horse and rider jumping. So you usually don't see such a high C curve in vision. I haven't seen this one before. A person drinking from bottle, so it's basically, you have not much idea about the bottle because we don't give the bottle writes and so most of the time the persons are confused. So this proves if you learn those detectors together because of this very rigid individual structure of the phrases you're going to have a very, very easy time to learn a very, very reliable detectors for the phrases. >>: Maybe the other way to phrase this is what if you took both of them learning from the same training data set. Let's say person drinking from bottle, right so we show the baseline and your method both, the exact same data sets. The only difference is when you show your person drinking bottle in the bounding box around the person and the bottle, whereas with Pedro's you actually draw the bounding box around the bottle and the person. >> Ali Farhadi: Uh-huh. Then we have the baseline, basically. >>: Is that what you did to the baseline or is that… >> Ali Farhadi: Yes. So this is a way of predicting the final phrase bounding box out of a person and a horse. >>: You did this on the training. When you actually trained the models, you trained them the same. >> Ali Farhadi: For the baseline? Yes. And I trained them on the test set not on the trained set. So basically I trained this to predict the position of the phrase from the position of the components and the confidence of the phrase from the confidence of the components. And then I trained them on the test set basically. And this is what you get. So now that you have this amazing results for basically detecting phrases, what should I do at this? Am I making the problem even harder than what it was? Because we had objects before. And we didn't know what to say about the image. Now we are having objects and phrases and I can run actually all of them over the image. What can I say about this at the end of the day? And we believe that there should be decoding machinery and every multi-class object detection system, like machine translation that's sits on the top of the predictions and decides on what to say about the image at the end of the day. So your image comes in, you run all of your objects, its phrases, whatever detector, each of them has an opinion, about where the objects are, where the phrases are and if you, if I show you this and I ask you okay, what should I say at the end of the day? Maybe there is a weak horse detection writing here just right below the threshold. And it didn't make it through the final answer. Maybe these are actually a little bit over the threshold and by just pushing them down I can get rid of them. So there is a machinery that we, we actually developed a machinery called decoding machinery that sits on the top of those predictions, and what it does is says okay, if there is a horse, a person riding a horse over here, there is a person of your, there should be a horse prediction somewhere down there. So let's push that prediction up. Let's push this horse prediction up. Let's push those things down, and basically the decoding machinery sits on the top of all of those predictions and decides what to say at the end of the day about the image. Decoding is not a new thing in [inaudible] recognition. We always do decoding. Non-[inaudible] is one way of decoding. It is another way of decoding saying that if this bounding box overlaps this bounding box, give the best one and ignore the rest. But usually when we, the way that we are going to phrase our decoding problem is so we have bounding boxes. Let's say we have three different categories red, blue and yellow. And I am going to run all of those images. I am going to phrase my decoding problem as assigning zeros and ones to these bounding boxes. So that I am going to basically mimic my training set. Or if you don't like to make the hard decision at the beginning, you can phrase it as I am going to increase the confidence of some of those and decrease the confidence of the others. So if I ask you immediately model this problem, what you would typically say is it is basically a unary plus binary approach. The unary talks about the appearance of the bounding box. The binary term talks about the relationships. So that's the typical way of doing this. But there are problems actually for which we really don't to need to get this hard inference. To be able to actually do this task that we want to do. Sometimes if you are smart about representations, you can avoid doing solving the hard problem. So what we can do here is we actually ignore this bounding term that talks about this is beside this, because we cannot model it correctly. You end up with [inaudible] differences and then the results are not going to be as good. So we are going to model this [inaudible] relationships in the [inaudible] presentation of unaries. As a result we have a simple inference which is basically a unary term and we are going to ignore the binary term. We are going to basically put all of the binary information into the unary term. How? So let's assume I want to represent the thick box over here. What I can do is basically I am going to build a long feature vector that has very crude spatial bins. The bowls, the sides of and below. And for each of those bins I am going to report the maximum responses for all those category responses and put it over here. Then if I show this to my learner at the end of the day my learner has an idea about the local context. If this corresponds to the horse and there is, if this box corresponds to the person riding a horse then there should be a strong response for a person on the above bin. There should be a strong response for a horse in the middle bin, and maybe there is a fence in the bottom bin. So basically by building, by designing this feature I am avoiding having the problem of solving this inference and turning the problem into its simple inference. So the problem formulation is that I am going to predict an H which is the assignment of ones and zeros to these bounding boxes. And I am going to phrase it again is a [inaudible] instructional learning problem. So I am looking for the W's to assign to my features so that the best, the grass-roots hypothesis scores highest comparing to the other problems. And then you have a [inaudible] which actually is dealt with this maximization problem. Let's see some of the decoding results. So there is a person riding a bicycle over there and there are three wrong person detectors. So the person riding the bicycle actually pushed up the bicycle, and pushed down all of the wrong persons. There is a dog lying on the sofa here which is very confident. So the dog lying on the sofa got rid of all those person detections and pushed up the sofa prediction up. Why? Because if there is a sofa over there, and if there is a dog lying on a sofa, there should be a sofa nearby in the besides bin. So you just look for the weakest, highest which prediction is below threshold than just bring it up. If you like numbers, so we compare, so one thing I would compare is how good you can predict objects using phrases. Basically phrasal recognition not only is good by itself but it's also good in boosting object recognition. So these are the state-of-the-art detection responses on those numbers, on those eight object categories that we are dealing with. And these are the decode our R final results. If you are in the recognition business, you know that from this to this is a long way. So if you look at the numbers in Pascal its basically, they are tiny little bit of improvements on APs. And these are basically, some of them are huge improvements in terms of APs. To compare this with a phrases are helping are not, actually we are going to run our R thing our decoding without the phrases. And as you can see the numbers are actually going down. Meaning that phrasal recognition actually helps object recognition. If you include phrases which are reliable very easy to detect into your recognition machinery, into your recognition spectrum, you can actually boost your recognition results. And also, it also matters how to decode. If you do the hard way of decoding, if you do it like with the unary plus binary and then start with the Greeley search, you are going to do; you are not going to do as well. So these are the state of the results with those modeling with and without phrases. And what this actually shows you is it actually matters how to decode. If you are careful about your problems some of the problems do not necessarily need a [inaudible] thing. But you actually have to model the relationships. And being careful about those actually matters. To conclude, so I talk about issues of recognition. I personally believe that there are serious problems with the way that we think about recognition. And I think that we have to rethink recognition. And basically one of the conclusions, we actually move the story; the visual phrases are different ways of rethinking about recognition. I believe that we have very powerful machineries but we have to be careful about how to use them. So let me ask you with a story. We didn't basically build a new machinery; we just used the classifiers that we had before. Or two phrase the story, we didn't build any new detector for phrases we just used the powerful detectors that we have. But you have to be careful about how to use them. And sometimes like the decoding story we have to build new machineries to be able to do your stuff. And basically the theme of my research is that representation is the key. So if you get the representation right, the rest will follow. And that will be basically the gist of all of the things that I have done. In terms of what I plan to do next, basically my short-term goals are, the first one is up I actually would like to answer the question what is the right quantum of recognition? From the data we start recognition, we always start going okay, there are cats and dogs and bicycles and cars and we want to go in and detect them. And then we’re bidding our detectors to produce best results on those fixed on those fixed categories. Why? Why should only actually deal with basic level categories? Why not actually include in phrases? Maybe there is something else between them. So they are in the spectrum. They are objects, they’re phrases, they are scenes and we believe that these phrases actually sits in between the scenes and objects. There might be some other things. So I am trying to find a principled way of finding out what is the right quantum of recognition. And I don't believe that the categories that we have right now are the right way of doing recognition. The second thing that I want to do is kind of related to this, is that I want to break down this notion of categories. So we want to build a dog detector. The way that we do that is we have examples of dogs and I then we have the models. The best one is basically that Pedro's bottle the root model. So we are going to beat that model to be able to detect dogs in different poses and different aspects, sitting dog, standing dog, jumping dog. And then we are going to basically have models that we cannot manage at the end of the day. So what I want to do is I want to wreck these strong walls between categories. Maybe I can build a reliable detector that can get half of a dog and half of cat and a half of the horses that is very unreliable. Why not use that, then build a model on top of the that that can use that information? So basically this topic tries to attack this convention that we have in recognition, that we want to recognize objects. These are cats, these are dogs, these are bicycles. Let's recognize them. I am going to break that this in here and the third thing that I want to do in short-term in the near future is build a phrase table. So what we have right now is machinery to get the phrases right. And I believe we also have the machinery to get reasons out of images. And those reasons are not necessarily super pics and lower segmentations. Those are regions that correspond to high object regions. So we can couple that with our phrase recognition and we can build a big table that has regions of the images on this side and phrases on the other side. So I know that there is a person riding a horse that corresponds to this and this and this region and not to the rest of the regions. And very similar to machine translation we can have a decoding that looks at this big phrase table that has regions over here, phrases over here and produce a actual description for the image. While we decoding is quite different from machine translation decoding, why, because we have to take care of inclusion, exclusion two things cannot belong to the same thing. So we have to have a decoding that is different from machine translation decoding slightly, but I believe that if I do this phrase table, then we can produce statements of the objects much richer than the list of the objects and more accurate than the way that we do recognition now. For future directions these are basically sort of long-term goals I plan to be able to reason in semantic spectrums, so we have parts, we have attributes, we have [inaudible] we have objects, we have phrases we have scenes and there might be some others in the middle. I would really like to be able to reason in this semantic spectrum to be able to talk about, to infer all of those things, not and this is necessarily a top down or a bottom down approach. So maybe a part, a strong part detection can help oppose a detection, can have object recognition, whereas strong phrase recognition can actually help apart recognition as well. So the other topic that I want to talk about is basically coupling, getting the geometry in. So with the advent of all of those nice methods like the [inaudible] that you guys are probably familiar with that you can fit boxes to the rooms. We could actually infer a lot of--you could actually couple that geometry with a lot of different things. You can couple geometry with material recognition; you can couple geometry with object recognition with [inaudible] and all of those things, and you can actually put them in a nice unified framework. The last thing that I am going to talk about is about knowledge selection. As I said we people are actually really good in selecting what to talk about and what not to talk about. If you want to describe this picture you probably don't talk about the white board back back then. And don't talk about the letters over here. So we are really good at knowing what to talk about. And so, one of my goals actually to basically learn to select knowledge the way that we do. Why? Because that basically simplify many applications for example image search, because then we know what we our reports are sort of aligned with what people think and then that makes the search much easier. The things that I didn't talk if you're interested in, we can talk about in meetings. I work in knowledge transfer and split representations based on comparative representations if you're interested that we can talk about it. There is some work on senior discovery and multitask learning. There is work that joins multitask learning with manifold learning if you're interested. Sign language and human activity recognition stuff and a little bit of work on using machines, learning approach to do with what's wide spectrum measurements to ensure network security. And with that actually I close. Thank you. [applause] >>: I have one question. It seems that in the phrases that kind of the main take of your message is that the data terms, the appearance of objects is not intended of the relationship. Me to take into account the relationship of some objects [inaudible] mottled appearances is that >> Ali Farhadi: There are certain things that you can actually take as a take-home message. One is that the appearance of the objects changes dramatically when you participate in relationships. And if you ignore that, it hurts a lot. I showed you the pictures. The other thing is that I want to basically break down this tradition of having objects as the content of recognition. There might be some other things which are extremely useful and extremely reliable to detect. And I will introduce them to the vocabulary of the recognition. And the second one is actually more important than the first one which is we really; the selection of the quantum's of recognition at this point is almost arbitrary. And I believe that there should be a principled way of doing that. One principled way is you basically think about phrases, relationships. Maybe there are other principle ways. Maybe there are other quantums, other than phrases and scenes that we don't know actually and appear later. >> Larry Zitnick: Thank the speaker one more time. [applause]