>> Larry Zitnick: Okay. So it's my pleasure to welcome Abhinav Gupta here today for his candidate talk. He's done a lot of really interesting work in the area of object recognition and see and understanding. He's looked at 3D modeling, physical modeling, functional modeling of scenes and different places. So you can see that's what he'll be talking about today. He got his PhD from the University of Maryland, and he's currently a post-doc at a CMU. And he recently won the runner up best paper award at ECCV. >> Abhinav Gupta: I guess I got lucky there. I will go here. Thank you, Larry for inviting me. I'm glad to be here at Microsoft talking about my work. All right. So I work in the field of image understanding. So let us first start with an image. So let us consider this image. When we humans see this image, we can tell so many things about it. For example, it is a room. It is probably a living room because I can see couches and table here. There are a lot of people in the room, so probably there's some kind of a party going on. Then I can see a cake, some balloons, so probably it's a birthday party. And this list can go on and on. We humans are really good at understanding images. Because vision happens to be one of the most important functions of the human brain. In fact, more than 50 percent of the human brain is dedicated to solving the task of vision. Why not? We use vision for all our daily task. We use vision to interact with the world. We use vision to navigate, perform actions in the scene, recognize people and things, and to even predict what is going to happen next. And because we are so good at vision, we often do not realize house hard it is for a computer to just look at this area of numbers and understand things which are just so obvious to us. In fact, there is a legend about how the field of computer vision actually started. Marvin Minsky assigned computer vision as a summer project to one of his undergraduate student, Gerald Sussman. And that project outline was spend a summer linking the camera to the computer and getting the computer to describe what it saw. So, in fact, he thought that vision can be solved in three months as a summer project. And it took us almost a decade to realize how hard the computer vision problem is. But still at the start of computer vision we as a field are very ambitious. We wanted to solve the whole problem. We want to solve the complete scene understanding problem. We wanted to extract everything from the image. We wanted to extract semantic properties, geometric properties, special layout, just everything we can think of. And we tried to build some really grand vision systems in these in the '70s and so on. For example some of the systems are VISIONS system from Hanson and Riseman, the ACRONYM system from Brooks and Binford and the system from Ohta and Kanade. So here is an example from the vision system from Hanson and Riseman. And remember this is 1977 when you can only fit one image into the RAM of computer. And at that point of time, given an image, they wanted to extract the semantic properties what there is a tree, house, road. They wanted to extract very special layout that the trees in front of the house, there's a tree behind the house. They wanted to extract the occlusion properties that this tree is occluding the house and the house is occluding the tree and so on. So we learned a lot of great lessons by building these grand vision systems. But while these systems were good, on a few virtual images they fail to generalize. And the reason is that the problem is just too hard. And we did not have computational tools in those years. It's very hard to model a complete representation of the world. And that brought us to the modern era in computer vision. So instead of trying to solve this global scene understanding problem, we now divided the problem into local parts, and we wanted to solve this local pattern matching problems. So instead of trying to understand this whole image of a room, we now wanted to understanding the bits and pieces of it. So maybe the couch, maybe the table and so on. So the standard ways you learn the classifier which separates couch from the rest of the world, you then take a window, slide it across the image, extract patterns and then classify these patterns as couch or not a couch. And during the last two decade we actually have made a lot of great progress in this direction. We now have face detection in our patterns. We identify landmarks using Google Goggles. But while we have made a lot of great progress, for example, in understanding classes such as faces, pedestrians and cars, we are still very bad at other classes, for example, chairs, tables and a birds. We have around 5 to 10 percent in that range. But wait a minute. Even if I solve this problem, even if I completely solve this detection problem, let us see what kind of an understanding would a system have. So given an image, I would just have the list of detections in the image. So this is the ideal case. The detection problem completely solved. So we'll have in this case the two couches, the table and the lamp. And this is what a system sees and understands. Just some boxes and with some words associated with those boxes. Now, this kind of understanding is great for answering a retrieval question such as is there a couch in the image? But how often are we interested in such kind of questions? We are more interested in questions such as where is a couch in the image and how do I find the parts to that couch? And this understanding cannot answer this question. In fact, this understanding is so superficial that it cannot answer the where can I sit in this image, even though I know there is a box corresponding to a couch somewhere here I can sit but I just don't know where can I sit? Is this area the place where I can sit? Or is this area the place where I can sit? So just for reference, this is the couch. So this is the correct answer. In fact, woke not even answer the most basic question necessary for survival, where in the image can I walk? Even though I know it's a ground, is this area good to walk or is this area good to walk on? So it is clear that we need to somehow go beyond these local bounding box kind of detections. We need to bring back the flavor of global scene understanding which our pioneers had in mind. And in this talk, I'm going to argue that we can bring back the flavor of global scene understanding using relationships. So it's not just a coincidence that this table and ground are cut together in the image. In fact, they share a very strong relationship; that is the table is supported by the ground and the table creates a free-space obstruction on the ground. So a human cannot stand at the location where the table is. And if somehow I can capture this strong relationship and extract this information I can now answer the question where in the image can I walk? So a human cannot walk at this location because of the free-space obstruction whereas this location is perfectly fine to walk because there's enough free space for a human to walk. Let us look at another question. How do I get to the lamp in the image? And let us look at the relationship between the couch on the right and the table here. Again, the couch and the table are not just some isolated detections. But they share a very strong relationship that the table is in front of the couch and the table is separated from the couch. And if I can capture this rich information, I can now answer the question how do I get to the lamp? So I can go between the table and the couch, because they are separated, and reach the lamp. So in this talk I'm going to argue that we can bring back the flavor of global scene understanding by extracting this rich information from relationships. And when I'm going to talk about relationships, I'm going to talk about them in a very broad sense. For example, I'm going to talk about relationships between the structure image of the scene, the walls and the floor, I am going to talk about relationships within the objects in the scene, for example, special relationships such as table is in front of the lamp and the couch, the left couch is attached to the wall. I can going to talk about support relationships. For example the couch and the table are separated by the ground. We can go further and have relationship between a human and a object in the scene. That is how to humans use those objects in the scene? So I can have a relationship with between a human and a couch that humans use couches to sit. And finally I can have relationships in time. So let us suppose there are three humans sitting on the couch and there's a fourth human who wants to sit on the couch. Well, the only way this human can sit is if one of the three persons decide to get up and create a space for him. So these kind of prerequirements for actions to happen, we can always have them as relationships. Now specifically in this talk I'm going to talk about three kind of relationships. I'm first going to start with physical relationships and I'll show how we can build a physical representation from these physical relationships. Then I'm going to briefly describe functional relationships, which are relationships between humans and the physical representation of the scene. And finally I'm going to just touch upon causal relationships which are relationships in time. So let me start with physical relationships. So given an image, our goal is to obtain a 3D understanding of this image. The way I'm going to do this is by building relationships between the structure, for example the walls in the floor, and the elements of the scene, for example the occupied volumes, the support surfaces, and the objects themselves. Very recently there has been a renewed interest in the field of 3D understanding of images. For example this latest work by Hoiem, et al, looks at patches and uses local classifiers to obtain surface orientations. And if the image is simple enough you can group these patches and create 3D models. But because these approaches still use local classifiers, they often lead to interpretations which are physically not possible or highly unlikely. For example in this image, the local classifier predicts that the ground is between the vertical region and while there's an interpretation possible, we can all see how unlikely this is. So as the next step what people thought was we can have to include some relationships and build a global understanding. So they tried including special relationships in image plane. But none of these approaches also showed any major improvements. And the problem is much more basic in nature. The problem is with the representation itself. All these approaches assume a planar representation of the world; that is the image is made up of planar patches standing on top of ground. And this representation is so permissive and it does not have enough global meaningful constraints or relationships to restrict the location of these planes. For example, in the this image it divides up into multiple planar patches standing on top of ground and the presentation is just fine with it. It has no problems with so many patches in the image. So what we need is a representation that can bring in some more global meaningful relationships that can constrain the location of these planes. One such possible representation is a volumetric presentation of the scene. So instead of assuming that the world is made up of planar patches we can now assume that the world is made up of volumes. And once we have volumes, we have so many relationships we can harness. We now have geometric relationship. For example, the finite volume which says every must -- object must occupy a finite non-zero volume. We have a spatial exclusion relationship which says every object must occupy a mutually exclusive volume. Once I am talking about volumes I can even talk about masses. And now I have mechanical relationships that volume should be configured such that they are physically stable. These volumes should not topple over. Now, interestingly these relationships are the same relationships which our pioneers wanted to include in the famous Bloxwell [phonetic] project at MIT. And that all during early '60s. So here is a kind of an understanding that our system would have. Given an image, we would break it into regions where each region would correspond to a physical object in the scene. For each region we [inaudible] a volumetric class which tells us some idea of what kind of volume does the object occupy? For example, in this -- in the case of this building I can say there's a left face visible, there's a right face visible and then there's a volume behind these two faces. In the case of this building I can say that the light face of the building is visible, the left face is either occluded or out of the image frame. Not only can we estimate some kind of volume, but we can also estimate some kind of density associated with these regions. For example, building is a high density region where the trees a medium heavy density reason. We can even extract special relationships. For example, this building is in front of the tree and the tree is in front of the building on the right. And finally, we can have support relationships that building and the trees are supported by the ground. But this understanding of the system is called a 3D parse graph, and we are automatically generating it using our program. So let me first talk about representation. So our goal is to break an a image into regions and estimate some kind of volume associated with each region. Now, one possible way is to do precise metric 3D reconstruction so what we can do is we can try to fit a cube into this region and then do these exact quantitative measurements that this cube is 10 meter wide, 15 meter high and so on. But doing precise metric 3D reconstruction from a single image is an extremely difficult problem. So we are not going to try to do that. I will repeat again we are not going to try to do any precise metric 3D reconstruction. We are just going to associate each region with one of the eight volumetric classes defined in the catalog. For example, the building region is left-right class which says I can see a left face of the building, I can see a right face of the building, and the volume is ridden behind these two faces. This building is a left-occluded class, which says I can see a right face of the building and the left face is either occluded, out of the image frame. And for the tree I can associate a porous class which says that this tree occupies some porous volume in the scene. Now, not only do you want to associate or estimate the volume with the region, but you also want to estimate motion of mass or weight associated with the region. About it visually estimating mass or weight is an extremely difficult problem. So it's impossible for me to say by looking just at this multi-armed bandit that whether this building weighs thousand tons, 2,000 tons or so on. But if you look closely what we can do is we can so much texture to some notion of qualitative density. For example, this is a brick texture, so this should be high density region, whereas this is a tree texture, and this should be a light density region. So again what we are going to do is we are going to associate regions to three qualitative density classes, light density, medium density and heavy density. So we will build a density classifier using appearance and location features. These are the features. And we use a decision tree classifier. So here are some of the examples of the density classifier. So as you can see, it classifies buildings as a high density material, trees as light density material and humans as a medium density material. Now, quantitatively the performance of our approach is 69.3 percent. Okay. So our goal is, given an input image like this and the set of blocks which are segments with their associated volumes and densities, we want to somehow configure these blocks such at that time configuration is physically table and looks exactly like the input image. Now, as you can see, this is a hard combinatorial optimization problem. Our cost function is neither submodular, nor we have just the bare [inaudible] in the cost function. In fact, we have a much higher order [inaudible] size. So by supplying this standard optimization approaches we used an iterative 3D approach. And our experiments show empirically it seems to work pretty good. And the way I'm going to motivate iterative approach is the way how children play blocks will during their childhood. So remember we had a target configuration we wanted to reach and we tried blocks one by one so that the block is stable. And it helps us reach towards a target configuration. They are going to do exactly the same thing. They are going to try blocks one by one, see which block is physically table, and which helps us to move toward the target configuration. So given an image, we'll first estimate the ground on on which we are going to stack the blocks. We are going to estimate the bag of blocks or the set of blocks which we are going to place. We also estimate the local surface layouts and the density. And now we are ready to place the block on the ground. So at each round we are going to try to place blocks one by one. So at round one, let us try to place this block. Now, if you look closely, this block has no physical contact with the ground. And therefore this is not stable and would fall. Let us try another block. In case of this block it has some contact with the ground, good -- so we are good to start. Let's see if some of the volumetric class fits to it. So we try volumetric classes one by one. And in this case, the left-right class fits well because of the left face and there's a right face. Now, let us see if this block is physically stable. So if I notice closely, this part of the region of the block is classified as a high density [inaudible] at the top and it has no support at the bottom. So this block is physically unstable and would topple over. Now, let us try a third block. Now, in the case of tree we have a nice geometric class which is porous, and it is physically stable. So we select this block at this stone and we move to round two. At round two I am going to try to place all the blocks again one by one. So let me try this block now. Well, still it has no support here, so it would not be physically stable. But let us try this block now. Now in the case of this block, we have a new hypothesis that the trooe occludes the base of the building. So now the block becomes physically stable. So now we select this block, assign the left-right class to it, and extract the occlusion edges. Now, during this whole iterative process I kept on selecting and rejecting the blocks. Well, we have a cost function to do that -- to achieve that. Our cost function has five terms. The first two terms are the geometric cost. The first are measures [inaudible] the input surface layouts and the same volumetric class. For example, the front right class is a bad assignment because it doesn't match the input surface layouts, whereas the left right class is a good assignment because it matches the input surface layouts. The second one measures the agreement between the ground and could I contact points. So if you see a right face, right facing face, then the ground and sky contact points should intersect at horizon. The next two term measure [>)] the physical stability of the system. The third term measures the internal stability and reekts blocks which have light bottom and heavy top. The fourth term assumes that we are looking at a static and physically stable world. So we are not looking at the measures of falling buildings and so on. And under this assumption it sees the other blocks externally stable or not. So the way we do it is we first fit the volumetric class to the block which means finding the can axis of rotation and the four edges where the planar surface area orientation changes. And once I know the four edges and the axis of rotation, I can now compute the top due to the weight of the patch and the top due to the reaction from the ground. And if the top is physically stable, and if the top stands [inaudible] I can say the block is physically stable. Now, this term also rejects configurations which lead to heavy block at the top and the light block at the bottom. So for example in this image there are two possible configurations, that the building is on the top of the slub or the building is behind the slub. Now, since building is a high dependency city material and slub is a light or medium density material, the first configuration is not possible because a high density material cannot be on top of a low density material. Now, the fifth term and the final term measures the global agreement between pairwise depth constraints. And the way you can obtain these pairwise depth constraints is using depth rules. For example, if you have a 2D projection like this, then under the convexity assumption it means that the yellow block is in front of the grey block. Now that I have described my five terms of cost function and my [inaudible] approach we are now read to see some results. So given an image like this, our approach can generate a 3D parse graph for the image. So it can break the image into regions corresponding to the physical objects. It can say that these trees are the porous classes. The building is a left-right class, this building here. The chimney's a left-right class. It can extract the densities. For example the trees are medium density material whereas the building is a high density material. It can look through this special layout. For example the tree is in front of the building and so on. Now, the interesting thing to note here is our approach can combine these two faces of the building. None of the current segmentation approaches can actually achieve that. And the reason we can do this is because we use strong global relationships. So if you use this face of the building as one segment, it would lead to a zero volume block. If you use this face of the building as one segment, it would lead to a zero volume block. The only way you can have a finite volume is if you combine the two faces and they create a volume. >>: How do you actually learn your parameters for your weight function? >> Abhinav Gupta: That's a good question. >>: You don't have to go into detail but ->> Abhinav Gupta: Yeah. >>: [inaudible] high level ->> Abhinav Gupta: So in the current system what we did was we did -- I locally manually tuned the parameters for 10 images and then -- not 10, I think 20 images or so and we learn it on the best of the 300 images. But the reason I guess we couldn't do this cross-validation kind of thing is because we didn't have enough images. So the dataset comes with 300 images and 300 images probably not enough to do cross-validation. And out of these 300 we used -already used hundred for some other training, for example, training of density classifier so we cannot use them anymore. So I guess if you have -- I mean the best way should be to do the cross-validation kind of thing. >>: [inaudible]. >> Abhinav Gupta: I mean but the way in these parameters of these cost function, what we made sure was that are not fitting in the sense that general -- I think if I remember well, the parameters which come outward like all ones or something like that, to make sure that it's not a -- it's a simple cost function, it's not like I'm putting 1.25 something like that. Now, here's another example how -- where physical relationships can actually help to get a better understanding of the system. So using physical relationships our approach rejects a block like this, because it would topple over, and divides this block into two blocks where the first block is assigned the front-right class, which says there's a front face and the right face here, and this is another block where there's on a frontal face. We can even extract the -- in the special layout properly here as well. For example, this tree is this tree here and is in front of the building here. And this tree is this and it is in front of the second part of the building. And one of the interesting things about our approach is that if you make more assumptions we can even 3D deconstruction. So let us these volumetric classes actually come from a cuboid so these classes would be something like this. We can now fit cuboids to the regions and extract occlusion boundaries, and we can do a 3D reconstruction of our toy blocks world. For example, in this image the big pink block corresponds to the building. The small pink blocks corresponds to the chimney. And this green block corresponds to the tree here. This green block corresponds to this tree here and so on. Again, none of the current vision approaches can generate this kind of 3D volumetric reconstruction from a single image. So we applied similar concepts of volumetric relationships to the problem of indoor scene understanding as well. Now, this work is part of thesis work of David Lee who is a student at CMU. And I'm helping co-advise him. So the problem here is given an input image like this, we want to extract where the walls and floor are. And again if you use simple local classifiers, you often get interpretations which are physically impossible. For example in this case, the wall is predicted to be in front of the stove and the table. And this is physically not possible. But if somehow we can subtend the volume at the stove and the table, we can now easily argue that the wall should be behind these volumes and not in front of the volumes because these stoves are inside the room. So once we do this kind of reasoning we get a much better wall estimate. Here's another example. When using just simple local classifiers predict that the value is in front of the bed. But if you begin subtend the volume at the bed, you can easily argue that this thing at this wall should be behind the bed and not in front of the bed and you get a much better wall -- much better wall estimate. So that was about physical relationships. I will now talk about a different kind of relationships which is a relationship between human body and the physical presentation of the scene. So if the first part of the presentation, I talked about how we can get a physical representation of the scene, how we can extract the occupied volumes, the free space, the support surfaces and so on. But often when we look at images we're interested in other properties such as what can I push in this image, what can I throw, where can I sit. Now, if you look closely these properties are subjective. What is going to be pushable for an elephant is not necessarily pushable for me. And in the case of the images, an image understanding, the subject is us, the humans. We want to understand images exactly the way humans do. Therefore, in this part of the presentation, I'll introduce a new paradigm in scene understanding and we call it human centric scene understanding. Instead of trying to understand the scene in terms of identities of object or the 3D representation, we are now going to understand the scene in terms of what can a human do in this scene? For example, in this image, a human can sit, as shown in the green skeleton. The blue skeleton shows how can a human sit with straight legs. And the red skeleton shows how can a human stand and touch a six-feet high location on the wall? So why am I so excited about this human centric scene understanding? Well, it brings in a new fresh perspective on the scene understanding problem. Instead of trying to understand the scene in terms of identities of object, there's a bed, there's a table, paintings and so on, we now want to understand scene in terms of what can a human do here? So he can sit on table or on ground. If he has something in his hand, he could put that thing on the top of the shelf or on the top of a table. If he wants to move, he can move that painting. If he wants to store something, he can store it inside the cavity of the shelf. And so on. So we now have a task based understanding of the scene. And it can evaluate how good our understanding is based on how many tasks a human can achieve. Furthermore, it provides prior for object recognition problem itself. For example, if I want to search for cups in this image, one possible way is to look for cups in all possible locations in the image. The other smart way would be to only look for cups in the reachable areas in the image because it is highly unlikely that one would put the cup on the top of a shelf. We can even go further and have unconventional categories. For example, we can now have reachable area detectors, hideable area detectors, storable area a detectors and you cannot have these kind of categories using conventional pattern recognition items. And finally, this brings in the subjective interpretation. So the way a child looks at a scene is completely different from a way an adult human looks at the same scene. A child can sit with straight legs on the couch itself, whereas an adult human needs the support of a table to sit with straight legs. Similarly, to touch a six-feet high location, a child needs to claim on the couch and only then can he touch the six-feet high location, whereas and adult human can just stand and touch a six-feet high location. Now, interestingly this idea of human centric scene understanding is not new in the field of psychology. In early '50s, James Gibson came up with this idea of affordances, which are the opportunities for intersection afforded by the scene. So his claim was that when humans look at an image, they don't -- they do not understand identities but they infer the functions of the objects. For example, if you look at a knee high flat surface, you would infer that [inaudible]. But Gibson was very strong on affordances. He believed fruits afford eating, waters afford -- water affords drinking and TVs afford watching. We are not going to have strong a notion of affordances. We are only going to consider physical affordances, that is actions which include physical interaction with the scene. The other problem with the affordances idea has been their association with the semantic categories. They believe that as soon as you look at an image you infer where all you can sit in an image. But if you consider sitting, there are multiple ways a human can sit. I can sit with a back support, I can sit without a back support, I can sit in some kind of crouching position, I can sit with straight legs. There are just so many ways a human can sit. So considering just a single classifier we should find out all the sittable in the room is probably going to be too impossible of a task. So instead of considering this semantic vocabulary of affordances we are now going to consider a data-driven vocabulary of affordances. That is, we are -- instead of proposing the problem as finding the sittable surfaces in the room, we want to pose the problem is where can a human post fruit in that room. And the way woke get the poses is using the motion capture data. In a motion capture the human wears a black kind of a suit with some markers as you can see here, and then he interacts with the scene in this lab environment with large Vicon cameras look at you. And you can extract the poses using those cameras. But this brings in a whole new problem, that a space of poses is huge. The space of poses is exponential in the degrees of freedom of the human body. So what we going to do is we are again going to fall back on a qualitative representation of a pose. So we are going to represent a pose in terms of what volume will it occupies in the scene and what are the support surfaces it requires. So, for example, sitting with the back support and straight legs would mean a volume like this red block, and it would need support at the legs, at the pelvic joint, and at the back. Here's another example. We are sitting would mean a red volume like this, and you would need support at the pelvic joint and at the back. So now the problem is given these kind of physical representation of a scene will we know the occupied volumes and so on, and given this block representation of pose we need to find where would this human block fit? Does it resemble with anything to you? This exactly the Tetris problem. You remember in Tetris there's a block which falls and you need to find where would this block fit? It's exactly the same problem in 3D. You have this scene occupied with occupied blocks and a human block is going to fall, and you need to find where would this human block fit? So we are going to use the exact same constraints as we use in Tetris problem. The first constraint is the support -- supports our best constraint which says that if you need a support surface somewhere this should be present in the scene. So, for example, for sitting you need a support surface at the pelvic joint, and therefore there should -- this part of the scene should be occupied in the image. So there should be -- for example, the couch provides a support for the human sitting. The second constraint is a free space constraint which says that wherever the human [inaudible] in the image, that location should not be occupied by any object in the scene. So, for example, if a human has to fit in this yellow location, this location should be completely empty. So let me show you some results. Given an input image like this, our approach can predict where can a human sit in this image. The blue shows the location of the pelvic joint. We can predict where can a human sit with a straight leg so the sign shows the back support and the blue shows the pelvic joint support. We can predict where can a human lie down in the image. And the magenta shows the mask where again a human lie down or where the joints go if the human is lying down. >>: [inaudible]. >> Abhinav Gupta: Yes. So in this case, we didn't find the chair as a physical surface, in the physical presentation. And we can predict where can a huge reach at a six-feet high location. And if you change the height of the reaching you get multiple possible reach locations. Here's another example. Now, again in this case, magenta shows where a human can lie down. This shows where can a human sit with straight legs so sign shows that you can get the back support from the bed and you can get the pelvic support from the ground. And the blue shows where can a human sit without a back support. So it would be basically at the corners of the bed. But this is a very interesting result. In this case our approach predicts that a human can sit on top of a stove, on top of a table. Well, this is perfectly correct result based on my approach. We were just looking for physical interactions and humans can really sit on top of stove. I tried that after getting this result. But how often do we see human sitting on top of stoves? So apart from these physical requirements there's something called a cultural context. How do humans use objects culturally? So apart from this physical layer you need the statistical layer which shows which correlates how do humans object use -humans use objects generally? So I have looked into this problem doing my thesis work so this was in 2008 when I looked into the statistical correlations between the human motion and the objects in the scene. For example, the cups, the flashlight, the phones and the spray bottle. The idea was very simple. If you look at these two objects, it is very hard to say just based on appearances which one is a drinking bottle and which one is a spray bottle. But if you look these objects in the context of the whole scene, you can easily say this one is a spraying bottle and this one is a drinking bottle. Similarly, if you just look at trajectories, it is impossible for me to say which one is the answering of the phone call and which one is a drinking from the cup. But again, if you look at it in the context of the whole image, this one is the answering of the phone and this one is the drinking from the cup. So the idea was that human actions can help in recognition of objects and objects can help in recognition of human actions. So that was about modeling of functional relationships. So in the first part I talked about how we can model physical relationships, in the second part I talked about how we can model physical -- functional relationships, sorry. And now I'm going to talk about another kind of relationship, which is a relationship in time. And these are called temporal causal relationships. Again, activities that happen in a scene are not isolated and independent. For example, in this image -- in this scene, the car crosses the -- jumps a red light. The skater stops. And this car on the left stops. Now, these three activities are not isolated and independent of each other. In fact, they share a very strong relationship. The skater stops to avoid being hit by the car that jumped the red light. And we can use these causal relationships to develop stories of videos where the storyline would be the set of actions that occur in the video and relationship between those actions. For example, the story of this video would be skater stops to avoid being hit by the car that jumped the red light. But discovering causal relationships or modeling these causal relationships is a very difficult problem. For example, in this scene I need to model the traffic laws of the world, I need to model how do cars run on roads. I need to model how do pedestrians cross the road. I need to model the cultural context that cars stop when pedestrians are crossing. So there's so many things I need to model. So instead of -- so it's a very difficult problem. So here what we did was we considered a very restricted domain of videos of human activities in a very structured domain. For example, sports videos such as baseball video. Now, in this case, videos have a very strong story associated with them. For example here, the pitcher pitches the ball. The batter then hits it. He then runs towards the base. In the meantime, the fielder runs to catch the ball. The story in the second video is much shorter and much simpler. The pitcher pitches the ball and the batter just misses it. Now, we would like to model the variation across the storyline and the domain and we would like to discover the causal relationships. So we like to learn a common storyline model are the space of storylines for a given domain. For example, the storyline model for the baseball domain would be the pitcher pitches the ball, then the batter can either miss it or he can hit it. If the batter hits the ball, then he then towards the base. In the meantime, the fielder runs to catch the ball. He catches the ball and then either the fielder can throw it or he can himself run with the ball towards the base. So our goal here is given videos and some associated text or captions, we want to discovery the causal relationships and extract the stolen model for the domain. Now, discovery causal relationships require labeling action in video. For example I need to know which action is pitching, which action is hitting to discover the relationship between pitching and hitting. However, labeling of actions itself requires me to know the rules of the game. I need to know the storyline model. So I need to know what happens before pitching or hitting. What is pitching? I need to know these kind of things to label what is pitching and what is hitting. So now we have a classical chicken and egg problem. I'm very briefly going to industry an iterative approach of how we handle data. Given the videos, we first extract the human tracks and given the captions we extract the rough timelines of each video. We then combine the timelines and opt in a rough storyline model for the beginning. This is just initial model. It can have errors in it. But if you notice closely, it still does not have any visual grounding at this point of time. So we obtain the visual grounding by focus statistics between the visual features and the words. And now we have a storyline model to begin with. Using this storyline model, we assign actions in the video. Once we have the actions labeled, we can use these labeled actions to improve our storyline model. And we keep on doing these iterations until the storyline model doesn't change or converges. And once we have a learned storyline model, we can use it to infer storylines of new videos. For example in this test video we found the storyline was pitcher pitches the ball, batter hits it, he then runs towards the base. The fielder runs to catch the ball. He catches the ball, throws the ball toward the base and the fielder at the base catches the ball. And the assignment of actions to the track is shown in color. So, for example, pitching is blue, hitting is sign, and then catching would be orange. Now weep throw in the red track and so on. And because we are learning these storyline model from text we can even generate an automatic test caption for the video. So this is generated automatically from our approach. The pitcher pitches the ball before the batter is going to hit and the batter will hit and so on. And the currently active part of the text is that right now being shown in red. What is happening right now at this point of time in the video. All right. To summarize I argued that we can bring back the flavor of global scene understanding using relationships. I first talked about how we can use physical relationships to build a physical representation of the scene. I then talked about how we can use functional relationships to build human centric scene understanding. And finally I talked about how we can use causal relationships to build storyline model for the domain. Now, in my quest to get this complete scene, as part of our future work I want to look into intentions. For example, to really understand this cartoon, I need to understand the intention of the green actor, that the green person is trying to steal the fish caught by the red person. But understanding intentions requires modeling all the relationships together. For example, in this case, I need to model the physical relationship that the cliff is higher than the ledge. I need to model functional relationship that fishing rods are used to fish but since it is not touching the water, so probably it's not being used to fish here, I need to model causal relationship that even though this guy is catching the fishes, the number of fishes in his box doesn't increase. And once I can model all the three relationships, I can then tend to model intentions. So as part of my future work, I want to investigate how we can combine these three relationships and build a grand system which understands all these relationships together. For example, I can think about building causal relationships on top of physical representation. So I can ask a question, why did this guy pass the ball to that guy on the theft? Well, because he saw a lot of free space between them so there couldn't -- this free space could not have been intercepted by any other player. And so he passed the ball to the guy on the left. So now you can see how building causal relationships on physical representations help us really better understand the scenes. But it comes with a challenge. The errors from the physical layer should not pass to the errors in the causal layer. In fact, the causal layer should feed back and improve the layers in the physical layer. As part of my future work I also will want to look into how we can use recognition itself to build better relationships. So here I'm trying to model recognition as an association problem. Now, this is the work of Tomasz Malisiewicz, who is also a student at CMU, and I am helping co-advise him. So here what we are trying to do is -- the idea is very simple. If you see a bus in the image, you associate it with a bus you have seen in the past and because you know about the attributes of the bus in the past, for example the 3D model or its geometric structure, you can just transfer the 3D structure from here to this bus. And you can now have a better understanding of the scene. Here's another example where this chair gets associated with this chair I've seen in the past and because I know the 3D model of this chair, I can now transfer the 3D model here. Now, another relationship I'm interested in looking at is the relationship between language and vision. Currently all vision approaches learn concepts by looking at this lot of visual data. However, we humans don't do that. For example, we don't need to fall from the cliff or see someone falling from the cliff to understand that you get hurt when you fall from the cliff. You just read the sentence and you understand this concept. So I want to learn into how we can learn visual concepts by looking at large amount of text data on the Web. And fortunately I have done some work in this direction. I have looked into how we can use the richness of language. For example, prepositions to get a spatial structure of spatial relationships in the images. Again, this comes with a huge challenge that we need to have visual correspondence between the words and the visual world. And finally I would also like to look into some of the applications of my works into the vertex and graphics. For example, we have now these meaningful physical groundings because we have a physical representation of the scene. So we can now talk about scene in terms of occupied voxels and support surfaces. So now instead of -- so now we can use this representation to do path planning or action planning. In fact, we can even have a robot centric scene understanding instead of this human centric scene understanding. So in case, it would be perfectly fine for a snake robot to crawl under a table and reach its destination. Similarly in graphics the way currently we create animations is by hand marking the constraints and then create using motion graphs to follow those constraints. But we can automatically create constraints using this physical representation in the scene. And we can combine this physical representation with the actions of the motions to create movies from still images. All right. To finally conclude, we have reached a very exciting phase in computer vision. After decades of hard work, vision is finally coming of age. We are now finally starting to see lots of success in computer vision. We have so many startups and so many things happening right now. And it is kind of inevitable that we'll have much more successes to come. We'll have lot more success stories. We'll have lot more advancements. And I hope our work is part of that advancement. So would I like to acknowledge all the people I have collaborated with, without whom this would not have been possible. And a thank you. [applause]. >> Abhinav Gupta: Yes? >>: So I think it's great, you know, set of getting back to the roots, sort of saying okay, we're going [inaudible] scene understanding. A lot of great pieces of the relationships. Of course all the reason -- probably the reason people stepped back from these colloquial problems was that it turned out that they all depend on solving a bunch of lower level problems. And everybody's been running around trying to solve these low-level problems. So I'm trying to look at your results and then come back to the fact that the initial thing [inaudible] that party scene at the beginning [inaudible] said about that and then the initial thing that you know we can't even find a chair. And so there's some big gap here. I'm kind of wondering whether that gap is being addressed by either the people working on the low-level stuff or you work on your high-level stuff. I can't see how you can really get good results without waiting until -- does that make sense? >> Abhinav Gupta: So the question makes sense but the answer is not small, so it might take little like two minutes or so. But so what you are saying is perfectly correct, that we started with this big global problem. We left it because it -- low-level was not working. And then we spent almost three decades on this low-level problems of pattern recognition and stuff. So what I'm kind of hinting at here in this whole -- my line of research is hinting is that we have now started to make enough progress in this low-level problem. We now have these -- that's what the first part of the presentation was saying, that we have now seen now success stories in these pattern matching, pattern recognition problems. So probably now is a good time to now go back to these ideas because they are not working initially because the low level was not working because at the end of the day we would need both low level and high level to finish the task. We cannot just wait for -- the high-level guys cannot wait for the low-level guys to finish the task and then we'll start working on us. And the other -- and the low-level guys themselves cannot finish the task until they get top down information from the high-level guys. We have to collaborate. So what kind of my work is trying to do is see kind of hint the advancements in the low level and see where can we get with these old ideas, with this functional understanding and all these things, when can we get with the current low-level approaches? And now at some point of time, I think now is probably a good time, I -- whatever information I get, the partial information, I should feed it back to the low-level guys. Say, see, this is what I partially understand based on the first thing you gave me. Improve just can you use this information to improve? In fact, I didn't get time because I was already like 55 minute. In fact, the segmentation thing, the two buildings being combined, that's what I was claiming that none of the low-level approaches can do that right now. And the only way we could do it is because of partial interpretation. We pass it to our low level and we try to resegment the image again based on this partial information. We tried to match two regions, we tried to divide into multiple regions. So I didn't get into this part of my approach where you -- how do you actually use the high level to improve the low level. And we finally got these two segments together. So and, in fact, this was not like one case where it happened. Okay? So here is another case. This tree is occluding the building again. Again, two segments. And it commands it into one building. The only reason this is working is because we have this high-level volumetric relationship which feeds into segmentation and says I don't like this as one segment, this as one segment, just didn't work for me. Yet can you do me better? So then the low level says okay that we tried to combine these two, will it give me a better one? Are you happy with that? So that kind of thing. So we have to kind of interface with each other. And that is when I think the vision is going to -- we'll have the vision solve problem. I mean, otherwise the low-level people will say the best we can do is 40 percent, we'll say, okay, we can thought start working at 40 percent, we have to kind of feed each other. We have to work together. That's where I think the fun lies right now. >>: I think looking into this is a lot of fun. I think it does have a lot of practical uses like improving accuracies. But one thing I do find unsatisfying is the fact that relationships typically need to be hand coded. But you come up with your own vocabulary or you say how people sit on couches or you say how clocks can be arranged in the world. These things aren't learned automatically. Is there any way that you can impact them? >> Abhinav Gupta: I'm glad that you asked this question because I wanted someone to really ask this question, because this is a kind of a future work thing. So first of all, I need to clarify that whatever I am hand coding is kind of the nativist idea. What I'm hand coding is a representation of space. I'm just saying that space means something like this. I don't want to hand encode rules. For example, this rule of sitting, what you said. The only reason I have to hand code or manually annotate these support surfaces because right now I don't have those annotations. So as part of future work what we are doing is when we are taking this data from the motion capture system, you can now have four splits at the joint. So when you sit, you immediately see a bump in the force. You got and automatic annotation, automatic rule that you need support here. So now I don't need to do it hand code anymore. So this is like one example I'm giving you. There are multiple ways. But the first part of the presentation, the physical representation, that was just trying to encode how would we need to represent space. And that is kind of hinting at with the philosophical point of view or the psychological point of view. That's hinting at the nativist idea that you cannot start [inaudible] you cannot -that's -- it's going to be little problem. Even in case of humans when you've moved the hands, babies follow your hand. The reason is they have come with some representation of space. They know that if I'm moving my hand towards there, they have to look on the right to follow my hand. Things like that. So I'm trying to model this nativist idea of representation. Once you have this representation of space and time, you can learn these relationships automatically for example using four splits or you can have a system, a vortex system which physically interacts with the world. Okay. This cup is not -- where it can try to say, okay, this falls. There's some kind of thing going on. So it plays with the world, it physically interacts with the world and tries to learn these rules themselves. But we still need our basic representation. That is what the -- in some sense what I'm trying to hint at that the basic representation is what we are focusing at right now. >>: So I was kind of [inaudible] ask, a lot of the -- if you had a 3D -- if you had a 3D reconstruction of the scene, in many aspects of the scene understanding problem become easier because now you don't have to infer it anymore. A lot of the block wall, the concept is -- if you had some ways to get 3D shape, surfaces, voxels, then a lot of the inference problem would become easier. And so why is that not an approach that -- are you considering that approach and are [inaudible]. >> Abhinav Gupta: So that is actually a very good question. And I think a similar question was raised by Larry when I was talking to him, that you can even use videos. I mean, who stops you at using single images and stuff like that. So my answer is that's perfectly fine. I mean, you can use -- if you get more information which helps you, it's a perfectly good thing to use. Now, the question is why am I not using it? There is a basic thing here. So again, I was -- as I was trying to focus that I want to focus on these representation issues. And what I believe -- and I guess it's kind of true is that's a presentation can be solved using a single image, the representation problem. How do you represent the world? The representation problem remains the same. The problem is the same. I've just isolated some part of it. And by isolating it, I've made it -- I've made myself more focused on that representation problem. And again, with the same representation you can now move to video or stereo. You now have a better 3D. You now have more constraints in the cost function to put from the 3D world that these are the stereo relationships it should satisfy. But the representation of volumetric, for example, how do you -- so that's where I am focusing on. And I think that representation problem is same in video, stereo, single image. And now in basically whole vision. It's a representation problem of vision, it's not a representation problem of single image versus other thing. But I completely agree, from a practical point of view if you want to really use -- if you have stereo, you should definitely use it. It will really help the system do it much better. >>: I like this work. Something that's characteristic of some of the work you did, which is like you said, instead of going to true semantics, which is what Gibson is arguing, right, semantics or whatever or you form ecological imperatives, right, is another word he used, ecological ->> Abhinav Gupta: Yeah, he used that. >>: You know you focus on something that is somewhat more tractable like physics, right? So then rather than saying what is the person trying to achieve as an a ecological being, you say what does the physical of mass and ball support, right? But of course as you mentioned, the semantics on the cultural things or whatever, the behaviors that we engage in, right, for buildings and physics is the sensible thing. For humans [inaudible] behaviors, right? So you did have a little small section on how motion and the recognition interact with each other. >> Abhinav Gupta: Right. Right. >>: But it seems like there's so much more opportunity. Like, for example, analyzing this sitting situation, right, or that part -- just coming back to the very first picture you showed us, right, the party. I mean, we can immediately say there's a group of people here talking to each other, and we know that in that situation the statistical priors that they face each other, they stack the back, right? Whereas the people watching the TV, there's a natural tendency to face the television, right? So all kinds of things like that. So how would you kind of layer in behavioral stuff and also the statistics you can gather from let's say observing people behave? >> Abhinav Gupta: So whenever I'm presenting this stuff, I'm presenting it from an extreme point of view in the sense that I do not try to encode recognition at this point of time. But I still agree, and that is what one of the future works actually I said is that recognition has to come into the system. It -- we cannot be away -- for example, even in this case, the simplest example I was talking you about. We still have to recognize that these are people and these two are from the same team. Only then he can pass things to them. So I complete agree that recognition has to come into the system, and that is where we are kind of focusing on right now in like what we are formulating here in recognition is an association problem, not as a -- see, while the problems with this recognition right now, which I believe with the current recognition approaches is their strong major with the linguistics. For example, I want to model all the chairs in the world into one class called chairs. That is going to be a really, really difficult problem because chairs can vary a lot in this world. I mean, for example this becomes a chair if I start to sit off -- sit here. I mean, so this definition of a chair is just so much varied. And the problem is this manage with the linguistic world. Which is what I -- that is why I am focusing on this kind of association where there's no match with the linguistic world. We associate with something you have seen in the past. And then that's how you transfer information. Now even -- similarly in the kind of statistical co-relationship, the kind of work I did, at that point of time I did model the linguistic kind of semantic categories and then kind of model the statistical. But I think we need to somehow model it non linguistically. But linguistics should help. And that is what the last part said that I want to -- I want to model -- I don't want to model classes linguistically, but I still want to get the structure of the world linguistically. Because there is lot of information in text data as well. It's a little complicated argument but the argument is that you don't want to make your classes based on chairs and these linguistic world. But you can still get a structure from the linguistic world because there's so much linguistic data out there. So I want to capture the structure through linguistic data, but I want my recognition to be separate from this linguistic concept. But I agree it has to -- it has to be closely managed with recognition. I think -- and one of, for example Derek's work was kind of that -- it was trying to put all these kind of 3D understanding into the recognition, how can it help recognition itself? And I think, yes, that should be one of the things we should -- I should be looking at in a pretty soon thing. I mean because I think this amount of 3D data, amount of functional information, amount of causal information can really help us in doing much, much better recognition right now and that is what actually I was doing kind of pointing at the human centric scene understanding is well, for example, you can do a much better cup recognition by having this idea of functional world. Because you should only look for ->>: [inaudible]. >> Abhinav Gupta: Yeah, yeah. So in the reachable -- so you should only look for things in the reachable areas. You could have a much better predictor. What I didn't focus is you can even have a much better action recognition here. I didn't point it out because it's hard to -- it's getting too much. But if you have -- if you know where can a person lie down in the scene, you know where to look for lying actions, lying down actions. So you have much better -- you have action recognition themselves. So these kind of things are there, and they have to kind of come together. So I think as Michael pointed out and as this kind of question comes out, that we will not be able to solve vision with extreme views. We all have to come together and everything has to come together. Vision is such an extremely difficult problem that everything has to come together and click at a proper position and then I think everything would be solved. But this is one extreme view of where you can -- if you model these physical physics, fill relationships, functional relationships have an ecological point of view how would things look like. And this is one -- that point of view that's an extreme academic point of view, I guess. >> Larry Zitnick: Let's thank our speaker again. [applause]