>> Larry Zitnick: Okay. So it's my pleasure... today for his candidate talk. He's done a lot...

advertisement
>> Larry Zitnick: Okay. So it's my pleasure to welcome Abhinav Gupta here
today for his candidate talk. He's done a lot of really interesting work in the area
of object recognition and see and understanding. He's looked at 3D modeling,
physical modeling, functional modeling of scenes and different places. So you
can see that's what he'll be talking about today.
He got his PhD from the University of Maryland, and he's currently a post-doc at
a CMU. And he recently won the runner up best paper award at ECCV.
>> Abhinav Gupta: I guess I got lucky there. I will go here. Thank you, Larry for
inviting me. I'm glad to be here at Microsoft talking about my work. All right. So
I work in the field of image understanding. So let us first start with an image. So
let us consider this image.
When we humans see this image, we can tell so many things about it. For
example, it is a room. It is probably a living room because I can see couches
and table here. There are a lot of people in the room, so probably there's some
kind of a party going on. Then I can see a cake, some balloons, so probably it's
a birthday party. And this list can go on and on.
We humans are really good at understanding images. Because vision happens
to be one of the most important functions of the human brain. In fact, more than
50 percent of the human brain is dedicated to solving the task of vision. Why
not? We use vision for all our daily task. We use vision to interact with the
world. We use vision to navigate, perform actions in the scene, recognize people
and things, and to even predict what is going to happen next.
And because we are so good at vision, we often do not realize house hard it is
for a computer to just look at this area of numbers and understand things which
are just so obvious to us. In fact, there is a legend about how the field of
computer vision actually started. Marvin Minsky assigned computer vision as a
summer project to one of his undergraduate student, Gerald Sussman. And that
project outline was spend a summer linking the camera to the computer and
getting the computer to describe what it saw.
So, in fact, he thought that vision can be solved in three months as a summer
project. And it took us almost a decade to realize how hard the computer vision
problem is.
But still at the start of computer vision we as a field are very ambitious. We
wanted to solve the whole problem. We want to solve the complete scene
understanding problem. We wanted to extract everything from the image. We
wanted to extract semantic properties, geometric properties, special layout, just
everything we can think of. And we tried to build some really grand vision
systems in these in the '70s and so on.
For example some of the systems are VISIONS system from Hanson and
Riseman, the ACRONYM system from Brooks and Binford and the system from
Ohta and Kanade. So here is an example from the vision system from Hanson
and Riseman. And remember this is 1977 when you can only fit one image into
the RAM of computer. And at that point of time, given an image, they wanted to
extract the semantic properties what there is a tree, house, road. They wanted to
extract very special layout that the trees in front of the house, there's a tree
behind the house. They wanted to extract the occlusion properties that this tree
is occluding the house and the house is occluding the tree and so on.
So we learned a lot of great lessons by building these grand vision systems. But
while these systems were good, on a few virtual images they fail to generalize.
And the reason is that the problem is just too hard. And we did not have
computational tools in those years. It's very hard to model a complete
representation of the world.
And that brought us to the modern era in computer vision. So instead of trying to
solve this global scene understanding problem, we now divided the problem into
local parts, and we wanted to solve this local pattern matching problems. So
instead of trying to understand this whole image of a room, we now wanted to
understanding the bits and pieces of it. So maybe the couch, maybe the table
and so on.
So the standard ways you learn the classifier which separates couch from the
rest of the world, you then take a window, slide it across the image, extract
patterns and then classify these patterns as couch or not a couch. And during
the last two decade we actually have made a lot of great progress in this
direction.
We now have face detection in our patterns. We identify landmarks using
Google Goggles. But while we have made a lot of great progress, for example,
in understanding classes such as faces, pedestrians and cars, we are still very
bad at other classes, for example, chairs, tables and a birds. We have around 5
to 10 percent in that range.
But wait a minute. Even if I solve this problem, even if I completely solve this
detection problem, let us see what kind of an understanding would a system
have. So given an image, I would just have the list of detections in the image.
So this is the ideal case. The detection problem completely solved. So we'll
have in this case the two couches, the table and the lamp. And this is what a
system sees and understands. Just some boxes and with some words
associated with those boxes.
Now, this kind of understanding is great for answering a retrieval question such
as is there a couch in the image? But how often are we interested in such kind of
questions? We are more interested in questions such as where is a couch in the
image and how do I find the parts to that couch? And this understanding cannot
answer this question. In fact, this understanding is so superficial that it cannot
answer the where can I sit in this image, even though I know there is a box
corresponding to a couch somewhere here I can sit but I just don't know where
can I sit? Is this area the place where I can sit? Or is this area the place where I
can sit?
So just for reference, this is the couch. So this is the correct answer. In fact,
woke not even answer the most basic question necessary for survival, where in
the image can I walk? Even though I know it's a ground, is this area good to walk
or is this area good to walk on?
So it is clear that we need to somehow go beyond these local bounding box kind
of detections. We need to bring back the flavor of global scene understanding
which our pioneers had in mind. And in this talk, I'm going to argue that we can
bring back the flavor of global scene understanding using relationships.
So it's not just a coincidence that this table and ground are cut together in the
image. In fact, they share a very strong relationship; that is the table is
supported by the ground and the table creates a free-space obstruction on the
ground. So a human cannot stand at the location where the table is. And if
somehow I can capture this strong relationship and extract this information I can
now answer the question where in the image can I walk? So a human cannot
walk at this location because of the free-space obstruction whereas this location
is perfectly fine to walk because there's enough free space for a human to walk.
Let us look at another question. How do I get to the lamp in the image? And let
us look at the relationship between the couch on the right and the table here.
Again, the couch and the table are not just some isolated detections. But they
share a very strong relationship that the table is in front of the couch and the
table is separated from the couch. And if I can capture this rich information, I can
now answer the question how do I get to the lamp?
So I can go between the table and the couch, because they are separated, and
reach the lamp. So in this talk I'm going to argue that we can bring back the
flavor of global scene understanding by extracting this rich information from
relationships. And when I'm going to talk about relationships, I'm going to talk
about them in a very broad sense. For example, I'm going to talk about
relationships between the structure image of the scene, the walls and the floor, I
am going to talk about relationships within the objects in the scene, for example,
special relationships such as table is in front of the lamp and the couch, the left
couch is attached to the wall. I can going to talk about support relationships. For
example the couch and the table are separated by the ground.
We can go further and have relationship between a human and a object in the
scene. That is how to humans use those objects in the scene? So I can have a
relationship with between a human and a couch that humans use couches to sit.
And finally I can have relationships in time. So let us suppose there are three
humans sitting on the couch and there's a fourth human who wants to sit on the
couch. Well, the only way this human can sit is if one of the three persons
decide to get up and create a space for him. So these kind of prerequirements
for actions to happen, we can always have them as relationships.
Now specifically in this talk I'm going to talk about three kind of relationships. I'm
first going to start with physical relationships and I'll show how we can build a
physical representation from these physical relationships. Then I'm going to
briefly describe functional relationships, which are relationships between humans
and the physical representation of the scene. And finally I'm going to just touch
upon causal relationships which are relationships in time.
So let me start with physical relationships. So given an image, our goal is to
obtain a 3D understanding of this image. The way I'm going to do this is by
building relationships between the structure, for example the walls in the floor,
and the elements of the scene, for example the occupied volumes, the support
surfaces, and the objects themselves.
Very recently there has been a renewed interest in the field of 3D understanding
of images. For example this latest work by Hoiem, et al, looks at patches and
uses local classifiers to obtain surface orientations. And if the image is simple
enough you can group these patches and create 3D models.
But because these approaches still use local classifiers, they often lead to
interpretations which are physically not possible or highly unlikely. For example
in this image, the local classifier predicts that the ground is between the vertical
region and while there's an interpretation possible, we can all see how unlikely
this is.
So as the next step what people thought was we can have to include some
relationships and build a global understanding. So they tried including special
relationships in image plane. But none of these approaches also showed any
major improvements. And the problem is much more basic in nature. The
problem is with the representation itself. All these approaches assume a planar
representation of the world; that is the image is made up of planar patches
standing on top of ground. And this representation is so permissive and it does
not have enough global meaningful constraints or relationships to restrict the
location of these planes.
For example, in the this image it divides up into multiple planar patches standing
on top of ground and the presentation is just fine with it. It has no problems with
so many patches in the image.
So what we need is a representation that can bring in some more global
meaningful relationships that can constrain the location of these planes. One
such possible representation is a volumetric presentation of the scene. So
instead of assuming that the world is made up of planar patches we can now
assume that the world is made up of volumes. And once we have volumes, we
have so many relationships we can harness. We now have geometric
relationship. For example, the finite volume which says every must -- object
must occupy a finite non-zero volume.
We have a spatial exclusion relationship which says every object must occupy a
mutually exclusive volume. Once I am talking about volumes I can even talk
about masses. And now I have mechanical relationships that volume should be
configured such that they are physically stable. These volumes should not topple
over.
Now, interestingly these relationships are the same relationships which our
pioneers wanted to include in the famous Bloxwell [phonetic] project at MIT. And
that all during early '60s.
So here is a kind of an understanding that our system would have. Given an
image, we would break it into regions where each region would correspond to a
physical object in the scene. For each region we [inaudible] a volumetric class
which tells us some idea of what kind of volume does the object occupy?
For example, in this -- in the case of this building I can say there's a left face
visible, there's a right face visible and then there's a volume behind these two
faces. In the case of this building I can say that the light face of the building is
visible, the left face is either occluded or out of the image frame. Not only can we
estimate some kind of volume, but we can also estimate some kind of density
associated with these regions.
For example, building is a high density region where the trees a medium heavy
density reason. We can even extract special relationships. For example, this
building is in front of the tree and the tree is in front of the building on the right.
And finally, we can have support relationships that building and the trees are
supported by the ground. But this understanding of the system is called a 3D
parse graph, and we are automatically generating it using our program.
So let me first talk about representation. So our goal is to break an a image into
regions and estimate some kind of volume associated with each region. Now,
one possible way is to do precise metric 3D reconstruction so what we can do is
we can try to fit a cube into this region and then do these exact quantitative
measurements that this cube is 10 meter wide, 15 meter high and so on.
But doing precise metric 3D reconstruction from a single image is an extremely
difficult problem. So we are not going to try to do that. I will repeat again we are
not going to try to do any precise metric 3D reconstruction. We are just going to
associate each region with one of the eight volumetric classes defined in the
catalog.
For example, the building region is left-right class which says I can see a left face
of the building, I can see a right face of the building, and the volume is ridden
behind these two faces.
This building is a left-occluded class, which says I can see a right face of the
building and the left face is either occluded, out of the image frame. And for the
tree I can associate a porous class which says that this tree occupies some
porous volume in the scene.
Now, not only do you want to associate or estimate the volume with the region,
but you also want to estimate motion of mass or weight associated with the
region. About it visually estimating mass or weight is an extremely difficult
problem. So it's impossible for me to say by looking just at this multi-armed
bandit that whether this building weighs thousand tons, 2,000 tons or so on.
But if you look closely what we can do is we can so much texture to some notion
of qualitative density. For example, this is a brick texture, so this should be high
density region, whereas this is a tree texture, and this should be a light density
region.
So again what we are going to do is we are going to associate regions to three
qualitative density classes, light density, medium density and heavy density. So
we will build a density classifier using appearance and location features. These
are the features. And we use a decision tree classifier.
So here are some of the examples of the density classifier. So as you can see, it
classifies buildings as a high density material, trees as light density material and
humans as a medium density material. Now, quantitatively the performance of
our approach is 69.3 percent.
Okay. So our goal is, given an input image like this and the set of blocks which
are segments with their associated volumes and densities, we want to somehow
configure these blocks such at that time configuration is physically table and
looks exactly like the input image.
Now, as you can see, this is a hard combinatorial optimization problem. Our cost
function is neither submodular, nor we have just the bare [inaudible] in the cost
function. In fact, we have a much higher order [inaudible] size.
So by supplying this standard optimization approaches we used an iterative 3D
approach. And our experiments show empirically it seems to work pretty good.
And the way I'm going to motivate iterative approach is the way how children play
blocks will during their childhood. So remember we had a target configuration we
wanted to reach and we tried blocks one by one so that the block is stable. And
it helps us reach towards a target configuration.
They are going to do exactly the same thing. They are going to try blocks one by
one, see which block is physically table, and which helps us to move toward the
target configuration.
So given an image, we'll first estimate the ground on on which we are going to
stack the blocks. We are going to estimate the bag of blocks or the set of blocks
which we are going to place. We also estimate the local surface layouts and the
density. And now we are ready to place the block on the ground.
So at each round we are going to try to place blocks one by one. So at round
one, let us try to place this block. Now, if you look closely, this block has no
physical contact with the ground. And therefore this is not stable and would fall.
Let us try another block. In case of this block it has some contact with the
ground, good -- so we are good to start. Let's see if some of the volumetric class
fits to it. So we try volumetric classes one by one. And in this case, the left-right
class fits well because of the left face and there's a right face.
Now, let us see if this block is physically stable. So if I notice closely, this part of
the region of the block is classified as a high density [inaudible] at the top and it
has no support at the bottom. So this block is physically unstable and would
topple over.
Now, let us try a third block. Now, in the case of tree we have a nice geometric
class which is porous, and it is physically stable. So we select this block at this
stone and we move to round two. At round two I am going to try to place all the
blocks again one by one. So let me try this block now. Well, still it has no
support here, so it would not be physically stable. But let us try this block now.
Now in the case of this block, we have a new hypothesis that the trooe occludes
the base of the building. So now the block becomes physically stable. So now
we select this block, assign the left-right class to it, and extract the occlusion
edges.
Now, during this whole iterative process I kept on selecting and rejecting the
blocks. Well, we have a cost function to do that -- to achieve that. Our cost
function has five terms. The first two terms are the geometric cost. The first are
measures [inaudible] the input surface layouts and the same volumetric class.
For example, the front right class is a bad assignment because it doesn't match
the input surface layouts, whereas the left right class is a good assignment
because it matches the input surface layouts.
The second one measures the agreement between the ground and could I
contact points. So if you see a right face, right facing face, then the ground and
sky contact points should intersect at horizon.
The next two term measure [>)] the physical stability of the system. The third
term measures the internal stability and reekts blocks which have light bottom
and heavy top. The fourth term assumes that we are looking at a static and
physically stable world. So we are not looking at the measures of falling
buildings and so on. And under this assumption it sees the other blocks
externally stable or not.
So the way we do it is we first fit the volumetric class to the block which means
finding the can axis of rotation and the four edges where the planar surface area
orientation changes. And once I know the four edges and the axis of rotation, I
can now compute the top due to the weight of the patch and the top due to the
reaction from the ground. And if the top is physically stable, and if the top stands
[inaudible] I can say the block is physically stable.
Now, this term also rejects configurations which lead to heavy block at the top
and the light block at the bottom. So for example in this image there are two
possible configurations, that the building is on the top of the slub or the building is
behind the slub. Now, since building is a high dependency city material and slub
is a light or medium density material, the first configuration is not possible
because a high density material cannot be on top of a low density material.
Now, the fifth term and the final term measures the global agreement between
pairwise depth constraints. And the way you can obtain these pairwise depth
constraints is using depth rules. For example, if you have a 2D projection like
this, then under the convexity assumption it means that the yellow block is in
front of the grey block.
Now that I have described my five terms of cost function and my [inaudible]
approach we are now read to see some results. So given an image like this, our
approach can generate a 3D parse graph for the image. So it can break the
image into regions corresponding to the physical objects. It can say that these
trees are the porous classes. The building is a left-right class, this building here.
The chimney's a left-right class. It can extract the densities. For example the
trees are medium density material whereas the building is a high density
material. It can look through this special layout. For example the tree is in front
of the building and so on.
Now, the interesting thing to note here is our approach can combine these two
faces of the building. None of the current segmentation approaches can actually
achieve that. And the reason we can do this is because we use strong global
relationships. So if you use this face of the building as one segment, it would
lead to a zero volume block. If you use this face of the building as one segment,
it would lead to a zero volume block. The only way you can have a finite volume
is if you combine the two faces and they create a volume.
>>: How do you actually learn your parameters for your weight function?
>> Abhinav Gupta: That's a good question.
>>: You don't have to go into detail but ->> Abhinav Gupta: Yeah.
>>: [inaudible] high level ->> Abhinav Gupta: So in the current system what we did was we did -- I locally
manually tuned the parameters for 10 images and then -- not 10, I think 20
images or so and we learn it on the best of the 300 images. But the reason I
guess we couldn't do this cross-validation kind of thing is because we didn't have
enough images. So the dataset comes with 300 images and 300 images
probably not enough to do cross-validation. And out of these 300 we used -already used hundred for some other training, for example, training of density
classifier so we cannot use them anymore. So I guess if you have -- I mean the
best way should be to do the cross-validation kind of thing.
>>: [inaudible].
>> Abhinav Gupta: I mean but the way in these parameters of these cost
function, what we made sure was that are not fitting in the sense that general -- I
think if I remember well, the parameters which come outward like all ones or
something like that, to make sure that it's not a -- it's a simple cost function, it's
not like I'm putting 1.25 something like that.
Now, here's another example how -- where physical relationships can actually
help to get a better understanding of the system. So using physical relationships
our approach rejects a block like this, because it would topple over, and divides
this block into two blocks where the first block is assigned the front-right class,
which says there's a front face and the right face here, and this is another block
where there's on a frontal face.
We can even extract the -- in the special layout properly here as well. For
example, this tree is this tree here and is in front of the building here. And this
tree is this and it is in front of the second part of the building.
And one of the interesting things about our approach is that if you make more
assumptions we can even 3D deconstruction. So let us these volumetric classes
actually come from a cuboid so these classes would be something like this. We
can now fit cuboids to the regions and extract occlusion boundaries, and we can
do a 3D reconstruction of our toy blocks world.
For example, in this image the big pink block corresponds to the building. The
small pink blocks corresponds to the chimney. And this green block corresponds
to the tree here. This green block corresponds to this tree here and so on.
Again, none of the current vision approaches can generate this kind of 3D
volumetric reconstruction from a single image. So we applied similar concepts of
volumetric relationships to the problem of indoor scene understanding as well.
Now, this work is part of thesis work of David Lee who is a student at CMU. And
I'm helping co-advise him.
So the problem here is given an input image like this, we want to extract where
the walls and floor are. And again if you use simple local classifiers, you often
get interpretations which are physically impossible. For example in this case, the
wall is predicted to be in front of the stove and the table. And this is physically
not possible. But if somehow we can subtend the volume at the stove and the
table, we can now easily argue that the wall should be behind these volumes and
not in front of the volumes because these stoves are inside the room.
So once we do this kind of reasoning we get a much better wall estimate. Here's
another example. When using just simple local classifiers predict that the value
is in front of the bed. But if you begin subtend the volume at the bed, you can
easily argue that this thing at this wall should be behind the bed and not in front
of the bed and you get a much better wall -- much better wall estimate.
So that was about physical relationships. I will now talk about a different kind of
relationships which is a relationship between human body and the physical
presentation of the scene.
So if the first part of the presentation, I talked about how we can get a physical
representation of the scene, how we can extract the occupied volumes, the free
space, the support surfaces and so on.
But often when we look at images we're interested in other properties such as
what can I push in this image, what can I throw, where can I sit. Now, if you look
closely these properties are subjective. What is going to be pushable for an
elephant is not necessarily pushable for me. And in the case of the images, an
image understanding, the subject is us, the humans. We want to understand
images exactly the way humans do.
Therefore, in this part of the presentation, I'll introduce a new paradigm in scene
understanding and we call it human centric scene understanding. Instead of
trying to understand the scene in terms of identities of object or the 3D
representation, we are now going to understand the scene in terms of what can a
human do in this scene?
For example, in this image, a human can sit, as shown in the green skeleton.
The blue skeleton shows how can a human sit with straight legs. And the red
skeleton shows how can a human stand and touch a six-feet high location on the
wall?
So why am I so excited about this human centric scene understanding? Well, it
brings in a new fresh perspective on the scene understanding problem. Instead
of trying to understand the scene in terms of identities of object, there's a bed,
there's a table, paintings and so on, we now want to understand scene in terms
of what can a human do here? So he can sit on table or on ground. If he has
something in his hand, he could put that thing on the top of the shelf or on the top
of a table. If he wants to move, he can move that painting. If he wants to store
something, he can store it inside the cavity of the shelf. And so on.
So we now have a task based understanding of the scene. And it can evaluate
how good our understanding is based on how many tasks a human can achieve.
Furthermore, it provides prior for object recognition problem itself. For example,
if I want to search for cups in this image, one possible way is to look for cups in
all possible locations in the image. The other smart way would be to only look for
cups in the reachable areas in the image because it is highly unlikely that one
would put the cup on the top of a shelf.
We can even go further and have unconventional categories. For example, we
can now have reachable area detectors, hideable area detectors, storable area a
detectors and you cannot have these kind of categories using conventional
pattern recognition items.
And finally, this brings in the subjective interpretation. So the way a child looks at
a scene is completely different from a way an adult human looks at the same
scene. A child can sit with straight legs on the couch itself, whereas an adult
human needs the support of a table to sit with straight legs.
Similarly, to touch a six-feet high location, a child needs to claim on the couch
and only then can he touch the six-feet high location, whereas and adult human
can just stand and touch a six-feet high location.
Now, interestingly this idea of human centric scene understanding is not new in
the field of psychology. In early '50s, James Gibson came up with this idea of
affordances, which are the opportunities for intersection afforded by the scene.
So his claim was that when humans look at an image, they don't -- they do not
understand identities but they infer the functions of the objects. For example, if
you look at a knee high flat surface, you would infer that [inaudible].
But Gibson was very strong on affordances. He believed fruits afford eating,
waters afford -- water affords drinking and TVs afford watching. We are not
going to have strong a notion of affordances. We are only going to consider
physical affordances, that is actions which include physical interaction with the
scene.
The other problem with the affordances idea has been their association with the
semantic categories. They believe that as soon as you look at an image you
infer where all you can sit in an image. But if you consider sitting, there are
multiple ways a human can sit. I can sit with a back support, I can sit without a
back support, I can sit in some kind of crouching position, I can sit with straight
legs. There are just so many ways a human can sit.
So considering just a single classifier we should find out all the sittable in the
room is probably going to be too impossible of a task. So instead of considering
this semantic vocabulary of affordances we are now going to consider a
data-driven vocabulary of affordances. That is, we are -- instead of proposing
the problem as finding the sittable surfaces in the room, we want to pose the
problem is where can a human post fruit in that room. And the way woke get the
poses is using the motion capture data. In a motion capture the human wears a
black kind of a suit with some markers as you can see here, and then he
interacts with the scene in this lab environment with large Vicon cameras look at
you. And you can extract the poses using those cameras.
But this brings in a whole new problem, that a space of poses is huge. The
space of poses is exponential in the degrees of freedom of the human body. So
what we going to do is we are again going to fall back on a qualitative
representation of a pose. So we are going to represent a pose in terms of what
volume will it occupies in the scene and what are the support surfaces it requires.
So, for example, sitting with the back support and straight legs would mean a
volume like this red block, and it would need support at the legs, at the pelvic
joint, and at the back. Here's another example. We are sitting would mean a red
volume like this, and you would need support at the pelvic joint and at the back.
So now the problem is given these kind of physical representation of a scene will
we know the occupied volumes and so on, and given this block representation of
pose we need to find where would this human block fit? Does it resemble with
anything to you? This exactly the Tetris problem. You remember in Tetris
there's a block which falls and you need to find where would this block fit? It's
exactly the same problem in 3D.
You have this scene occupied with occupied blocks and a human block is going
to fall, and you need to find where would this human block fit? So we are going
to use the exact same constraints as we use in Tetris problem. The first
constraint is the support -- supports our best constraint which says that if you
need a support surface somewhere this should be present in the scene.
So, for example, for sitting you need a support surface at the pelvic joint, and
therefore there should -- this part of the scene should be occupied in the image.
So there should be -- for example, the couch provides a support for the human
sitting.
The second constraint is a free space constraint which says that wherever the
human [inaudible] in the image, that location should not be occupied by any
object in the scene. So, for example, if a human has to fit in this yellow location,
this location should be completely empty.
So let me show you some results. Given an input image like this, our approach
can predict where can a human sit in this image. The blue shows the location of
the pelvic joint. We can predict where can a human sit with a straight leg so the
sign shows the back support and the blue shows the pelvic joint support. We can
predict where can a human lie down in the image. And the magenta shows the
mask where again a human lie down or where the joints go if the human is lying
down.
>>: [inaudible].
>> Abhinav Gupta: Yes. So in this case, we didn't find the chair as a physical
surface, in the physical presentation.
And we can predict where can a huge reach at a six-feet high location. And if
you change the height of the reaching you get multiple possible reach locations.
Here's another example. Now, again in this case, magenta shows where a
human can lie down. This shows where can a human sit with straight legs so
sign shows that you can get the back support from the bed and you can get the
pelvic support from the ground. And the blue shows where can a human sit
without a back support. So it would be basically at the corners of the bed.
But this is a very interesting result. In this case our approach predicts that a
human can sit on top of a stove, on top of a table. Well, this is perfectly correct
result based on my approach. We were just looking for physical interactions and
humans can really sit on top of stove. I tried that after getting this result.
But how often do we see human sitting on top of stoves? So apart from these
physical requirements there's something called a cultural context. How do
humans use objects culturally? So apart from this physical layer you need the
statistical layer which shows which correlates how do humans object use -humans use objects generally?
So I have looked into this problem doing my thesis work so this was in 2008
when I looked into the statistical correlations between the human motion and the
objects in the scene. For example, the cups, the flashlight, the phones and the
spray bottle.
The idea was very simple. If you look at these two objects, it is very hard to say
just based on appearances which one is a drinking bottle and which one is a
spray bottle. But if you look these objects in the context of the whole scene, you
can easily say this one is a spraying bottle and this one is a drinking bottle.
Similarly, if you just look at trajectories, it is impossible for me to say which one is
the answering of the phone call and which one is a drinking from the cup. But
again, if you look at it in the context of the whole image, this one is the answering
of the phone and this one is the drinking from the cup.
So the idea was that human actions can help in recognition of objects and
objects can help in recognition of human actions. So that was about modeling of
functional relationships.
So in the first part I talked about how we can model physical relationships, in the
second part I talked about how we can model physical -- functional relationships,
sorry. And now I'm going to talk about another kind of relationship, which is a
relationship in time. And these are called temporal causal relationships.
Again, activities that happen in a scene are not isolated and independent. For
example, in this image -- in this scene, the car crosses the -- jumps a red light.
The skater stops. And this car on the left stops. Now, these three activities are
not isolated and independent of each other. In fact, they share a very strong
relationship. The skater stops to avoid being hit by the car that jumped the red
light. And we can use these causal relationships to develop stories of videos
where the storyline would be the set of actions that occur in the video and
relationship between those actions.
For example, the story of this video would be skater stops to avoid being hit by
the car that jumped the red light. But discovering causal relationships or
modeling these causal relationships is a very difficult problem. For example, in
this scene I need to model the traffic laws of the world, I need to model how do
cars run on roads. I need to model how do pedestrians cross the road. I need to
model the cultural context that cars stop when pedestrians are crossing. So
there's so many things I need to model.
So instead of -- so it's a very difficult problem. So here what we did was we
considered a very restricted domain of videos of human activities in a very
structured domain. For example, sports videos such as baseball video. Now, in
this case, videos have a very strong story associated with them. For example
here, the pitcher pitches the ball. The batter then hits it. He then runs towards
the base. In the meantime, the fielder runs to catch the ball.
The story in the second video is much shorter and much simpler. The pitcher
pitches the ball and the batter just misses it. Now, we would like to model the
variation across the storyline and the domain and we would like to discover the
causal relationships. So we like to learn a common storyline model are the
space of storylines for a given domain.
For example, the storyline model for the baseball domain would be the pitcher
pitches the ball, then the batter can either miss it or he can hit it. If the batter hits
the ball, then he then towards the base. In the meantime, the fielder runs to
catch the ball. He catches the ball and then either the fielder can throw it or he
can himself run with the ball towards the base.
So our goal here is given videos and some associated text or captions, we want
to discovery the causal relationships and extract the stolen model for the domain.
Now, discovery causal relationships require labeling action in video. For
example I need to know which action is pitching, which action is hitting to
discover the relationship between pitching and hitting. However, labeling of
actions itself requires me to know the rules of the game. I need to know the
storyline model. So I need to know what happens before pitching or hitting.
What is pitching? I need to know these kind of things to label what is pitching
and what is hitting.
So now we have a classical chicken and egg problem. I'm very briefly going to
industry an iterative approach of how we handle data. Given the videos, we first
extract the human tracks and given the captions we extract the rough timelines of
each video. We then combine the timelines and opt in a rough storyline model
for the beginning. This is just initial model. It can have errors in it.
But if you notice closely, it still does not have any visual grounding at this point of
time. So we obtain the visual grounding by focus statistics between the visual
features and the words. And now we have a storyline model to begin with.
Using this storyline model, we assign actions in the video. Once we have the
actions labeled, we can use these labeled actions to improve our storyline model.
And we keep on doing these iterations until the storyline model doesn't change or
converges. And once we have a learned storyline model, we can use it to infer
storylines of new videos. For example in this test video we found the storyline
was pitcher pitches the ball, batter hits it, he then runs towards the base. The
fielder runs to catch the ball. He catches the ball, throws the ball toward the base
and the fielder at the base catches the ball.
And the assignment of actions to the track is shown in color. So, for example,
pitching is blue, hitting is sign, and then catching would be orange. Now weep
throw in the red track and so on.
And because we are learning these storyline model from text we can even
generate an automatic test caption for the video. So this is generated
automatically from our approach. The pitcher pitches the ball before the batter is
going to hit and the batter will hit and so on. And the currently active part of the
text is that right now being shown in red. What is happening right now at this
point of time in the video.
All right. To summarize I argued that we can bring back the flavor of global
scene understanding using relationships. I first talked about how we can use
physical relationships to build a physical representation of the scene. I then
talked about how we can use functional relationships to build human centric
scene understanding. And finally I talked about how we can use causal
relationships to build storyline model for the domain.
Now, in my quest to get this complete scene, as part of our future work I want to
look into intentions. For example, to really understand this cartoon, I need to
understand the intention of the green actor, that the green person is trying to
steal the fish caught by the red person.
But understanding intentions requires modeling all the relationships together.
For example, in this case, I need to model the physical relationship that the cliff is
higher than the ledge. I need to model functional relationship that fishing rods
are used to fish but since it is not touching the water, so probably it's not being
used to fish here, I need to model causal relationship that even though this guy is
catching the fishes, the number of fishes in his box doesn't increase.
And once I can model all the three relationships, I can then tend to model
intentions. So as part of my future work, I want to investigate how we can
combine these three relationships and build a grand system which understands
all these relationships together. For example, I can think about building causal
relationships on top of physical representation. So I can ask a question, why did
this guy pass the ball to that guy on the theft? Well, because he saw a lot of free
space between them so there couldn't -- this free space could not have been
intercepted by any other player. And so he passed the ball to the guy on the left.
So now you can see how building causal relationships on physical
representations help us really better understand the scenes. But it comes with a
challenge. The errors from the physical layer should not pass to the errors in the
causal layer. In fact, the causal layer should feed back and improve the layers in
the physical layer.
As part of my future work I also will want to look into how we can use recognition
itself to build better relationships. So here I'm trying to model recognition as an
association problem. Now, this is the work of Tomasz Malisiewicz, who is also a
student at CMU, and I am helping co-advise him. So here what we are trying to
do is -- the idea is very simple. If you see a bus in the image, you associate it
with a bus you have seen in the past and because you know about the attributes
of the bus in the past, for example the 3D model or its geometric structure, you
can just transfer the 3D structure from here to this bus. And you can now have a
better understanding of the scene.
Here's another example where this chair gets associated with this chair I've seen
in the past and because I know the 3D model of this chair, I can now transfer the
3D model here.
Now, another relationship I'm interested in looking at is the relationship between
language and vision. Currently all vision approaches learn concepts by looking
at this lot of visual data. However, we humans don't do that. For example, we
don't need to fall from the cliff or see someone falling from the cliff to understand
that you get hurt when you fall from the cliff. You just read the sentence and you
understand this concept.
So I want to learn into how we can learn visual concepts by looking at large
amount of text data on the Web.
And fortunately I have done some work in this direction. I have looked into how
we can use the richness of language. For example, prepositions to get a spatial
structure of spatial relationships in the images. Again, this comes with a huge
challenge that we need to have visual correspondence between the words and
the visual world.
And finally I would also like to look into some of the applications of my works into
the vertex and graphics. For example, we have now these meaningful physical
groundings because we have a physical representation of the scene. So we can
now talk about scene in terms of occupied voxels and support surfaces. So now
instead of -- so now we can use this representation to do path planning or action
planning. In fact, we can even have a robot centric scene understanding instead
of this human centric scene understanding.
So in case, it would be perfectly fine for a snake robot to crawl under a table and
reach its destination.
Similarly in graphics the way currently we create animations is by hand marking
the constraints and then create using motion graphs to follow those constraints.
But we can automatically create constraints using this physical representation in
the scene. And we can combine this physical representation with the actions of
the motions to create movies from still images.
All right. To finally conclude, we have reached a very exciting phase in computer
vision. After decades of hard work, vision is finally coming of age. We are now
finally starting to see lots of success in computer vision. We have so many
startups and so many things happening right now.
And it is kind of inevitable that we'll have much more successes to come. We'll
have lot more success stories. We'll have lot more advancements. And I hope
our work is part of that advancement.
So would I like to acknowledge all the people I have collaborated with, without
whom this would not have been possible. And a thank you.
[applause].
>> Abhinav Gupta: Yes?
>>: So I think it's great, you know, set of getting back to the roots, sort of saying
okay, we're going [inaudible] scene understanding. A lot of great pieces of the
relationships. Of course all the reason -- probably the reason people stepped
back from these colloquial problems was that it turned out that they all depend on
solving a bunch of lower level problems. And everybody's been running around
trying to solve these low-level problems.
So I'm trying to look at your results and then come back to the fact that the initial
thing [inaudible] that party scene at the beginning [inaudible] said about that and
then the initial thing that you know we can't even find a chair. And so there's
some big gap here. I'm kind of wondering whether that gap is being addressed
by either the people working on the low-level stuff or you work on your high-level
stuff.
I can't see how you can really get good results without waiting until -- does that
make sense?
>> Abhinav Gupta: So the question makes sense but the answer is not small, so
it might take little like two minutes or so.
But so what you are saying is perfectly correct, that we started with this big global
problem. We left it because it -- low-level was not working. And then we spent
almost three decades on this low-level problems of pattern recognition and stuff.
So what I'm kind of hinting at here in this whole -- my line of research is hinting is
that we have now started to make enough progress in this low-level problem.
We now have these -- that's what the first part of the presentation was saying,
that we have now seen now success stories in these pattern matching, pattern
recognition problems.
So probably now is a good time to now go back to these ideas because they are
not working initially because the low level was not working because at the end of
the day we would need both low level and high level to finish the task. We
cannot just wait for -- the high-level guys cannot wait for the low-level guys to
finish the task and then we'll start working on us. And the other -- and the
low-level guys themselves cannot finish the task until they get top down
information from the high-level guys. We have to collaborate.
So what kind of my work is trying to do is see kind of hint the advancements in
the low level and see where can we get with these old ideas, with this functional
understanding and all these things, when can we get with the current low-level
approaches? And now at some point of time, I think now is probably a good
time, I -- whatever information I get, the partial information, I should feed it back
to the low-level guys. Say, see, this is what I partially understand based on the
first thing you gave me. Improve just can you use this information to improve? In
fact, I didn't get time because I was already like 55 minute. In fact, the
segmentation thing, the two buildings being combined, that's what I was claiming
that none of the low-level approaches can do that right now. And the only way
we could do it is because of partial interpretation. We pass it to our low level and
we try to resegment the image again based on this partial information.
We tried to match two regions, we tried to divide into multiple regions. So I didn't
get into this part of my approach where you -- how do you actually use the high
level to improve the low level. And we finally got these two segments together.
So and, in fact, this was not like one case where it happened. Okay?
So here is another case. This tree is occluding the building again. Again, two
segments. And it commands it into one building. The only reason this is working
is because we have this high-level volumetric relationship which feeds into
segmentation and says I don't like this as one segment, this as one segment, just
didn't work for me. Yet can you do me better? So then the low level says okay
that we tried to combine these two, will it give me a better one? Are you happy
with that? So that kind of thing.
So we have to kind of interface with each other. And that is when I think the
vision is going to -- we'll have the vision solve problem. I mean, otherwise the
low-level people will say the best we can do is 40 percent, we'll say, okay, we
can thought start working at 40 percent, we have to kind of feed each other. We
have to work together. That's where I think the fun lies right now.
>>: I think looking into this is a lot of fun. I think it does have a lot of practical
uses like improving accuracies. But one thing I do find unsatisfying is the fact
that relationships typically need to be hand coded. But you come up with your
own vocabulary or you say how people sit on couches or you say how clocks can
be arranged in the world. These things aren't learned automatically. Is there any
way that you can impact them?
>> Abhinav Gupta: I'm glad that you asked this question because I wanted
someone to really ask this question, because this is a kind of a future work thing.
So first of all, I need to clarify that whatever I am hand coding is kind of the
nativist idea. What I'm hand coding is a representation of space. I'm just saying
that space means something like this. I don't want to hand encode rules. For
example, this rule of sitting, what you said. The only reason I have to hand code
or manually annotate these support surfaces because right now I don't have
those annotations.
So as part of future work what we are doing is when we are taking this data from
the motion capture system, you can now have four splits at the joint. So when
you sit, you immediately see a bump in the force. You got and automatic
annotation, automatic rule that you need support here. So now I don't need to do
it hand code anymore. So this is like one example I'm giving you. There are
multiple ways.
But the first part of the presentation, the physical representation, that was just
trying to encode how would we need to represent space. And that is kind of
hinting at with the philosophical point of view or the psychological point of view.
That's hinting at the nativist idea that you cannot start [inaudible] you cannot -that's -- it's going to be little problem. Even in case of humans when you've
moved the hands, babies follow your hand. The reason is they have come with
some representation of space. They know that if I'm moving my hand towards
there, they have to look on the right to follow my hand. Things like that.
So I'm trying to model this nativist idea of representation. Once you have this
representation of space and time, you can learn these relationships automatically
for example using four splits or you can have a system, a vortex system which
physically interacts with the world. Okay. This cup is not -- where it can try to
say, okay, this falls. There's some kind of thing going on.
So it plays with the world, it physically interacts with the world and tries to learn
these rules themselves. But we still need our basic representation. That is what
the -- in some sense what I'm trying to hint at that the basic representation is
what we are focusing at right now.
>>: So I was kind of [inaudible] ask, a lot of the -- if you had a 3D -- if you had a
3D reconstruction of the scene, in many aspects of the scene understanding
problem become easier because now you don't have to infer it anymore. A lot of
the block wall, the concept is -- if you had some ways to get 3D shape, surfaces,
voxels, then a lot of the inference problem would become easier. And so why is
that not an approach that -- are you considering that approach and are
[inaudible].
>> Abhinav Gupta: So that is actually a very good question. And I think a similar
question was raised by Larry when I was talking to him, that you can even use
videos. I mean, who stops you at using single images and stuff like that.
So my answer is that's perfectly fine. I mean, you can use -- if you get more
information which helps you, it's a perfectly good thing to use. Now, the question
is why am I not using it? There is a basic thing here.
So again, I was -- as I was trying to focus that I want to focus on these
representation issues. And what I believe -- and I guess it's kind of true is that's
a presentation can be solved using a single image, the representation problem.
How do you represent the world? The representation problem remains the
same. The problem is the same. I've just isolated some part of it. And by
isolating it, I've made it -- I've made myself more focused on that representation
problem.
And again, with the same representation you can now move to video or stereo.
You now have a better 3D. You now have more constraints in the cost function
to put from the 3D world that these are the stereo relationships it should satisfy.
But the representation of volumetric, for example, how do you -- so that's where I
am focusing on. And I think that representation problem is same in video, stereo,
single image. And now in basically whole vision. It's a representation problem of
vision, it's not a representation problem of single image versus other thing.
But I completely agree, from a practical point of view if you want to really use -- if
you have stereo, you should definitely use it. It will really help the system do it
much better.
>>: I like this work. Something that's characteristic of some of the work you did,
which is like you said, instead of going to true semantics, which is what Gibson is
arguing, right, semantics or whatever or you form ecological imperatives, right, is
another word he used, ecological ->> Abhinav Gupta: Yeah, he used that.
>>: You know you focus on something that is somewhat more tractable like
physics, right? So then rather than saying what is the person trying to achieve as
an a ecological being, you say what does the physical of mass and ball support,
right?
But of course as you mentioned, the semantics on the cultural things or
whatever, the behaviors that we engage in, right, for buildings and physics is the
sensible thing. For humans [inaudible] behaviors, right? So you did have a little
small section on how motion and the recognition interact with each other.
>> Abhinav Gupta: Right. Right.
>>: But it seems like there's so much more opportunity. Like, for example,
analyzing this sitting situation, right, or that part -- just coming back to the very
first picture you showed us, right, the party. I mean, we can immediately say
there's a group of people here talking to each other, and we know that in that
situation the statistical priors that they face each other, they stack the back,
right? Whereas the people watching the TV, there's a natural tendency to face
the television, right? So all kinds of things like that. So how would you kind of
layer in behavioral stuff and also the statistics you can gather from let's say
observing people behave?
>> Abhinav Gupta: So whenever I'm presenting this stuff, I'm presenting it from
an extreme point of view in the sense that I do not try to encode recognition at
this point of time. But I still agree, and that is what one of the future works
actually I said is that recognition has to come into the system. It -- we cannot be
away -- for example, even in this case, the simplest example I was talking you
about. We still have to recognize that these are people and these two are from
the same team. Only then he can pass things to them.
So I complete agree that recognition has to come into the system, and that is
where we are kind of focusing on right now in like what we are formulating here
in recognition is an association problem, not as a -- see, while the problems with
this recognition right now, which I believe with the current recognition approaches
is their strong major with the linguistics.
For example, I want to model all the chairs in the world into one class called
chairs. That is going to be a really, really difficult problem because chairs can
vary a lot in this world. I mean, for example this becomes a chair if I start to sit
off -- sit here. I mean, so this definition of a chair is just so much varied. And the
problem is this manage with the linguistic world. Which is what I -- that is why I
am focusing on this kind of association where there's no match with the linguistic
world. We associate with something you have seen in the past. And then that's
how you transfer information. Now even -- similarly in the kind of statistical
co-relationship, the kind of work I did, at that point of time I did model the
linguistic kind of semantic categories and then kind of model the statistical. But I
think we need to somehow model it non linguistically. But linguistics should help.
And that is what the last part said that I want to -- I want to model -- I don't want
to model classes linguistically, but I still want to get the structure of the world
linguistically. Because there is lot of information in text data as well. It's a little
complicated argument but the argument is that you don't want to make your
classes based on chairs and these linguistic world. But you can still get a
structure from the linguistic world because there's so much linguistic data out
there.
So I want to capture the structure through linguistic data, but I want my
recognition to be separate from this linguistic concept. But I agree it has to -- it
has to be closely managed with recognition. I think -- and one of, for example
Derek's work was kind of that -- it was trying to put all these kind of 3D
understanding into the recognition, how can it help recognition itself? And I think,
yes, that should be one of the things we should -- I should be looking at in a
pretty soon thing. I mean because I think this amount of 3D data, amount of
functional information, amount of causal information can really help us in doing
much, much better recognition right now and that is what actually I was doing
kind of pointing at the human centric scene understanding is well, for example,
you can do a much better cup recognition by having this idea of functional world.
Because you should only look for ->>: [inaudible].
>> Abhinav Gupta: Yeah, yeah. So in the reachable -- so you should only look
for things in the reachable areas. You could have a much better predictor. What
I didn't focus is you can even have a much better action recognition here. I didn't
point it out because it's hard to -- it's getting too much. But if you have -- if you
know where can a person lie down in the scene, you know where to look for lying
actions, lying down actions. So you have much better -- you have action
recognition themselves. So these kind of things are there, and they have to kind
of come together.
So I think as Michael pointed out and as this kind of question comes out, that we
will not be able to solve vision with extreme views. We all have to come together
and everything has to come together. Vision is such an extremely difficult
problem that everything has to come together and click at a proper position and
then I think everything would be solved. But this is one extreme view of where
you can -- if you model these physical physics, fill relationships, functional
relationships have an ecological point of view how would things look like. And
this is one -- that point of view that's an extreme academic point of view, I guess.
>> Larry Zitnick: Let's thank our speaker again.
[applause]
Download