>> Larry Zitnick: It is my pleasure to introduce... He is from UIUC under the advisor of David Forsyth...

advertisement
>> Larry Zitnick: It is my pleasure to introduce Ali Farhadi to Microsoft this morning.
He is from UIUC under the advisor of David Forsyth and his early work was mostly
centered on activity recognition and then more recently he's done a lot of really great
work on attributes and high level semantic scene understanding. So I think he's going to
talk about that mainly today so I will hand it over.
>> Ali Farhadi: Thank you. First of all thanks a lot for having me here. It's a great
pleasure. I'm going to talk about better recognition with richer representation. As it
sounds I'm basically going to be focusing on a representational approach as to
recognition. And what can we do with these representations? First I will talk about one;
I am interested in recognition, so obviously it's intellectually a very, very challenging
problem. It's a core fundamental computer vision problem, of course. And it provides a
lot of deep insight into other fundamental problems as well. And I would say even
human vision, for example. And if you approach computer vision actually correctly, it
provides a lot of applications.
I don't need to actually provide, explain the applications. Surveillance, robotics, image
search, all of those things actually come as applications of object recognition, assuming
that you have approached this problem correctly. The way that we currently do
recognition in the computer vehicular community is that we come up with a list of objects
that we want to recognize. So we sit down and write a bunch of names, car, bicycle, I
don't know motorbikes, people and then we put what we actually gather, positive,
negative examples for those. And then we build models to learn those category models.
And then when we want to actually use those models, when an image comes in, we give
it to all of our detectors, run it over to the image and we’re just crossing our fingers that
one of those detectors is going to be excited about this picture. And say hey, there's a car
in here. I have seen cars before and it's basically right in there.
Most of or almost all of the focus in computer vision or in computer recognition or object
recognition community is to improve these numbers in these tasks. How can I build a
better car detector? How can I build a robust person detector, and so? And so we are
pushing the bar each year; each year you're going to see new benchmarks, new numbers
in those benchmarks and we are basically making great progress as the recognition
community towards this task. But we basically never step back and think about what are
we going to do with this recognition system that we are building? What happens if I
show you a picture like this? And actually I run all of my detectors over pictures like
this. Do you actually get anything out of there? So the best outcome that I can get this
would be, gee, I don't know what this is because, I haven't seen it before.
But on the other hand, if I ask you as a human, so what is this picture? You can provide
me tons of useful information for this picture. You can say, I don't know what this is, but
I am sure it's a vehicle. It has wheels; I can see the wheels. I've seen wheels on other
vehicles, so I know this is a wheel. And since it has wheels, it has a wheel, it's a round
thing, it probably moves on the road. And looking at the size of the car, it probably
works with [inaudible] power. So you can infer tons of useful information. You can say
it's probably a new and modern vehicle. If I have to guess, it's probably pricey. So, one
of my goals actually is to provide such a capability for recognition systems. Without
knowing what this is, humans can recognize it and localize it, localize the wheel, localize
the windshield and provide tons of useful information for this.
So part of his talk will focused on how we can achieve such capability with our
recognition systems. The other actually part is, basically, this topic is going to talk about
what are the interesting things in the image? The other half would be interesting, we
would be interested in, is talking about what is it I'm going to report as the output of my
description for an image? Do I want to list everything together and say this is this, this is
this, and this is this? Or do I want to be smarter than? So the way that we do
recognition, or the way that we describe images in the current image paradigm is, I'm
going to get all of my detectors running all over the image and that will be the description
of the image, list of words. Do we really actually want to do this? Or do we, people,
describe this image like this? Or if I show you this picture and ask you to describe this
picture, what would you say? You would probably select some of those objects. You
probably don't talk about the flower back there, the pen down here, the car back there, so
you select some of the objects. And then you put them sort of in relationships, in
different forms, maybe in the form of a phrase, or maybe in the form of a sentence.
So the other part of this talk will be focused on how can I provide such a description for
an image instead of listing a bunch of words? How can I select some of those things and
put them into relationships in terms of events, actions, functions, scenes and stuff like
that? So basically my talk today would be concerned with three big topics, attributes as a
way of generalizing across categories, being able to talk about unfamiliar objects.
Describing and localizing unfamiliar objects. And then I'm going to go to richer
descriptions of images. How can I actually predict sentences for pictures, instead of a
bunch of words? And then at the end I'm going to talk about visual phrases, which I am
really excited about, which is a way of getting actually a more sophisticated way of
predicting sentences. On the attribute side, I am going to talk about basically describing
and localizing objects. First I'm going to start with describing objects, and then I am
going to switch to the localizing part.
So the way, my goal of this attribute-based representation is mainly focused on shifting
the goal of the recognition community from focusing on predicting a single name for an
image, to learning to describe objects. So basically my goal is to shift this point of view
from pure naming to description. I want to be able to describe things instead of name
things or assign a singular word to this. And the procedure is a very simple procedure.
So an image comes in. There is a little bit of technical detail that I'm going to skip for the
sake of time, which is basically try to de-correlate attribute predictors and stuff but at the
end of the day I am going to have attribute classifiers. I am going to write over an image
and then that provides me a description of the object.
And if I am interested in naming the object, like I'm basically filling a category with this
description, and then find the category. And this description can be semantic attributes or
this community attributes, so both of them. And I'm going to show, if you adopt this
approach you're going to do amazing new stuff that you could never have done before in
recognition. For example, not only can you name things like before, this is an airplane, I
know, you can describe new objects, novel objects things that you never observed before.
We've never observed carriages before. Despite that fact you can say this thing has
wheels, it is made of wood, and a lot of different descriptions. Examples would be we
never observed buildings in our training. But despite that we can say that these things are
3-D boxy things which are vertical, and they have rows of windows in it. We have never
observed centaurs. Despite that we can say it has head; it has leg; it has, we think it has
saddle in it for some reason. But we can do more than that. We can also report what is
typical, for known categories. Assume we’ve known birds. We know the properties of
the birds. We know birds should have head and beak. If you show a picture like this for
the system, it says I know it's a bird, but the head and beaks are missing. And that is
something that is not usual, and is worth reporting.
I know motorbike. But I don't expect to see cloth on a motorbike. If I see that, I can
actually report it. And here are some examples. Basically we expect to see, for example,
jet engines on an airplane. If you don't see a jet engine, you can report it. If you don't see
a sail on a boat, that sort of, the system thinks it's an unusual boat. And it can report it’s a
boat with no sail, or the other way around. We don't expect to see faces on buses. And
once we see it, we can actually report it. And sometimes we think there is a horn on a
bike because the handlebars look like a horn. And we think it's actually something
suspicious.
And there's even more than that. The attribute framework provides you with such new
functional capabilities that you can deal with all these things. And more than that, you
can also learn new categories from much fewer training examples, or in the extreme case,
with zero visual training examples. Like if you, assume you have not ever observed
goats before. And I want to explain goat to you. I want to say goats are something that
have four legs. It's an animal that has four legs. It is made of wool. It has horns. And
basically I will list everything for you. So you have some sort of idea about the goat.
You are not the best goat expert in the world, but you have some sort of idea about what a
goat is, with this pure textual description of the goat category. So our system provides
such a capability for you, to be able to basically learn new categories from fewer or no
visual examples. So this basically, the dotted black line is the attribute framework. The
blue line is basically standard recognition. This is the accuracy that we get. And this is
the number of training examples.
So the point that I am trying to make is one of them is that you can actually can get to
sort of the same accuracy with way fewer training examples. Meaning that you can learn
new categories much faster with way fewer training examples adopting this attribute
representation. And the interesting part is that what happens if I have zero training
examples with few textual descriptions of the category? And, here is the chance, and
here's where we’re standing with the attribute representation. Still we are not here
because we need to basically forty examples to give there. But with zero visual examples
we can actually go up to here. We have some idea of what the goat is without knowing,
without being a perfect goat expert or a good goat detector.
With that, I'm going to close the description part. So right now we will just talk about
how to describe objects in terms of the attributes. And one other thing that we can do as,
people can do, is being able to localize unfamiliar objects. Being able to say this is, I
know there is an object here; it's probably a vehicle. Here is its wheel here; here is its
windshield. So we people can do that. Can we ask computer, can we build a recognition
system that can actually do the same? And the answer is, yes. So, again, we are going to
adopt the attribute driven recognition system. We are going to build detectors, not only
for basic level categories as we usually do, but also for super ordinary categories, for
parts and attributes for everything. So we are going to have a pool of detectors and then
when a new image comes in, I am going to run all of my detectors over the image. And
each of them is going to have some opinion about where the objects are. And then I'm
going to give all of those detections and I'm going to use all of those detections to reason
about the location, the property of the objects.
So the procedure looks like, an image comes in. I get all of my detectors that I learned
on-- actually we build a data center called [inaudible]. It goes through examples with
detailed annotations of all of the parts for vehicles and animals. So we use that data set to
build the detectors. We get all of our detectors. We run them over an image. Then we
have machinery that actually looks at all of those predictions and decides on where the
object is. And then we're going to use this localization information together with all of
the other detection results to describe the objects in terms of the attributes, special
behavior of functions, like this is carnivorous. This creature can jump, so, all that
information.
First, how do we localize? So basically it's a very simple thing and very fast. Each of the
detections has an opinion about where the object is. So the ear detector has an opinion
about where the dog is, the nose detector, the same. The dog detector itself has an
opinion about where the dog is. The mammal detector has an opinion about where the
dog is, the animal detector as well. So we gather all of those votes for the location of the
object and then we cluster data space and we pick the base of the most populated cluster
as the response for where the object is. It's very simple and it's very fast. And once we
localize that we basically can describe the object. The way we describe the object is
basically we have two different types of attributes. There are attributes for which we
have direct visual evidence, for example, if something has like if something is a dog, if
something is a mammal. So basically for parts, for basic level categories and for super
categories, we can actually build a detector. So for those things we have a detector.
We have other types of attributes for which we either don't know how to build a detector
or it is very hard to build a detector. For example, if some object has the potential of
having a leg, no matter if it is visible or not. This is talking about it this thing has a leg or
not, if it's visible or not. This talks about a potential of the object having a leg. Talking
about the functions, this creature can jump, can run fast, is carnivorous and also the
aspectual information, is lying down and facing toward the camera. So the way they're
going to approach this is we are going to basically learn all of those things by looking at
the simple [inaudible]. And what it does is basically it tries to look at the correlation of
the things that we have detectors for and infers the others. And we solve this with simple
EM. We inference is exact and very fast.
So the gist of the idea is that if I show you a box and tell you there is a head in this side,
and a tail in the side and ask you which way the animal is facing, you can say it's facing
that way. So basically we're going to look at all of the predictions of all the things that
we have, and the correlation between them to infer things for which we don't have any
direct visual evidence. So to see how good this is working…
>>: The things that we don't have visual evidence for, can they in any way help the
detection of the things that we do have evidence for? Can they help weed out false…
>> Ali Farhadi: No. Actually, we are going to use those as walls to marginalize the
work.
>>: So they all help each other?
>> Ali Farhadi: Yes. They all have attributes. So to test this, what we do is we actually
give the coordinated set. We divided into two sets, familiar categories and unfamiliar
categories. So unfamiliar categories are the things that we are not going to observe at all
during training. And we're going to see actually is can we generalize to those unfamiliar
categories or not. The first question is can I basically learn a wheel model, let's say, on
the trucks and pickups and cars and expect this to actually work on motorbikes? And the
answer is actually yes. The detectors that we have are actually really powerful with some
loss of accuracy; you can actually learn those detectors. Here's an example. This is
learning a leg on the same category, tested on the same category, not the same instance
but instances of the same category, and this is the leg detector the RC for the leg detector.
If you train it on some categories it teaches some other categories, and these are examples
for leg, horn, wing, head, eye and so on. So the gist of the idea is yes they generalize.
They are not as good, but actually they are reasonably good that we can work with them.
So here is the sort of things that we can do. So elephant is an example of familiar
categories. So we can localize them as before, as a standard recognition system but you
can do much more than that. Because we can say hey, this is the animal. It is a fourlegged mammal. I know it is an elephant because I have seen it before. Here is the leg
here is the foot, here is the trunk. All of those involve localization information.
And things actually get more interesting when we are shown unfamiliar things. We have
never observed cats before. Despite that fact, you can actually localize the cat an animal.
Green means animal; red means vehicle in these pictures. And we can say here is the
head, here is the leg. We think it has a hump. And it's a mammal. We have never
observed jet skis before. Despite that we can actually localize them as a watercraft. We
have never observed buses before. No single example of a bus is in our training set.
Despite that we can actually localize the bus. I can say, I don't know what this is but
whatever this name is it is a wheeled vehicle. Here is the wheel; here is the license plate.
And if you are interested in numbers basically these are the quantitative results on how
good we are on the coordinated sets comparing to the dotted lines are basically traditional
recognition. So what happens if you just focus on basic localization versus the attribute
center recognition? So the red curve is for familiar objects. So this is what you get if you
adopt the attribute-based recognition. Even for familiar cases, there is this much gain,
considering the attribute-based representation, comparing to only doing the standard
thing, which is naming. Even for localization of familiar things. And also there's a gain
for unfamiliar and there is a gain for both of the objects.
>>: The main reason for this boost, is it because in the standard method are you using
less training data because you're actually generalizing across categories so they
essentially have more training data, or is it a natural representation that the attributes can
be more…
>> Ali Farhadi: It's actually both of them. Because you may miss a car, but you may not
miss a wheel. And then the wheel can actually help boost your car detection in the root
ball, so basically in the voting system. So basically both of the hypotheses are sort of
correct, that we get gains both because we have more models and because of actually
those models talk to each other. So far what I've talked about, these are actual results of
our system. So we never observed horses or carriages here in our training at all. No
single instances of them. Despite that fact, we can say I know there is a vehicle in here.
Here is a wheel and this vehicle, whatever it is, moves on the road and is facing to the
right. I know there is an animal here which is probably is a four-legged mammal. Here
is the head, here is a leg. This creature can run, can jump, is herbivorous and is facing
right. So this type of information you can actually do this type of inferences for an image
by adopting this attribute-based representation.
But at this I am going to actually shift gears and talk about, after talking about what to
predict over an image, talk about what to say for an image. Do I really want to list
everything? Or do I want to provide a concise description of an image?
>>: Ali, can you back up a bit? It's going by very, very fast for me, so. Typically we
have time in these talks. So I can see that you are detecting the leg, but it's a total
mystery how you, and what part of the system is saying is herbivorous or can jump or
something like that.
>> Ali Farhadi: For that I have to get back to the root ball. So basically this comes from
this node that predicts the function of the objects by looking-- for example if I want to
predict if something can jump or not, I have supervision for all of those things. And what
I can do, I can actually learn that if something has a long leg, has four of them and
probably has a little belly, it can jump. So we are going to learn those things through the
correlation that we are going to learn all through the things for which I have a detector.
So I am going to have a detector for leg, for all of the parts of the attributes of the objects.
And what I can do is I can basically infer things by looking at the patterns of the
responses of those detectors.
>>: So the notion of a long leg is different than the notion of a short leg?
>> Ali Farhadi: No, no. There is no notion of actually long versus short leg in this
framework. And if you actually look at the results for things that can jump, for example,
we really think that elephants can jump, which they probably can’t, because we don't
have the notion of weight with respect to the leg length, with respect to the muscle, power
and this stuff. So it does make mistakes because this model is very basic in the way of
modeling these correlations, or the attributes are not detailed enough to infer all of those
things. The reason that we actually can predict some of those things that some of them,
for example, something being carnivorous or herbivorous, is probably just some
coincidence of the correlations between the things that we have. And so for some of
them basically the correlation is the only thing that helps us to infer that.
>>: So for example the cart and horse, were there other predictions that it determined
that were not correct?
>> Ali Farhadi: Sure this is actually, I have examples, for example, I showed you a
wrong prediction, for example. There is a hump over here. There are two trunks over
here. So there are many wrong predictions over the image. This is a specific one that we
use as the icon of the paper. This is actually the perfect example. No mistakes, but it
happens.
>>: So facing right for the carriage, if you've never seen it carriage before how would
you infer that it is facing right?
>> Ali Farhadi: So basically you can infer it from facing right cars, let's say.
>>: Yes, so has no [inaudible] all wheels in front… You see the person on top, you see
his face…
>> Ali Farhadi: We don't do that actually. We only look at information that relates to
this vehicle. We don't infer from the person, right? But we can actually-- this is to show
that if you know how cars are going to be or how buses are going to be facing, right?
You can actually transfer that knowledge to carriages.
>>: So does that come from the spatial relationship between the wheels and the body?
>> Ali Farhadi: Yes.
>>: I was surprised that none of the attributes where a child knows, parent-child
relationships with the functions. I kind of expected to see, you know if it was there
before then that would affect conditions and possibilities of, it could compare things that
happened. So they are only communicating to the parent node [inaudible].
>> Ali Farhadi: Actually, I agree that you can actually do this child, makes the model
more sophisticated and probably works better. The reason that we didn't do that is we
wanted to have a very fast inference model rather than exact things. And we wanted to
be fast or we didn't want to get into details of how we can boost those things. But I
completely agree with you if we wanted to add layers to these root models, you could do
better than just a single layer of the root.
>>: With only the parent node for communication I mean that greatly limits how much
those notes can actually coordinate back and forth.
>> Ali Farhadi: Of course. I agree.
>>: So what is the probability that this [inaudible]?
>> Ali Farhadi: So distributions for the detector nodes we basically produced a
multinomial. And we are going to consider the other nodes as multinomial and then they
will be marginalized over everything. So it's basically very simple EM style life and root
model.
So I am going to go to the second part which is basically how can I provide the different
types of descriptions for images? We think actually sentences would be the right
description for images. If I ask people, people tend to select some of the objects and put
them in some sort of relationships. We believe that sentences are the wide representation
for the images. Why? Because they provide us with the capability to talk about
relationships, events, functions and also they implicitly are talking about what is worth
mentioning. If I know how to predict a sentence for an image, basically, implicitly
selecting some of the detectors and then put them in a sentence. So it's an extremely
challenging task for people familiar with recognition. It is very, very hard to predict a
sentence for an image.
If you look at the literature there are some approaches who are actually trying to do this
by being explicit about the relationships. This is on top of this, this is below this, this is
beside this, and then they try to basically to infer those properties. But there is a limit to
what you can do with those types of explicit approaches. Why? The domain of the
problem is very hard. You are going to be lost in the inference of very sophisticated
models. One thing we should do is we are sort of adopting the non-parametric approach
to this problem. By non-parametric, I mean if I have data sets of images and sentences in
correspondence and I have a nice representation can I actually learn to measure the
similarity of a sentence with an image? And if I can do that than what I can do is
assuming that I have a big enough data sets of images and sentences in correspondence, a
new image comes in; I find the closest sentence and report that as a description. Or a
new sentence comes in; I measure the similarity and find the best image.
So how can I build such a similarity score? So what we are thinking is we believe that
there is a space of meeting in the middle of the space of sentences and images. So each
point in the space has a projection into the space of the meeting and has a projection into
the space of sentences, so each point here in the projection of space of images and the
space of sentences. So if I can learn these two projections than I am home. Basically
then what I can do to score a similarity of a picture, of an image on a sentence, I predict
both of them that to the space of the meeting and then look at the distance of the
projections in the data space of the meeting, and that provides me a similarity measure.
How to I do that? So the way that we are going to do that is basically since this is an
extremely hard problem, we are going to have a simplistic approach to this problem. I
am going to assume the space of the meeting can be represented by an object, and action,
and a scene. I am ignoring the subject, object, the properties, the [inaudible], everything.
So I am just concerned with the objects seen in action. And what I'm going to do is I'm
basically going to learn this link and this rejection discriminately and together, how
basically I'm going to set up the structure and learning problem which I am not going to
talk about. If you're interested we can actually talk off-line. The job of this structured
learning is to basically rank the correct sentence for this image higher than the rest of the
sentences. And I am going to learn the parameters on this space over these three nodes,
over these three elements over the space of the meetings. To be able to do that we need a
data set that puts sentences and images in correspondence. So we actually built such a
[inaudible], is called UAOC sentence data set. We get examples of Pascal shouldn't be
that dark. Sorry about the picture. And ask people in the mechanical to write sentences
over the images. And these are, this is an example of the data set.
They're interesting properties with the sentences. So people are interestingly very, very,
are in great agreement about what to say about an image. All the people who are talking
about the two men here, all of them are talking about the talking; all of them are talking
about the relationship to the plane and all of them are talking about the plane. So there is
this great agreement between people about what to say about an image. They don't talk
about the grass back there. They don't talk about the trees cloud the sky, the jacket that
this guy is wearing. So people are in great agreement about what to say about an image.
And we basically tried to implicitly approach this problem because we really don't know
how to actually [inaudible] approach this. It is an extremely challenging problem. It is
actually one of my future directions I'm going to talk about.
But if you approach it implicitly in terms of how can I predict a sentence for an image?
Through implicitly if I predict a sentence I am implicitly selecting some of the things. So
basically the way that we are going to do it is an image comes in. I am going to get all of
my detectors, all of my scene detectors, all of my object detectors and all of my action
detectors. I basically I'm going to have all of them running over an image. I have a
simple CRF that connects these together, and then I am going to do an inference. And
then that gets me to--that feeds information to the structure learner. Here is an example
of what happens at the end. So of course this is a random example. This is a good
example of the method. For this picture we were predicting a man stands next to a train
on a cloudy day. A backpacker stands beside a green train, and different sentences. And
remember we we’re not generating the sentences. We didn't want to actually get into the
details of language generation. We just are getting it out of the data set by scoring the
similar things.
And surprisingly this simple nonparametric approach, it worked quite well for images
and sentences. We have evaluation metrics in the paper. If you are interested we can talk
about that. I am not going to bore you with evaluations at this point.
>>: Just to be absolutely clear. The sentences that you are generating from this are
whole sentences that people were referring down for other images. So you were not
mixing and matching any parts. You are just taking the entire sentences?
>> Ali Farhadi: No, no, no. Yes. And since our system is symmetric what we can do is
we can actually go the other way around. If you give me a sentence like this, and
remember this is a sentence written by a [inaudible]. So given that sentence, we can find
images like this. A horse being ridden within a fenced area. They are very interesting
properties so, for example we don't have detectors for everything. We don't have
detectors for fence. We have a limited number of things where [inaudible] realty
detector. The way that we are going to recognize those is basically we have a way to
incorporate the distributional semantics to our detectors. For example, I haven't seen let's
say beetle; there is a sentence that comes with this, there is a yellow beautiful beetle on
the street. We have never observed beetle or we don't have a beetle detector. But true
distributional semantics I know beetle is very similar to the car. Beetle is very similar to
the bus. It is a little similar to the car and it is not similar to dinosaurs.
So basically I am going to re-weight my detectors to have a rough and ready beetle
detector that I can actually use here. And incorporating those distributional semantics
actually works really good. If you are interested about how to incorporate that we can
actually talk about that off-line as well. But there is one big problem that actually Larry
mentioned with this approach. Which is what I am doing at this point is I am scoring the
whole sentence with the whole image. And if I want to do that I am basically implicitly
assuming that I have huge data sets for which I have sentences that correspond to every
possible image, which is a little bit unreasonable assumption so I cannot build that. And
looking back at the history of machine translation, if you look at the history of machine
translation, the boost in the machine translation happens when people actually talk about
phrases. This chunk of a sentence, corresponds to this chunk of a-- this chunk of a
sentence in English corresponds to this chunk of a sentence in French, let's say.
So people you do translation by phrase by phrase. And we have a translation problem. I
am going to translate from the space of the images to the English text. And instead of
actually matching the whole sentence to the whole image, at this point I want to actually
match the chunk of an image to a chunk of a sentence and then establish a phrasal
recognition basic system. That gets me to the third part of my talk which is how can I
learn a complex composites like a person riding a horse or a rider and a horse jumping
over a fence? How can I do that? Because if I can learn those, basically I have sort of
learned the phrase detector, which basically has something which is bigger than an object
but smaller than a scene. So there is, we do all this talk about objects and scenes. And
we never thought about there might be something in the middle. And those phrases are
something that is in the middle. And so they are bigger than an object that corresponds to
a chunk of an image, bigger than an object, but was smaller than the whole image.
Similar to the phrasal translation, there are phrases consisting of a couple of words or
more words, but are smaller than the whole sentence. How can I do that?
So conventional wisdom in vision says if you want to detect a person riding a horse,
detect the person, detect the horse and then sort of put them in correspondence somehow.
An example of that would be the [inaudible] paper on how to basically learn, this is on
the top of that, and this is beside this and therefore infer that this is a person riding a
horse. When we do that we implicitly are making a huge assumption which is wrong.
That assumption is that the appearance of the objects are going to be the same when they
interact. But as a matter of fact the appearance of the objects are going to change a lot
when they participate in relationships. A person riding a horse takes a very few
characteristic postures comparing to a typical person. A horse being ridden by a person
actually is quite different from a typical horse. A person drinking from a bottle has
actually such a rigid characteristic behavior compared to a typical person that we actually
are suggesting that if you want to detect these things, why don't you detect it as a whole?
If you want to detect a person lying on a sofa, you're going to have a miserable time to
detect the person, detect the sofa. But these things together provide such a reduced visual
complexity that you can actually learn them as one entity. So we are proposing instead of
learning horse, dog, besides learning horse and dog, what about learning a person riding a
horse, a dog lying on a sofa? What's wrong with this?
The first thing that comes into mind is that well I have combinatorial number of phrases I
can make out of words. How are you going to do a deal with that? Do you want to learn
a combinatorial of phrases? And if you want to do that I'm sure you are not going to have
enough training data to do that. And the answer is actually it's very similar to what
happens in language. So there is a saying that you can generate with the amount of words
that you know, you can talk your whole life without repeating a phrase twice. But we
people don't do that. We actually have very few numbers of frequent phrases that we use
a lot. And what we are proposing is that there are few very characteristic phrases that we
can highly reliably detect. And there is no excuse in not using them and not getting them
into our recognition systems. Our recognition systems consist of only objects before.
Some people talk about context and scenes and stuff like that. And you are actually
adding these phrases to it. Why? Because they are very characteristic. We can actually
learn them very, very reliably and they are much better than learning the individuals
apart. How am I going to do that? Basically we build a phrase into the data set called
phrasal recognition. It consists, we give basic objects from Pascal, objects that really
know how to deal with them. They are famous objects from Pascal. And then we
basically make all possible phrases that we can think of out of those objects and we add
up-- basically there are eight objects, 17 visual phrases, almost 3000 images and we draw
bounding boxes for all of the phrases and all of the objects in them.
Because of this great reduction in the--look at the people drinking from bottles, for
example. Look at the people riding horses. So there is this great reduction in the
individual complexity of the objects. That what I can do is basically I can learn to detect
directly for those. And surprisingly I don't actually need so many examples to do that
because this is such an easy detection example. It's very characteristic. For example, if I
show you this picture and ask you what does this correspond to? You all can say it is a
person riding a horse. What is the model for this?
>>: A person riding a bicycle.
>> Ali Farhadi: Exactly. So it is so simple it's, the behaviors are so, have such a great
characteristic that we can easily, easily learn detectors for all of those things.
>>: I have a question about the 17 phrases. So when you say 17 phrases, you mean 17
pieces that can recombine into other phrases or do you mean 17 phrases like a person
drinking from a bottle?
>> Ali Farhadi: The second one. So a person riding a horse, a person jumping.
>>: But even if you argue that all possible phrases coming in are huge and there are a lot
of repeats, but 17 is smaller than expected.
>> Ali Farhadi: Actually learning these, scaling these to hundreds or thousands is easy,
because actually I only learned 50 examples of those. These are so easy to train and so
fast to train that you can actually learn them very easily. Later on actually I am going to
show you some experiments that will totally convince you that it's not a scale issue in this
subject. Let's see them all in action, so here. These are the baseline comparisons of
person riding a bicycle, person on bicycle detector. Bicycle detector has absolutely no
clue of other bicycle. There is only one person detection over here, whereas the person
riding the bicycle detector actually fires in five correct places. The dog detector has no
idea about what's going on over here. The sofa detector has no idea. But the dog lying
on the sofa actually finds it. The bottle detector has no idea over here, but the person
drinking from bottle actually fires over there. So the next question is actually how can I
compare this approach, which is basically train the phrases directly, versus find the
individual components and then put them in correspondence.
For that we actually built a baseline. So what this baseline is trying to do is that it tries to
basically find an upper bound and optimistic upper bound on how well one can predict a
person riding a horse from this prediction of a person and this prediction of a horse. And
we are actually building a very, very generous baseline. Well, basically I am going to
expand the bounding boxes to form another bounding box and I'm going to use them in
the max and min of those confidences to basically put on the final bounding box. But I
am not going to stop here. I'm also going to regress the position of these two bounding
boxes; I'm going to regress the position of these two bounding boxes against these two
bounding boxes on the test set. And I am going to regress the position and the confidence
over here and I am going to look at all of those things and I am going to pick the best one
on the test set. So this is an extremely generous baseline. I am just saying what is the
upper bound and how well one can predict this phrase from those predictions. And I am
going to basically train my regresser on the test set. And I'm going to compare this
generous baseline to detection the phrase as a whole. So these C curves are our C curves.
The blue lines are detections for the phrases. And the red one is the baseline that I just
talked about. Look at the huge gaps. These are not actually easy to get gaps in
recognition. If you look at the recognition results, they are just tiny little gaps on all of
our C curves.
>>: So initially how did you train the baseline here? Did you train it on the same way
you train yours on, or did you train it on generic people?
>> Ali Farhadi: For those we actually had two approaches. I used the state-of-the-art
best person detector that Pedro gets out of the Pascal and I have a person to take it to my
own data set. I am going to run both of them and I pick the best one. So basically this
one is the best possible detection plus the best possible relationships. The problem is that
a typical person detector won't get a person riding a bicycle. Because it's bent, and if you
want to build a root ball that actually handles all the variations of the people the root ball
is going to be lost.
>>: It might be a function of the training data, because you're training the people on
these very diverse people images which have very few people riding on bike images. In a
lot of ways this looks similar to the component models Adris, where basically you take
people and you break that data set into let's say six or ten possible threads, people laying
down, people standing up people bending at the waist, that sort of thing. And you are
basically saying what let's do another cut which is people on bikes. So…
>> Ali Farhadi: So you are actually asking two questions. One is do you train this on the
status that people are actually on the bikes only? The answer is no. I train the two
detectors per person. One is on the Pascal, the state-of-the-art Pascal detector. The other
is trained on my own data set. So basically I am trying to capture both of them, and as a
response to the second one, look at this curve. If the root model can handle that this
curve shouldn't be here. It should be somewhere close to that, or there should be a
smaller gap here. The reason actually that the Pedro's model cannot do that is because if
you're issue is too big you cannot have a light and root model with 20 different light and
roots and you can that actually to do these things. And those gaps are actually amazing
gaps. Look at a horse and rider jumping. So you usually don't see such a high C curve
in vision. I haven't seen this one before. A person drinking from bottle, so it's basically,
you have not much idea about the bottle because we don't give the bottle writes and so
most of the time the persons are confused. So this proves if you learn those detectors
together because of this very rigid individual structure of the phrases you're going to have
a very, very easy time to learn a very, very reliable detectors for the phrases.
>>: Maybe the other way to phrase this is what if you took both of them learning from
the same training data set. Let's say person drinking from bottle, right so we show the
baseline and your method both, the exact same data sets. The only difference is when
you show your person drinking bottle in the bounding box around the person and the
bottle, whereas with Pedro's you actually draw the bounding box around the bottle and
the person.
>> Ali Farhadi: Uh-huh. Then we have the baseline, basically.
>>: Is that what you did to the baseline or is that…
>> Ali Farhadi: Yes. So this is a way of predicting the final phrase bounding box out of
a person and a horse.
>>: You did this on the training. When you actually trained the models, you trained
them the same.
>> Ali Farhadi: For the baseline? Yes. And I trained them on the test set not on the
trained set. So basically I trained this to predict the position of the phrase from the
position of the components and the confidence of the phrase from the confidence of the
components. And then I trained them on the test set basically. And this is what you get.
So now that you have this amazing results for basically detecting phrases, what should I
do at this? Am I making the problem even harder than what it was? Because we had
objects before. And we didn't know what to say about the image. Now we are having
objects and phrases and I can run actually all of them over the image. What can I say
about this at the end of the day? And we believe that there should be decoding
machinery and every multi-class object detection system, like machine translation that's
sits on the top of the predictions and decides on what to say about the image at the end of
the day. So your image comes in, you run all of your objects, its phrases, whatever
detector, each of them has an opinion, about where the objects are, where the phrases are
and if you, if I show you this and I ask you okay, what should I say at the end of the day?
Maybe there is a weak horse detection writing here just right below the threshold. And it
didn't make it through the final answer. Maybe these are actually a little bit over the
threshold and by just pushing them down I can get rid of them.
So there is a machinery that we, we actually developed a machinery called decoding
machinery that sits on the top of those predictions, and what it does is says okay, if there
is a horse, a person riding a horse over here, there is a person of your, there should be a
horse prediction somewhere down there. So let's push that prediction up. Let's push this
horse prediction up. Let's push those things down, and basically the decoding machinery
sits on the top of all of those predictions and decides what to say at the end of the day
about the image. Decoding is not a new thing in [inaudible] recognition. We always do
decoding. Non-[inaudible] is one way of decoding. It is another way of decoding saying
that if this bounding box overlaps this bounding box, give the best one and ignore the
rest.
But usually when we, the way that we are going to phrase our decoding problem is so we
have bounding boxes. Let's say we have three different categories red, blue and yellow.
And I am going to run all of those images. I am going to phrase my decoding problem as
assigning zeros and ones to these bounding boxes. So that I am going to basically mimic
my training set. Or if you don't like to make the hard decision at the beginning, you can
phrase it as I am going to increase the confidence of some of those and decrease the
confidence of the others. So if I ask you immediately model this problem, what you
would typically say is it is basically a unary plus binary approach. The unary talks about
the appearance of the bounding box. The binary term talks about the relationships. So
that's the typical way of doing this. But there are problems actually for which we really
don't to need to get this hard inference. To be able to actually do this task that we want to
do. Sometimes if you are smart about representations, you can avoid doing solving the
hard problem. So what we can do here is we actually ignore this bounding term that talks
about this is beside this, because we cannot model it correctly.
You end up with [inaudible] differences and then the results are not going to be as good.
So we are going to model this [inaudible] relationships in the [inaudible] presentation of
unaries. As a result we have a simple inference which is basically a unary term and we
are going to ignore the binary term. We are going to basically put all of the binary
information into the unary term. How? So let's assume I want to represent the thick box
over here. What I can do is basically I am going to build a long feature vector that has
very crude spatial bins. The bowls, the sides of and below. And for each of those bins I
am going to report the maximum responses for all those category responses and put it
over here. Then if I show this to my learner at the end of the day my learner has an idea
about the local context. If this corresponds to the horse and there is, if this box
corresponds to the person riding a horse then there should be a strong response for a
person on the above bin. There should be a strong response for a horse in the middle bin,
and maybe there is a fence in the bottom bin.
So basically by building, by designing this feature I am avoiding having the problem of
solving this inference and turning the problem into its simple inference. So the problem
formulation is that I am going to predict an H which is the assignment of ones and zeros
to these bounding boxes. And I am going to phrase it again is a [inaudible] instructional
learning problem. So I am looking for the W's to assign to my features so that the best,
the grass-roots hypothesis scores highest comparing to the other problems. And then you
have a [inaudible] which actually is dealt with this maximization problem.
Let's see some of the decoding results. So there is a person riding a bicycle over there
and there are three wrong person detectors. So the person riding the bicycle actually
pushed up the bicycle, and pushed down all of the wrong persons. There is a dog lying
on the sofa here which is very confident. So the dog lying on the sofa got rid of all those
person detections and pushed up the sofa prediction up. Why? Because if there is a sofa
over there, and if there is a dog lying on a sofa, there should be a sofa nearby in the
besides bin. So you just look for the weakest, highest which prediction is below
threshold than just bring it up. If you like numbers, so we compare, so one thing I would
compare is how good you can predict objects using phrases. Basically phrasal
recognition not only is good by itself but it's also good in boosting object recognition.
So these are the state-of-the-art detection responses on those numbers, on those eight
object categories that we are dealing with. And these are the decode our R final results.
If you are in the recognition business, you know that from this to this is a long way. So if
you look at the numbers in Pascal its basically, they are tiny little bit of improvements on
APs. And these are basically, some of them are huge improvements in terms of APs. To
compare this with a phrases are helping are not, actually we are going to run our R thing
our decoding without the phrases. And as you can see the numbers are actually going
down. Meaning that phrasal recognition actually helps object recognition. If you include
phrases which are reliable very easy to detect into your recognition machinery, into your
recognition spectrum, you can actually boost your recognition results.
And also, it also matters how to decode. If you do the hard way of decoding, if you do it
like with the unary plus binary and then start with the Greeley search, you are going to
do; you are not going to do as well. So these are the state of the results with those
modeling with and without phrases. And what this actually shows you is it actually
matters how to decode. If you are careful about your problems some of the problems do
not necessarily need a [inaudible] thing. But you actually have to model the
relationships. And being careful about those actually matters. To conclude, so I talk
about issues of recognition. I personally believe that there are serious problems with the
way that we think about recognition. And I think that we have to rethink recognition.
And basically one of the conclusions, we actually move the story; the visual phrases are
different ways of rethinking about recognition. I believe that we have very powerful
machineries but we have to be careful about how to use them. So let me ask you with a
story. We didn't basically build a new machinery; we just used the classifiers that we had
before. Or two phrase the story, we didn't build any new detector for phrases we just
used the powerful detectors that we have. But you have to be careful about how to use
them. And sometimes like the decoding story we have to build new machineries to be
able to do your stuff. And basically the theme of my research is that representation is the
key. So if you get the representation right, the rest will follow. And that will be basically
the gist of all of the things that I have done. In terms of what I plan to do next, basically
my short-term goals are, the first one is up I actually would like to answer the question
what is the right quantum of recognition?
From the data we start recognition, we always start going okay, there are cats and dogs
and bicycles and cars and we want to go in and detect them. And then we’re bidding our
detectors to produce best results on those fixed on those fixed categories. Why? Why
should only actually deal with basic level categories? Why not actually include in
phrases? Maybe there is something else between them. So they are in the spectrum.
They are objects, they’re phrases, they are scenes and we believe that these phrases
actually sits in between the scenes and objects. There might be some other things. So I
am trying to find a principled way of finding out what is the right quantum of
recognition. And I don't believe that the categories that we have right now are the right
way of doing recognition.
The second thing that I want to do is kind of related to this, is that I want to break down
this notion of categories. So we want to build a dog detector. The way that we do that is
we have examples of dogs and I then we have the models. The best one is basically that
Pedro's bottle the root model. So we are going to beat that model to be able to detect
dogs in different poses and different aspects, sitting dog, standing dog, jumping dog.
And then we are going to basically have models that we cannot manage at the end of the
day. So what I want to do is I want to wreck these strong walls between categories.
Maybe I can build a reliable detector that can get half of a dog and half of cat and a half
of the horses that is very unreliable. Why not use that, then build a model on top of the
that that can use that information? So basically this topic tries to attack this convention
that we have in recognition, that we want to recognize objects. These are cats, these are
dogs, these are bicycles. Let's recognize them. I am going to break that this in here and
the third thing that I want to do in short-term in the near future is build a phrase table. So
what we have right now is machinery to get the phrases right. And I believe we also have
the machinery to get reasons out of images. And those reasons are not necessarily super
pics and lower segmentations. Those are regions that correspond to high object regions.
So we can couple that with our phrase recognition and we can build a big table that has
regions of the images on this side and phrases on the other side. So I know that there is a
person riding a horse that corresponds to this and this and this region and not to the rest
of the regions. And very similar to machine translation we can have a decoding that
looks at this big phrase table that has regions over here, phrases over here and produce a
actual description for the image. While we decoding is quite different from machine
translation decoding, why, because we have to take care of inclusion, exclusion two
things cannot belong to the same thing. So we have to have a decoding that is different
from machine translation decoding slightly, but I believe that if I do this phrase table,
then we can produce statements of the objects much richer than the list of the objects and
more accurate than the way that we do recognition now.
For future directions these are basically sort of long-term goals I plan to be able to reason
in semantic spectrums, so we have parts, we have attributes, we have [inaudible] we have
objects, we have phrases we have scenes and there might be some others in the middle. I
would really like to be able to reason in this semantic spectrum to be able to talk about, to
infer all of those things, not and this is necessarily a top down or a bottom down
approach. So maybe a part, a strong part detection can help oppose a detection, can have
object recognition, whereas strong phrase recognition can actually help apart recognition
as well. So the other topic that I want to talk about is basically coupling, getting the
geometry in. So with the advent of all of those nice methods like the [inaudible] that you
guys are probably familiar with that you can fit boxes to the rooms. We could actually
infer a lot of--you could actually couple that geometry with a lot of different things. You
can couple geometry with material recognition; you can couple geometry with object
recognition with [inaudible] and all of those things, and you can actually put them in a
nice unified framework.
The last thing that I am going to talk about is about knowledge selection. As I said we
people are actually really good in selecting what to talk about and what not to talk about.
If you want to describe this picture you probably don't talk about the white board back
back then. And don't talk about the letters over here. So we are really good at knowing
what to talk about. And so, one of my goals actually to basically learn to select
knowledge the way that we do. Why? Because that basically simplify many applications
for example image search, because then we know what we our reports are sort of aligned
with what people think and then that makes the search much easier. The things that I
didn't talk if you're interested in, we can talk about in meetings. I work in knowledge
transfer and split representations based on comparative representations if you're
interested that we can talk about it. There is some work on senior discovery and
multitask learning. There is work that joins multitask learning with manifold learning if
you're interested. Sign language and human activity recognition stuff and a little bit of
work on using machines, learning approach to do with what's wide spectrum
measurements to ensure network security. And with that actually I close. Thank you.
[applause]
>>: I have one question. It seems that in the phrases that kind of the main take of your
message is that the data terms, the appearance of objects is not intended of the
relationship. Me to take into account the relationship of some objects [inaudible] mottled
appearances is that
>> Ali Farhadi: There are certain things that you can actually take as a take-home
message. One is that the appearance of the objects changes dramatically when you
participate in relationships. And if you ignore that, it hurts a lot. I showed you the
pictures. The other thing is that I want to basically break down this tradition of having
objects as the content of recognition. There might be some other things which are
extremely useful and extremely reliable to detect. And I will introduce them to the
vocabulary of the recognition. And the second one is actually more important than the
first one which is we really; the selection of the quantum's of recognition at this point is
almost arbitrary. And I believe that there should be a principled way of doing that. One
principled way is you basically think about phrases, relationships. Maybe there are other
principle ways. Maybe there are other quantums, other than phrases and scenes that we
don't know actually and appear later.
>> Larry Zitnick: Thank the speaker one more time.
[applause]
Download