23062 >> Ashish Kapoor: Thanks a lot. We are... there were only a couple of people and both were...

advertisement
23062
>> Ashish Kapoor: Thanks a lot. We are repeating this. I think last time I came
there were only a couple of people and both were my interns. So I think -- so
probably good to see a better audience.
All right. It's not showing up on the projector. Now it is. Okay. All right. So this
is a talk that I have combined into multiple works and so when I started thinking
about machine -- so specifically the problem we'll be looking at is some
recognition problems and machine vision, face recognition, things like those.
And I take the perspective that context can help a lot in some of those machine
vision tasks and some people have not really been looking at context a lot. And
this whole project or lines of projects have been trying to exploit whatever can we
do beyond just trying to do high level processing at the pixels or trying to find
more richer features.
How can we just exploit the context? So, again, like context can mean a lot of
different things to people. So when I first announced this talk, some folks from
[inaudible] groups came to me and when I spoke to them the context meant
something that I wasn't really thinking about at that time.
So for this talk specifically, I'm really interested in recognition and classification
specifically. And the contexts are all those cues that can help me boost
recognition, right? So, for instance, as we will see, right, we're going to mainly
focus on, you know, image recognition problems, specifically face recognition.
So identities of the person. So, of course, I can process each individual face and
I can recover whatever informative feature that I think works best. Like, for
instance, the shape of the eyes. Aspect ratio of the face, even the skin color.
And that's more towards feature engineering.
But, on the other hand, there's a lot of other contexts we can exploit. Things like
who do you most often appear with in a photograph, right? Where was this
photograph taken? What type of the day? What's the background looking like?
So there's all of these things which are not specifically trying to engineer features
but trying to exploit the co-occurrences of events, places, things like those.
And so that's the setting that we're going to work on. We're going to sort of go
through three or four graphical models which tries to incorporate different kinds of
information. Right? And so again this is just situating the problem in the setting
of face recognition, but you can imagine trying to do some similar recognition
tasks in other domains where there might be other variables of interest that can
inform your final classification task.
And as I said, we're interested in people, right? So one simple thing can use -you can use as co-occurrences, right? If I am like if there's a photograph in
which the image recognition algorithm can very well detect the presence of my
wife, right, it's likely that the other person is me. And say 70 or 80 percent of the
cases, right?
>>: Hopeful.
>> Ashish Kapoor: That's hopeful. But again it's statistics. As you will see. Like
we're not going to hard code it. And similarly you have events, right, people go to
parties. They select a number of photographs. So it's a burstingness nature. It's
like none of these photographs actually kind of independent, they are governed
by an event that occurred clustered in time, and also then people are clustered,
right? If it's MSR one going on then you know that most of the people are going
to be within from MSR. And of course you have locations if you have additional
information like GPS information, et cetera, right?
So let's try to think about how people have tried to use context like most of the
time, right? So usually people will try to use context as a feature. So imagine if I
have a classifier that can detect location or a classifier that can detect event.
Then I can take the output of that classifier, use that output as a feature, right,
and then I can train on basically a high level classifier. Then the problem is there
is that -- you know, it kind of depends on the quality of your underlying classifiers.
So if my location classifier is not any good, it's not going to help me, right? And
similarly others, right? So here -- of course there are other methods that hard
code the relationships of different like phenomenons being happening and then
try to exploit that relationship.
But there is -- one of the things that sort of drove us in this work is that instead of
assuming an existence of primary task, which is such as face recognition, and
secondary task such as location and like event detection, I mean instead what
we can do we can instead flatten this. We can actually use bootstrapping
capability from each of this. So in a nutshell if I know the persons in the
photograph and if I know the locations then probably I can guess the events.
Similarly if I know events, if I know like where it occurred, I can sort of guess who
the people are.
And can you actually -- this kind of combination and works and any possible way
you like, right? And that's what we're trying to achieve. What we're trying to do
is basically help these classifier bootstrap off of each other.
So instead of trying to train them independently and then sort of cascading that
output into a high level classifier, we'll try to basically train them simultaneously
and so that they can bootstrap off each other, right? And to do that we'll actually
use graphical models, right?
And some of the things as I said what we can use is we can use clothes, what
clothes people are wearing. You can have a clothing feature or a clothing
detector, time stamps. And people have used such things in the past. But again
these are nothing but features that go in.
So it really assumes that you have a very well performing context detector.
Right? So let's try to build one thing at a time, right? So first I'm going to use a
very simple model, right? So let's first think about constraints that are induced in
a photograph, right? So if I have a bunch of photographic collections, right, so
the first thing that I get from my prior knowledge is that if there are two faces that
are appearing in the same photograph, they need to be of different people.
So that's one kind of context. That's very simple. Similarly, if it's a video and I've
been tracking a face, right, I can kind of see that probably is the same guy, right?
So it's a very simple constraint. And we are not adding any kind of people,
location event thing yet. It's a very simple -- this constraint. And how can we
actually exploit it?
So first I'm going to show you a graphical model that just tries to model these
constraints. And on top of it we'll add the other modalities, right?
So, well, this is the graphical model. But I'm going to explain it to you one thing
at a time, right? So all right let's start from the top. Let's basically, let's assume
that this is all covered. Just the top portion, right?
So Y-1 through YN, imagine the true accidents of N different face patches that
you observed. So if in a corpus you had 100 photographs and out of that you
exclude -- you basically extracted, say, 500 face patches. So one through 500, Y
1 through YN is nothing but the basic true IDs.
And basically what I'm showing on top is a potential function which is encoding a
Gaussian process prior. But again let's not get into the details, but what you can
think is that suppose if you had a classifier, that would take Xs, which is the
image patches, and it would output the IDs Y 1 through YN. Basically the first
half is nothing but a classifier, right?
The second half is what that you are observing, right? If it is a shaded node you
basically say that's observed. So what you're saying someone has given you to
the identities of the T 1 and T 9, which is some face ID, right? And the rest you
need to infer, right?
And then what you have is you have these constraints that are coming from your
prior knowledge, right? So if two images came into the same photograph, they
are represented using these red lines, which is basically saying that T-7 and T-9
cannot be the same ID. T-9 and T-8 cannot be the same ID, here. Where a
green edge would say these all need to be the same ID.
So intuitively there are two parts. So the first part is a simple classifier that you
have trained using any feature-based method, computer vision, right? But on the
other hand you have these constraints that are coming out, right?
So this is basically the graphical model, right? So, again, if you're interested
these are the exact form of the potential functions. But in the end what I'm going
to do I'm just going to do a massage parsing, and a lot of -- all of these things are
hidden variables, right? And once I do the message parsing I should have
property distribution over everything that's unobserved.
So essentially intuitively your message parsing is going to do the following: It's
first going to classify, right? So that's all of these guys, right? Once it's classified
it's going to look at these constraints. Right? And it's going to resolve the
constraints. So since all of these are property distributions, right, you can
basically use these constraints to further refine your beliefs in the final labels,
right?
So intuitively first you classify and then you resolve the constraints. That will
result in new labels, right? And I can actually use these labels again to further
retrain my classifier. And you can keep doing this thing on and on until it
converges. Basically what essentially I've told you is a variation message
parsing algorithm, where kind of the first hidden variables are basically inferred
using a classifier. Then you run some kind of message parsing at this layer in
order to resolve the constraints, right? And then again you pass the messages
back so you can refine these guys again. That's really what's happening?
>>: [inaudible] for each class you have training examples.
>> Ashish Kapoor: The thing is the way we've implemented it, we don't assume
that. And, again, it really falls through from the structure of the Gaussian
processes. Really it's basically you know the structure is the -- your classifier is
of the form you have a matrix times the labels. And these labels are nothing but
indicator matrices.
So basically if you have more clusters coming on, all you need to do is just obtain
your label matrix. So, again, I'm skipping a lot of details. But it's kind of -- that's
how we implemented it.
>>: Schedule for the number of classes, right?
>> Ashish Kapoor: Yes, again, here's the thing, for one inference step, we
assume that we know the labels, right? But you'll see that we're going to do a bit
of active learning on top, that's where you don't need to really -- you can break
those boundaries there. All right. Clear, right? The simple message parsing.
It's first classify, down, resolve the constraints. The way you resolve the
constraints is basically loopy belief propagation, you pass messages back keep
doing it until it converges.
>>: [inaudible].
>> Ashish Kapoor: Well, the thing is if it weren't for a loopy belief propagation,
you could show the convergence, right? Otherwise assume that you can do this
constraint, this step perfectly, the constraint resolution step. The rest of it is a
simple variational message parsing which is it's going to go to a local optimum.
But because of existence of loopy belief propagation, you cannot guarantee it.
But in this application the graphs are such that the loops are fairly less. So it's
not a problem.
So the details we don't need to go over it, right? All right. So --
>>: I have a question.
>> Ashish Kapoor: Yeah.
>>: Get a bunch of people who have done this kind of layering classifier. So you
have a classifier at phase one and then you kind of train given the set of
classifiers you have at stage two. So you basically are kind of stacking
classifiers on top of one another. And this is similar in some ways because this
has -- you can take the information given the first classifier and then adapt to it.
>> Ashish Kapoor: Yeah, but also the thing is you need to notice that the
information flow is both ways, right? So classifier classifiers and based on those
potentials you have your constraints and you resolve those. And the messages
then get passed back. And so it refines the base layer classifier as well.
>>: For the top, the temporal revolution of those things. So you can think of it as
stage-wise. So you have a classifier, and then you kind of have the update via
the second layer and then you get the new ->> Ashish Kapoor: Yes, basically unfolding happening, yeah, you're right. It's
similar spirit, right.
>>: So what's different in that work is that they're doing it for a set of classifiers
here. And so I think they're combining different information. Combining different
information.
>> Ashish Kapoor: Again, probably I'll have to look at the paper that you exactly
are talking about. But I think in a sense the idea is very similar, that you have
multiple classifiers. They all can inform each other, right? So instead of doing
piece-wise linear, you can do, try to basically do this graphical model way in
which you have a message passing way of training all the classifiers
simultaneously.
>>: [inaudible] iteration.
>> Ashish Kapoor: You're doing an iteration, right. In an ideal world you should
be able to do.
>>: But you descend on all time ->> Ashish Kapoor: Well, it's a kind of gradient descent, like. You can view this
as doing a coordinate ascent where you have variables.
>>: You fix one, fix one and then ->> Ashish Kapoor: Yeah.
>>: Then optimize.
>> Ashish Kapoor: Instead of fixing one variable you're fixing a group of
variables. You can think like that also.
>>: Kind of alternate.
>> Ashish Kapoor: Alternation. Again, in real world I would be able to do this
inference exactly. Imagine if I was a huge sampling capability. Then essentially
what I'm saying is what I'm limited right now is by my inference procedure. If
there was a way to do exact inference, this would be as if I were doing it
simultaneously.
So one is the model and the other part is doing the inference. The inference,
because of the computation reason I'm using a variational approximate inference,
which has the flavor of doing iteration.
But had -- magically I had a capability of doing exact inference, then in a sense I
would be training this simultaneously. Right?
>>: Another question. You have here the assumption that the prior is no longer -which is ->> Ashish Kapoor: This is basically there is a very good correspondence
between this prior and the regularizer that does use an SVM. It's basically the
same thing.
>>: How strong does it depend on this? So if the prior, do you have some way to
control -- what is the case if this is the wrong prior, how bad can it be?
>> Ashish Kapoor: Let me try to answer this question in a different manner.
Imagine there are no constraints here. Imagine there is nothing here. Right?
Then this model is the same as regularized Lee's squares. Right? Then it boils
down to the question: How good is regularized Lee's square. Again you need to
make sure that your features are appropriate. Right?
All right. So, again, let's move -- so the good thing about this. This is like fully
based on inference going on. So I can do things like active learning using value
of information computations, basically. So what I can do is I can ask questions
like, so, you know given T 1 and T 9 I have a posterior distribution over the rest
of the label set, right?
So I can ask, hey, if I knew T8, how is this posterior going to shrink? So now you
can imagine that if I knew T8, I would automatically know what T7 is, right?
Because of the green lines. But if I knew TN then it does not give me that much
information. So I can do this kind of sort of reasoning in order to guide my active
learning. Right?
Imagine now if I have hundreds of photographs but I have a really long video
sequence, so it makes more sense for me to get a label for a single frame on that
video sequence in order to make, have a lot of discriminatory information I can
use to train classifier.
So this is the first sets of experiments basically that shows how well it works with
active learning. Again, I mean, you can get these results in the paper, but
basically as you get more and more label set, you tend to do more.
The blue line here in all of these graphs -- so there are like four different
datasets. The blue line is if you don't have any constraint resolution you're
basically training as if you had a supervised classifier, right? And the rest are
once you have, right?
So once you have the constraints then how will you do, basically? Again, you
can -- results you can read. All right. So moving on. Now comes the more
interesting.
So so far we were only talking about this modality, right? So from image
patches, right, figure out the identities and then you have these constraints in the
real world, right? Now imagine you had like other classifiers as well. I have a
location classifier, I have event classifiers. For people classifier I was just looking
at the image patches. But for characterizing a location maybe I can use the
background. So from the images I can compute a simple feature that looks at
the histogram of the background, right? So similarly for the event, if I have the
XF data from the file I can just look at the time stamps and maybe when GPS
coordinates. But so some features that you can use.
So if there were no connections like this, then they would be like individual
classifiers, which basically looks at the background scene and then tries to guess
what location it is. Similarly it looks at the time stamps and tries to classify, right?
But now I'm going to put this length -- this is where the message parsing is going
to happen again. So as I said, if I knew who all in the photograph were, and I
can characterize the location, then it gives me a lot of information about the
event.
Similarly, if I know what event was and what location was it, then I can get
information about the people and all possible combinations such as those.
Right?
So let's look at how do we do that. Right? So the crucial lengths of these guys,
and in the beginning I also mentioned that, you know, the fact that
co-occurrences of people can also help you, right? So that's what's shown by
this self-loop, basically. And let's try to see how do we model these links.
So I'm going to borrow these links basically using a relational table. It's nothing
but a table, right? And all it encompasses is the statistics of people and events
and when they are occurring.
So basically an entry 0 says that this guy cannot -- is probably not present at an
event characterized by these group of pictures. I mean, it's a group of pictures
that I have used to denote an event. But all this table is capturing is the
likelihood of all these people presented in an event.
I would even select a compatibility function basically. When it's 0 these guys are
not likely to occur in this event. But these guys are. Right?
>>: Do you find an event arbitrary group of pictures.
>> Ashish Kapoor: The way we define it is clustering in time. So we run
basically a clustering on the time stamps and we see that these guys are
different events. That's how we are defining it.
And similarly here as well. Right? So basically, you know, this is all unknown to
us, but right now I have filled those entries just to explain things, right? But if we
had a table like this, this information can be very useful, and this is what we will
use in order to do a message parsing, right?
So this is nothing but a compatibility matrix if you like to think of it. How
compatible are certain people in certain events, which is encoding, likelihood of
the presence. Right? And, yeah, right basically who they are and where.
So let's just work with two domains right now for simplicity. So we had the
people domain, right? Where we are looking at face features and clothes
features, we can simply classify them as we did in the previous thing, right?
Similarly in the event domain we have time stamps, you can simply classify.
Now you want to sort of do a message parsing, that's where the relational model
sits.
So again it's just nothing but a simple compatibility function which is nothing but a
2-D table. Again the entries are not known to us. But if you know those entries,
then I can simply do the message parsing as I did in the earlier slide, right? I
classified these guys, I classified these guys. I look at the relational model, and I
basically try to resolve the constraints and pass the messages back and keep
doing it. The problem is I don't know what the relational model is.
And what we're going to do is we're going to actually learn that as well. And,
again, in terms of probability formulations if you're interested in the formulas I can
go into the detail. But it has only three terms. The first term, the first two terms
are the unary potentials which are nothing but the unary classifiers, from
basically here to here, here to here. So what is the likelihood that if I observe a
certain facial features and clothes feature, the ID of the person is something.
And the second term is nothing but the compatibility, sorry, this term is nothing
but the compatibility across two domains, right? And, again, we can go into the
details. But I'll probably keep it at very high level.
As I said, classify this, classify this, pass the message along if you know what
this guy is. Right?
And so the way we are going to do it is basically EM, right? And again use the
parameters of these guys as -- think of them as a theta, a variable. So in an EM
step, right, you have an E step where you first figure out, you basically classify
these guys, right? And given your theta, you can compute the likelihood of the
model, right?
And what you can do you can try to optimize that with respect to this guy. So in a
nutshell, right, what's really happening is you first infer a distribution of the label
given some initial value of theta, which can be uniform, right? And then once you
have that you can try to maximize the model likelihood over your thetas and you
can just keep doing it. So it's various influence, but on top of it is an EM step
going on.
Basically, to the algorithm all you give is a bunch of images, a bunch of face
patches and it automatically not only just classifies them but finds out the limited
relationships. By using this graphical model what you have done is you have
provided a structure, and you're exploiting that structure in order to recover the
variables of interests. Right?
So in a nutshell, right, it's a very simple model. Classify these guys, classify
these guys, given this thing, find the likelihood of the model and then optimize
with respect to this guy to maximize it and keep doing it, right?
>>: So it's not clear to me where the initial model is coming from. So in this
example you said you have these red and the whites, the 0s and the non-0s in
this matrix. And you drew it so that that's the structure you have and that graph.
Where did that come from?
>> Ashish Kapoor: Well, that's the thing. This is unknown to us. So that's the
parameter that you need to find out as well. And that's where the EM step is
doing.
>>: The structural element to it, structurally is the red versus the white. And M is
the parametric part of it. So are you also identifying the structure -- what's not
clear to me is the structure as well.
>> Ashish Kapoor: The structure is fixed.
>>: How do you get the structure of the red versus the white?
>> Ashish Kapoor: Structure is very simple. So this is nothing but a giant table.
It's a lookup table, right? So in graphical model terms, right? This is a random
variable and this is a random variable. And entry here basically tells you the
compatibility of this value of the random variable with this.
So it's basically a simple 2D parameterization and that's where it's fixed.
>>: So it's not red versus white. You don't have any structural information. So
the graph in that second picture is misleading in the sense that you have edges
going from every person to every event.
>> Ashish Kapoor: We do actually have it. So well one thing to think about is the
following. So this is one random variable, this whole, the rows represent one
random variable. The columns represent one random variable corresponding to
each face patch. Correct? Right? So then basically this is nothing but a
potential function between this guy, this random variable and this random
variable, right?
And, yeah, that's pretty much it.
>>: So this table contains any additional information that given at the outside
quarters that you extract ->> Ashish Kapoor: That's something we'll extract ourselves. So this table is
unknown to us. It's unknown.
>>: Until someone says look at this picture, this ->> Ashish Kapoor: Exactly. So here's a hypothetical way to learn this table,
right? So I have, say, unary classifiers, classify people just based on the patch,
the location, based on those things.
>>: Classifiers.
>> Ashish Kapoor: Yes, probabilistic classifier. Assume they're any SVM or
anything. And what I can do I can look at the output of those classifiers and
come up with a table. As long as the classifiers are informative, not completely
random, I'll have some reasonable estimate here, right?
>>: What I don't understand is that this doesn't add any information.
>> Ashish Kapoor: The information is coming in terms of structure. That's what
I'm trying to say, right? What I have encoded in the model is the fact that certain
people tend to basically tend to be present at certain events but at certain events
they are not. Right? And that's basically what I'm saying using this. Like there is
a ->>: So I'm trying to understand what you said. So in the table itself, if it was just
extracting this data, there's no additional information?
>> Ashish Kapoor: Yeah, exactly right. So the additional information comes
from the fact that you say ->>: I'm imposing the fact on the model.
>> Ashish Kapoor: Some prior on these tables saying, for example, these two
guys, they tend to appear together.
>>: Appear together.
>> Ashish Kapoor: Something like that.
>>: Yes.
>> Ashish Kapoor: So the structure is -- the additional information is not in this
table but it's something on ->>: Here's the thing, right? Suppose if I were just training independent
classifiers, right? What's the difference between this model and that? Right?
The only difference is these lengths. So by specifying explicitly specifying this
dependency, I have added information.
>>: In other words, just because two pictures appeared at roughly the same time,
you're now biassing.
>> Ashish Kapoor: Constraints that you're adding, exactly.
>>: You mentioned, for example, you cannot be the same person in the ->> Ashish Kapoor: The same ->>: For example, also seems you want to encourage the same people appearing
the same way. What are the specific examples for these things that instantiate?
The prior comes from these samples. There's some model there.
>> Ashish Kapoor: All right. So one is -- when we talk about prior, I think I mean
the way I'm thinking about prior is probably a little bit slightly different than how
you're thinking about the prior. So let's start from very scratch, right? If I have
just unary classifiers, look at the faces and classifies them, that's a simple
classifier. And it can be SVM. It can be logistical regression. It can be decision
trees, it can be whatever.
Similarly here I can have another classifier. So now if I'm -- what I'm going to say
is the output that I observe here are not only dependent upon this but also going
to be dependent on the output of this guy.
This I'm explicitly encoding in my model and that's the extra information, right?
And there's no other extra information. I'm not specifically hand coding any kind
of statistics or anything.
>>: Original version of the matrix that said I found ->> Ashish Kapoor: If you knew this model, if you knew this model, it basically
just tells you how to marry these two labels. This is just a compatibility function,
right? All it's saying is how do I make sense when I observe things here and
things here. Right?
When it's absent there's no way for information exchange, right? So you
basically say all right whatever my classifier is giving is fine.
>>: The structure, so it will say -- so the same label cannot be in the same way.
Joe appears in the event at least one event must be Joe. So what are these
templates. What are the potentials you're [inaudible].
>> Ashish Kapoor: There are three, four different kinds of potentials, basically
one is people event. So you have a one face patch and you have random
variable corresponding to the event that the image was taken in, so there's a
compatibility function there. Similarly between event and a location. And, again,
location and a people and people and people so all those four lengths you have
four compatibility tables.
But again this is going through the details. Assume that if I just had only two
domains let's not even worry about the specific application. If I only had just two
domains, right, if I had simple classifier, there's no way of exchanging
information, unless in my model I explicitly, you know, spell out the dependency,
right? And this dependency is through this relational model which is
parameterized.
So if you know, then you know how to sort of marry those two domains, right?
So that's ->>: Let me make sure I understand correctly. So the matrix is not per image but
per of group images that are timestamped, classifiers ->> Ashish Kapoor: Every face patch -- so every face patch is a -- so you have a
face ID, right, and you have an event ID, corresponding to every face patch,
right?
>>: But if you're going to have multiple pictures with the same event ID, this is -so this is where you ->> Ashish Kapoor: Exactly.
>>: This is where the ->> Ashish Kapoor: This is where -- so those parameters are shared across, yes.
>>: Because if you didn't have ->>: No.
>> Ashish Kapoor: These parameters are all shared across all the instantiations.
>>: To the table?
>> Ashish Kapoor: Yeah.
>>: So the one .9, the event, the first event, if you learn this, it should be --
>> Ashish Kapoor: Let's not worry about the learning. Suppose this is the table
that someone has given you. Someone has just hand coded from prior
knowledge, he basically looked at the invitation list of different events, like
someone went in, and he hand coded this table and gave it to you to the model,
right? So this is how it would look like, right? So basically this table is saying
that this woman with a very high likelihood will represent in this event, right? So
if I detect this event, right, then the chances that this woman is of this ID is very
high. That's all it's saying.
>>: Take a picture of that event.
>> Ashish Kapoor: Any picture of that event.
>>: Copy quality for that potential for all pictures in that event, is that what you're
saying?
>> Ashish Kapoor: Yes.
>>: If she's in multiple pictures, is that more likely then? Can you ignore the face
patch, whatever face patch it is ->> Ashish Kapoor: Yeah, it's that.
>>: Face patch.
>>: It's pooling ->> Ashish Kapoor: Yeah, it's basically saying, it's not looking at features at all it's
just looking at the random variables on top. It's trying to basically -- it's a
compatibility function. It's a way of saying that I am going to be present at this
event, right?
>>: Trying to understand what the point in time means. Does it mean.
>> Ashish Kapoor: It's not a probability. It's some number. It's a function. So
the thing this is an [inaudible] graphical model. So if this number is high it
basically says that the likelihood of this person is high.
>>: If it's high. If it's 0.
>> Ashish Kapoor: That means completely ->>: That face should not appear.
>> Ashish Kapoor: Exactly.
>>: If it's 1 does it mean [inaudible].
>> Ashish Kapoor: No, no, it does not.
>>: Sometimes it does mean it has to ->> Ashish Kapoor: If it goes to infinity, but that's ->>: Use the scale. So we have infinity it has to.
>>: Does it have to appear in the event or does it have to appear in every image
of the event?
>> Ashish Kapoor: What it's saying is it does not tell about every image. All it's
saying is if an image is clicked at a certain event, then if it's a -- if it goes to
infinity that basically will say that that person should be there.
>>: In every.
>> Ashish Kapoor: Yes.
>>: It's not that it's present in the event.
>> Ashish Kapoor: Presented in the image. These are unnormalized. 0 means
definitely what you said.
>>: Goes from 0 to infinity.
>>: It's not pooling across all images saying at least one where they're present.
It's [inaudible].
>>: You estimate its pool.
>> Ashish Kapoor: When you estimate, that's how you'll pool it. That's when you
pool it. But if -- I'm just given this function, right? It's basically telling me that this
is the compatibility between these two general variables. That's all it's saying. It
does not tell me about whether it's like certain or not because it's not giving me a
scale.
>>: The maximum value ->> Ashish Kapoor: Yeah.
>>: Across all ->> Ashish Kapoor: But this is unbounded. This function can have any real value
actually. It can also go negative. Like we didn't implement it that way. But it can
actually go negative.
And so, again, all of this thing actually gets resolved when you compute the
partition function. But let's not go there. But, again, if someone has given you
then that's how you need to interpret it. The question is we don't even know what
it is to start with, right?
So the ->>: The image, you can use the thing on the left say here are the people in it.
You can use the thing on the right to say this is what the event is for. Based on
the matrix.
>> Ashish Kapoor: Exactly. That's what it's going to happen. The way we're
going to learn, right? Learn those parameters. First you classify these things
individually, then count. Come up with that table. Once you have that table you
can further refine it back, these guys as well, and keep doing it.
Again, it's an EM kind of thing.
>>: So in that framework, where are the constraints get input? Are you solving a
constrained option, is that where you put them in?
>> Ashish Kapoor: Well, each step is basically, again, we are doing it very
simply. So the way actually it is implemented is first all right so first fix this guy.
So first, all right, classify, classify, and then estimate whatever it is. Then fix this
guy and this guy, and then sort of redo this thing, then fix this, fix this, then redo
this, then redo this and you keep doing this.
>>: Can you go over the joint optimization framework.
>> Ashish Kapoor: It's a coordinate descent.
>>: To guarantee ->> Ashish Kapoor: It's a local minimum. It's a variational message passing.
Exactly it's basically a variational algorithm. Same flavor as doing clustering
mixture of Gaussians, exactly the same thing.
And the other details of it and, again, I think as I said it's basically an iterative
way. We can basically look at some of the results, right?
So if I was just using simple face classifier, right, just image features, I can do,
well, there are two different datasets. I can do up to, say, 39 percent. That's the
best. It's basically ->>: It's state of the art.
>> Ashish Kapoor: State of the art, MSR face recognition library. We
collaborated with the Bing folks on this, and pretty much -- this is like around 30,
40 class problem. It's a pretty big -- so the margin is less than one percent. So
even state of the art on a even hard image recognition task is not more than 30,
40 percent, actually. And this is like real -- these are [inaudible] datasets. These
are all weird kind of poses and all that kind of thing.
So I think this is 15 classes. This is like 30 classes, the earlier one. This is the
30 class problem. And this is pretty much state of the art. This is the best they
could do on this dataset using number of different algorithms. So Simon Baker
and [inaudible] they are face recognition gurus essentially.
So next what if you start to incorporate the statistics of co-occurrences of
people? So it goes up, a little bit. Not that much. Still it's a hard problem. And
co-occurrences, you can imagine that they are limiting.
As soon as you start incorporating events and combine these two, it shoots up
pretty much to 95 percent. And if you imagine, it's a 13 class problem. But if
it's -- if you go to different events, let's say group of two or three, certainly
knowing an event reduces your 15 class problem to a two or three class problem,
and that's a big -- that's a big information, you can actually exploit.
Right?
>>: Do you have any sense how it decays when you have a much larger -- 20
isn't -- if I look at my photo album 20 is not that big. So do you have any sense
how it decays?
>> Ashish Kapoor: So that's something that we are now actually looking at.
Especially looking at large corpuses. This is preliminary studies. But I mean our
sense is that more of the classes the better the gain is, just because if I can
reduce a 300 class problem to a five class problem, it's better than doing from 15
class problem to three class problem.
So ->>: Better ->>: Better the relative game is the question.
>> Ashish Kapoor: Better the relative game, and, well, in terms of -- it's a much
harder problem to do a 300-class problem. So I mean that's a little bit harder to
compare. But that's the initial sense. Again, I don't have concrete numbers on it
yet.
>>: Well, at this level, not even automatic, you could make a great UI to this
person, this person, this person, it would be real easy. But when you have 200
people now is it still going to be useful or not to be able to create a useful UI.
>>: That would be nice to get the complexities.
>>: Complexities.
>> Ashish Kapoor: So ->>: Large vocabulary. Large class.
>> Ashish Kapoor: The thing is, we looked at number of different Facebook
datasets. And usually the different identities that people have really goes beyond
40 or 50, actually. 300 is a really rare dataset, actually.
I mean, I don't know, like my personal albums -- if it's even like 300 identities,
probably some of those faces are occurring only once or twice.
>>: Identify the fact that they only occur once or twice that's a different problem.
>> Ashish Kapoor: Different problem. So that's a different scale of problem.
>>: What's different about the UI if you're talking about identification, if these
things, the easy part is kind of the mistakes of the machine and the hard part is to
correct them. So to look through a list of 300 people is a time consuming thing,
but it seems that here you can just point to this. You can just say this is not
what ->> Ashish Kapoor: The constraint comes in.
>>: The constraint comes in. Redo the thing that would be a very useful type
of ->>: That's essentially pretty awesome.
>> Ashish Kapoor: Yes. So basically well, those red lines that we're talking
about basically sincere and you can do message parsing.
>>: They actually have it in Windows Live photo, I don't think they do it in an
event. Combining the red lines you're talking about with the event, event
clustering.
>>: Kind of mistakes you don't have to correct.
>>: You can optionally point to the corrections or mistakes. Say these guys are
right, these guys are wrong. Essentially pretty reasonable. Plug in nicely to this.
>> Ashish Kapoor: So they're trying to put this on to an iPhone. So that you can
do some of these UI kind of things. So the problem is not just machine learning.
I think there's a lot of usability issues. I think they're trying to work it out.
>>: Has these types of input feedback. So they can take this [inaudible] with it.
>> Ashish Kapoor: Yeah.
>>: And figure the difference between LAP and people to people. You look at the
right-hand side you actually have more. You have the relationship, you've got
more increase than on the right-hand side, any idea about that?
>> Ashish Kapoor: So you mean like so this jump is more but not this much?
>>: Between ILT and PT, the smaller [phonetic].
>> Ashish Kapoor: I think it's really hard to figure out how much the distance
like -- I mean that's something we probably should look at. But I don't have a
good answer. But I think it's an interesting point, observation, I would say.
>>: What are the ->> Ashish Kapoor: Like 500 images. So this one is 500 images. This one. So
500 instances of third in classes. So event are more on this one, they're further
more events.
So again these are standard Facebook datasets. Basically a couple of people,
Bing they upload the album and that's how it is.
>>: So data, eventually when you measure correctness, in the clustering sense,
right?
>> Ashish Kapoor: No it's in an identity sense, classification.
>>: So what is the label?
>> Ashish Kapoor: So the label is the ID of the person, the image patch. That's
what we're looking at right now. So each image patch is an ID associated with it,
right?
And our goal is to find, do the classification in that sense. Find the ->>: But then you'll have to have another dataset in which you have some images
of this person with the right idea. Otherwise ->> Ashish Kapoor: Oh, yeah, yeah, exactly right. That's how we do it, right?
Out of these, like there are some -- some instances that are seeded with label
data, and then ->>: How many.
>> Ashish Kapoor: I think it's 40/60 but I need to go back. I didn't put it up here.
But it's something like that, around -- so there are some of them which are using
restraining data, in a sense it's unsupervised if you like to think. You have a
corpus of data and a few of them are labeled but not all of them. And you do all
this message parsing in order to ->>: But then some of the reasons for these kind of things is the -- there was a big
difference if you say 40 percent, but in every event have a label image, then it
isn't easy, right? Whereas if I event ->> Ashish Kapoor: Not necessarily.
>>: Event, then it is hard it's not trivial. It's definitely easier for every event I have
an image which is ->> Ashish Kapoor: I don't know exactly. One thing if you look at this diagram
again, these guys are pretty independent. You know, when I'm training these
classifiers, these features, right? The only relational model thing is happening
between, is this. Right? So as long as if I have enough data to train a
reasonable unary classifier, I'll be okay.
If, if, for instance, if I have a situation where all my events are day and one event
was at night and it's all dark and completely screwed up image features then I'll
have problems, probably.
>>: Events as well? Just ->> Ashish Kapoor: Basically we have label faces only to start with. And initially if
you have time stamps on the edges we just run a clustering on times to basically
come up with IDs for events.
>>: I think this is kind of the next question after Ryan's question what happens if
you obscure a face. What if you take someone's face and hide it in the image,
are you able to identify this person as well.
>> Ashish Kapoor: Depends on the relational model if you have a strong
relational model. It will give you a probability distribution even if you don't have
this.
>>: [inaudible].
>> Ashish Kapoor: We have not. That's a good idea. We should probably try it.
>>: Because people do these kind of things in large group much better. Another
image sets which ->> Ashish Kapoor: That's a good idea. Someone is doing this, like you can say
all right I know who you are.
>>: Go to image sets. One image set -- is there a difference?
>> Ashish Kapoor: Again, I need to go back and look at it in detail. Honestly,
these are some of the good issues. I think it might be -- I should do that.
>>: Assuming you have 100 percent accuracy in face patch identification.
>> Ashish Kapoor: Face patch, it's like I mean it's pretty good. Again, it's state
of the art. And ->>: Terribly when you see the back of a head.
>> Ashish Kapoor: The pose and all that.
>>: You should be able to identify somebody from the back of the head in lots of
cases.
>> Ashish Kapoor: That's an interesting -- I should try to do that.
>>: It shouldn't be face patch. It should be.
>> Ashish Kapoor: Like an.
>>: Head patch. And it's now you can -- it's a different identification problem.
>> Ashish Kapoor: That's a good idea.
>>: That's the level three, if you see the face and he shouldn't be with her, he
should have been with me. [laughter] but she's the wrong one.
>> Ashish Kapoor: That's funny.
>>: Body parts, too.
>>: Yeah.
>> Ashish Kapoor: Actually, can make a surprise vector.
>>: Because have a picture of you with your wife and -- she shouldn't be with
you, she should be with -- [laughter].
>> Ashish Kapoor: All right. So the next set of experiments were like what if you
use -- what we use these unary classifiers like features, the first thing I talked
about, instead of doing this relational inference, what if I have these context
classifiers, but use them as a feature in classification, right?
So using faces again, only around 39 percent. Clothing bumps up. If I use a
timestamp, some people use a timestamp as a feature, as a context. Doesn't do
well. Cross-inferences again, what we talked about.
Really if your context classifiers are not that strong you're basically screwed.
Basically. That's probably not the best way to go about it. That's what it's trying
to show. And interestingly, like there is -- so so far I had only talked about
people, people, people event, but I had nothing about people location. Right?
So, in fact, the gain from people location, if I have people event already is very
marginal. Usually location and event are highly correlated. So if I have
incorporate event, it's kind of the same as having location altogether.
>>: But the thing is if you have multiple albums for different people, the best
segmentation is per album, allocation, be able to start having edges with different
people's.
>> Ashish Kapoor: You have those cases.
>>: So if you're talking about the party, if I'm traveling now, then it's going to be ->> Ashish Kapoor: It's different. Yeah. I mean, the thing is, yeah, it's basically
how well you can characterize the event and how much correlation is between
the event and correlation. This dataset unfortunately there was a high
correlation.
>>: You have no images from the same event it's not necessarily true whereas
it's likely to assume in the same day want to be in Europe and Australia
[phonetic].
>> Ashish Kapoor: Yeah. Again, some intricacies of the datasets here probably
the location and the events occur together. But then the thing is that ->>: It's because -- you do have to have the event. You have to ->> Ashish Kapoor: Yeah, we tried all possible combinations. And then the thing
is the good thing then is that location labeling is very hard, because I'm just
looking at the background of images, right? But now if I start incorporating event,
that accuracy goes up. So even though it didn't help us with recognizing people,
but its accuracy of finding where the location went up because now -- so again
it's kind of saying that all of the three classifiers are getting benefited because we
can do this bootstrapping of each other. And, yeah, basically accuracy for event
label as well and similar results.
So, again, the cool thing is again bayesian model, again I can do that trick of
active learning. That is what it is showing here. The red curve is basically active
learning with cross-inference. Right? 150 images. If I'm able to label even like
around 39, 40 images, I pretty much nail the rest of the dataset. Just because of
this information flow going around? And again like four different kinds of
sampling strategies.
All right. So basically pretty much I think the messages, it helps sometimes if you
can model each of like the relationships between different domains, specifically if
they're dependent upon each other instead of trying to use them as a feature into
a [inaudible] classifier. Again you can imagine doing this with other classifiers
beyond faces like mobile scenarios or like a lot of people actually we are thinking
of using it for human behavior modeling like people do certain, say their cognitive
states based on what they're doing and often they're also very co-dependent. So
you can exploit that. I'm sure you can do a lot of stuff on health. Specifically
many of these signals are probably dependent on each other. And instead of
using all the signals as a feature, you might want to maybe think about modeling
them individually and then trying to model the relationships across. So that
concludes it. [applause].
>> Ashish Kapoor: All right.
Download