23062 >> Ashish Kapoor: Thanks a lot. We are repeating this. I think last time I came there were only a couple of people and both were my interns. So I think -- so probably good to see a better audience. All right. It's not showing up on the projector. Now it is. Okay. All right. So this is a talk that I have combined into multiple works and so when I started thinking about machine -- so specifically the problem we'll be looking at is some recognition problems and machine vision, face recognition, things like those. And I take the perspective that context can help a lot in some of those machine vision tasks and some people have not really been looking at context a lot. And this whole project or lines of projects have been trying to exploit whatever can we do beyond just trying to do high level processing at the pixels or trying to find more richer features. How can we just exploit the context? So, again, like context can mean a lot of different things to people. So when I first announced this talk, some folks from [inaudible] groups came to me and when I spoke to them the context meant something that I wasn't really thinking about at that time. So for this talk specifically, I'm really interested in recognition and classification specifically. And the contexts are all those cues that can help me boost recognition, right? So, for instance, as we will see, right, we're going to mainly focus on, you know, image recognition problems, specifically face recognition. So identities of the person. So, of course, I can process each individual face and I can recover whatever informative feature that I think works best. Like, for instance, the shape of the eyes. Aspect ratio of the face, even the skin color. And that's more towards feature engineering. But, on the other hand, there's a lot of other contexts we can exploit. Things like who do you most often appear with in a photograph, right? Where was this photograph taken? What type of the day? What's the background looking like? So there's all of these things which are not specifically trying to engineer features but trying to exploit the co-occurrences of events, places, things like those. And so that's the setting that we're going to work on. We're going to sort of go through three or four graphical models which tries to incorporate different kinds of information. Right? And so again this is just situating the problem in the setting of face recognition, but you can imagine trying to do some similar recognition tasks in other domains where there might be other variables of interest that can inform your final classification task. And as I said, we're interested in people, right? So one simple thing can use -you can use as co-occurrences, right? If I am like if there's a photograph in which the image recognition algorithm can very well detect the presence of my wife, right, it's likely that the other person is me. And say 70 or 80 percent of the cases, right? >>: Hopeful. >> Ashish Kapoor: That's hopeful. But again it's statistics. As you will see. Like we're not going to hard code it. And similarly you have events, right, people go to parties. They select a number of photographs. So it's a burstingness nature. It's like none of these photographs actually kind of independent, they are governed by an event that occurred clustered in time, and also then people are clustered, right? If it's MSR one going on then you know that most of the people are going to be within from MSR. And of course you have locations if you have additional information like GPS information, et cetera, right? So let's try to think about how people have tried to use context like most of the time, right? So usually people will try to use context as a feature. So imagine if I have a classifier that can detect location or a classifier that can detect event. Then I can take the output of that classifier, use that output as a feature, right, and then I can train on basically a high level classifier. Then the problem is there is that -- you know, it kind of depends on the quality of your underlying classifiers. So if my location classifier is not any good, it's not going to help me, right? And similarly others, right? So here -- of course there are other methods that hard code the relationships of different like phenomenons being happening and then try to exploit that relationship. But there is -- one of the things that sort of drove us in this work is that instead of assuming an existence of primary task, which is such as face recognition, and secondary task such as location and like event detection, I mean instead what we can do we can instead flatten this. We can actually use bootstrapping capability from each of this. So in a nutshell if I know the persons in the photograph and if I know the locations then probably I can guess the events. Similarly if I know events, if I know like where it occurred, I can sort of guess who the people are. And can you actually -- this kind of combination and works and any possible way you like, right? And that's what we're trying to achieve. What we're trying to do is basically help these classifier bootstrap off of each other. So instead of trying to train them independently and then sort of cascading that output into a high level classifier, we'll try to basically train them simultaneously and so that they can bootstrap off each other, right? And to do that we'll actually use graphical models, right? And some of the things as I said what we can use is we can use clothes, what clothes people are wearing. You can have a clothing feature or a clothing detector, time stamps. And people have used such things in the past. But again these are nothing but features that go in. So it really assumes that you have a very well performing context detector. Right? So let's try to build one thing at a time, right? So first I'm going to use a very simple model, right? So let's first think about constraints that are induced in a photograph, right? So if I have a bunch of photographic collections, right, so the first thing that I get from my prior knowledge is that if there are two faces that are appearing in the same photograph, they need to be of different people. So that's one kind of context. That's very simple. Similarly, if it's a video and I've been tracking a face, right, I can kind of see that probably is the same guy, right? So it's a very simple constraint. And we are not adding any kind of people, location event thing yet. It's a very simple -- this constraint. And how can we actually exploit it? So first I'm going to show you a graphical model that just tries to model these constraints. And on top of it we'll add the other modalities, right? So, well, this is the graphical model. But I'm going to explain it to you one thing at a time, right? So all right let's start from the top. Let's basically, let's assume that this is all covered. Just the top portion, right? So Y-1 through YN, imagine the true accidents of N different face patches that you observed. So if in a corpus you had 100 photographs and out of that you exclude -- you basically extracted, say, 500 face patches. So one through 500, Y 1 through YN is nothing but the basic true IDs. And basically what I'm showing on top is a potential function which is encoding a Gaussian process prior. But again let's not get into the details, but what you can think is that suppose if you had a classifier, that would take Xs, which is the image patches, and it would output the IDs Y 1 through YN. Basically the first half is nothing but a classifier, right? The second half is what that you are observing, right? If it is a shaded node you basically say that's observed. So what you're saying someone has given you to the identities of the T 1 and T 9, which is some face ID, right? And the rest you need to infer, right? And then what you have is you have these constraints that are coming from your prior knowledge, right? So if two images came into the same photograph, they are represented using these red lines, which is basically saying that T-7 and T-9 cannot be the same ID. T-9 and T-8 cannot be the same ID, here. Where a green edge would say these all need to be the same ID. So intuitively there are two parts. So the first part is a simple classifier that you have trained using any feature-based method, computer vision, right? But on the other hand you have these constraints that are coming out, right? So this is basically the graphical model, right? So, again, if you're interested these are the exact form of the potential functions. But in the end what I'm going to do I'm just going to do a massage parsing, and a lot of -- all of these things are hidden variables, right? And once I do the message parsing I should have property distribution over everything that's unobserved. So essentially intuitively your message parsing is going to do the following: It's first going to classify, right? So that's all of these guys, right? Once it's classified it's going to look at these constraints. Right? And it's going to resolve the constraints. So since all of these are property distributions, right, you can basically use these constraints to further refine your beliefs in the final labels, right? So intuitively first you classify and then you resolve the constraints. That will result in new labels, right? And I can actually use these labels again to further retrain my classifier. And you can keep doing this thing on and on until it converges. Basically what essentially I've told you is a variation message parsing algorithm, where kind of the first hidden variables are basically inferred using a classifier. Then you run some kind of message parsing at this layer in order to resolve the constraints, right? And then again you pass the messages back so you can refine these guys again. That's really what's happening? >>: [inaudible] for each class you have training examples. >> Ashish Kapoor: The thing is the way we've implemented it, we don't assume that. And, again, it really falls through from the structure of the Gaussian processes. Really it's basically you know the structure is the -- your classifier is of the form you have a matrix times the labels. And these labels are nothing but indicator matrices. So basically if you have more clusters coming on, all you need to do is just obtain your label matrix. So, again, I'm skipping a lot of details. But it's kind of -- that's how we implemented it. >>: Schedule for the number of classes, right? >> Ashish Kapoor: Yes, again, here's the thing, for one inference step, we assume that we know the labels, right? But you'll see that we're going to do a bit of active learning on top, that's where you don't need to really -- you can break those boundaries there. All right. Clear, right? The simple message parsing. It's first classify, down, resolve the constraints. The way you resolve the constraints is basically loopy belief propagation, you pass messages back keep doing it until it converges. >>: [inaudible]. >> Ashish Kapoor: Well, the thing is if it weren't for a loopy belief propagation, you could show the convergence, right? Otherwise assume that you can do this constraint, this step perfectly, the constraint resolution step. The rest of it is a simple variational message parsing which is it's going to go to a local optimum. But because of existence of loopy belief propagation, you cannot guarantee it. But in this application the graphs are such that the loops are fairly less. So it's not a problem. So the details we don't need to go over it, right? All right. So -- >>: I have a question. >> Ashish Kapoor: Yeah. >>: Get a bunch of people who have done this kind of layering classifier. So you have a classifier at phase one and then you kind of train given the set of classifiers you have at stage two. So you basically are kind of stacking classifiers on top of one another. And this is similar in some ways because this has -- you can take the information given the first classifier and then adapt to it. >> Ashish Kapoor: Yeah, but also the thing is you need to notice that the information flow is both ways, right? So classifier classifiers and based on those potentials you have your constraints and you resolve those. And the messages then get passed back. And so it refines the base layer classifier as well. >>: For the top, the temporal revolution of those things. So you can think of it as stage-wise. So you have a classifier, and then you kind of have the update via the second layer and then you get the new ->> Ashish Kapoor: Yes, basically unfolding happening, yeah, you're right. It's similar spirit, right. >>: So what's different in that work is that they're doing it for a set of classifiers here. And so I think they're combining different information. Combining different information. >> Ashish Kapoor: Again, probably I'll have to look at the paper that you exactly are talking about. But I think in a sense the idea is very similar, that you have multiple classifiers. They all can inform each other, right? So instead of doing piece-wise linear, you can do, try to basically do this graphical model way in which you have a message passing way of training all the classifiers simultaneously. >>: [inaudible] iteration. >> Ashish Kapoor: You're doing an iteration, right. In an ideal world you should be able to do. >>: But you descend on all time ->> Ashish Kapoor: Well, it's a kind of gradient descent, like. You can view this as doing a coordinate ascent where you have variables. >>: You fix one, fix one and then ->> Ashish Kapoor: Yeah. >>: Then optimize. >> Ashish Kapoor: Instead of fixing one variable you're fixing a group of variables. You can think like that also. >>: Kind of alternate. >> Ashish Kapoor: Alternation. Again, in real world I would be able to do this inference exactly. Imagine if I was a huge sampling capability. Then essentially what I'm saying is what I'm limited right now is by my inference procedure. If there was a way to do exact inference, this would be as if I were doing it simultaneously. So one is the model and the other part is doing the inference. The inference, because of the computation reason I'm using a variational approximate inference, which has the flavor of doing iteration. But had -- magically I had a capability of doing exact inference, then in a sense I would be training this simultaneously. Right? >>: Another question. You have here the assumption that the prior is no longer -which is ->> Ashish Kapoor: This is basically there is a very good correspondence between this prior and the regularizer that does use an SVM. It's basically the same thing. >>: How strong does it depend on this? So if the prior, do you have some way to control -- what is the case if this is the wrong prior, how bad can it be? >> Ashish Kapoor: Let me try to answer this question in a different manner. Imagine there are no constraints here. Imagine there is nothing here. Right? Then this model is the same as regularized Lee's squares. Right? Then it boils down to the question: How good is regularized Lee's square. Again you need to make sure that your features are appropriate. Right? All right. So, again, let's move -- so the good thing about this. This is like fully based on inference going on. So I can do things like active learning using value of information computations, basically. So what I can do is I can ask questions like, so, you know given T 1 and T 9 I have a posterior distribution over the rest of the label set, right? So I can ask, hey, if I knew T8, how is this posterior going to shrink? So now you can imagine that if I knew T8, I would automatically know what T7 is, right? Because of the green lines. But if I knew TN then it does not give me that much information. So I can do this kind of sort of reasoning in order to guide my active learning. Right? Imagine now if I have hundreds of photographs but I have a really long video sequence, so it makes more sense for me to get a label for a single frame on that video sequence in order to make, have a lot of discriminatory information I can use to train classifier. So this is the first sets of experiments basically that shows how well it works with active learning. Again, I mean, you can get these results in the paper, but basically as you get more and more label set, you tend to do more. The blue line here in all of these graphs -- so there are like four different datasets. The blue line is if you don't have any constraint resolution you're basically training as if you had a supervised classifier, right? And the rest are once you have, right? So once you have the constraints then how will you do, basically? Again, you can -- results you can read. All right. So moving on. Now comes the more interesting. So so far we were only talking about this modality, right? So from image patches, right, figure out the identities and then you have these constraints in the real world, right? Now imagine you had like other classifiers as well. I have a location classifier, I have event classifiers. For people classifier I was just looking at the image patches. But for characterizing a location maybe I can use the background. So from the images I can compute a simple feature that looks at the histogram of the background, right? So similarly for the event, if I have the XF data from the file I can just look at the time stamps and maybe when GPS coordinates. But so some features that you can use. So if there were no connections like this, then they would be like individual classifiers, which basically looks at the background scene and then tries to guess what location it is. Similarly it looks at the time stamps and tries to classify, right? But now I'm going to put this length -- this is where the message parsing is going to happen again. So as I said, if I knew who all in the photograph were, and I can characterize the location, then it gives me a lot of information about the event. Similarly, if I know what event was and what location was it, then I can get information about the people and all possible combinations such as those. Right? So let's look at how do we do that. Right? So the crucial lengths of these guys, and in the beginning I also mentioned that, you know, the fact that co-occurrences of people can also help you, right? So that's what's shown by this self-loop, basically. And let's try to see how do we model these links. So I'm going to borrow these links basically using a relational table. It's nothing but a table, right? And all it encompasses is the statistics of people and events and when they are occurring. So basically an entry 0 says that this guy cannot -- is probably not present at an event characterized by these group of pictures. I mean, it's a group of pictures that I have used to denote an event. But all this table is capturing is the likelihood of all these people presented in an event. I would even select a compatibility function basically. When it's 0 these guys are not likely to occur in this event. But these guys are. Right? >>: Do you find an event arbitrary group of pictures. >> Ashish Kapoor: The way we define it is clustering in time. So we run basically a clustering on the time stamps and we see that these guys are different events. That's how we are defining it. And similarly here as well. Right? So basically, you know, this is all unknown to us, but right now I have filled those entries just to explain things, right? But if we had a table like this, this information can be very useful, and this is what we will use in order to do a message parsing, right? So this is nothing but a compatibility matrix if you like to think of it. How compatible are certain people in certain events, which is encoding, likelihood of the presence. Right? And, yeah, right basically who they are and where. So let's just work with two domains right now for simplicity. So we had the people domain, right? Where we are looking at face features and clothes features, we can simply classify them as we did in the previous thing, right? Similarly in the event domain we have time stamps, you can simply classify. Now you want to sort of do a message parsing, that's where the relational model sits. So again it's just nothing but a simple compatibility function which is nothing but a 2-D table. Again the entries are not known to us. But if you know those entries, then I can simply do the message parsing as I did in the earlier slide, right? I classified these guys, I classified these guys. I look at the relational model, and I basically try to resolve the constraints and pass the messages back and keep doing it. The problem is I don't know what the relational model is. And what we're going to do is we're going to actually learn that as well. And, again, in terms of probability formulations if you're interested in the formulas I can go into the detail. But it has only three terms. The first term, the first two terms are the unary potentials which are nothing but the unary classifiers, from basically here to here, here to here. So what is the likelihood that if I observe a certain facial features and clothes feature, the ID of the person is something. And the second term is nothing but the compatibility, sorry, this term is nothing but the compatibility across two domains, right? And, again, we can go into the details. But I'll probably keep it at very high level. As I said, classify this, classify this, pass the message along if you know what this guy is. Right? And so the way we are going to do it is basically EM, right? And again use the parameters of these guys as -- think of them as a theta, a variable. So in an EM step, right, you have an E step where you first figure out, you basically classify these guys, right? And given your theta, you can compute the likelihood of the model, right? And what you can do you can try to optimize that with respect to this guy. So in a nutshell, right, what's really happening is you first infer a distribution of the label given some initial value of theta, which can be uniform, right? And then once you have that you can try to maximize the model likelihood over your thetas and you can just keep doing it. So it's various influence, but on top of it is an EM step going on. Basically, to the algorithm all you give is a bunch of images, a bunch of face patches and it automatically not only just classifies them but finds out the limited relationships. By using this graphical model what you have done is you have provided a structure, and you're exploiting that structure in order to recover the variables of interests. Right? So in a nutshell, right, it's a very simple model. Classify these guys, classify these guys, given this thing, find the likelihood of the model and then optimize with respect to this guy to maximize it and keep doing it, right? >>: So it's not clear to me where the initial model is coming from. So in this example you said you have these red and the whites, the 0s and the non-0s in this matrix. And you drew it so that that's the structure you have and that graph. Where did that come from? >> Ashish Kapoor: Well, that's the thing. This is unknown to us. So that's the parameter that you need to find out as well. And that's where the EM step is doing. >>: The structural element to it, structurally is the red versus the white. And M is the parametric part of it. So are you also identifying the structure -- what's not clear to me is the structure as well. >> Ashish Kapoor: The structure is fixed. >>: How do you get the structure of the red versus the white? >> Ashish Kapoor: Structure is very simple. So this is nothing but a giant table. It's a lookup table, right? So in graphical model terms, right? This is a random variable and this is a random variable. And entry here basically tells you the compatibility of this value of the random variable with this. So it's basically a simple 2D parameterization and that's where it's fixed. >>: So it's not red versus white. You don't have any structural information. So the graph in that second picture is misleading in the sense that you have edges going from every person to every event. >> Ashish Kapoor: We do actually have it. So well one thing to think about is the following. So this is one random variable, this whole, the rows represent one random variable. The columns represent one random variable corresponding to each face patch. Correct? Right? So then basically this is nothing but a potential function between this guy, this random variable and this random variable, right? And, yeah, that's pretty much it. >>: So this table contains any additional information that given at the outside quarters that you extract ->> Ashish Kapoor: That's something we'll extract ourselves. So this table is unknown to us. It's unknown. >>: Until someone says look at this picture, this ->> Ashish Kapoor: Exactly. So here's a hypothetical way to learn this table, right? So I have, say, unary classifiers, classify people just based on the patch, the location, based on those things. >>: Classifiers. >> Ashish Kapoor: Yes, probabilistic classifier. Assume they're any SVM or anything. And what I can do I can look at the output of those classifiers and come up with a table. As long as the classifiers are informative, not completely random, I'll have some reasonable estimate here, right? >>: What I don't understand is that this doesn't add any information. >> Ashish Kapoor: The information is coming in terms of structure. That's what I'm trying to say, right? What I have encoded in the model is the fact that certain people tend to basically tend to be present at certain events but at certain events they are not. Right? And that's basically what I'm saying using this. Like there is a ->>: So I'm trying to understand what you said. So in the table itself, if it was just extracting this data, there's no additional information? >> Ashish Kapoor: Yeah, exactly right. So the additional information comes from the fact that you say ->>: I'm imposing the fact on the model. >> Ashish Kapoor: Some prior on these tables saying, for example, these two guys, they tend to appear together. >>: Appear together. >> Ashish Kapoor: Something like that. >>: Yes. >> Ashish Kapoor: So the structure is -- the additional information is not in this table but it's something on ->>: Here's the thing, right? Suppose if I were just training independent classifiers, right? What's the difference between this model and that? Right? The only difference is these lengths. So by specifying explicitly specifying this dependency, I have added information. >>: In other words, just because two pictures appeared at roughly the same time, you're now biassing. >> Ashish Kapoor: Constraints that you're adding, exactly. >>: You mentioned, for example, you cannot be the same person in the ->> Ashish Kapoor: The same ->>: For example, also seems you want to encourage the same people appearing the same way. What are the specific examples for these things that instantiate? The prior comes from these samples. There's some model there. >> Ashish Kapoor: All right. So one is -- when we talk about prior, I think I mean the way I'm thinking about prior is probably a little bit slightly different than how you're thinking about the prior. So let's start from very scratch, right? If I have just unary classifiers, look at the faces and classifies them, that's a simple classifier. And it can be SVM. It can be logistical regression. It can be decision trees, it can be whatever. Similarly here I can have another classifier. So now if I'm -- what I'm going to say is the output that I observe here are not only dependent upon this but also going to be dependent on the output of this guy. This I'm explicitly encoding in my model and that's the extra information, right? And there's no other extra information. I'm not specifically hand coding any kind of statistics or anything. >>: Original version of the matrix that said I found ->> Ashish Kapoor: If you knew this model, if you knew this model, it basically just tells you how to marry these two labels. This is just a compatibility function, right? All it's saying is how do I make sense when I observe things here and things here. Right? When it's absent there's no way for information exchange, right? So you basically say all right whatever my classifier is giving is fine. >>: The structure, so it will say -- so the same label cannot be in the same way. Joe appears in the event at least one event must be Joe. So what are these templates. What are the potentials you're [inaudible]. >> Ashish Kapoor: There are three, four different kinds of potentials, basically one is people event. So you have a one face patch and you have random variable corresponding to the event that the image was taken in, so there's a compatibility function there. Similarly between event and a location. And, again, location and a people and people and people so all those four lengths you have four compatibility tables. But again this is going through the details. Assume that if I just had only two domains let's not even worry about the specific application. If I only had just two domains, right, if I had simple classifier, there's no way of exchanging information, unless in my model I explicitly, you know, spell out the dependency, right? And this dependency is through this relational model which is parameterized. So if you know, then you know how to sort of marry those two domains, right? So that's ->>: Let me make sure I understand correctly. So the matrix is not per image but per of group images that are timestamped, classifiers ->> Ashish Kapoor: Every face patch -- so every face patch is a -- so you have a face ID, right, and you have an event ID, corresponding to every face patch, right? >>: But if you're going to have multiple pictures with the same event ID, this is -so this is where you ->> Ashish Kapoor: Exactly. >>: This is where the ->> Ashish Kapoor: This is where -- so those parameters are shared across, yes. >>: Because if you didn't have ->>: No. >> Ashish Kapoor: These parameters are all shared across all the instantiations. >>: To the table? >> Ashish Kapoor: Yeah. >>: So the one .9, the event, the first event, if you learn this, it should be -- >> Ashish Kapoor: Let's not worry about the learning. Suppose this is the table that someone has given you. Someone has just hand coded from prior knowledge, he basically looked at the invitation list of different events, like someone went in, and he hand coded this table and gave it to you to the model, right? So this is how it would look like, right? So basically this table is saying that this woman with a very high likelihood will represent in this event, right? So if I detect this event, right, then the chances that this woman is of this ID is very high. That's all it's saying. >>: Take a picture of that event. >> Ashish Kapoor: Any picture of that event. >>: Copy quality for that potential for all pictures in that event, is that what you're saying? >> Ashish Kapoor: Yes. >>: If she's in multiple pictures, is that more likely then? Can you ignore the face patch, whatever face patch it is ->> Ashish Kapoor: Yeah, it's that. >>: Face patch. >>: It's pooling ->> Ashish Kapoor: Yeah, it's basically saying, it's not looking at features at all it's just looking at the random variables on top. It's trying to basically -- it's a compatibility function. It's a way of saying that I am going to be present at this event, right? >>: Trying to understand what the point in time means. Does it mean. >> Ashish Kapoor: It's not a probability. It's some number. It's a function. So the thing this is an [inaudible] graphical model. So if this number is high it basically says that the likelihood of this person is high. >>: If it's high. If it's 0. >> Ashish Kapoor: That means completely ->>: That face should not appear. >> Ashish Kapoor: Exactly. >>: If it's 1 does it mean [inaudible]. >> Ashish Kapoor: No, no, it does not. >>: Sometimes it does mean it has to ->> Ashish Kapoor: If it goes to infinity, but that's ->>: Use the scale. So we have infinity it has to. >>: Does it have to appear in the event or does it have to appear in every image of the event? >> Ashish Kapoor: What it's saying is it does not tell about every image. All it's saying is if an image is clicked at a certain event, then if it's a -- if it goes to infinity that basically will say that that person should be there. >>: In every. >> Ashish Kapoor: Yes. >>: It's not that it's present in the event. >> Ashish Kapoor: Presented in the image. These are unnormalized. 0 means definitely what you said. >>: Goes from 0 to infinity. >>: It's not pooling across all images saying at least one where they're present. It's [inaudible]. >>: You estimate its pool. >> Ashish Kapoor: When you estimate, that's how you'll pool it. That's when you pool it. But if -- I'm just given this function, right? It's basically telling me that this is the compatibility between these two general variables. That's all it's saying. It does not tell me about whether it's like certain or not because it's not giving me a scale. >>: The maximum value ->> Ashish Kapoor: Yeah. >>: Across all ->> Ashish Kapoor: But this is unbounded. This function can have any real value actually. It can also go negative. Like we didn't implement it that way. But it can actually go negative. And so, again, all of this thing actually gets resolved when you compute the partition function. But let's not go there. But, again, if someone has given you then that's how you need to interpret it. The question is we don't even know what it is to start with, right? So the ->>: The image, you can use the thing on the left say here are the people in it. You can use the thing on the right to say this is what the event is for. Based on the matrix. >> Ashish Kapoor: Exactly. That's what it's going to happen. The way we're going to learn, right? Learn those parameters. First you classify these things individually, then count. Come up with that table. Once you have that table you can further refine it back, these guys as well, and keep doing it. Again, it's an EM kind of thing. >>: So in that framework, where are the constraints get input? Are you solving a constrained option, is that where you put them in? >> Ashish Kapoor: Well, each step is basically, again, we are doing it very simply. So the way actually it is implemented is first all right so first fix this guy. So first, all right, classify, classify, and then estimate whatever it is. Then fix this guy and this guy, and then sort of redo this thing, then fix this, fix this, then redo this, then redo this and you keep doing this. >>: Can you go over the joint optimization framework. >> Ashish Kapoor: It's a coordinate descent. >>: To guarantee ->> Ashish Kapoor: It's a local minimum. It's a variational message passing. Exactly it's basically a variational algorithm. Same flavor as doing clustering mixture of Gaussians, exactly the same thing. And the other details of it and, again, I think as I said it's basically an iterative way. We can basically look at some of the results, right? So if I was just using simple face classifier, right, just image features, I can do, well, there are two different datasets. I can do up to, say, 39 percent. That's the best. It's basically ->>: It's state of the art. >> Ashish Kapoor: State of the art, MSR face recognition library. We collaborated with the Bing folks on this, and pretty much -- this is like around 30, 40 class problem. It's a pretty big -- so the margin is less than one percent. So even state of the art on a even hard image recognition task is not more than 30, 40 percent, actually. And this is like real -- these are [inaudible] datasets. These are all weird kind of poses and all that kind of thing. So I think this is 15 classes. This is like 30 classes, the earlier one. This is the 30 class problem. And this is pretty much state of the art. This is the best they could do on this dataset using number of different algorithms. So Simon Baker and [inaudible] they are face recognition gurus essentially. So next what if you start to incorporate the statistics of co-occurrences of people? So it goes up, a little bit. Not that much. Still it's a hard problem. And co-occurrences, you can imagine that they are limiting. As soon as you start incorporating events and combine these two, it shoots up pretty much to 95 percent. And if you imagine, it's a 13 class problem. But if it's -- if you go to different events, let's say group of two or three, certainly knowing an event reduces your 15 class problem to a two or three class problem, and that's a big -- that's a big information, you can actually exploit. Right? >>: Do you have any sense how it decays when you have a much larger -- 20 isn't -- if I look at my photo album 20 is not that big. So do you have any sense how it decays? >> Ashish Kapoor: So that's something that we are now actually looking at. Especially looking at large corpuses. This is preliminary studies. But I mean our sense is that more of the classes the better the gain is, just because if I can reduce a 300 class problem to a five class problem, it's better than doing from 15 class problem to three class problem. So ->>: Better ->>: Better the relative game is the question. >> Ashish Kapoor: Better the relative game, and, well, in terms of -- it's a much harder problem to do a 300-class problem. So I mean that's a little bit harder to compare. But that's the initial sense. Again, I don't have concrete numbers on it yet. >>: Well, at this level, not even automatic, you could make a great UI to this person, this person, this person, it would be real easy. But when you have 200 people now is it still going to be useful or not to be able to create a useful UI. >>: That would be nice to get the complexities. >>: Complexities. >> Ashish Kapoor: So ->>: Large vocabulary. Large class. >> Ashish Kapoor: The thing is, we looked at number of different Facebook datasets. And usually the different identities that people have really goes beyond 40 or 50, actually. 300 is a really rare dataset, actually. I mean, I don't know, like my personal albums -- if it's even like 300 identities, probably some of those faces are occurring only once or twice. >>: Identify the fact that they only occur once or twice that's a different problem. >> Ashish Kapoor: Different problem. So that's a different scale of problem. >>: What's different about the UI if you're talking about identification, if these things, the easy part is kind of the mistakes of the machine and the hard part is to correct them. So to look through a list of 300 people is a time consuming thing, but it seems that here you can just point to this. You can just say this is not what ->> Ashish Kapoor: The constraint comes in. >>: The constraint comes in. Redo the thing that would be a very useful type of ->>: That's essentially pretty awesome. >> Ashish Kapoor: Yes. So basically well, those red lines that we're talking about basically sincere and you can do message parsing. >>: They actually have it in Windows Live photo, I don't think they do it in an event. Combining the red lines you're talking about with the event, event clustering. >>: Kind of mistakes you don't have to correct. >>: You can optionally point to the corrections or mistakes. Say these guys are right, these guys are wrong. Essentially pretty reasonable. Plug in nicely to this. >> Ashish Kapoor: So they're trying to put this on to an iPhone. So that you can do some of these UI kind of things. So the problem is not just machine learning. I think there's a lot of usability issues. I think they're trying to work it out. >>: Has these types of input feedback. So they can take this [inaudible] with it. >> Ashish Kapoor: Yeah. >>: And figure the difference between LAP and people to people. You look at the right-hand side you actually have more. You have the relationship, you've got more increase than on the right-hand side, any idea about that? >> Ashish Kapoor: So you mean like so this jump is more but not this much? >>: Between ILT and PT, the smaller [phonetic]. >> Ashish Kapoor: I think it's really hard to figure out how much the distance like -- I mean that's something we probably should look at. But I don't have a good answer. But I think it's an interesting point, observation, I would say. >>: What are the ->> Ashish Kapoor: Like 500 images. So this one is 500 images. This one. So 500 instances of third in classes. So event are more on this one, they're further more events. So again these are standard Facebook datasets. Basically a couple of people, Bing they upload the album and that's how it is. >>: So data, eventually when you measure correctness, in the clustering sense, right? >> Ashish Kapoor: No it's in an identity sense, classification. >>: So what is the label? >> Ashish Kapoor: So the label is the ID of the person, the image patch. That's what we're looking at right now. So each image patch is an ID associated with it, right? And our goal is to find, do the classification in that sense. Find the ->>: But then you'll have to have another dataset in which you have some images of this person with the right idea. Otherwise ->> Ashish Kapoor: Oh, yeah, yeah, exactly right. That's how we do it, right? Out of these, like there are some -- some instances that are seeded with label data, and then ->>: How many. >> Ashish Kapoor: I think it's 40/60 but I need to go back. I didn't put it up here. But it's something like that, around -- so there are some of them which are using restraining data, in a sense it's unsupervised if you like to think. You have a corpus of data and a few of them are labeled but not all of them. And you do all this message parsing in order to ->>: But then some of the reasons for these kind of things is the -- there was a big difference if you say 40 percent, but in every event have a label image, then it isn't easy, right? Whereas if I event ->> Ashish Kapoor: Not necessarily. >>: Event, then it is hard it's not trivial. It's definitely easier for every event I have an image which is ->> Ashish Kapoor: I don't know exactly. One thing if you look at this diagram again, these guys are pretty independent. You know, when I'm training these classifiers, these features, right? The only relational model thing is happening between, is this. Right? So as long as if I have enough data to train a reasonable unary classifier, I'll be okay. If, if, for instance, if I have a situation where all my events are day and one event was at night and it's all dark and completely screwed up image features then I'll have problems, probably. >>: Events as well? Just ->> Ashish Kapoor: Basically we have label faces only to start with. And initially if you have time stamps on the edges we just run a clustering on times to basically come up with IDs for events. >>: I think this is kind of the next question after Ryan's question what happens if you obscure a face. What if you take someone's face and hide it in the image, are you able to identify this person as well. >> Ashish Kapoor: Depends on the relational model if you have a strong relational model. It will give you a probability distribution even if you don't have this. >>: [inaudible]. >> Ashish Kapoor: We have not. That's a good idea. We should probably try it. >>: Because people do these kind of things in large group much better. Another image sets which ->> Ashish Kapoor: That's a good idea. Someone is doing this, like you can say all right I know who you are. >>: Go to image sets. One image set -- is there a difference? >> Ashish Kapoor: Again, I need to go back and look at it in detail. Honestly, these are some of the good issues. I think it might be -- I should do that. >>: Assuming you have 100 percent accuracy in face patch identification. >> Ashish Kapoor: Face patch, it's like I mean it's pretty good. Again, it's state of the art. And ->>: Terribly when you see the back of a head. >> Ashish Kapoor: The pose and all that. >>: You should be able to identify somebody from the back of the head in lots of cases. >> Ashish Kapoor: That's an interesting -- I should try to do that. >>: It shouldn't be face patch. It should be. >> Ashish Kapoor: Like an. >>: Head patch. And it's now you can -- it's a different identification problem. >> Ashish Kapoor: That's a good idea. >>: That's the level three, if you see the face and he shouldn't be with her, he should have been with me. [laughter] but she's the wrong one. >> Ashish Kapoor: That's funny. >>: Body parts, too. >>: Yeah. >> Ashish Kapoor: Actually, can make a surprise vector. >>: Because have a picture of you with your wife and -- she shouldn't be with you, she should be with -- [laughter]. >> Ashish Kapoor: All right. So the next set of experiments were like what if you use -- what we use these unary classifiers like features, the first thing I talked about, instead of doing this relational inference, what if I have these context classifiers, but use them as a feature in classification, right? So using faces again, only around 39 percent. Clothing bumps up. If I use a timestamp, some people use a timestamp as a feature, as a context. Doesn't do well. Cross-inferences again, what we talked about. Really if your context classifiers are not that strong you're basically screwed. Basically. That's probably not the best way to go about it. That's what it's trying to show. And interestingly, like there is -- so so far I had only talked about people, people, people event, but I had nothing about people location. Right? So, in fact, the gain from people location, if I have people event already is very marginal. Usually location and event are highly correlated. So if I have incorporate event, it's kind of the same as having location altogether. >>: But the thing is if you have multiple albums for different people, the best segmentation is per album, allocation, be able to start having edges with different people's. >> Ashish Kapoor: You have those cases. >>: So if you're talking about the party, if I'm traveling now, then it's going to be ->> Ashish Kapoor: It's different. Yeah. I mean, the thing is, yeah, it's basically how well you can characterize the event and how much correlation is between the event and correlation. This dataset unfortunately there was a high correlation. >>: You have no images from the same event it's not necessarily true whereas it's likely to assume in the same day want to be in Europe and Australia [phonetic]. >> Ashish Kapoor: Yeah. Again, some intricacies of the datasets here probably the location and the events occur together. But then the thing is that ->>: It's because -- you do have to have the event. You have to ->> Ashish Kapoor: Yeah, we tried all possible combinations. And then the thing is the good thing then is that location labeling is very hard, because I'm just looking at the background of images, right? But now if I start incorporating event, that accuracy goes up. So even though it didn't help us with recognizing people, but its accuracy of finding where the location went up because now -- so again it's kind of saying that all of the three classifiers are getting benefited because we can do this bootstrapping of each other. And, yeah, basically accuracy for event label as well and similar results. So, again, the cool thing is again bayesian model, again I can do that trick of active learning. That is what it is showing here. The red curve is basically active learning with cross-inference. Right? 150 images. If I'm able to label even like around 39, 40 images, I pretty much nail the rest of the dataset. Just because of this information flow going around? And again like four different kinds of sampling strategies. All right. So basically pretty much I think the messages, it helps sometimes if you can model each of like the relationships between different domains, specifically if they're dependent upon each other instead of trying to use them as a feature into a [inaudible] classifier. Again you can imagine doing this with other classifiers beyond faces like mobile scenarios or like a lot of people actually we are thinking of using it for human behavior modeling like people do certain, say their cognitive states based on what they're doing and often they're also very co-dependent. So you can exploit that. I'm sure you can do a lot of stuff on health. Specifically many of these signals are probably dependent on each other. And instead of using all the signals as a feature, you might want to maybe think about modeling them individually and then trying to model the relationships across. So that concludes it. [applause]. >> Ashish Kapoor: All right.