23871 >> Larry Zitnick: Hi. It's my pleasure to introduce Dhruv Batra. He's a research assistant professor at TTI Chicago. Last spring he spent some time as a visiting researcher at MIT with Bill Freeman, where he's working on learning latency structural SVMs. Before that he also interned at SRC with Pushmeet. And before that he was a Ph.D. student at CMU. He was being advised by Tsuhan chen and working on co-segmentation using interactive methods. >> Dhruv Batra: Thanks, Larry. Thank you all for coming. So I gave a talk at MSR a couple of years ago. And I think one of the things I said was it's really nice to be here at Microsoft, and one of the things I've always wanted to do was give a talk at Microsoft with a Mac. That was two years ago. I've been doing that since, I've been giving a talk with a Mac. This trip I forgot my dongle at home, and so yesterday I had to go to an Apple store and buy that. So I guess karma got its way for me making that remark. But, okay. So let me begin. So if I had to summarize here machine learning 20 years ago or what we were doing in machine learning 20 years ago, this picture essentially summarizes what we were interested in, right? We were interested in partitioning two classes, and finding best ways of partitioning two classes. Faces from nonfaces, digit three from digit six. Chairs from tables. This is the canonical thing that we were interested in. And if I had to say what changed in the last 20 years, I would say things that changed is now we're interested in much larger output spaces. We're interested in exponentially large output spaces. And I'll show what I mean by that. So segmentation, which is a problem that I'm interested in, you're given an image. Maybe some user scribbles. Your output space is either at each pixel a binary label 01 foreground/background or K data that you know exist in the database. So the space of possible outputs is the number of labels to the power of pixels that's the output space that we're essentially searching over. Or in object detection, where to find an object we work with parts based models so we say a bicycle is made up of maybe a wheel, maybe a bicycle rim or a person is made of these parts. We search for these parts, where are these parts located and spatial reasoning. So the output space in this case is the number of pixels which is possible locations of these parts raised to the number of parts. Again, exponentially large output space that we have to search over. This can be not just in a single frame but in video. So you're trying to do person or layout in video. So certainly exponent goes up by the number of frames that you're dealing with. In super resolution, one of the early -- one of the fundamental models that we have is a graphical model which says I'm trying to resolve this image into a much higher resolution, what I'll do is I'll collect a batch of low and high res images, low and high res, collect a dictionary of low and high res patches. What I'll say is for each input patch I'll find the closest low res patch in my dictionary. I'll replace that by the high risk patch. So the space of output, the space of output that you're searching over is basically your dictionary size raised to the number of input patches. Again an exponentially large output space. This is not just vision. People in NLP are really interested in these problems of dependency parsing. So there's a parse tree that tells you this word modifies this other word in the sentence. So it's a directed tree on the space of sentences, on the space of words. How many directed trees can I form? It's again N to the N minus 2 without the sentence. So that's the space you're searching over. And it could be information retrieval. So I go on my favorite search engine, Bing, and I search for some documents. And the space of output space in this case is document factorial, that's the space of rankings that I have to search over. So in some sense if we're on this side of machine learning, if we have the search over exponentially large output spaces then we need to revisit some of the same issues that we addressed for the two label case. We have to understand how do we hold distributions over exponentially large objects. My running example would be segmentation because it sort of makes sense. How do I hold a distribution over the space of all possible segmentations which contain checkerboard segmentations, the stuff I'm interested in, all white, all block, all possible segmentations. Given this model how do I perform inference in this model? How do I find the most likely segmentation I'm interested in, and learning, which is how do I learn this from data. No expert is going to be able to hand this to me. All right. So my work has been on all three of these issues, with structured output models. And today I want to talk about a couple of things, which is essentially I'll talk about this one work which we're calling the M best mode problem. And I'll go into details of that. In the second part, so the first part will be a modeling and inference question. The second part will be a pure inference question. And time permitting, maybe there will be some other teasers of things that I'm working on. So let's get started. So here's the problem that I want to talk about. This is what we're calling the M best mode problem. And in order to tell me -- in order for me to tell you that problem, let me give you the model. And the model that we're going to be working on is the conditional random field, just so that I hand you the notation. We're given an image, there are some variables. So let's say pixels. I'm showing a grid graph structure, but I'm not making any assumptions on the graph structure. This is just a running example. Each pixel or node in my graph I have some labels that I have to assign to. So this pixel might be a car. A road, a grass, a person. So K set of labels. Also, I'm handed an energy function, which scores all possible outputs. So somehow it gives me the cost of each output. And that is represented by a collection of node energies or local costs. So it might be a vector of 10, 10, 10, 0, 100, because you're minimizing cost, this variable prefers to be the third label, which is cross. So also at each edge, I'm holding some sort of distributed priors. So nodes that are adjacent to each other in this graph tend to take the same label. So this is just showing if you take the same label, your cost is 0. If you take different labels your cost is 100. It's just encoding my prior information. Now the problem -- and instead of thinking this as a cost function, you could think of this as a distribution. If you just exponentiate the energy and factorize it out by a constant, this becomes the distribution that I'm showing. That's fairly straightforward. It's a discrete space so you can sum it out and it's really easy to come up with. So that summation, computing that summation might be really hard but it's easy to think of this as distribution. Now, the task that we're interested in typically is given this distribution, find me the best segmentation under this distribution. So and that can be expressed as given this cost function find me the lowest cost assignment of these variables. I have some node potentials edge potentials let me find the lowest cost assignment, that's my best segmentation. And this is too general a problem in general it's NP hard. I can reduce max cart to this. I can reduce vertex coloring to this. There are inapproximatable problems I can reduce to this. So in the general case this is hopeless. And faced with an NP hard problem you bring out your standard tool set. You either write down your heuristic algorithms, greedy algorithms or convex relaxations. Before we do that before it turns into an optimization talk, let's think a second. Is computer vision really an optimization problem? Or is it just an optimization problem? Right? If you did have an oracle that could solve this optimization problem, would you be happy? Would computer vision be solved? And we have addressed that question before. And the answer is no, we've done large scale studies. So in this paper by Italia Meltzer and Yanover, they found that if you do actually spend, if you take current instances on some datasets, take your models, run exponential time algorithms and find the global optimum under existing models, those global optimums still have some of the same problems that approximate solutions do. They tend to oversmooth. They miss certain objects. They're not good enough. In fact, even worse than that, which was work done by Rick and others, if you actually find, if you compare the energy of the ground truth, turns out that the ground truth has much higher energy than the map. Turns out that your model thinks that the ground truth is much less probable than this other thing that it believes in. So not only do other models not perfect, when we spend more time coming up with approximate or better approximate inference algorithms, we somewhat move away from the ground truth. More time is spent up in embedded algorithms takes you away from the ground truth in some sense, and that's sort of disappointing. The reason for this is somewhat obvious. Our models aren't perfect. They're inaccurate. They're not completely garbage, though. They have some reasonable information in them. So while one solution to this problem might be just learn better models, right, go ahead and learn better models. Seems simple enough. What I'm going to say you should be asking for more than just the map. You've learned this rich distribution from data or from an expert or from some source. Why extract just a point estimate? And the problem that I'm going to talk about, some people have looked at this problem in the context of combinatorial optimization algorithm problems. This is called the M best map problem. Instead of just the best solution they'll find the top K best or top M best solutions. Can anyone think of a problem with this approach. If you were to find the top M solutions, what is the problem you would run into? >>: [inaudible]. >> Dhruv Batra: They'd be nearly identical. Any reasonable objective function will have some peak when you ask for top M solutions you'll be nicely clustered around that peak and these solutions are essentially useless for you. What you would like to solve is this M best mode problem where you can do some sort of mode hopping where you want to know other things that your model believes in. And this is the problem that we're calling the M best mode problem. I want to be very careful we're working with discrete space, this is not continuous distribution what does mode mean, I'll formalize that in a second. But before I tell you we can find the M best mode problem how we can solve this, what would you do if you did have an algorithm that produced some diverse set of hypotheses, what would you do with it? One thing you can do is anytime you're working with interactive systems, anytime there's a user in the loop, so this is interactive segmentation a person scribbles on the image you present to the person the best segmentation, instead of just one best you can present some five best. But you have to ensure that those bests are sufficiently different from each other, there's a diverse hypothesis. So anytime there's a user in the loop you can present those solutions and the user can just pick one. So you minimize reaction time. The other thing is you can rerank those solutions. You can produce some diverse hypotheses and run some expensive post processing step that ranks these. This is the current state of our segmentation algorithm that we have on Pascal segmentation challenge. What it does it takes an image. It produces close to 10,000 segmentation hypotheses. And these are highly correlated. They're highly redundant. But there are many segmentation hypothesis, and it uses an external segment ranking mechanism to rerank these segmentation hypotheses. You might ask if I have access to a ranker, why don't I just optimize the ranker, why don't I search for the thing that would be the best ranking. Ranking can be expensive. You want to evaluate the expensive thing, only small number of things. Okay. So if we now are convinced that this is an interesting problem, let me show you how to do this. I'm going to present to you the formulation of the problem. So I'm going to be working with an overcomplete representation. I said that each pixel had a variable that could take a label, some discrete set of labels. Instead of representing it as a single variable, I will represent this as a Boolean vector of length K. So there's a vector of length as many number of classes that this node is supposed to label. And an entry one in one of the positions means that that's the class that this variable takes. So if the entry is in the first place, if the one is in the first entry then XI is set to one. If it's the second place then the next I is set to 2. And we disallow configurations where there are more than 1s in the vector or 0-1s in the vector. You can do the same thing for an edge of variables. Now the vector just becomes much longer. It becomes K squared. Now you're encoding all quadratic, all K squared pairs of labels that these two variables can take. So if you have one in the first place, that means XI is set to 1 and XJ is set to 1. If you have 1 in the second place it means XI is set to 2 and XJ to 1 and all quadratic of these. And notice that these are not independent decisions, right? The decisions you take an edge has to be, has to agree with the decision that you take it in node. This encoding is saying that XI is set to 2 and this encoding is saying XI is set to 1. That is not allowed. Why do we do this? Why do we blow up a set of variables. The reason we did this is now that energy function of, cost function I showed you I can write it down as a dot product. I pull out the cost of each label. Multiply that with this Boolean vector, it exactly picks out the cost of this labeling. And the same thing at the edges. So it's a nice dot product. And here's the integer program that you were trying to solve. That energy minimization is just minimizing this sum of dot products at nodes and edges subject to mu I and those Boolean vectors being Boolean. That's the integer programming problem. This will find you the best segmentation, and in order to find the second best segmentation, here's the simple modification that I'm doing. All the variables stay the same. Mu 1 is your map. So that's the best segmentation that you found. And I have introduced a new inequality that says delta is some diversity function that measures distance between two labelings or two configurations and I force that distance must be greater than K. K is a parameter to the problem. Delta is something that you choose. I will talk about both of those in a second. But it's just something that forces you to be different. Visually what does that look like? Here's what it looks like. This is the space of all exponential segmentations. You were searching over that space. You minimized over a convex hull of this space. So this is the best segmentation that your model thinks that's the map. You disallowed some other segmentations that lied less than K distance away from it. And now when you still minimize over the remaining configuration, you find the second best solution. Right? That's what that visually looks like. So now this is the problem that we're interested in solving. This is the M best mode problem. For this part of the talk, I'm going to assume that somehow there is a black box that solves the map inference problem. In the second part I'll go into how we solve that. But given a model, there is some algorithm that solves the map inference problem. But this -- this almost like the map inference problem, but there's an extra constraint. So you can't exactly plug your existing algorithms. What you can do is you can dualize this constraint, which means that instead of handling it in the primal form in a constraint form, you add it to the objective with Lagrangian multiplier which means instead of forcing a hard constraint you pay a penalty of lambda every time you produce a solution that's not K distance away. Every time you produce a solution that's less than K distance away you pay a penalty of lambda. So this is the dual problem. The reason why we do this is because this is now looking like something we know how to handle. And ->>: The previous -- so you're actually penalizing -- the further away you are from K, the more you like the solution, right? It's not just an inequality constraint. >> Dhruv Batra: Yes, if you search over the best lambda, you will converge to the things that if there is something that's -- yeah, I'll get to that in a second. But, yeah. The reason why we do this is because this objective function now starts to look like things we know how to handle. Right? This is an additional term that when we knew how to handle, and in the literature, if you've seen this is the loss segmented minimization problem. We handle this every time we have to train SVMs or train structure SVMS, you add the loss and minimize the original energy. So this is the problem I'm calling the M modes. There's still two things I haven't told you about. Delta. So I've traded the primal variable. Now there's a new dual variable as well. And delta the diversity function that I didn't tell you. And you can think about for each setting of lambda, which is a dual variable, this relaxation provides you a lower bound to the original problem that you were solving. And to get to Rick's question you can try and maximize this lower bound. What does this function -- so I tell you lambda, you minimize this, what does this function look like as a function of lambda? You can easily show that it's a piece-wise linear concave function. You can maximize over the space of lambda. So let's see. What's the delta? Let's nail that down. What's the diversity function? There's some nice special cases. There are a number of special cases. If your diversity function was a 0-1 function, if you exactly map then your distance 0 if you're anything else even one pixel you label different you're some different configuration. If you work with that, then this is the M best map problem. So we generalize that. There's some other generalizations that I won't talk about. We allow a large class of delta functions, and I'll talk about one of them specifically today, which will correspond to Hamming distance. Here's a delta function that I'm going to be working with. It says you sum over each node in your graphical model. Each node in your graph. And you take the dot product of the mu I with the mu I of the map. What does this mean? Mu I remember is a Boolean vector. That encodes what label you took at mu I. Mu is your new variable. This is just counting how many pixels took the same label as last time. If you take a dot product with a Boolean vector, only if they agree do you get a 1. Otherwise you get a 0. So this is exactly Hamming distance. Why is this interesting? Well, it's interesting because if I now look at that problem that I was trying to solve, my original energy function minus this Lagrangian times the loss, this loss is now linear in mu. It's linear function of the variable that I'm minimizing, and it nicely decomposes across nodes. So all that happens is that now I have my original node potentials, plus lambda times some other constant. Mu I is a Boolean vector of length K. Only one of the entries is set to 1, which is the map entry. The cost for the map label just went up by lambda. >>: Just going back to your original formulation, the UIJs are independent of your UI? They're additional extra variables? >> Dhruv Batra: Yes, so there are some constraints that I've hidden away. Mu are not actually independent, there are constraints that tie mu I. Those constraints exist. I've sort of hidden them away because they weren't relevant. >>: Mu is always an integer? >> Dhruv Batra: In the integer programming formulation it will always be an integer. >>: [inaudible]. >> Dhruv Batra: In order to solve it, so that's the black box that I hid away. In order to solve it, you will relax it to NP relaxation. >>: Forces one bit to be one is also constrained in that family of constraints? >> Dhruv Batra: Constraint that forces only one B. >>: To term. >> Dhruv Batra: Yes, that's also a constraint that's hidden away in this mu. So there are constraints -- so forcing one bit to be one is sort of a normalization constraint if you think only one of the bits -- after you relax it it will become a normalization constraint. And the constraints that they agree with each other will become marginalization constraints. But they both exist. >>: One last thing. So you just talked about second best mode. >> Dhruv Batra: Yes. >>: But third and fourth, do you have different lambdas for different ones or ->> Dhruv Batra: Yeah. So in the primal case, what you'll have to do is you can add different inequalities that find me the next best, which is K away from the first, the second and the third. So you can either have different Ks or you can have some standard setting of Ks. So that would say I just plopped down this Hamming hypercube that disallows some K, some solutions. In practice, this is going to be a question of how do I set K. And I'll get to that question in practice. >>: Lambda single across? [inaudible]. >> Dhruv Batra: So Lagrangians, there's a different Lagrangian for each inequality. >>: Optimize. >> Dhruv Batra: You'll have to optimize over those, yeah, right. All right. So this is nice that all I have to do now is to modify some potentials, the cost of -- so if node I was set to label 1, then the cost of node I taking label 1 just went up by lambda. And I just rerun the same mechanism that you have for map. So if you had a black box, that black box still runs I just have to perturb the potentials a little bit. Even better, since I did not modify this edge potential, data IJ, if there was some structure in the original edge potentials, I preserve that structure. So if your original problem was pair-wise binary submodular minimization problem for which you have exact inference algorithms, this new modified problem is still pairwise binary submodular. So if there was some exact algorithm for the first problem then there's an exact algorithm for the second case. And I think that's the most interesting part. If you have invested some work to extract one solution out of your model, this can extract multiple solutions out of your model. So what does this look like in practice? Here's an image, and somebody scribbled on it. So one color of the scribble indicates this is the foreground, the other color indicates this is background. Here's the ground truth on these images. This is what presumably you'd like to extract. We encoded this with a pair-wise binary submodular parts model with color potentials look at the color of the foreground and background, set up node potentials and parts model. This is the map that you extract from these scribbles on these images. This is the best segmentation. This is the exact second best map. So this is literal definition of second best map. Does anyone even see the difference between the first and the second best? >>: Only in the top row. >> Dhruv Batra: Yeah. Because there's these pixels here that get turned on. The others are different as well. Maybe there's one pixel that turns on. So I wasn't making those figures up. This really happens in practice, that you run your entire algorithm again in the second best decision is essentially useless. This is the second best mode that we can extract. So in the first case, we're able to extract the other instance of that object. This entire thing was absent in the first and second. In this one, we're able to fill out the arm of the person that was cut out. Right? All by forcing Hamming dissimilarity. >>: The third row there's not much difference. >> Dhruv Batra: Right, the third row there's not much difference. >>: There must be at least K pixels that are different. >> Dhruv Batra: Sorry. Yeah. So the way we're solving this is instead of setting K, we're setting a fixed lambda, which means that you don't actually enforce diversity, you trade off diversity with the original energy, with some fixed weighting term, which means that if your original model strongly believes in the original model, you will still return the first solution back. So here's a second experiment that we did. Sorry. Here's a second experiment that we did. This is Pascal's segmentation challenge. For those unfamiliar with this, this is a large international challenge that's been running for a few years now. The organization that's running this challenge releases the images. But not test set annotations. There's 20 categories on this dataset and a background. For each image, what you're expected to produce is this. This is the ground truth. You're expected to produce a labeling of one of these categories for all of the pixels, or background, which is shown in black here. Right? And what we did was we took an existing model on this problem, which is the L model by Radicky [phonetic], Kohli and Tore [phonetic]. They took -- they developed this over a sequence of papers. If you haven't seen this model before, this is just a hierarchical CRF. There are some potentials of pixels. There's some neighboring pixels are joined with some parts models, and there's some super pixels and some scene level things. It's a hierarchical CRF. It took them a couple of papers to develop a good inference algorithm for this model. And all we had to do was modify some terms and run their same algorithm again. Right? So here's what I'm showing in image. Here's the ground truth. Blue this is boat. This is sheep. This is the ground truth. This is the best -- this is the map under their model. So the first case there's this large region that's labeled as boat. So the segmentation is wrong. In there the support is wrong. They've labeled it as boat but the support is wrong. It's one of the mistakes that the model does tend to make. Whenever it finds evidence for an object, it tends to smear it across the image. Labels everything as sheep. In this case it only found one of the instances of sheep. What we did was we extracted five additional solutions in addition to the map. Right? I'm showing you the best of those and the worst of those. Best and worst according to ground truth. So we are checking ground truth to see which one's best and which one's worse. And in this case what happens is you're accurately able to crop out that segmentation. Anytime you change segmentation, this is Hamming. So you get rewarded for being different from that. In this case you're actually able to get out that original support of the object, get an actual segmentation. In this case you're able to get the other instance of the object. So not just one sheep was present but a second one was also present. Right? These are examples where we find large improvements in additional solutions that we extract. These are examples of medium improvement, where again this is the ground truth. This is what the map says. In this case the horse, rider, they're both labeled as the horse in the map. What we did was extracted the rider out, lost some part of the horse. In this case everything was labeled sofa. We were able to extract the person out, lost some bit of the sofa here. Right? And in this case we will extract an object out here from the map. These are examples of cases where it doesn't make a big difference of running these additional solutions. >>: At the beginning of the talk you mentioned approaches where they just try unsupervised segmentation of riches in 10,000 -- you can do the same thing with the best, relative to ground truth. >> Dhruv Batra: Yes. >>: Did you do that. >> Dhruv Batra: Yes. >>: Does that do better, or even with 10,000 bandwidth examples, random segmentations does it not do as good as your approach? >> Dhruv Batra: Yes. So you're asking whether we took, whether we did a reranking on these? Yes, we did. So what I'm showing here are comparison or empirical results. This is intersection over union criteria. So you produce a mask. There's a ground truth mask you measure intersection over union, that's how accurate this mask is. This is the average over all categories. Here's how well just the map under this model performs. So it's just under 25. M best map in this case is a nonstarter. This algorithm is actually more difficult to implement than best mode because it's not just one map computation again. It's order N where N is the number of pixels, that many computations again. In this case we did back of the network calculation. It would take us ten years without parallelization to get additional solution. So just don't have that. And in practice it doesn't seem to make any difference anyway. We did implement a baseline which would confuse pixels and flip them to their next best thing. Here's the oracle numbers for M best modes. If you extract five additional solutions in addition to the map and took the best looking at ground truth, so that's cheating. This number is not a valid entry to Pascal, because you can't look at the ground truth. This goes from bar 24. But this tells you how much signal there is in those additional five solutions. Go from 25 or so to over 36. And to tell you the scale of these numbers, the state of art on this dataset is just about 36. Right? So you took a model that was nowhere close to state of the art is now beating state of the art. Well beating in the sense there is a solution in these best, in these 5ish. So the goal now is can we rerank these solutions? Can we take the six and run the reranker. We have an initial experiment on that already. We've taken the reranker from that work, applied it to these six. It already improves on the map. So we're able to do better than the map. But we haven't yet realized this potential. So we're still tweaking that reranker to see if we can do better. Intuitively, it feels like it should be an easier problem that reranking 10,000 segmentations, because now it's just six. One out of six reranking is much easier, and we hope this number ->>: The 10,000 it would be great if you had 100 grand -- 10,000 to see how many random. >> Dhruv Batra: Ranking. >>: How many you need to get up to ->> Dhruv Batra: Yes. So I don't know. This line is them reranking 10,000. So this is them. The state of the art is them but them reranking 10,000. I don't know if there's a core one ranking as you go fewer and fewer. >>: Were you using their ranker or were you using your own? >> Dhruv Batra: No right now we're using their ranker. Adding new features, retraining it on our own. This is not the latest that we've been able to achieve but off the shelf if we take their ranker and run it on ours, this is what that does. >>: But there's some number between their 10,000 and your 5, right? >> Dhruv Batra: Yes. >>: Whether it's forcing them to generate fewer or forcing your algorithm to generate more, two curves, which is your best K and their random K, whatever, they're doing ranking converge? >> Dhruv Batra: I don't have access to -- I can't really generate -- sorry. I'll have to check their paper to see if that core was available. I would be -- I would expect that they really need many, many segmentation because if you do a coarse job, all they do is run S min cuts they say this pixel -I assume this pixel is foreground I assume this other pixel is background and I run a segmentation on this and I just do this for many locations of source sensing. It's completely down procedure and I'm fairly certain you need a lot of locations of SNT to do this. >>: Different MRS than tail, right. >> Dhruv Batra: Yes. So the thing I'm interested in and that we have been doing with this is we've also taken both estimation problems where the goal is not just segmentation, the goal is not segmentation, but the goal is where is the arm of this person, where is the head of this person, where is the leg. So this is a different MRF. It's actually a tree graph. But your labels are locations, where are these people. And there has been some preliminary work on trying to find multiple hypotheses. We have the implementation of [inaudible] for the best case and we're modifying his code and we're finding improvements on that as well. But there are a lot of applications, I think, that can benefit from this. I think this can really improve parameter learning as well. The way we train our models right now is we run a four loop. You use your current setting of parameters to generate, to ask it what's the best segmentation it believes in. If it's not the ground truth you modify it a little until the ground truth is what wins. And if you had access to not just the map but also some other modes that it believes in, you can converge much sooner. Right? Because you're extracting other things at this point. And there are connections to rejection sampling and this. But just to summarize this part of the talk. I think here's what I would like you to take home from this, right? You're working on some problem. So think about the problem that you're interested in. Whether it's ranking documents some retrieval setting or whether it's structure for motion. There is some discrete aspect to it, right? And the key thing is are you happy with the first best solutions that you have. If you don't have the perfect models, then you're not happy with that. And if you're not happy, then you should look at extracting multiple solutions out of your model. And we're hoping that this can help additional applications as well. Are there any questions about this part before I move on? Okay. So the next few things will go much quicker, because I'll go into fewer details. But here's the second part of this talk. I want to introduce a notion. I'll give a high level picture. I won't go into all three of the things. I'll just give a high level idea and I'll go through it. We came up with this idea focused inference and we applied it to a few different applications. I will just tell you what focused inference is. So I told you there's this integer program that we're solving. In the first part of the talk I assume there's a black box that can solve it. Typically these things are solved with, for example, linear programming relaxations. So you are going to solve this linear function with some discrete variables. Right? That was a linear function. So I can take all your parameters, make them really long vector. I can take all your variables, make a really long vector. And that's just a dot product. And now the constraints that I had hidden away, that I didn't talk about before, they're also linear constraints. So this is just -- this is the exact form of optimization problem that we're studying. It's a linear objective function, linear constraints, Boolean variables. And what you study is an LP relaxation of this problem that you replace the Boolean constraints by 0 to 1 constraints. This is continuous relaxation. What it essentially involves is replacing convex hull of the solutions by an outer bound. This outer bound looks like a more complicated structure in 2 D. But it's actually a simpler structure to optimize over, because this is a very high dimensional space. In 2 D it looks more complicated but it's actually much simpler. The way we solve linear programs is by message parsing programs I won't give you details but what it involves is you localize each part of the graph solve each part of that graph exactly and you pass messages. Here's what I think my neighbor should be, and that takes you to the linear programming -- you can interpret these messages as dual ascent algorithms. But in a sense this is a highly inefficient procedure. What we wanted to do, this is what is done right now, and what we wanted to do was improve this procedure, because our observation was that data does not look like this. We don't have complexity at all scales, at all nodes, at all relative locations. Our data really looks like this. There are regions of complexity but there are large regions which are essentially simple. I can look at the local potentials and I know what it's going to be. If I had to give you an analogy, I would say that the first approach of passing messages everywhere is sort of like a carpet bombing approach to inference. Indiscriminate deployment of computation. Everywhere. And what we would like to see is a more focused deployment of computation, where you find the important parts of the problem and you only focus your computation there. So here's the key hammer that I'll explain really simply. We're going to solve a linear program. Corresponding to the primal linear program, there's a dual one. There's LP duality theory Lagrangian multipliers of your problem. We know from the Aldy [phonetic] theory primal decreases, each setting of the dual gives you a lower bound. The dual increases as you spend more computation. If the two meet at any point in time you know you've converged; you've solved the problem exactly. Moreover because these are linear programs you know complementary slackness conditions are exactly conditions that you can check for convergence. When these conditions are satisfied, you know that you have solved your original problem. Right? In our work, what we essentially did was instead of using them to check for convergence, we use them to guide where messages should be passed. We use them -- we distribute them at various locations in the graph and tell you which are the important regions in the graph. Right? These conditions distribute nicely over their time graph. And that's the most important concept here. That's the key hammer. And we were in a couple of papers what we were able to show was that we can say some precise theoretical things about it. We can say that this is a generalization of complementary slackness conditions. We can say that it's exactly a notion of distributed primal duality gap at any point of time you have a best primal, best dual and the sum of these scores sum to that primal duality gap. And these are really too cheap to compute. Constant time for each edge you compute the score. It's not like you have to spend computation to compute this score to help computation. It's really cheap. So we applied this idea to a few different things. One was distributed message passing, how did we speed it up. In this case there's an image, current segmentation. You update the segmentation somehow. In this case let's say a user says all the white pixels here are where your model has been updated. So user says I think this is background. Or you might have data coming in streaming for the next frame so the model has changed everywhere. And you get the next segmentation, but this is the key result here. This is what would happen, the primal -- the dual and the primal, if you were to pass messages everywhere to go from here to here. And this is what happens when you use a method to find the important parts of the region and only pass messages there. You essentially converge much sooner and the X axis notice is in log scale. Here we're converging 350 times soon. What is this baseline that I'm talking about? That's the TRWS implementation of [inaudible]. And if you've played around with that implementation, it's an extremely efficient implementation and not an easy one to beat. In this case we're able to beat it by this big margin, and in the other case we're also able to beat it by a big margin. The reason why we were able to beat it is precisely this figure right here. It's showing you where we passed messages. So the white pixels are where potentials were updated. Small number of updates. Large number of updates. But it only passes messages where things really matter, where segmentations are changing between the two frames. And that's exactly why you converge so much sooner. >>: Why does your purple graph start off higher than the red graph? >> Dhruv Batra: I think what's happening is that -- so what's happening is this is log scale, and I'm zooming into the region that's closer to convergence. It might be worse off initially than it's beating to convergence. You can alternate between the two. You can start off here and then at some point of time switch to the other algorithm. This algorithm, the baseline, what it's doing is it's passing messages horizontally and then vertically. So it makes, initially it makes big improvements. But later it gets stuck in this process where it has to pass lots of messages. Our stuff is finding edges where messages need to pass and only passing there. So initially it doesn't make a lot of improvement, because it keeps -- it keeps passing messages locally. So there's some things -- segmentations have to change. So somehow this node has changed. It needs to let its neighbor know and its neighbor know, it takes a lot of time initially but it converges much sooner because you're only focusing on that. >>: The classic at least that style of TRW has a horizontal and vertical suite. Have people looked at hierarchical pyramid or whatever you want to call it-based techniques where the propagation looks like it's more happening ->> Dhruv Batra: At multiple scales? >>: Yeah. >> Dhruv Batra: Not that I've seen. I mean, so scheduling, which is what I'm presenting, isn't certainly new. People have looked at scheduling before. And some people have looked at -- let's see where the messages are changing more. So if the last time you sent this message this time it's essentially the same, then maybe you shouldn't send this message, maybe somebody else should send this message. I have not yet seen hierarchical -- so I know Pedro looked at this a little bit where -- but that's essentially constructing a hierarchical graph. Like you have to reconstruct a different graph. >>: If you're going to exploit hierarchy, you have to construct a different graph. Because there are no connections, you jump it's not theory wide that you introduce those. If you have sort of an auxiliary graph that it's supposed to mimic lower resolution version in whatever that means, right, it could be used as a hint graph, right, sort of propagate stuff up at a coarser level and not down and then say my LP at the final kind of move forward faster because the information -- so raster order propagation is extremely efficient if you have a tree, right, that's optimal. >> Dhruv Batra: That's optimal. >>: But in general it's not a bad heuristic, in the linear system solving literature, generation implicit, decades old, but proven to not work nearly as well as multi-grade, right? >> Dhruv Batra: Uh-huh. >>: And now ultra grade, multi grade, to adapt to the intrinsic complexity. And it's something I'm very interested in. I've only worked on the linear, which would be equivalent to the quadratic energies or Gaussian MRF versions that's all I've worked on. I've been dying to start working on it for an article for general inference problems. >> Dhruv Batra: Yeah. I think that would make a lot of sense, because that way you can -- that way you can make large regions flip their labels by just going one layer up. >>: I'd be worried about this method, when you have a large region, is that it wouldn't -- it wouldn't choose that region to actually update messages, because we don't have a little bit -- a little bit of -- is that a problem? >> Dhruv Batra: So one thing I showed was that the scoring function, the way I wrote down the LP and I said, look, I can score every edge, our formulation extends to scores over regions. So you can compute scores over large regions as well. So even if every edge has a little bit of score, the sum might still be the most important part. So you might decide to go up, if you had written hierarchical ->>: It would be higher. >> Dhruv Batra: But if you haven't written a hierarchical then you're kind of stuck. >>: Like always be kind of below that threshold and never ->> Dhruv Batra: Yeah. But in order for them to be below a threshold there has to be something else that's always winning, there's this big edge that thinks here's where the most problem -- yeah. All right. So these things really help. You can make things really faster. In fact, we applied the same idea to another direction. We said -- I said that, look, you can compute scores on these edges and I can tell you which ones are important edges. But that was all assuming your original LP was a good LP to begin with. It's a good relaxation. This is an NP hard problem so a lot of cases are going to look like this. The best lower bound you can extract is nowhere near the best primal that you can extract. And so our formulation, you know, there's not a single LP. There's tighter and tighter LPs, because you can add more constraints to the original linear program. And our formulation, the first LP was saying that edges are consistent with nodes, that the labeling that you give at edges is consistent with the labeling that you give at nodes. You can write down tighter LPs by saying that triplets are consistent with edges. That the labeling that you give at three nodes are consistent with the labelings you give at edges. >>: The original paired nodes is that question of UIJ, the right marginal. >>: Yes. >>: What's the triplet version? >> Dhruv Batra: For triplet you would introduce a new variable, mu IJK. So it would be a K cube long Boolean vector and you would force it to be consistent with something else. Your original energy is still pairwise, so you don't care about optimizing over mu IJK. Its objective function would still be 0. But it plays a role in the constraints. And that tightens the LP because now it has more constraints in the LP. But the question here is while we could reasonably think about adding all edges to the original LP, we can't think about adding all triplets. N nodes, lots of triplets. You can think about tighter relaxations on four nodes. How many of these things are we going to add? So if there was a way to score ->>: Why can't you add all triplets it's originally a mesh graph. Still order N squared ->> Dhruv Batra: If you restrict yourself to only the original triplets that are present in the original graph, then perhaps you can think of adding them. But long-range triplets can also tighten your LP which might also be interesting to add. You can include edges that don't exist in your graph but that can still help tighten the LP. Then you have to consider all N cubed or N choose 3. >>: Philosophically you're making a big jump, because the original thing was you encode the problem as a continuous optimization or integer program, where it's assuming the constraints are met, right, it's exact, right? Now you're saying let's just throw in lots of extra constraints so that the solution would proceed faster, right? >> Dhruv Batra: No, but even with these constraints, it is still the original problem, because think about -- so think about what was happening. If I -- think about it this way. What is the worst I can do, that I introduce a variable that depends on all pixels and so instead of being K squared or K cubed it is K to the N. What are the constraints I can add to this? That it sums to one over all possible labelings only choose one labeling, and each of these labeling is consistent with the sub labelings that you have. That type of constraint would still be a valid constraint to your LP, right? In fact, if you threw that in for all the clicks in your tree decomposition, if you wrote down a junction tree of this graph, and if you wrote down as many constraints as the tree width of that graph, then your LP is guaranteed to be tight. And we can show that, that you just need to add as many constraints as the tree with linear LP in the worst case. So we're essentially moving towards that by adding more and more constraints. And the thing to think about is we can't add all triplets or all four things or all four pairs, and so if we could somehow score these things that would be helpful, which ones should we add into the relaxation? Can you know a priori -- so when you add a constraint, it will help, obviously, because tighten LP. But can you before adding it in know how much it is going to help or have an estimate of how much? >>: When you say can't add all of them, right, if the triangles consists of two edges from the original thing, plus the extra edge, there's only constant, if you start with a regular planar grid, there's only constant number of such triangles right? Why would you want to sort of take three nodes that are all over the map and make up a hypothetical triangle of those three? Right? In other words, it's just using locality almost as good as using something based on local gaps, right? >> Dhruv Batra: Right. Right. So you can think about -- you can enumerate, what you're saying it's really easy to enumerate over all the local triangles. Yes, you can do that. But what about when you go much higher, four, 5, six, then that space is becoming much larger, and even enumerating over that space becomes complicated. And what I'm trying to say here is there a way to construct these clicks where I can sort of localize that this is where my primal duality gap is coming from, can I directly construct a hierarchical so all the edges in this neighborhood have a little bit score, can I just add this entire region as a click into my relaxation? And that's what we looked at here. And that's what essentially what we did. For this problem, there was an original image. We had access to a blurry noisy version of this image. This is the -- we set up an MRF problem for denoising and deconvolution. This is the best primal we could extract out of the pairwise linear program. So what I was showing before. This is the best primal I can extract from the triplet LP, if I add triplets into this, it becomes tight. This is the actual integer map. So that's fine. We can extract this from this. Here's the objective function. Not objective function, the prime dual gap decreasing as a function of time as I'm adding more and more constraints. So if you add constraints randomly, if you randomly throw in triplets, not random triplets, but if you enumerate over them and randomly throw one in, then here's how prime duality gap is increasing because relaxation is getting tighter here's if you choose using R score. It's converging to 0 much sooner. On some levels you're essentially three times faster. At the end, you know, this guy's converged this guy's not even close to convergence. That's the idea. >>: What's the intuition when you watch it select triplets what is it typically selecting. >>: Typically selecting things that are boundaries. It's selecting things that object boundaries, so like here it might select some triplets here. So in a sense it is using locality of the problem. But still it's getting a one step above edges. >>: Right. So it's locality based on the actual smoothness graph in the blurry image or in the solution, which is it looking at, does it tend to look at the current solution and that's what drives it or does it more look at the original input? >> Dhruv Batra: No, it's looking at the best solution that it can extract and the best dual value it can extract, which is a function of this. So it's looking at both. What's is the best primal and the best dual. >>: Your reasoning about on triplets but not all sets ->> Dhruv Batra: In this case I was reasoning only on triplets. But the formulation extends to arbitrary size subsets. >>: Can you do them all together, like 3s and 4s? >> Dhruv Batra: You can construct them from their subset. So you can score -so at any point in time you can only score the things that exist in the relaxation. So if only edges exist in the relaxation, I can compute scores and edges, and summing up the scores and edges in a triplet I can compute scores of edges. So I can't give you -- so if I had to compute a score for a set of 5, then I would have to look at all five choose two of those edges that exist now and I would have to sum their scores up. Does that make sense? All right let me try to finish up quickly. I won't go into the last part, but we also took in an algorithm alpha expansion, which at the surface of it looks nothing like an LP relaxation. Looks like a greedy algorithm. But people have interpreted it as solving that same LP. And we were able to use this idea to also speed up standard alpha expansion by factors of 2 to 3. And that's all I'll say about that. So in general I'm interested in extending this to QP relaxations. I talked only about linear programming relaxations. I'm interested in a lot of the methods I said are natively paralyzable. So one. Things I want to do is have paralyzed implementations. There's this really nice work coming out of CMU which Is GraphLab, which lets you work at a really high level. You specify your algorithm. It does all the low level parallelization. It's for multicore and for distributed settings. And there's a chance that I'll be, there's a good chance I'll be working with Carlos Scoop at CMU and taking my graphs to GraphLab. And I want to look at focused learning trying to do scheduling for learning problems, and I think there's some interesting scope there. Okay. Let me show some teasers and I think we should be done in like a minute or so. In my Ph.D. thesis, I think a lot of people here have seen this before, but we worked on this problem of interactive goal segmentation where you have a large collection of images. And somebody, the same object appearing in these images, and you can write down -- you can build a system. We built a system where someone can scribble on a single image or few image saying this is the foreground, this is the object I'm interested in. And our system would go ahead and segment that object not just in that image, we'll see that, but in the entire collection of images. And this, you have to -- in this case you have to look at all the images and so we also looked at active learning formulations where the system would tell you where to go next, where should I need to see scribbles next. This was mostly for building an automatic collage, you scribble once and build a collage. But what we were able to do was also extend this with Adarsh Kowdie, who was an intern here that people are familiar with, we were able to extend this to volumetric reconstruction by using a shape [inaudible] algorithm. So you use a standard structure for motion pipeline to find camera settings, back projector, silhouettes back into 3-D to carve out a volumetric reconstruction. And this is the cutest part. I'll skip this video. This is the cutest part. Others just got hold of himself at 3-D printer so he was able to actually produce these little tiny structures from this. These are printed using a 3-D printer using our algorithms. So just a couple of images. You scribble. It was able to extract -- it was able to produce three more [inaudible] printed on a 3-D printer as well. Worked on other problems like single depth estimation, maybe we can talk about that if we end up talking about this. We have a really nice algorithm coming that's the first max margin learning algorithm for Laplacian CRFs, marginals effective for this problem that haven't been looked at before because the algorithms didn't exist. >>: What's the Laplacian CRF? >> Dhruv Batra: Laplacian CRF is a CRF where the edge potentials have L1 norm terms. When you have L1 norm terms it's not a log linear model. And some of the existing algorithms don't work because they make a log linear assumption. So we came up with a first approximate max margin algorithm for training for these things. In the past I've also looked at some retrieval problems where suppose you have an image and you're trying to find out what is the content in this image. I give you an image you give me a textual description. And this was an algorithm that we built that was first segmented search for images with respect to the segment and then do some textual analysis on that to answer your query, essentially. All right, there's been some work on similarity learning again, also. So with that I'll stop. Here's the people involved in some of these things and that's it. Thanks. [applause]. >> Larry Zitnick: Any additional questions? >>: Interesting talk. >> Dhruv Batra: Thanks.