17389 >> Dengyong Zhou: Actually, this afternoon Alekh will give a talk about message passing. Actually, I'm quite proud to invite Alekh to give a talk. Alekh currently is a second year Ph.D. student in Berkeley, as an advisor. [Inaudible]. Before that he got a Bachelor's Degree in IT and computer science. >> Alekh Agarwal: Thanks for the introduction, Denny. So today I'll be talking about proximal methods and rounding schemes for optimization and certain graph structure optimization problems. And this is joint work with Pradeep Ravikumar and Martin Wainwright at U.C. Berkeley. In this work we're concerned with solving some optimization problems that arise in Markov random fields. So let me quickly introduce, recap what Markov random fields are to you. A Markov random field is basically a representation for a probability distribution on a collection of random variables. So we have random variables, let's say X1 through XP and we assume that these are discrete random variables. So each of them takes values 1 through M. And we have a probability distribution over these random variables that is characterized by the structure of a graph which is the Markov random field. And the probability distribution is parameterized by two sets of parameters. The local parameters are the node potentials, which give you the local information about the node's affinity to take on particular values. And we have the H potentials which describe the nature of pair-wise interaction between pairs of random variables. So for instance if two variables are positively correlated, you expect the edge potential to have a high positive value and to have a high negative value if they're negatively correlated and so on. By theory of MRF, the sort of independence relations under these distributions are characterized by connectivity to the property in the graph. So the problem we're interested in solving is the problem of finding the MAP labeling for this sorts of random variables, which is -- yes? >>: So [inaudible] beyond parallelized? >> Alekh Agarwal: Well, the thing is you can, in general, represent any arbitrary potential as a pair-wise potential by certain methods that have already been proposed in previous literature. I did not explicitly consider a form beyond a pair-wise form because you can sort of -- there's sort of a black box reduction you can do to convert it into a pair-wise potential always by forming something like super nodes. So we're interested in finding this MAP labeling which is basically the mode of this distribution, the labeling that has the highest probability and of this model, and it's quite obvious that it is a labeling that will maximize over all possible labelings of, maximize this linear objective inside the exponential function. So clearly because we have these discrete values only, this is sort of an integer linear program. The problem of finding this MAP labeling and, hence, it is known to be NP hard and there are lots of hardness results on problems of this kind. So this problem arises a lot and there's been a lot of previous work on this problem starting from the classical sort of max product algorithm which is a dynamic programming algorithm that is known to be exact on [inaudible] and it's basically an introduction of [inaudible] algorithms from HMMs. One field in which this problem arises a lot is computer vision where people try to use the MAP labelings to for instance emit segmentation and other ideas for problems. So some very nice literature on this problem came out of the vision community. And in particular they showed for certain kinds of potentials you can use the technique of graph cards to find the MAP labeling. And there is a very nice survey in the work of Markov, et al. Probably the state-of-the-art algorithm for this problem is the 3-D rated max product algorithm which was proposed by Wainwright, et al, further studied by Kolmogorov and co-authors. And recently Kolmogorov and Globerson proposed globally convex convergent version of this algorithm by using the idea of [indiscernible] hypertrees. There have been a lot of convex tree energy approximations which are also related to the work that we'll be discussing in the next few slides. And these were proposed by Yirise [phonetic] and co-authors. A lot of convex relaxations of this interlinear program to things like linear programs, quadratic programs, second order programs, STPs, et cetera, have been studied in a lot of detail. And some of the more recent work is related to Langrangian relaxation and simulated annealing also related to our work. And the idea of using basically subgradient method after doing some kind of compatibility composition, which is one of the only convergent and exact methods to solve the relaxed version of the problem that I'll be describing in the next slide. So what's the basic idea? Well, we have an integer linear program. We do what people always do, which is to form a linear programming relaxation. And we do this by observing that even though we are optimizing over discrete values of -- discrete value variables which can be seen as optimizing basically over a bunch of indicator variables, we can always relax the zero and indicate the variables to their expected values and this does not change the optimum of the linear program. So we can always replace the indicator variables with the expectations which are the node marginal probabilities, mu Ss and edge marginal probabilities, mu STs. And now, of course, because of these marginal probabilities are coming out of a single consistent distribution they cannot be completely arbitrary. They will be related in some ways. There will be some constraints that will be relating these mu Ss and mu STs. However, note that if we can enumerate those constraints and optimize over those constraints, this linear objective in these mus, then we can hope to solve this relaxed problem efficiently and get the optimum of the MAP labeling. Since the original problem was NP hard, something has to go wrong. And the thing that goes wrong is the number of constraints that characterize these distributions is exponentially larger than a number of optimization variables. You cannot even enumerate all the constraints in polynomial time, let alone optimizing over them. So the trick people use is to take a subset of these constraints that is tractable that can be easily optimized over. So some of the natural constraints are, of course, these are probabilities they should be nonnegative. The probability of a node taking a particular label should add up to one when you sum over all labels. That's the normalization constraint. In addition, the edge marginal probability should be consistent with the node marginal probability, which means that if you sum out over all possible labels of one node, then you should get node marginal probability of the other node, and that's the marginalization constraint mentioned here. And the set of normalization and marginalization constraints taken together is often referred to as the local polytope because it's just looking at sort of first order local constraints and not looking at higher order as such in the graph. So the first order linear programming relaxation best known in the literature is this maximization problem over with nonnegativivity constraints and over the local polytope of inner product between the potential, between the parameter vector theta and the marginal probability vector mu. And this LP relaxation is what we aim to solve. But before we even go about doing that, it's clear this is a linear program now. And we can just use any off-the-shelf LP solving procedure to solve it. So what's the big deal, why do we need a new method? The problem is that the classical methods turn out to be too expensive and too slow for the typical problem sizes that are often encountered in applications of this problem. So, for instance, I mentioned computer vision. Often the number of labels is as big as 256. And the number of variables is the number of pixels in your picture. So it's a pretty huge number. And the total number of optimization variables that you have is the number of edges in the graph times the number of labels squared. And if your method scales badly, numbers of variables, then you can forget using the method on this problem. So that's one issue. The second is that these methods are -- often the classical methods are not easy to implement in a parallelized distributed fashion, which, again, is a big concern on large problem sizes. And both these issues are arising because of the fact that we are trying to use something generic which does not leverage the particular structure, the optimization problem coming out of sort of an underlying graphical structure has some very nice properties, and we need to leverage these properties if we want to come up with an efficient method to solve these linear programs. Let me remind the 3-D weighted max product that I referred to earlier does address a lot of these concerns, because it solves a particular dwelled instance of the LP relaxation that I just mentioned. However, 3-D weighted max product is specifically tailored to the constraints of the local polytope. It's not that since the number of constraints is exponentially large often people also want to work with high order constraints, known as cycle inequalities and so on and so forth. And modifying 3-D weighted max product to handle new constraints is a very challenging job. People have tried it in the past and it gets very messy very soon. So essentially we want to come up with an LP solving procedure that can provide us an exact solution to the LP relaxation in a time complexity compared with algorithms like max product entry weighted max product, which should be easy to implement in a parallelized and distributed fashion and which provides us the capability of seamlessly incorporating new constraints into our relaxation if we wish to. So in order to do this we need to borrow some techniques from optimization theory. And the first concept that comes in handy is that of Bregman divergences which can be defined for most strictly convex functions F. And Bregman divergence defined under function F between two points mu and nu is basically the gap between the function F evaluated at mu. Sorry. This QNP should have been -- this should have been nu and mu. But so you take the point mu and you evaluate the function at that point. Then you take a reference point mu and you take the tailor approximation function at the point mu and evaluate the gap at this tailor approximation. The convexity of F this gap is always going to be nonnegative. And if it's strictly convex it's going to be strictly nonnegative. And as the function is more and more curved, this gap will be larger and larger. And the reason it's called a divergence is because it has certain similarities to the notion of distances that we are used to which I will remark in a minute. But let's first see some examples of Bregman divergences. The first and the most logical example is when the Bregman function F is clearly non-Q squared. So I'll be dealing with Bregman divergences defined over vector. So the functions I will define will always be functions of vectors. So the first example is the [inaudible] non-squared of vector. In this the Bregman divergence is just a distance between two vectors. That's kind of nice. It tells us these divergences definitely incorporate the standard notion of distance that we are used to. Another extremely classical example is that using negative entropy. Now, of course, in this case I have probability distributions on nodes and edges. So what I do is I define the negative shannon [phonetic] entropy on each node marginal distribution and each edge marginal distribution and then just take a weighted linear, weighted linear combination of these individual node and edge entropies, which is also a convex function by the convexity of individual functions. And then not surprisingly, the Bregman divergence turns out to be the weighted sum of the care divergences of the node in the edge marginal distributions. So this is again -- this is actually going to be one of the very useful Bregman divergences to us as we will see in a minute. The third useful and extremely non-standard Bregman function that I would be using -- and in fact it's not a Bregman divergence in the strict sense -- is what I call the 3-D weighted divergence. Now, this is an idea borrowed from the 3-D weighted message passing literature. So you have a set of span increase in the graph which means you have a collection of trees. And each tree contains all the vertices in your graph and some subset of edges. You define the notion of these edge appearance probabilities, row STs, which is just the fraction of the trees that an edge appears in. So if you have two trees, two spanning trees and if an edge appears in both of them then you set row ST to be 1. If an edge appears in just one of them you set row ST to be .5. If it doesn't appear in any of them, then row ST would be zero. That's the idea of these edge appearance probabilities. Using these edge appearance probabilities and these spanning trees, we basically define a three T weighted entropy, which is basically the average of a bunch of -- the negative entropy of a bunch of probability distributions where each probability distribution is defined on one tree. So the idea is that once you have your node and edge marginal distributions, there is a well-known consistent way of defining a probability distribution over the subset of variables involved in the tree. So what you do -- and this sort of scary looking formula is actually a very standard probability distribution known in the literature for trees based on these marginal probabilities. And you just take this distribution. You compute its entropy. It again has a nice and simple form which I will not mention here, though. And then you just take the average of these entropies across the trees. So there is a corresponding Bregman divergence which is not so nice to write out. So I'm not going to write it. But the reason why I mentioned this entropy function is because it's going to inspire a very nice algorithm which works remarkably well in practice. So just to mention quickly the properties of Bregman divergences that make them similar to distances, these are, as I mentioned, are always nonnegative and they'll be .0 if and only if the two points are identical. However, as was clear from the entropy example these divergences do not enjoy properties like symmetry or triangle inequality. >>: So, sorry, this divergence actually does the [inaudible]. >> Alekh Agarwal: Yes, it does. So it's not exactly a Bregman divergence. Let me tell you why. Because this function is actually strictly convex only when the mus come from the marginal local polytope that I described. If the mus do not satisfy the constraints of the local polytope, this function is not even going to be convex. So it's convex on your constraint side. It's noncovex outside your constrained side, but just convexity on constrained side is sufficient for me to come up with the algorithms I want. The next tool that we borrow from optimization literature is the idea of proximal minimization, which is sort of a regularization technique. So it says that I have some objective function mu, objective function F that I'm interested in minimizing. But this function is -- I don't like this function. It has some bad properties. And it's not very nice to optimize over directly. So I'm going to add a second component to my optimization problem. This second component in the proximal minimization literature is usually some kind of generalized distance function. People call it the Prox function. It's basically some kind of a distance function and as you might guess for us it's going to be a Bregman divergence. So why is this nice? Well, it gives us a bunch of nice properties. First of all, adding this Bregman divergence just by the strict convexity of Bregman divergence makes our entire optimization problem strictly convex. So we know that dual is going to be nicely behaved. There's going to be nice strong duality and everything. We can go and work in the dual if we want to. However, there are even better properties that we get as a result. So note that instead of solving just one problem of maximizing theta.nu, I now set up a series of problems where mu plus 1 is defined to be the minimizer of my linear objective function, plus a weighting factor times Bregman divergence to the previous and solve these three sub problems now. However, let's assume for a moment that my sequence mu is converging to a fixed point. Then, of course, this distance, this Bregman divergence will start shrinking to zero because my [inaudible] are getting closer and closer. Hence, the only possible fixed point of the sequence is going to be the optimizer of the first function F. The original function that I was intending to optimize. So what it tells me is that even without -- so the idea of adding this second sort of strictly convex function is something that people have been using in simulated annealling, for instance, for a long time and have tried to apply to this MAP-MAP optimization problem already but were unable to prove theoretical guarantees about. But what we get here is that in simulated annealling, it's kind of essential that you send omega into infinity at a particular rate which may or may not be needed in general proximal minimization because the Bregman divergence term is already shrinking by itself. So omega can afford to be a little larger and that kind of gives us an interesting degree of freedom when we apply the approximization minimization techniques to solve this problem. So since we have a Bregman divergence in our objective function, another concept that comes in very handy is the property of cyclic projections that Bregman projections enjoy. So since Bregman divergence is a strictly convex function, we can define for any reference point mu, we can define its Bregman projection onto a convex set, which is the point mu had that is closest to mu -- should have been mu. My mus and nus are all over the place today. In the point nu in the convex set that is closest to nu in the sense of this Bregman divergence. Now the interesting property of cyclic version tells us that suppose I have a complicated convex set that can be written down as an intersection of a bunch of simpler convex sets and I want to -- so, for instance, let's say my simple convex sets are two lines and I want to -- their intersection is, of course, this one point which is not so complicated but for purposes of example. So I start with a point P that I want to project onto this intersection. I do it by first projecting onto the first line, taking the projection, projecting onto the second line and repeating this process until it can be provably shown that I am going to can converge to the intersection, to the projection of point P onto the intersection of these convex sets CIs, if I just repeat this process cyclically and the rates of convergence and everything is well studied. >>: What do you mean by cyclic? Do you have to project ->> Alekh Agarwal: Right. So just one round of doing this sequence will not suffice. You have to keep on doing this. And asymptotically you will go to the intersection projection onto the intersection. So just to quickly recap. I started with the original MAP optimization problem, which I furthered by adding a proximal function to it and setting up a sequence of problems. At this point I do a slight rewriting to absolve this linear term into the Bregman divergence itself. This is done by taking the referenced parameter mu and modifying it to a mu to N [phonetic]. The reason why I do this is because now my original minimization, the proximal minimization problem has been cast as a Bregman projection problem. Okay. Now the rewriting involved in going from mu N to mu N [phonetic], of course, depends on the particular Bregman function being used. So to give you some examples [inaudible] plus theta for [inaudible] to mu N times exponents of theta where addition and multiplication are elementized for vectors. The second thing we note is that we're optimizing over a polytope which is a bunch of linear -- an intersection of a bunch of linear constraints. So we have to do, perform a Bregman projection over an intersection of linear constraints. If we can perform projections over just individual linear constraints sufficiently, then cyclic projections tells us we can repeat this process cyclically and we will eventually find the optimum over the entire local polytope. That's the key idea of our method. We have an optimization problem to solve over this local polytope. We start with some initial point mu zero and we perform the initialization step to go to mu level 1 that I just described. We then project mu level 1 onto the set by using Bregman projections. And we repeat this process several times and eventually we converge to mu start which is the optimum of our linear programming relaxation. So to give you a flavor of what these projections and what these updates look like, let's consider the special case where the Bregman function is the weighted entropy function. In fact, here I'm using uniform weights. You can think of a sum of a bunch of entropies. So the initialization step is something that I just mentioned on an earlier slide. It's the same as that. And the interesting thing is that to solve, to project over normalization and marginalization constraint you need to iterate the set of messages over all the edges and you just need, for normalization, you just divide -- you just divide everything by sum of the probability over all the labels. So you just normalize by division in the usual way. But this is the only real operation that you need to perform over all edge. You need to do this sort of a message passing procedure at every step. So this is actually very similar to the belief propagation kind of methods that were being used already but this is now a convergent version of those belief propagation methods. However, because it is so similar to the belief propagation methods, it does end up inheriting all the good properties of those methods. Like because these updates are very local, you can implement these updates independently for all the edges so you can distribute your updates. You can have them in a parallelized fashion and you can perform similar algorithm under other divergences. There's only one case where the algorithm looks significantly different, and I'm going to point that one out as well, which is the case of the 3-D weighted proximal solver. Now, here the difference is that in the -- so remember here in the initialization step I was going from mu to sort of a mu till, so I was within this sort of probability domain. In the 3-D weighted case, in my initialization step kind of leaves the probability domain. So I basically compute sort of a set of parameters at every round. Theta N, theta NS and theta STs which are basically said in some strange way depending on the gradients of my Bregman function and so on. And one way to think of it is you have kind of taken the log of these updates, roughly. So the theta here is sort of like log of the mu here. But you just compute some set of these thetas according to fixed deterministic rule. The interesting thing that happens is that you can then show that the problem of the Bregman projection problem that you need to solve can be actually cast as a standard problem that three weighted some product algorithm solves. So three weighted has a sum product version and a message passing version. And you can show that the inner loop performing the Bregman projection just requires you to be able to solve the three weighted sum product problem, which there are plenty of very efficient solvers available for developed by Wainwright and Kolmogorov and later by a lot of other people. So the idea of using three weighted sum product problem solvers has the inner workhorse for our method is a very attractive one because they can potentially be a lot more efficient than doing cyclic projections. Now, at this point it seems we have basically a method that can give us exact solution to LP relaxations which have distributable and parallelizable updates. Let me remark that three weighted sum product algorithms are very similar to belief propagations. So they also have this property. And additionally our methods have ease often incorporating new constraints because at least in the cases where we're doing cyclic projections projecting onto a new constraint requires you to compute the projection onto an individual constraint which usually is not that hard to do. The only thing not clear at this point is what the time convexity of our methods is going to look like. These methods have a double loop structure which is often considered as sort of the big taboo in a lot of these optimization problems. However -- and indeed if our methods, if we have to sort of wait for numerical convergence of our methods, we can actually take quite a while to converge. However, the second trick we use is to come up with some very nice rounding schemes that allow us to actually converge really, really fast when our LP relaxation is tight, when the relaxation is tight, when the relaxation is exact and has an integral solution we can come up with certain rounding schemes that give us very nice convergence properties. So what's the idea with rounding schemes? The idea is we're trying to solve an integer linear program here anyways. At some point, while solving my linear program, I can take the set of probabilities I have, extract an integral solution out of them and give you a certificate of optimality. Tell you that this integral configuration is actually the MAP labeling, then the problem is solved. I don't need to optimize any further and wait for numerical convergence of my LP solving procedure. This is exactly what the rounding schemes aim to do. What might these schemes look like? The simplest thing is to maybe just take the highest probability level for every node according to my node marginals. Doesn't work. There are plenty of examples in the literature where such a rounding scheme would fail. Okay. We can't work with nodes. So let's try and work with edges. We define certain local quantities on every edge, exact nature is not important but just something based on node and edge marginals. Just so we are not -- the important thing to note is that the local quantity is just based on that edge and its corresponding nodes. And for every -- every local quantity we find the label pair that maximizes it. Now, of course, a lot of it is are going to share nodes and both these edges will want to assign some label to this common node. Now, as this label turns out to be the same for both edges and we say that -- and this happens across all edges, then we say we have found a consistent labeling through the edge rounding method. However, it can always happen that there are two edges that do not agree over the label they're assigned to the shared node and in that case we declare failure and we continue optimizing further. Just means that our marginals are not yet good enough for the rounding scheme to work. Don't need to restrict ourselves to edges. We can go to higher order neighborhoods like stabs [phonetic] which is higher order neighbors. Trees. Again, all these neighborhoods like stars, trees, they're very easy to optimize over because just things like optimize -- max product are going to work. Just two rounds of dynamic programming algorithm suffice to find the highest probability labeling over such simple structures. So we can find these labelings very efficiently and again we check for consistency over the shared nodes. So while the rounding scheme does not enjoy any guarantees in general, what we can show if any edge or tree rounding schemes find a consistent configuration, then that configuration is indeed the MAP labeling for the set of random variables. And this theorem basically relies on certain invariants that our algorithms maintain across the updates. So based on that invariant we can show that if at any point we have reached an agreement using this rounding method, then it is going to be globally optimum. And empirically, usually for at least sort of the redeems where I've tested it, it takes roughly -- so if TRW max product takes you X amount of time, then these rounding schemes will give you a solution in a lot more than 8 X or 10 X of time which is not bad given that TRW does not give you a guarantee that it will be solving the LP relaxation exactly while we give you a guarantee that we'll be solving the LP relaxation exactly. But of course you can even hold this if LP relaxation is tight, otherwise you cannot hope to get a convergence through rounding. So just to remark about race of convergence for proximal rounding schemes, certain rates of convergence are well known. But they assume that the problem, the Bregman projection problem in our case is exactly solved. We use [inaudible] which is only asymptotically going to give us the solution to the [inaudible] problem. So we have an inexact solution to the Bregman projection problem. But people in proximal minimization theory were nice enough to analyze that case for us as well. And we know that at least a linear rate of convergence will be guaranteed even when most solutions are inexact. However, in this case what we can show is if the LP relaxation is tight, then you have a finite convergence no asymptotics. You can prove a bound on the manipulations whether one of the rounding schemes is going to succeed. And that's, again, a very nice property of these methods. So that's great. But all that works when the LP relaxation is tight. Now, of course, when the LP relaxation is not right you still want to be able to extract a good solution from whatever probability vector you're stuck with now. So, in particular, the deterministic rounding schemes don't tell you what to do when you have a conflict over a node or an edge or something. So we want to have a rounding scheme that will give you sort of a principal prescription for what to do, something hopefully better than just arbitrary tie-breaking when there are conflicts. So we come up with some randomized rounding schemes which work regardless of whether the LP relaxation is exact or inexact. So they work in all cases. They can be easily derandomized for efficient implementation. It can be shown that for the case for [inaudible] the randomized version finds the correct solution with high probability. The derandomized version, again, will have an iteration bound that will give you a finite convergence for integral LPs again. The really neat part with these randomized rounding methods they're not designed specifically to work with our methods. While deterministic rounding schemes rely on certain invariants that my algorithm maintains, the randomized rounding scheme only relies on the fact that it has access to some LP solver that has a linear rate of convergence to its optimum. As long as you have that linear rate of convergence, this derandomized version is going to enjoy a finite convergence and randomized version will converge with high probability. So these randomized rounding schemes have also an appeal in the fact they can be applied to even other LP solving procedures. So ate up one of the edges. It was there in my computer yesterday, I promise. So the idea with randomized rounding scheme is that you have a graph with a vertex in the set. You take a subset of the edges and chop them off from the graph. So well imagine, imagine a black line here, by the way, this complete grid. And imagine I chop off all the blue edges. Then I'm left with a bunch of trees. Not an arbitrary bunch of trees but a vertex disjoint bunch of trees. So these trees do not shed any vertex. That's what we call a forest. We find an after disconnecting those edges we're left with a forest. Now as before I go back to this way of defining probability distributions over trees based on marginal probabilities. And for each sub tree TI I define this probability distribution and I sample a labeling for the nodes. Nodes involved in that tree based on this probability distribution. Now you might ask me is the sampling, how efficient is this and all that, but I'm just going to give you a derandomized version if you ask me that question. So it's not even -- it's not even important. What is important is that these trees, because they're vertex disjoint, there are no conflicts now. You can just concatenate the labels for each of these subsets and you have a consistent labeling for the entire graph. And you're going to have all the properties that I just mentioned on the previous slides. So so much for the theoretical convergence rates. Let's see how this works in practice. And I was not interested in trying out any huge problems or anything, I just wanted small proof of concept implementation of this. So I tried it on small grid graphs with 100 to 200 nodes, five node labels. The usual standard parts, mixed part potential with varying signal to noise ratio, which is just the ratio of the strength of node potentials to H potentials so low S and L is hard problem here. We wanted to randomize the rate of convergence we just proposed and we wanted to adjust rounding schemes. First was to verify the rate of convergence. We took the log of the distance of the marginality probability vector the fixed point mu star which by optimality of method also LP optimum. Took the log of distance to that point and plotted it against a number of iterations. On the log plot you'll see there was super linear rate of convergence. It would have been a straight line if the rate of convergence were linear. So, okay, that theorem was correct. Hopefully that means because these are converging well. The interesting thing is that these curves do not defer that much based on problem sizes so that sort of bodes well for the scalability of method. And let me remark that this plot was done using entropy proximal solver, three way proximal solver, gives similar and even better convergence in a lot of cases and scales even better with respect to problem sizes, because of the inherent scalability probabilities of three way to some algorithms and also the people who developed this write much better code than I do. For the convergence of the rounded solution, we just, for the edge-based rounding scheme, we plotted the number of edges in the graph where conflicts were found during rounding and how this evolved with number of iterations. And, again, we see like in very few number of iterations these conflicts have pretty much vanished which means we get to the LP optimum by rounding really fast. And, again, let me remark that the other rounding schemes, the tree-based rounding scheme, for instance, works even better in a lot of cases because it's looking over large neighborhoods. And it often converges even faster than the edge-based rounding scheme and the randomized rounding schemes, the derandomized versions also have a similar behavior, and again with respect to problem sizes we have a pretty good scaling here. Finally, we also looked at the energy of the rounded solution, because that tells you that well it could be the case that the rounding scheme hasn't declared a convergence, but it has already obtained the right solution. So if the energy of the entropy solution stops sending, you know it did recover the rounded solution well before it was able to give you a certificate of optimality. And that sort of does happen to some extent. Again, the energy of the rounded solution gets close to the best possible energy it's going to get really, really fast. So, again, just sort of a proof that these rounding schemes do work well in practice. To sum up, we have proposed some new MAP optimization algorithms here which use very nice proximal minimization, and the use of graph structured Bregman divergences allow us to exploit the particular problem structure at hand. And we solved these in some cases by using cyclic Bregman projections and sometimes using three weighted solvers. But all the methods that exploit the structure of the local polytope basically to a large extent. We come up with simple message passing updates that has a guarantee on the local optimum to the LP optimum. Let me remark that this is a very nice departure in this area. Most of the algorithms known are just algorithms. They don't come with any certificates of optimality. They're just shown to work well in practice. It's kind of the only sort of exception is probably the subgradient method which by virtue of being an exact optimization method works, but works horribly slow in practice from my experience. And, finally, we come up with a bunch of rounding schemes that allow really, really fast solutions for integral LPs and also extend beyond other methods when we look at the randomized rounding schemes. So that's all I had to say, and thanks a lot for listening to my talk. [applause] >> Dengyong Zhou: Question. >>: Question, so the cyclic, can I view it as a random? >> Alekh Agarwal: Yes, it is actually very closely related to -- so it is related to doing coordinate descent in the dual, I think, if I remember correctly. That is one interpretation that has been shown for it. >>: So another question is on the roundings, [inaudible] so is there any -- is there any connection [inaudible] like sometimes you look at what gets violated and then [inaudible]. >> Alekh Agarwal: Right. Right. But I don't really think that it is connected to [inaudible] in any way, because so it's not that we are based on what happens in the rounding schemes we're going to solve the different problem. We always solve the same problem. And rounding schemes is just a convergence, sort of a convergence test that you make, basically. Rounding schemes, at least the deterministic ones, really what you should think of them as some kind of -- they basically take some kind of reparameterization of the view of the algorithm which has been popularized a lot by these 3-D weighted message processing algorithms. And again here we can show there's a reparameterization going on and that's what allows you to do these rounding schemes. >>: So I have a question. So just to make sure I understand. The reason you do this cyclic Bregman divergence projection is you want to have some local thing? >> Alekh Agarwal: Yes. >>: But, of course, in principle, you could project them to some [inaudible], the subset C -- subset C are involved and in principle this kind of cyclic scheme converge even faster, right? >> Alekh Agarwal: You can lift it to higher order neighborhoods if you like. So in fact -- in fact, you can -- yes, you can certainly -- so, for instance, taking, instead of taking just an edge and a node, you can start taking something like star never hurts and you can show exactly projections in closed form even on to bigger neighborhoods and that's not so bad. And actually the way you implement these projections is kind of -- it changes the efficiency of the implementation a lot. >>: But [inaudible] kind of a more sophisticated projection because you're gaining something in terms of convergence speed but you're losing something in terms of making a projection, right? >> Alekh Agarwal: As long as you're picking neighborhoods that are small enough that in sort of your parallelized optimization it's going to reside on one machine maybe and then, for instance, in a lot of these image problems your graph is a grid. So if you select your neighborhood well, then you can make sure that most of the things are still staying on the same machine and then you can hopefully project on to bigger neighborhoods and not lose too much inefficiency and certainly is going to speed things up. >> Dengyong Zhou: No further questions.