17389 >> Dengyong Zhou: Actually, this afternoon Alekh will give...

advertisement
17389
>> Dengyong Zhou: Actually, this afternoon Alekh will give a talk about message passing. Actually, I'm
quite proud to invite Alekh to give a talk. Alekh currently is a second year Ph.D. student in Berkeley, as an
advisor. [Inaudible]. Before that he got a Bachelor's Degree in IT and computer science.
>> Alekh Agarwal: Thanks for the introduction, Denny. So today I'll be talking about proximal methods and
rounding schemes for optimization and certain graph structure optimization problems. And this is joint work
with Pradeep Ravikumar and Martin Wainwright at U.C. Berkeley.
In this work we're concerned with solving some optimization problems that arise in Markov random fields.
So let me quickly introduce, recap what Markov random fields are to you. A Markov random field is
basically a representation for a probability distribution on a collection of random variables.
So we have random variables, let's say X1 through XP and we assume that these are discrete random
variables. So each of them takes values 1 through M. And we have a probability distribution over these
random variables that is characterized by the structure of a graph which is the Markov random field. And
the probability distribution is parameterized by two sets of parameters.
The local parameters are the node potentials, which give you the local information about the node's affinity
to take on particular values. And we have the H potentials which describe the nature of pair-wise
interaction between pairs of random variables. So for instance if two variables are positively correlated,
you expect the edge potential to have a high positive value and to have a high negative value if they're
negatively correlated and so on.
By theory of MRF, the sort of independence relations under these distributions are characterized by
connectivity to the property in the graph.
So the problem we're interested in solving is the problem of finding the MAP labeling for this sorts of
random variables, which is -- yes?
>>: So [inaudible] beyond parallelized?
>> Alekh Agarwal: Well, the thing is you can, in general, represent any arbitrary potential as a pair-wise
potential by certain methods that have already been proposed in previous literature. I did not explicitly
consider a form beyond a pair-wise form because you can sort of -- there's sort of a black box reduction
you can do to convert it into a pair-wise potential always by forming something like super nodes.
So we're interested in finding this MAP labeling which is basically the mode of this distribution, the labeling
that has the highest probability and of this model, and it's quite obvious that it is a labeling that will
maximize over all possible labelings of, maximize this linear objective inside the exponential function.
So clearly because we have these discrete values only, this is sort of an integer linear program. The
problem of finding this MAP labeling and, hence, it is known to be NP hard and there are lots of hardness
results on problems of this kind.
So this problem arises a lot and there's been a lot of previous work on this problem starting from the
classical sort of max product algorithm which is a dynamic programming algorithm that is known to be
exact on [inaudible] and it's basically an introduction of [inaudible] algorithms from HMMs.
One field in which this problem arises a lot is computer vision where people try to use the MAP labelings to
for instance emit segmentation and other ideas for problems.
So some very nice literature on this problem came out of the vision community. And in particular they
showed for certain kinds of potentials you can use the technique of graph cards to find the MAP labeling.
And there is a very nice survey in the work of Markov, et al. Probably the state-of-the-art algorithm for this
problem is the 3-D rated max product algorithm which was proposed by Wainwright, et al, further studied
by Kolmogorov and co-authors. And recently Kolmogorov and Globerson proposed globally convex
convergent version of this algorithm by using the idea of [indiscernible] hypertrees.
There have been a lot of convex tree energy approximations which are also related to the work that we'll be
discussing in the next few slides. And these were proposed by Yirise [phonetic] and co-authors.
A lot of convex relaxations of this interlinear program to things like linear programs, quadratic programs,
second order programs, STPs, et cetera, have been studied in a lot of detail. And some of the more recent
work is related to Langrangian relaxation and simulated annealing also related to our work.
And the idea of using basically subgradient method after doing some kind of compatibility composition,
which is one of the only convergent and exact methods to solve the relaxed version of the problem that I'll
be describing in the next slide.
So what's the basic idea? Well, we have an integer linear program. We do what people always do, which
is to form a linear programming relaxation. And we do this by observing that even though we are
optimizing over discrete values of -- discrete value variables which can be seen as optimizing basically over
a bunch of indicator variables, we can always relax the zero and indicate the variables to their expected
values and this does not change the optimum of the linear program.
So we can always replace the indicator variables with the expectations which are the node marginal
probabilities, mu Ss and edge marginal probabilities, mu STs. And now, of course, because of these
marginal probabilities are coming out of a single consistent distribution they cannot be completely arbitrary.
They will be related in some ways. There will be some constraints that will be relating these mu Ss and mu
STs. However, note that if we can enumerate those constraints and optimize over those constraints, this
linear objective in these mus, then we can hope to solve this relaxed problem efficiently and get the
optimum of the MAP labeling.
Since the original problem was NP hard, something has to go wrong. And the thing that goes wrong is the
number of constraints that characterize these distributions is exponentially larger than a number of
optimization variables. You cannot even enumerate all the constraints in polynomial time, let alone
optimizing over them.
So the trick people use is to take a subset of these constraints that is tractable that can be easily optimized
over.
So some of the natural constraints are, of course, these are probabilities they should be nonnegative. The
probability of a node taking a particular label should add up to one when you sum over all labels. That's the
normalization constraint.
In addition, the edge marginal probability should be consistent with the node marginal probability, which
means that if you sum out over all possible labels of one node, then you should get node marginal
probability of the other node, and that's the marginalization constraint mentioned here. And the set of
normalization and marginalization constraints taken together is often referred to as the local polytope
because it's just looking at sort of first order local constraints and not looking at higher order as such in the
graph.
So the first order linear programming relaxation best known in the literature is this maximization problem
over with nonnegativivity constraints and over the local polytope of inner product between the potential,
between the parameter vector theta and the marginal probability vector mu. And this LP relaxation is what
we aim to solve.
But before we even go about doing that, it's clear this is a linear program now. And we can just use any
off-the-shelf LP solving procedure to solve it. So what's the big deal, why do we need a new method? The
problem is that the classical methods turn out to be too expensive and too slow for the typical problem
sizes that are often encountered in applications of this problem.
So, for instance, I mentioned computer vision. Often the number of labels is as big as 256. And the
number of variables is the number of pixels in your picture.
So it's a pretty huge number. And the total number of optimization variables that you have is the number of
edges in the graph times the number of labels squared. And if your method scales badly, numbers of
variables, then you can forget using the method on this problem.
So that's one issue. The second is that these methods are -- often the classical methods are not easy to
implement in a parallelized distributed fashion, which, again, is a big concern on large problem sizes.
And both these issues are arising because of the fact that we are trying to use something generic which
does not leverage the particular structure, the optimization problem coming out of sort of an underlying
graphical structure has some very nice properties, and we need to leverage these properties if we want to
come up with an efficient method to solve these linear programs.
Let me remind the 3-D weighted max product that I referred to earlier does address a lot of these concerns,
because it solves a particular dwelled instance of the LP relaxation that I just mentioned.
However, 3-D weighted max product is specifically tailored to the constraints of the local polytope. It's not
that since the number of constraints is exponentially large often people also want to work with high order
constraints, known as cycle inequalities and so on and so forth. And modifying 3-D weighted max product
to handle new constraints is a very challenging job. People have tried it in the past and it gets very messy
very soon.
So essentially we want to come up with an LP solving procedure that can provide us an exact solution to
the LP relaxation in a time complexity compared with algorithms like max product entry weighted max
product, which should be easy to implement in a parallelized and distributed fashion and which provides us
the capability of seamlessly incorporating new constraints into our relaxation if we wish to.
So in order to do this we need to borrow some techniques from optimization theory. And the first concept
that comes in handy is that of Bregman divergences which can be defined for most strictly convex functions
F. And Bregman divergence defined under function F between two points mu and nu is basically the gap
between the function F evaluated at mu. Sorry. This QNP should have been -- this should have been nu
and mu.
But so you take the point mu and you evaluate the function at that point. Then you take a reference point
mu and you take the tailor approximation function at the point mu and evaluate the gap at this tailor
approximation. The convexity of F this gap is always going to be nonnegative. And if it's strictly convex it's
going to be strictly nonnegative.
And as the function is more and more curved, this gap will be larger and larger. And the reason it's called a
divergence is because it has certain similarities to the notion of distances that we are used to which I will
remark in a minute.
But let's first see some examples of Bregman divergences. The first and the most logical example is when
the Bregman function F is clearly non-Q squared. So I'll be dealing with Bregman divergences defined over
vector. So the functions I will define will always be functions of vectors.
So the first example is the [inaudible] non-squared of vector. In this the Bregman divergence is just a
distance between two vectors. That's kind of nice. It tells us these divergences definitely incorporate the
standard notion of distance that we are used to.
Another extremely classical example is that using negative entropy. Now, of course, in this case I have
probability distributions on nodes and edges. So what I do is I define the negative shannon [phonetic]
entropy on each node marginal distribution and each edge marginal distribution and then just take a
weighted linear, weighted linear combination of these individual node and edge entropies, which is also a
convex function by the convexity of individual functions. And then not surprisingly, the Bregman
divergence turns out to be the weighted sum of the care divergences of the node in the edge marginal
distributions.
So this is again -- this is actually going to be one of the very useful Bregman divergences to us as we will
see in a minute.
The third useful and extremely non-standard Bregman function that I would be using -- and in fact it's not a
Bregman divergence in the strict sense -- is what I call the 3-D weighted divergence. Now, this is an idea
borrowed from the 3-D weighted message passing literature. So you have a set of span increase in the
graph which means you have a collection of trees. And each tree contains all the vertices in your graph
and some subset of edges.
You define the notion of these edge appearance probabilities, row STs, which is just the fraction of the
trees that an edge appears in. So if you have two trees, two spanning trees and if an edge appears in both
of them then you set row ST to be 1. If an edge appears in just one of them you set row ST to be .5. If it
doesn't appear in any of them, then row ST would be zero. That's the idea of these edge appearance
probabilities.
Using these edge appearance probabilities and these spanning trees, we basically define a three T
weighted entropy, which is basically the average of a bunch of -- the negative entropy of a bunch of
probability distributions where each probability distribution is defined on one tree.
So the idea is that once you have your node and edge marginal distributions, there is a well-known
consistent way of defining a probability distribution over the subset of variables involved in the tree.
So what you do -- and this sort of scary looking formula is actually a very standard probability distribution
known in the literature for trees based on these marginal probabilities.
And you just take this distribution. You compute its entropy. It again has a nice and simple form which I
will not mention here, though. And then you just take the average of these entropies across the trees.
So there is a corresponding Bregman divergence which is not so nice to write out. So I'm not going to write
it. But the reason why I mentioned this entropy function is because it's going to inspire a very nice
algorithm which works remarkably well in practice.
So just to mention quickly the properties of Bregman divergences that make them similar to distances,
these are, as I mentioned, are always nonnegative and they'll be .0 if and only if the two points are
identical. However, as was clear from the entropy example these divergences do not enjoy properties like
symmetry or triangle inequality.
>>: So, sorry, this divergence actually does the [inaudible].
>> Alekh Agarwal: Yes, it does. So it's not exactly a Bregman divergence. Let me tell you why. Because
this function is actually strictly convex only when the mus come from the marginal local polytope that I
described. If the mus do not satisfy the constraints of the local polytope, this function is not even going to
be convex. So it's convex on your constraint side. It's noncovex outside your constrained side, but just
convexity on constrained side is sufficient for me to come up with the algorithms I want.
The next tool that we borrow from optimization literature is the idea of proximal minimization, which is sort
of a regularization technique. So it says that I have some objective function mu, objective function F that
I'm interested in minimizing.
But this function is -- I don't like this function. It has some bad properties. And it's not very nice to optimize
over directly. So I'm going to add a second component to my optimization problem.
This second component in the proximal minimization literature is usually some kind of generalized distance
function. People call it the Prox function. It's basically some kind of a distance function and as you might
guess for us it's going to be a Bregman divergence.
So why is this nice? Well, it gives us a bunch of nice properties. First of all, adding this Bregman
divergence just by the strict convexity of Bregman divergence makes our entire optimization problem
strictly convex.
So we know that dual is going to be nicely behaved. There's going to be nice strong duality and everything.
We can go and work in the dual if we want to.
However, there are even better properties that we get as a result. So note that instead of solving just one
problem of maximizing theta.nu, I now set up a series of problems where mu plus 1 is defined to be the
minimizer of my linear objective function, plus a weighting factor times Bregman divergence to the previous
and solve these three sub problems now.
However, let's assume for a moment that my sequence mu is converging to a fixed point. Then, of course,
this distance, this Bregman divergence will start shrinking to zero because my [inaudible] are getting closer
and closer.
Hence, the only possible fixed point of the sequence is going to be the optimizer of the first function F. The
original function that I was intending to optimize.
So what it tells me is that even without -- so the idea of adding this second sort of strictly convex function is
something that people have been using in simulated annealling, for instance, for a long time and have tried
to apply to this MAP-MAP optimization problem already but were unable to prove theoretical guarantees
about.
But what we get here is that in simulated annealling, it's kind of essential that you send omega into infinity
at a particular rate which may or may not be needed in general proximal minimization because the
Bregman divergence term is already shrinking by itself. So omega can afford to be a little larger and that
kind of gives us an interesting degree of freedom when we apply the approximization minimization
techniques to solve this problem.
So since we have a Bregman divergence in our objective function, another concept that comes in very
handy is the property of cyclic projections that Bregman projections enjoy. So since Bregman divergence
is a strictly convex function, we can define for any reference point mu, we can define its Bregman
projection onto a convex set, which is the point mu had that is closest to mu -- should have been mu. My
mus and nus are all over the place today. In the point nu in the convex set that is closest to nu in the sense
of this Bregman divergence.
Now the interesting property of cyclic version tells us that suppose I have a complicated convex set that
can be written down as an intersection of a bunch of simpler convex sets and I want to -- so, for instance,
let's say my simple convex sets are two lines and I want to -- their intersection is, of course, this one point
which is not so complicated but for purposes of example.
So I start with a point P that I want to project onto this intersection. I do it by first projecting onto the first
line, taking the projection, projecting onto the second line and repeating this process until it can be provably
shown that I am going to can converge to the intersection, to the projection of point P onto the intersection
of these convex sets CIs, if I just repeat this process cyclically and the rates of convergence and everything
is well studied.
>>: What do you mean by cyclic? Do you have to project ->> Alekh Agarwal: Right. So just one round of doing this sequence will not suffice. You have to keep on
doing this. And asymptotically you will go to the intersection projection onto the intersection.
So just to quickly recap. I started with the original MAP optimization problem, which I furthered by adding a
proximal function to it and setting up a sequence of problems.
At this point I do a slight rewriting to absolve this linear term into the Bregman divergence itself. This is
done by taking the referenced parameter mu and modifying it to a mu to N [phonetic]. The reason why I do
this is because now my original minimization, the proximal minimization problem has been cast as a
Bregman projection problem.
Okay. Now the rewriting involved in going from mu N to mu N [phonetic], of course, depends on the
particular Bregman function being used. So to give you some examples [inaudible] plus theta for
[inaudible] to mu N times exponents of theta where addition and multiplication are elementized for vectors.
The second thing we note is that we're optimizing over a polytope which is a bunch of linear -- an
intersection of a bunch of linear constraints. So we have to do, perform a Bregman projection over an
intersection of linear constraints. If we can perform projections over just individual linear constraints
sufficiently, then cyclic projections tells us we can repeat this process cyclically and we will eventually find
the optimum over the entire local polytope. That's the key idea of our method.
We have an optimization problem to solve over this local polytope. We start with some initial point mu zero
and we perform the initialization step to go to mu level 1 that I just described. We then project mu level 1
onto the set by using Bregman projections.
And we repeat this process several times and eventually we converge to mu start which is the optimum of
our linear programming relaxation.
So to give you a flavor of what these projections and what these updates look like, let's consider the special
case where the Bregman function is the weighted entropy function. In fact, here I'm using uniform weights.
You can think of a sum of a bunch of entropies.
So the initialization step is something that I just mentioned on an earlier slide. It's the same as that.
And the interesting thing is that to solve, to project over normalization and marginalization constraint you
need to iterate the set of messages over all the edges and you just need, for normalization, you just
divide -- you just divide everything by sum of the probability over all the labels. So you just normalize by
division in the usual way.
But this is the only real operation that you need to perform over all edge. You need to do this sort of a
message passing procedure at every step.
So this is actually very similar to the belief propagation kind of methods that were being used already but
this is now a convergent version of those belief propagation methods.
However, because it is so similar to the belief propagation methods, it does end up inheriting all the good
properties of those methods. Like because these updates are very local, you can implement these updates
independently for all the edges so you can distribute your updates. You can have them in a parallelized
fashion and you can perform similar algorithm under other divergences.
There's only one case where the algorithm looks significantly different, and I'm going to point that one out
as well, which is the case of the 3-D weighted proximal solver.
Now, here the difference is that in the -- so remember here in the initialization step I was going from mu to
sort of a mu till, so I was within this sort of probability domain. In the 3-D weighted case, in my initialization
step kind of leaves the probability domain.
So I basically compute sort of a set of parameters at every round. Theta N, theta NS and theta STs which
are basically said in some strange way depending on the gradients of my Bregman function and so on.
And one way to think of it is you have kind of taken the log of these updates, roughly. So the theta here is
sort of like log of the mu here. But you just compute some set of these thetas according to fixed
deterministic rule.
The interesting thing that happens is that you can then show that the problem of the Bregman projection
problem that you need to solve can be actually cast as a standard problem that three weighted some
product algorithm solves. So three weighted has a sum product version and a message passing version.
And you can show that the inner loop performing the Bregman projection just requires you to be able to
solve the three weighted sum product problem, which there are plenty of very efficient solvers available for
developed by Wainwright and Kolmogorov and later by a lot of other people.
So the idea of using three weighted sum product problem solvers has the inner workhorse for our method
is a very attractive one because they can potentially be a lot more efficient than doing cyclic projections.
Now, at this point it seems we have basically a method that can give us exact solution to LP relaxations
which have distributable and parallelizable updates. Let me remark that three weighted sum product
algorithms are very similar to belief propagations. So they also have this property.
And additionally our methods have ease often incorporating new constraints because at least in the cases
where we're doing cyclic projections projecting onto a new constraint requires you to compute the
projection onto an individual constraint which usually is not that hard to do.
The only thing not clear at this point is what the time convexity of our methods is going to look like. These
methods have a double loop structure which is often considered as sort of the big taboo in a lot of these
optimization problems.
However -- and indeed if our methods, if we have to sort of wait for numerical convergence of our methods,
we can actually take quite a while to converge.
However, the second trick we use is to come up with some very nice rounding schemes that allow us to
actually converge really, really fast when our LP relaxation is tight, when the relaxation is tight, when the
relaxation is exact and has an integral solution we can come up with certain rounding schemes that give us
very nice convergence properties.
So what's the idea with rounding schemes? The idea is we're trying to solve an integer linear program here
anyways. At some point, while solving my linear program, I can take the set of probabilities I have, extract
an integral solution out of them and give you a certificate of optimality. Tell you that this integral
configuration is actually the MAP labeling, then the problem is solved. I don't need to optimize any further
and wait for numerical convergence of my LP solving procedure.
This is exactly what the rounding schemes aim to do. What might these schemes look like? The simplest
thing is to maybe just take the highest probability level for every node according to my node marginals.
Doesn't work. There are plenty of examples in the literature where such a rounding scheme would fail.
Okay. We can't work with nodes. So let's try and work with edges. We define certain local quantities on
every edge, exact nature is not important but just something based on node and edge marginals.
Just so we are not -- the important thing to note is that the local quantity is just based on that edge and its
corresponding nodes.
And for every -- every local quantity we find the label pair that maximizes it. Now, of course, a lot of it is are
going to share nodes and both these edges will want to assign some label to this common node. Now, as
this label turns out to be the same for both edges and we say that -- and this happens across all edges,
then we say we have found a consistent labeling through the edge rounding method.
However, it can always happen that there are two edges that do not agree over the label they're assigned
to the shared node and in that case we declare failure and we continue optimizing further.
Just means that our marginals are not yet good enough for the rounding scheme to work.
Don't need to restrict ourselves to edges. We can go to higher order neighborhoods like stabs [phonetic]
which is higher order neighbors. Trees. Again, all these neighborhoods like stars, trees, they're very easy
to optimize over because just things like optimize -- max product are going to work. Just two rounds of
dynamic programming algorithm suffice to find the highest probability labeling over such simple structures.
So we can find these labelings very efficiently and again we check for consistency over the shared nodes.
So while the rounding scheme does not enjoy any guarantees in general, what we can show if any edge or
tree rounding schemes find a consistent configuration, then that configuration is indeed the MAP labeling
for the set of random variables.
And this theorem basically relies on certain invariants that our algorithms maintain across the updates. So
based on that invariant we can show that if at any point we have reached an agreement using this rounding
method, then it is going to be globally optimum.
And empirically, usually for at least sort of the redeems where I've tested it, it takes roughly -- so if TRW
max product takes you X amount of time, then these rounding schemes will give you a solution in a lot
more than 8 X or 10 X of time which is not bad given that TRW does not give you a guarantee that it will be
solving the LP relaxation exactly while we give you a guarantee that we'll be solving the LP relaxation
exactly. But of course you can even hold this if LP relaxation is tight, otherwise you cannot hope to get a
convergence through rounding.
So just to remark about race of convergence for proximal rounding schemes, certain rates of convergence
are well known. But they assume that the problem, the Bregman projection problem in our case is exactly
solved. We use [inaudible] which is only asymptotically going to give us the solution to the [inaudible]
problem. So we have an inexact solution to the Bregman projection problem. But people in proximal
minimization theory were nice enough to analyze that case for us as well. And we know that at least a
linear rate of convergence will be guaranteed even when most solutions are inexact. However, in this case
what we can show is if the LP relaxation is tight, then you have a finite convergence no asymptotics. You
can prove a bound on the manipulations whether one of the rounding schemes is going to succeed.
And that's, again, a very nice property of these methods. So that's great. But all that works when the LP
relaxation is tight. Now, of course, when the LP relaxation is not right you still want to be able to extract a
good solution from whatever probability vector you're stuck with now.
So, in particular, the deterministic rounding schemes don't tell you what to do when you have a conflict over
a node or an edge or something.
So we want to have a rounding scheme that will give you sort of a principal prescription for what to do,
something hopefully better than just arbitrary tie-breaking when there are conflicts.
So we come up with some randomized rounding schemes which work regardless of whether the LP
relaxation is exact or inexact. So they work in all cases. They can be easily derandomized for efficient
implementation. It can be shown that for the case for [inaudible] the randomized version finds the correct
solution with high probability. The derandomized version, again, will have an iteration bound that will give
you a finite convergence for integral LPs again.
The really neat part with these randomized rounding methods they're not designed specifically to work with
our methods. While deterministic rounding schemes rely on certain invariants that my algorithm maintains,
the randomized rounding scheme only relies on the fact that it has access to some LP solver that has a
linear rate of convergence to its optimum. As long as you have that linear rate of convergence, this
derandomized version is going to enjoy a finite convergence and randomized version will converge with
high probability.
So these randomized rounding schemes have also an appeal in the fact they can be applied to even other
LP solving procedures. So ate up one of the edges.
It was there in my computer yesterday, I promise. So the idea with randomized rounding scheme is that
you have a graph with a vertex in the set. You take a subset of the edges and chop them off from the
graph.
So well imagine, imagine a black line here, by the way, this complete grid. And imagine I chop off all the
blue edges. Then I'm left with a bunch of trees. Not an arbitrary bunch of trees but a vertex disjoint bunch
of trees. So these trees do not shed any vertex. That's what we call a forest. We find an after
disconnecting those edges we're left with a forest.
Now as before I go back to this way of defining probability distributions over trees based on marginal
probabilities.
And for each sub tree TI I define this probability distribution and I sample a labeling for the nodes. Nodes
involved in that tree based on this probability distribution.
Now you might ask me is the sampling, how efficient is this and all that, but I'm just going to give you a
derandomized version if you ask me that question. So it's not even -- it's not even important.
What is important is that these trees, because they're vertex disjoint, there are no conflicts now. You can
just concatenate the labels for each of these subsets and you have a consistent labeling for the entire
graph. And you're going to have all the properties that I just mentioned on the previous slides.
So so much for the theoretical convergence rates. Let's see how this works in practice. And I was not
interested in trying out any huge problems or anything, I just wanted small proof of concept implementation
of this.
So I tried it on small grid graphs with 100 to 200 nodes, five node labels. The usual standard parts, mixed
part potential with varying signal to noise ratio, which is just the ratio of the strength of node potentials to H
potentials so low S and L is hard problem here. We wanted to randomize the rate of convergence we just
proposed and we wanted to adjust rounding schemes. First was to verify the rate of convergence. We
took the log of the distance of the marginality probability vector the fixed point mu star which by optimality
of method also LP optimum. Took the log of distance to that point and plotted it against a number of
iterations. On the log plot you'll see there was super linear rate of convergence. It would have been a
straight line if the rate of convergence were linear. So, okay, that theorem was correct. Hopefully that
means because these are converging well.
The interesting thing is that these curves do not defer that much based on problem sizes so that sort of
bodes well for the scalability of method. And let me remark that this plot was done using entropy proximal
solver, three way proximal solver, gives similar and even better convergence in a lot of cases and scales
even better with respect to problem sizes, because of the inherent scalability probabilities of three way to
some algorithms and also the people who developed this write much better code than I do.
For the convergence of the rounded solution, we just, for the edge-based rounding scheme, we plotted the
number of edges in the graph where conflicts were found during rounding and how this evolved with
number of iterations.
And, again, we see like in very few number of iterations these conflicts have pretty much vanished which
means we get to the LP optimum by rounding really fast. And, again, let me remark that the other rounding
schemes, the tree-based rounding scheme, for instance, works even better in a lot of cases because it's
looking over large neighborhoods. And it often converges even faster than the edge-based rounding
scheme and the randomized rounding schemes, the derandomized versions also have a similar behavior,
and again with respect to problem sizes we have a pretty good scaling here.
Finally, we also looked at the energy of the rounded solution, because that tells you that well it could be the
case that the rounding scheme hasn't declared a convergence, but it has already obtained the right
solution. So if the energy of the entropy solution stops sending, you know it did recover the rounded
solution well before it was able to give you a certificate of optimality. And that sort of does happen to some
extent. Again, the energy of the rounded solution gets close to the best possible energy it's going to get
really, really fast. So, again, just sort of a proof that these rounding schemes do work well in practice.
To sum up, we have proposed some new MAP optimization algorithms here which use very nice proximal
minimization, and the use of graph structured Bregman divergences allow us to exploit the particular
problem structure at hand. And we solved these in some cases by using cyclic Bregman projections and
sometimes using three weighted solvers. But all the methods that exploit the structure of the local polytope
basically to a large extent.
We come up with simple message passing updates that has a guarantee on the local optimum to the LP
optimum. Let me remark that this is a very nice departure in this area. Most of the algorithms known are
just algorithms. They don't come with any certificates of optimality. They're just shown to work well in
practice. It's kind of the only sort of exception is probably the subgradient method which by virtue of being
an exact optimization method works, but works horribly slow in practice from my experience.
And, finally, we come up with a bunch of rounding schemes that allow really, really fast solutions for
integral LPs and also extend beyond other methods when we look at the randomized rounding schemes.
So that's all I had to say, and thanks a lot for listening to my talk.
[applause]
>> Dengyong Zhou: Question.
>>: Question, so the cyclic, can I view it as a random?
>> Alekh Agarwal: Yes, it is actually very closely related to -- so it is related to doing coordinate descent in
the dual, I think, if I remember correctly. That is one interpretation that has been shown for it.
>>: So another question is on the roundings, [inaudible] so is there any -- is there any connection
[inaudible] like sometimes you look at what gets violated and then [inaudible].
>> Alekh Agarwal: Right. Right. But I don't really think that it is connected to [inaudible] in any way,
because so it's not that we are based on what happens in the rounding schemes we're going to solve the
different problem. We always solve the same problem. And rounding schemes is just a convergence, sort
of a convergence test that you make, basically. Rounding schemes, at least the deterministic ones, really
what you should think of them as some kind of -- they basically take some kind of reparameterization of the
view of the algorithm which has been popularized a lot by these 3-D weighted message processing
algorithms. And again here we can show there's a reparameterization going on and that's what allows you
to do these rounding schemes.
>>: So I have a question. So just to make sure I understand. The reason you do this cyclic Bregman
divergence projection is you want to have some local thing?
>> Alekh Agarwal: Yes.
>>: But, of course, in principle, you could project them to some [inaudible], the subset C -- subset C are
involved and in principle this kind of cyclic scheme converge even faster, right?
>> Alekh Agarwal: You can lift it to higher order neighborhoods if you like.
So in fact -- in fact, you can -- yes, you can certainly -- so, for instance, taking, instead of taking just an
edge and a node, you can start taking something like star never hurts and you can show exactly projections
in closed form even on to bigger neighborhoods and that's not so bad.
And actually the way you implement these projections is kind of -- it changes the efficiency of the
implementation a lot.
>>: But [inaudible] kind of a more sophisticated projection because you're gaining something in terms of
convergence speed but you're losing something in terms of making a projection, right?
>> Alekh Agarwal: As long as you're picking neighborhoods that are small enough that in sort of your
parallelized optimization it's going to reside on one machine maybe and then, for instance, in a lot of these
image problems your graph is a grid.
So if you select your neighborhood well, then you can make sure that most of the things are still staying on
the same machine and then you can hopefully project on to bigger neighborhoods and not lose too much
inefficiency and certainly is going to speed things up.
>> Dengyong Zhou: No further questions.
Download