>> Lin Xiao: Okay. It's a great pleasure... He's a Samsung Professor of Engineering and a professor in...

advertisement
>> Lin Xiao: Okay. It's a great pleasure today to have Professor Stephen Boyd from Stanford.
He's a Samsung Professor of Engineering and a professor in the electrical engineering
department. His work on convex optimization and applications are very well known, need no
introduction. He has written several books, very popular, the Convex Optimization, with
Vandenberghe, I think that's very, very popular now, and also his group has produced several
very popular softwares, including CVX. Okay, today, he's going to talk about distributed
optimization, where -- ADMM. Okay, Stephen.
>> Stephen Boyd: Good. Thanks. So we have a small audience, so we should make it superinteractive. Actually, you might be disappointed -- if you know a lot about the most recent
things in optimization, you'll probably be disappointed, just a heads up. But, anyway, feel free to
stop me any time, and otherwise I'll just keep going. So I'm going to get very interactive. So
what I'm going to talk about today is actually based on a longer paper. It's kind of like a survey
paper that you can find on the web and you can find all source code for everything. It's
pedagogical codes, things like that, there. But stop me at any moment. Also to speed me up, if
you're like, yes, please, we know all this, then that's even worse. Okay. So quickly, the goals, so
the goal is you wanted to be able to do arbitrary-scale optimization, so you want to do things like
machine learning with big data sets. You might want to do dynamic optimization on a largescale network, so you do -- suppose you're doing power scheduling or energy scheduling across a
network, you have -- we'll see examples like this later, but you have a whole bunch of generators
and consumers of electricity or something like that. They're all connected by some network with
tens of thousands of lines. These are all capacitated and have losses and stuff like that, and
before you know it -- and you're scheduling out over time, so in 15-minute increments over a
day. And the next thing you know, you actually have a very large problem, so we'll look at ones
like that later.
Computer vision is another no brainer, so you have HD video coming at you and you're solving
some convex optimization problem involving that, these are big problems. Okay, and the key, of
course, is decentralized optimization. A lot of these things, they wouldn't even fit in the memory
of a single machine, so we'll do something decentralized. And the idea is you have -- we'll call
them everything. I'll interchangeably talk about devices, processors, agents, but in fact these
could be anything. You could also have the sort of lower level. These could be threads, for
example, and in fact, we've done a bunch of implementations where we do all this and there's a
thread pool and that kind of stuff. But it could also be things at a much higher level, too. It
could actually be separate physical devices.
And so the idea is you have a bunch of little optimization problems that are solved. Little could
be anything from anything from something that's trivially solved analytically to something where
there's some serious heavy lifting done, and then relatively small messages are passed and the
hope is that eventually you arrive at the solution of the big problem. By the way, without having
collected all the data in one place, so this is the idea. And this is just to get everybody on the
same page.
Okay, so the outline is this, and we're going to go back in history, and I'll take 10, 15 minutes to
actually literally trace history. So I'll talk about dual decomposition, so that's actually from the
early '60s. Probably you could argue it's earlier than that. These are things Danzig wrote about
in his first book on linear programming. We'll then get to the '70s, which would be method of
multipliers. Then, from there, we'll move to alternating direction method of multipliers, which is
what I'm going to talk about today. That's actually from the '70s, and it's easily argued that the
ideas are from the '50s. So you came to a talk about a method from the '50s. I mean, there's a
little bit more to say about it, but actually not much, so we'll see. Then, what we're going to do,
and actually the theme of the talk is that this is obviously not a new method. So this is not some
new method that we hatched up last year or something like that. But, actually, I think what's
more interesting about it is that it's so simple and then you're so effective when you know these
methods, that you can piece together solutions, solving methods, algorithms to solve problems,
that are on distributed platforms, all sorts of things, and you can just kind of put these together
like Lego bricks or something.
What you'll end up with will not be the best one for that problem class. In fact, almost -- there's
no exception, where once you have a specific problem and you say, well, I really, really want to
solve that problem, there's going to be a better one. But the idea is, for rapid prototyping, I find
this is actually quite appealing. You just stick the things together and it just works. And so that's
this part. We'll talk about some common patterns. Everything else there, the algorithm I'll talk
about here, it's completely trivial. It's two lines, three lies. It's very, very short. It's not intuitive,
but it is, actually. We'll look at that. And then you know a couple of tricks here. Everything
else where will be completely trivial, right? And then it turns out, this algorithm, you take a few
trivial things, you combine them, and all of a sudden you're quite effective, and we'll look at
some examples. Then we'll come back at the end to talk about distributed, big problems, and
we'll talk about consensus and actually its dual, which is an exchange problem. Okay, so these
are -- and then I'll wrap up.
Okay, so let's start with dual decomposition. The year is, I don't know, 1949 or something like
that, but here's the idea. You want to solve this equality-constrained optimization problem. You
want to minimize the f-subject to AX=B, and so you form a Lagrangian, and you could make up
a story about this and you could say this might be that a market -- this is the constraint that a
market clears, that supply equals demand or something like that. And then you'd interpret this
dual variable here as a set of prices. And this says go ahead and violate. Use more or less than
you're allowed to or something like that, but you'll be paid for it, or you'll pay for it. So it
depends, of course, on the sine of Yi and the residual I component, the residual. And so you
think of Y as dual variables, prices, something like that.
Now, the dual function says make the best choice of X for the given set of prices, and then that's
a function of the prices, and this thing is actually always a concave function, so it curves down.
And then the dual problem is you choose the prices that makes this dual function as large as
possible, so it's actually the least favorable set of prices, and it turns out the least favorable set -and then, if all goes well, and there's a bunch of assumptions here -- these are not just little silly
borderline assumptions. These are real ones. If all goes well, then you can recover the optimal
thing by finding an optimal price and then minimizing this thing. And it's kind of interesting,
because there's a big market interpretation of this. So the key is, so you'd say here, if you like,
this is solving the first problem by price discovery. That's what it's doing. If you actually solve
this problem, you're doing price discovery, and this says once you've found the right price, you
just minimize over X and you're going to get the solution. Yes, okay.
So how might you do price discovery, which just means maximize the dual function? Well,
gradient. So let's just do a gradient or sub-gradient method if it's non-differentiable. So that
works like this. You find the gradient of this thing, and then we're going to step. We want to
maximize, so we step in the positive direction of the gradient. You make a step of length alpha.
It's not length alpha, but step size alpha. And the gradient of this thing turns out to be exactly the
residual, when you actually compute the minimizer of the Lagrangian. So you get dual ascent is
this method here, and actually, the reason you should look at it is because things that look like
this you're going to be staring at for one hour. They all look the same, and they always have a
step that goes like this. You lock down a set of prices and you minimize over X. That's the first
step. And the next step is, you calculate the residual. That's this thing. You want AX=B. This
is the residual, or it's the amount by which the market fails to clear, whatever your application is.
That's the residual. If the residual is zero, stop, because you're optimal. If it's not zero, then it
says, update the prices. So that's a dual update price update, so everything you're going to see is
going to look like this from now on. There's going to be just variations on this theme.
>> Lin Xiao: So this is a practical problem, here you do gradient descent anyway. You might as
well do the gradient on the original problem.
>> Stephen Boyd: Absolutely, right, right. So we haven't gotten to the question, why would you
do this, right? And in fact, it's pretty complicated. We could have done gradient descent on
minimize F(X), and I agree completely. I'm going to answer it on the next slide. I should say,
this method works, but with a lot of strong assumptions, and they're not just like silly little
mathematical things that don't come up in practice. They're quite real.
So, for example, you might want to solve a linear program. This method will fail utterly and
completely. It just won't work at all. These are real conditions. They're not like little analytical
things that don't really happen in practice. But it works. I mean, if these assumptions hold.
Now I'm going to answer your question about why would you make it complicated and run
gradient on the dual or something. Of course, there is one reason you might do it, is because it
sounds more sophisticated, so you could impress your friends and stuff and, say, oh, you're
running -- what are you doing? Well, I'm running gradient ascent on the dual, and it sounds -- so
there is that. That's a real advantage. I think a lot of papers get written for reasons like that.
Okay, but here's a real reason. Here's a real reason. This is the entryway, entry point, to
distributed optimization. This is it. This is the point. Suppose that function F is separable,
meaning X gets chunked up into X1 up to XN. These are blocks of variables. And F is the sum.
Oh, by the way, of course, if I minimize F(X), it's trivially parallelizable, because I just minimize
each one separately. It's a sum. So AX=B, that's linking them. So except for the equality
constraints, the problem would be trivially parallelizable, right? So this is the idea. Well, if you
form the Lagrangian, it's F(X) plus Y-transpose AX minus B. Y transpose AX minus B is an
affine function. It's completely separable. So what it says is the Lagrangian splits into a private
Lagrangian part for each of these Xs, and that says that when you're asked to minimize L over X,
it is actually you're minimizing a separable sum, where these things have nothing to with each
other, and it's trivially parallelizable. So that is the reason you'd do the dual, because if I have a
big problem where all variables are coupled, if I then do the dual and if I fix the price, that
problem becomes separable, and that's very cool. That's kind of the idea of a lot of markets and
stuff like that. Once I set the price, everybody in this room decides how much they're going to
consume or contribute to the market, independently. So that's the answer. Okay, so this leads us
to dual decomposition.
The year is about 1960, so here it is. Actually, it's nothing here. We've just assumed it's
separable, and the only thing that's happened is that this thing, which is parallel -- this is all done
in parallel. This thing here is minimizing. It's the X minimization step, so the algorithm hasn't
changed at all. We've just simply noted that this step, when you do X minimization, is in
parallel. And then we collect everybody's contributions, we add them up, subtract B. That's our
residual, but zero we quit. Otherwise, we update the price, and you can see, roughly speaking,
it's something -- look, if you want to know the simple explanation, you have a scatter gather for
each step. You scatter the price, everybody does whatever they're going to do. That's this thing
here, and then you gather their contributions. That's these things. And then -- actually, it's not a
scatter gather. It's actually an all reduce, but you had a question.
>>: I forgot to ask a question earlier.
>> Stephen Boyd: Sure.
>>: How careful do you have to be in choosing the alphas?
>> Stephen Boyd: We'll get to that.
>>: Okay.
>> Stephen Boyd: So the short answer is, for these algorithms, actually, these are quite slow and
finicky, and it depends on whether you want to do theory or practice. If you want to do theory,
alpha-K = 1/K just always works, period. Period.
>>: That's super slow.
>> Stephen Boyd: That's super slow, exactly. So we'll come back to your question several
times, and I'll give completely different answers, like there's an easy one like 1/K, so we'll get
back to that. Okay. Okay. So this works. And it's kind of cool. And if someone says, what's
happening here, you'd say, well, everyone here is minimizing their own function, and what's
happening is, I then collect everybody's ideas, Xi, I see if we've cleared the market or whatever
this means. If yes, we tell everybody to stop. Otherwise, we update the price vector and we
broadcast that out and everyone says, okay, guess what, the prices of all the commodities have
changed. Now what would you do? So it's very intuitive, the algorithm. Okay. All right, by the
way, that works, but with lots of assumptions, and real assumptions that would keep you from
doing and solving problems you really want to solve. Okay, method of multipliers. So now
we're going up to 1973 or something -- well, no, '70s, '80s. So what we'll do is we'll look at the
augmented Lagrangian. So you take the function, plus this is the usual Lagrangian, right? This
has the economic interpretation or something, which is violate the constraint but pay for it, or
you get a subsidy. If the sine is negative, it's a subsidy. Then you add a term here that's always
nonnegative. That's not something where you get a subsidy. It's always a penalty, and it's just a
penalty for deviating here, so that's the idea. And it turns out this does a lot of things -- and
there's a lot of theory about it, all of which I'm skipping, but this does things like the dual
problem now becomes actually smooth and has a Lipschitz constant, so if this means anything to
you, that's great. I mean, so this regularizes the dual.
>>: [Indiscernible] converting of [indiscernible] functions.
>> Stephen Boyd: That's exactly what it is. Yes. Actually, in the end, I'm beginning to realize,
all of these methods are the same, including the ones that are being published right now. Right?
I mean, maybe there's an epsilon variation, but it turns out, all of a sudden, everything starts
looking the same, because everything is the same in some way. Yes, but absolutely. Okay, so
the method of multipliers -- by the way, if you like, you can also think of this as the Lagrangian
to a modified problem. So the modified problem is, instead of minimizing F(X) subject to
AX=B, you minimize F(X) plus row over two times this thing, subject to AX=B. And you look
at that, and you say, well, that's a dumb problem. Clearly, it's equivalent, because if you're
feasible, AX=B, the term you've added to the objective is zero, and you'd think, well, this doesn't
seem like a good way to go. Anyway, so you see what I'm saying. But it does have this effect
on the dual. We'll see what the difference is. What?
>>: You affect the row [indiscernible].
>> Stephen Boyd: No, it's not. No, we'll talk about that. Well, we'll have to define what find
means, but yes, okay, we'll get to that. So you asked about the alphas. It turns out, once you do
this, the dual has a Lipschitz constant and you can work out various things, and it turns out that
there's no question about what the alpha should be. It's just row. Now, this is actually a
beautiful -- this is method of multipliers, right here. It is relatively useless as an algorithm, for
various reasons we'll look at. It is, however, still traditionally taught, I think out of tradition and
things like that. I mean, let's think about what would be required here. It would have to be a
problem where, for some reason, it's easy to minimize F plus this quadratic, but for some reason
it was very hard to minimize F subject to an equality constraint or something. You see what I'm
saying here? And in fact, it's not surprising that they're -- but it's interesting. Okay. So,
actually, we can even explain why the alpha is row, and it actually makes clear how these
methods work, so it looks like this. To minimize F(X) subject to AX=B -- these are the
optimality conditions. You have to be primal feasible, and you have to have -- I mean, this is
Lagrange multipliers that I just did, so that's it. This is the dual residual, primal residual. Okay.
Now, in the method of multipliers, here's what happens. X(k+1) is minimizing this thing, but
that's the gradient of this is just -- it's the gradient of F, and then the gradient of that other term,
plus the Y. And you write it this way, you stare at it for a while and you say, oh, hey, look at
this. This is Y(k+1) if row is the step length. And then you realize -- you say, well, the gradient
of this is zero. That's how you choose X(k+1), is to minimize this thing so the gradient is zero.
By the way, I'm doing it for the case where it's differentiable. It's the same if it's nondifferentiable, right? So what happens then, what it says is if you use this update, so if that's
your step size, then it says that -- every step, you're producing a primal and a dual pair. Actually,
you're producing a primal point and a price vector. It says, every time you're doing that, it turns
out this is always true. So you have two optimality conditions, primal and dual, and it says the
dual one is satisfied exactly, every step. And so you're just waiting for the primal one to come
in, to converge to zero. Okay, so it kind of makes sense. It's nice. This is the idea. This theme
we'll come back to, as well. Okay, about the method. Here's the good news. By adding that
quadratic regularization -- oh, by the way, modern versions of that, it's very unsophisticated.
They add quadratic. Of course, you'd add some kind of Bregman divergence or something, and
then you'd have some entropy-like thing, and it would look more sophisticated. And then you
could actually honestly say this wasn't done in 1973. Everybody see what I'm saying here? You
could impress your friends and things like that with some obscure thing and then argue that that
was better. I just want to mention that. I'm sticking with the conventional quadratic. Actually,
if you get the quadratic, you can go back and replace this with a Bregman -- okay, fine. Here's
the good news. Adding that regularization makes this thing go from a very fragile algorithm that
doesn't work for many, many problems into one that is absolutely rock solid. I don't know of
anything more solid. It's just this.
>>: So what's the [infusion] behind this [indiscernible]?
>> Stephen Boyd: There is a reason for it, because it actually -- it's actually from monotone
operators. We'll talk about that later, maybe, if we have time. It's by monotone operator theory
it tells you that it's got some. Okay, so basically what happens is, those functions, they can be
non-differentiable. That's number one. They can take on the value plus infinity. In fact, most of
the interesting cases, X will take on the value plus infinity. That's how we're going to encode
constraints. In fact, X can be a [SIC] function that looks like this. It could be plus infinity
almost everywhere, except like on one hyper plane or something like that. They're at zero. So,
by the way, when you're doing optimization, usually you start to say, oh, it has to be
differentiable. The gradient should be Lipschitz, I have a Lipschitz constant L, all this kind of
stuff. This just takes you to the most general function you could ever possibly want, and it just
works. It works.
Now, the bad news is, what happens is that splitting is now destroyed, because everything's
coupled by this quadratic, and the whole point was to do distributed optimization. And now -yes?
>>: This is a comment. So you said that I can stick all the constraints inside the objective,
because you can take the value to infinity.
>> Stephen Boyd: Absolutely.
>>: But you still are pulling out this AX=B as kind of a ->> Stephen Boyd: That's right. That's right.
>>: But now that you wouldn't have a Lagrangian or anything, so why is that a special
constraint?
>> Stephen Boyd: Oh, no, no, it's a great one, because what you do is -- suppose I have a
problem where it's separable, F is separable, and the constraints are separable. Then the
separable constraints, I'd shove them into the F and it's still separable. AX=B is the coupling
constraint. So it's a great question. That's why you'd do it. You could put the AX=B into the
constraint, solve it in one step, and you would have a very complicated way to say here's how
you should solve your problem. Step number one, solve your problem. That's what would
happen. Actually, it would be even worse than that, because you'd have an extra term in there.
So that's why.
>>: The method of multiplier sounds like you're adding essentially the integral of your
constraint as an additional optimization term.
>> Stephen Boyd: You're adding?
>>: I mean, to get the AX -- that additional term, that's essentially the integral of AX=B.
>> Stephen Boyd: Yes. Thank you, yes. We're going to talk about that. Very good. Thank
you, thank you, thank you. We're going to talk about it later, I could say it now. You're exactly
right. Yes, it's this. This, that's an error. Your job here is to -- you're doing price discovery, so
you want to manipulate Y until that is zero, right? So if you look at this formula, this says that
Yk is the running sum of the errors, multiplied by row, right? So if you know about control,
that's integral control. That's been used since 1880 or something like that, right? So was this
your point?
>>: I don't know integral control. I was just looking at the augmented Lagrangian. The last
term is the integral of the second term.
>> Stephen Boyd: Well, it's not, actually. Not quite, because of the Y here. Now, by the way, I
could put this linear term inside there, and we could make it just a quadratic. Actually, yes, it's
the integral. Here, it's the integral over iterations. Okay. So the question I thought you asked
was an excellent one. Yours was also quite good. Okay. Now, we're going to do -- now, finally,
we're up to alternating direction method of multipliers. So the idea is, you want something that
combines the rock-solid properties of method of multipliers with the ability to do decomposition,
because that's why we want to solve giant problems, right? So that's the idea. It's explicit in the
'70s and you can even -- it was later found to be equivalent to operator-splitting methods that
were developed in the '50s and stuff. Not exactly in the same form, so this is not from NIPS two
years ago or something. Okay, so here it is, and it's actually very straightforward. Here it is.
We're going to split X into just two things -- oh, sorry. The variable formerly known as X is now
X,Z, so we have two primal variables, X and Z, and we'll just say it splits across two. By the
way, in the Soviet literature, they split it into K things. We'll find that, by the way, splitting into
two is good enough. You don't need to split it into K. You'll see why two is good enough.
Okay, and then you have a completely general equality constraint linking them, okay? So there's
the augmented Lagrangian. Here are the steps. It is so dumb, it's this. The method of
multipliers, by the way, would do the following. You would minimize this jointly over X and Z,
and then you'd do a price update. Instead, you do one Gauss-Seidel sweep, so you'd do this.
You minimize over X, you get the new X. This is -- the price vector's locked down. On the next
step, you minimize over Z, but you use the new value of X. This is a Gauss-Seidel sweep. Then,
at the end, you calculate -- well, that's the residual, and you -- then this is the running sum of the
integral, or that's the integral control term, if you know about control. Okay? That's it. By the
way, if you were to -- this suggests various things, right? If you had time, you could do this
three times or twice or 17 times, or something like that. And, actually, if you did it enough
times, that would be Gauss-Seidel, and it would be back to the method of multipliers.
>>: We were back in 1600 when [indiscernible].
>> Stephen Boyd: Sure. The hint is, you're using names like Lagrange, so that's your other hint
that these are not from last year, these methods. Okay, so that's it. By the way, in some sense,
the talk is over now. It's over. That's the algorithm, right? So it just shows, you broadcast a
price vector, in one iteration, optimize over one group of variables X, optimize over Z, collect all
the parts, see if the residual is 0, if it is, quit, because you're done. Otherwise, update the price
vector and do it all again. That's it. Okay. All right, I already said all this.
>>: But I thought from that explanation, that row would be very important. If your row isn't
chosen reasonably well ->> Stephen Boyd: We'll talk about that.
>>: Then you don't get good price prediction in how good you ->> Stephen Boyd: We'll talk about it. We'll get there. Don't worry.
>>: [Indiscernible] what you need to send?
>> Stephen Boyd: What do I need to what?
>>: To send from X to Z?
>> Stephen Boyd: What do I need to send to X to Z? The value of X has to be known. So let's
look at that. Let's actually look at what's involved in this step. If you minimize this over X,
that's a constant and it's totally irrelevant, so you don't need to even know what G -- so this step,
let's suppose that's done in one process, just for fun. It doesn't even have to know what G is. It's
totally irrelevant, because it minimizes F plus this thing plus this thing, and notice that in that
step, that's a constant, period. So, in fact, you can have two separate processors, one handling F
and G, and neither knows what the other is. They don't have to know it. Now, you do have to
exchange the values, right? Like, if I minimize over X, I have to send a message to you, which is
this is X, or we can get it from shared memory or something like that. When you minimize over
Z, this thing, that's totally irrelevant.
>>: Is the term Z for the second quadratic thing? It changes your bias?
>> Stephen Boyd: No, look, what happens is this. At each step, suppose I'm going to do the X
step. Then I just -- what triggers me to do that is I just got the value of Z. So when I calculate -in fact, I don't even need to know the value of Z. I just need BZ. So I need BZ. If I get the
value of BZ, that's all I need, right? Then that's just some vector. B could be gigantic, but BZ is
a vector, and sort of one assumption -- I made a simple assumption, which is reasonable, which
is that we can pass around vectors. We can't pass around matrices. That's quite reasonable. So
you get BZ. Once you know BZ, you can do this. You don't even have to know what G is, and it
flips. When you're doing the Z minimization, you don't even have to know what F is. All you
need to know is, in fact, specifically, the vector AX. So, in fact, thank you. That's a better -- by
the way, very often A and B are like I and minus-I and stuff like that, but in fact, now what I'm
saying is actually precise and answers your question. Actually, the other one did answer your
question, but this is a better answer. So that's what you need. You need to pass AX and BZ.
You had a question.
>>: Yes, so that's great, but the updates to X and Z and Y are still synchronous, right?
>> Stephen Boyd: That's true, so I'm going to talk about that later. So there's actually been
some stuff done on asynchronous versions of this. First of all, Google did it and reported that it
works. I don't know what works means in that case, but works in practice, and there actually is
some theoretical support for it. I know nothing about that.
>>: [Indiscernible].
>> Stephen Boyd: No, certainly not. And actually, the reason it's not trivial to do -- I mean,
normally, if you have like a gradient minimization or something like that, asynchronous is fine,
because every time you do something, you're reducing something like the distance to the
optimum or something like that. This one has a state. Y is a state, a state variable here, and so
there's dynamics in this algorithm, and then asynchronous is not at all obvious. Okay, but
throughout here I'm going to assume, yes, it's synchronous.
>>: [Indiscernible] allow us to achieve the same minimum as if you did it non-separably?
>> Stephen Boyd: These questions are almost at the level where I should be paying people,
because that's I think the next slide or something like that. Or it's two slides over. Here. So yes.
Let's get that over with now. So here's the convergence. Ready for the convergence theory of
this? It just works. Here that's the convergence theory. If there's a solution, it solves the
problem, period, just period. No assumptions on A, B, anything. If there's a solution, this
converges to it, period.
Now, this is stated very carefully here. These are true statements. This thing goes to zero.
That's the primal residual, and the objective value goes to the optimum. Now, I'll tell you some
things that are false. Here would be one. X converges. Z converges. X converges to an optimal
point -- well, if the first one's false, this is stronger. Those are false.
>>: Is it rare or is it regular?
>> Stephen Boyd: Actually, I'm going to make a stronger statement than that in a minute. And
I'll tell you one that is true. Y, the dual variable, does converge. That does converge, always.
Now, I'm going to make what -- it's not radical, but it's true. By the way, if you want X to
converge, you have to have more assumptions. You have to start saying, well, A -- you have to
start talking about ranks and intersections of ranges and null spaces and crap like that. Okay.
But I'm going to make a stronger statement, a more provocative one. It's this. That doesn't
matter. Totally irrelevant in practice. People get very upset when you say that, like, excuse me,
you're going to run an optimization method and you even know there are counterexamples where
that variable does not converge? You see what I'm saying? Let alone to an optimal point -everyone kind of got this? And the answer is absolutely, because it doesn't matter. Here is what
actually matters if you run an algorithm like this. You will stop when the residual is small. That
is guaranteed to happen. And you will stop when you are near optimal in objective value, and
that is guaranteed to happen. Whether or not XK is converging to something is completely
irrelevant. Now, what's obviously happening is if XK is going to rotate around and drift, it's
obviously drifting around in the optimal set. The optimal set is often -- if it's a singleton, then of
course this implies it.
>> Lin Xiao: If F and G has a little bit -- it's not completely collapsed.
>> Stephen Boyd: Until we -- ask that later. That's a great question, but for now, until -- at the
very beginning, I had an assumption, and the scope of that assumption is these are convex. Until
we hit that right bracket, that assumption is in force, that scope of that. I will talk about that.
That's a super-interesting question, so remind me later what happens when F and G is nonconvex. That's absolutely fascinating. So if you forget, somebody else should ask.
>>: One more question.
>> Stephen Boyd: Yes.
>>: What is the guarantee we will hit the global maximum or minimum.
>> Stephen Boyd: It's absolute. Here, it will work perfectly -- no, exactly in this sense. Do you
mean hit it? Oh, then it's zero.
>>: Each time to be the global extremal for F(X) and ->> Stephen Boyd: That's right, it is. So this says you get arbitrarily close. I could tell you the
probability that you're going to hit it, zero, but that's completely irrelevant and stupid. That's not
a real question.
>>: [Indiscernible] in those functions which is well known [indiscernible]. It stops out of the
global extremal, even if you have ->> Stephen Boyd: Yes, this doesn't get the global extreme. I'm saying exactly what I'm saying.
I'm saying the residual gets small and you get arbitrarily close to the optimal value. I'm not
saying anything more. And in fact, I would argue you cannot, by definition, care about anything
else. If you do, you're making it up. If you care about something else, you better state it in the
optimization problem, because the semantics of an optimization problem, when I walk up to you
on the street and say, please minimize F subject to these constraints, okay, then here is the
contract when you solve it. The contract is you return to me something that satisfies the
constrains within some tolerance and is within some tolerance of the optimum. And if I say, oh,
I don't like your point. It doesn't satisfy this or that, that is nonsense. That's nonsense. That's a
different problem, but it's not an optimization problem. Yes.
>>: Do X and G just at random?
>> Stephen Boyd: Oh, that's a great question, too. We're going to get to that. This is going to
be fun. You had a question.
>>: About speed.
>> Stephen Boyd: That's a good one, too. So the question is, how does it work and, of course,
everyone now is looking at optimal order methods and is this 1/T, 1 over square root of T, these
kinds of things. This has the worst complexity that there is. So, on the other hand, it has to,
because it's super-duper general. Come on, it solves -- it handles problems that go to infinity and
stuff like that. I mean, this is crazy. There's no Lipschitz constants. If you [get] through
something and don't find a Lipschitz constant, then this is the best thing you can get. But, in fact,
the convergence rate according to the theory is the worst you can possibly get. That was your
question, right? Okay, good. This is great. Okay.
We skipped a couple slides I think maybe I'll go over quickly just to say what it was. Here it is.
It turns out there's the analog of why row works, and it's actually kind of cool. When the
optimality conditions of this problem are primal feasibility, there's really one, but it splits into
with respect to X and with respect to Z. This is with respect to X, and that's with respect to Z. I
mean, you stack them, and it's really two. So you have one primal residual, two duals. Okay?
Oh, and note that the two dual residuals can be evaluated on separate processors if we imagine an
X and Z separated, right? So if you minimize this, it's the same calculation. You go through
here, and it turns out that what happens is, if you run ADMM with this step length, then this
triplet here -- this are primal, primal, primal and then the dual variable -- that these things, you
get one of the dual conditions for free every step. And now you're waiting for a primal error to
go to zero and a dual. Okay? I'm just saying that the choice of the step length in the maximizing
the dual function is not arbitrary. Okay.
One thing to mention here is, you get shorter formulas if you do the following. This is a linear
function here. This is actually related to your integral question, and that's a quadratic. But I can
shove this linear thing into the quadratic by completion of squares or whatever you want to call
it, and it says that I can write this this way, like that. And then, in fact, U is nothing but the
scaled dual, and then the algorithm looks like this, so it looks like this. You minimize F plus a
quadratic, then you minimize F(G) plus a quadratic, and then you do the update. Here, there's
not even a row. Row is like one. Okay? You can think of them different ways. Up here, you
should think of that as a price vector or dual vector, and if you're running a problem where the
price vector has an important meaning, like you're doing advertising or you're doing energy stuff
and these are the locational marginal prices or something like this, then you might want to -- I
mean, it's trivial. It's just this change, but this is a nice way to think of it. So this is sort of the
economic term. This is the market term. And that's this weird term that penalizes going away
from the market for everybody. This one is actually quite beautiful. Negative U you can think
of as a target for the residual. So what is coordinating everything in this case is you're moving
U, and the fact is, mechanically, you can think of that as a spring, and so you're moving the point
U, and you have a spring tied to it, and you move that, and that spring pulls X and also Z. It will
pull it so this thing is zero. You're going to see formulas like this. We go back and forth, don't
even say anything about it. Yes.
>>: [Indiscernible].
>> Stephen Boyd: No, we're going to get to that. You're asking questions -- it's not causal, but
push that onto a stack and we will pop it later. So don't worry. The theory says here's how you
pick row. Ready? Positive. There, that's done for the theory, so that's over. By the way, it says
in theory that works, for any positive row. Okay, so there's that answer. You can imagine there's
going to be another answer. Okay, so I'm actually not going to go over this, because it gets kid
of boring, but it's related to gazillions of things. Actually, what's very interesting is it turned out
only in the '80s, late '80s, was it recognized that ADMM is identical to something called
Douglas-Rachford splitting, and that's something that goes back to like the '50s, applied to PDEs.
It's actually kind of weird. It's also, it turns out, is it's the same as something called progressive
hedging. It's identical to something from the '80s. By the way, you might imagine, these are
great people, why would they make a new algorithm up and then later would it be found to be the
same as this older one? And the algorithm's just complicated enough, you rewrite it a different
way and it has a different form, and it's actually much harder than you think to actually show this
algorithm is isomorphic to that one. It's not as simple as you think. Okay, I won't say much
more about that. Okay. Actually, now, we're going to enter a part of the talk that it's quite -- so
here's -- I'll tell you, here's the program. The program is this, you know what ADMM is now, so
the talk is over, but I'm just going to show you some simple things about what you have to do.
And notice what you have to do. You have to implement exactly one method. You take a
function F or G or whatever it is, and you have to actually minimize F plus a quadratic. There's
no other method. Also, actually, kind of interesting to point out, this method is a higher-level
method, I would say, than methods that -- than ordinary optimization. In ordinary optimization,
what do you have to implement if you want to minimize something with F -- with F, right? The
answer is, you have to evaluate a gradient or a sub-gradient, maybe a Hessian, this kind of stuff,
right? This is higher order. The only method we will ever use to access F is going to be -- well,
we'll give it a name in a minute. It's going to be F-dot augmented quadratic minimization. It's
going to have a name in a minute. Everybody see this? By the way, that's going to be something
that we can do analytically. It could be something where there's actually numerical computation.
It could even be heavy lifting. So minimizing that quadratic could actually fire up a primal dual
into your point solver and use 64 gigs of RAM. We'll see. We'll look at all sorts of stuff. Okay,
so the idea is this. When you minimize over X, you have to minimize F plus -- and, by the way,
that's the only thing you have to do. This will have all sorts of interesting implications if you do
machine learning and stuff like that. It'll be very cool. And so special cases of this come up
often, and then these are worth -- everything I'm going to say is going to be completely trivial.
Here's one. If F is block separable and A can formally separate, then that splits into N parallel
updates. That's the reason we're doing all this, right, is so that we can do this -- sorry, I haven't
said that yet. Sorry. So you can do this minimization in parallel. That is the point. Okay. Now,
a special case that comes up all the time is A=I. That says, minimize F plus a quadratic deviation
from V, a given point V. That's the argument. That's the proximal operator. You could say a
ton about that. It's a very cool thing. It comes up all the time. It's actually quite beautiful. It
just says basically, somehow, minimize F, plus don't be very far from V. That's all it says. Here
are some special cases. F could be the indicator function of a set, of a convex set. That means
it's either zero on the set or plus infinity. By the way, these are not the kind of functions that
play nice with optimization. There's a function at zero everywhere and then goes to plus infinity.
These are the kind of things that would ruin a gradient method, anything conventional, right?
Okay, so here, if this is zero or infinity, if it's zero, if you're in C, then it says minimize this thing.
If you're out of C, it's plus infinity, and if you're minimizing, plus infinity is the worst thing there
can be. So what it says is, please minimize this thing, this thing here, the distance squared to V,
over points in the set. That's projection. So this thing, if F is an indicator function, this is
projection, so everything is going to start looking very familiar. Well, suppose this was like an
L1 norm, right?
Well, then you can just analytically work this out, and it's soft thresholding, right? It basically
says subtract a number from it, like lambda over row or whatever, and then, if that changes sine,
make it zero, so it's a function that looks like this I guess for you. Well, here, I'll do this way. It
looks like that and is flat and that. I guess in EE you'd call it a dead zone or something like that.
Okay, and there's lots of others. There's long literature of these things, and they're not hard.
Okay.
Here's another obvious one. Suppose the function is quadratic, and then you are asked to
minimize a quadratic plus a quadratic. Well, a quadratic plus a quadratic is a quadratic. How do
you minimize a quadratic? It's called linear algebra. Okay, so that's nice, but there's some really
cool things you can do. This is an important one. This is this. Suppose I'm going to use a direct
method -- I'm going to factor a matrix like this. If I'm going to factor that matrix, then you use
the following thing, which if you know about numeric linear algebra is kind of cool, but it's this - how much does it cost to solve one set of equations with -- the first time you do this, you have
to factor this. You cache the factorization, and every time after that, the cost went down by a
factor of the smaller dimension in A. So this is actually extremely significant, right? Because it
just says, if you want to solve 5,000 by 5,000 equations, the first time, it's going to take you, I
don't know, whatever it takes you. That would be -- I made that up. No, more. It would be a
cube. It would take you 30 seconds on some machine. So it would take you 30 seconds. After
that, you get a discount by a factor of 5,000. Everyone see this? Actually, this changes a lot of
things, because when the complexity police come around and say, that's so unsophisticated.
That's a 1/T method. Oh, my God, oh, we've moved on to accelerated methods -- we all know
these people. What's relevant to point out is that, in a situation like that, the cost of an iteration is
complete nonsense, because the first iteration costs you 5,000 times as much as subsequent ones,
and what that says is, you don't even have to worry. As long as you're under 5,000 iterations,
there's nothing to say, right? And, by the way, the first iteration is solving a least-squares
problem. We'll get to these. Yes?
>>: So some problems have so much filling you don't ever want to do a full optimization.
>> Stephen Boyd: Oh, absolutely.
>>: If you compute a reasonable precondition, but then your update step is an approximate
update.
>> Stephen Boyd: Okay, so that's a good question. So the first question is approximate updates.
ADMM works with approximate updates. There, that's the summary. So that's that, number one.
Number two, iterative methods. If you use an iterative method, what you do is you warm start it
from the previous value, or from V, actually. You warm start it from the blast point, and then a
beautiful thing happens. What happens is, when you run CG -- and also, you can put in the fact
that you don't have to solve it exactly. You might do 50 steps, 50, 50, 50, then start doing 30, 30,
30, 10, 20, then at the end you're going two, two, two, two, two. These are CG steps, for obvious
reasons, because you're converging. And so what happens is, once again, the number of
iterations has nothing to do with the wall clock time. If you draw the iterations, you say, oh,
that's very slow convergence, but then you take into fact that the final ones are costing you two
CG steps and the first ones were costing you 50.
>>: [Indiscernible] the progress you make on the F [indiscernible].
>> Stephen Boyd: That's a great question. So there are some -- to be fully honest, the results
that tell you that it works with approximate minimization basically say that the sum of the suboptimalities has to be summable. So you'd have to know this. You could do that in CG. My
guess is that you wouldn't, so I can tell you what people do. I mean, this is street-fighting stuff,
and it's probably correct, but honestly, I don't know that it is. Street-fighting stuff is this. You
ready? There's a very sophisticated algorithm for doing this. I will do no more than 10 CG
steps, period. That's it. It just basically says, in minimizing this quadratic, I'm making 10 good
steps every time. The whole thing is a slow algorithm, so it works. And things like that appear
to work, but I'm actually not aware that anyone has actually proved that or that people are
running good, honest methods for doing this, so that's a great question.
>>: Can you go back to the previous slide? I have a question.
>> Stephen Boyd: I can.
>>: So this question, are you aware of any kind of function of F that will give you the -- rather
than the dead zone, might give you [indiscernible] functions.
>> Stephen Boyd: Sure. You can back that up. I don't know. We'd have to work it out. Of
course you can.
>>: So this one can always go back. When you give any kind of nonlinear function or whatever
you want --
>> Stephen Boyd: No. It would have to be monotone increasing, and later we can talk. It has to
have some properties, right? So we'll talk about that. Okay. Okay, quadratic. These are tricks,
right? Here's another one. Suppose F is smooth. Then it says, well, you use your standard
methods. If it's small enough, use Newton. If it's otherwise, you'd use things like limitedmemory BFGS, and now we're out there solving problems with 10 million variables with no
problem at all. Actually, if you do either CG type things, if you do iterative stuff for solving
linear equations, or over here, you're minimizing a smooth function, that extra term we added,
which is that extra quadratic, should bring warmth to your heart. Because when you do
something like LBFGS or CG, what you're worried about is things like condition numbers.
Actually, what you're really worried about is minimum curvature. So if you run BFGS on some
logistic thing without any regularization, it's not even going to work. I mean, it would work in
theory, but it's silly. Everybody see what I'm saying? You add quadratic regularization and that
tames these things right up. So it's not just that you're minimizing something smooth. You're
minimizing something smooth plus a quadratic, and people who do iterative methods of any kind
for smooth things, this makes them super happy, and for good reasons. Everybody got this? So,
okay -- all right. Now, all we're going to do is we're going to assemble. Now, we just do
assembly -- and this will show you how you represent things, and it's hilarious how simple it is.
So I claim you know everything you need to know now. Everything. Now, watch this.
Let's start with just silly little stuff, so we're going to do this. We're going to minimize a
function, subject to it being in a constraint set, and so here's how you write it for ADMM. And
these splittings typically are ridiculous. This is a good example, so here it is. What I'm going to
do is, let's suppose F is smooth or something. I don't know, let's make it smooth. Let's make it
logistic loss, so it's smooth, and this is some constraint set. I don't know, it doesn't matter. So
what happens is, you rewrite it this way, F(X) + G(Z), subject to X minus Z -- this is just X=Z.
If someone walked up to you and said, please help me solve this problem, you would say, like,
what are you talking about? That's the same as that, and why would you write it this way? It's
ridiculous.
Oh, by the way, the name from this is, you would say in going from here to here, you have
replicated the variable X, and that's called a consistency constraint. So you closed X to a new
variable called Z, and you added a constraint that says it's a new version but it has to be the
same. We're going to get to this in a more general context. Everybody got this? But what's cool
is this. You will only handle F or G at one time. Handling G actually is a projection. Handling
F, let's imagine F as smooth. It's actually doing a smooth minimization. So, for example, if you
want to minimize a smooth function over a polyhedron or whatever it is, over some box, you've
got a beautiful method for it now. You run limited-memory BFGS here with warm start, and
down here you clip the entries, and you have a method that's going to work and work really well.
Okay? This will solve LPs, QPs, everything, just immediately. You just throw in the parts.
Okay.
>>: So a question here. Earlier, I felt that just by looking at that part of the formulation, X and Z
are symmetric. You can do the order ->> Stephen Boyd: Oh, yes, yes, yes. Absolutely.
>>: But in this particular case, you won't be able to do that.
>> Stephen Boyd: No, no, no, you could re-label F and G in ADMM, and you'll actually get a
different algorithm. It's not quite the same.
>>: Exactly. Which one is easier? Which one is better in that case?
>> Stephen Boyd: No one knows. No one knows. They typically perform about the same, but
they're not the same algorithm and they're different. And there's even a form of ADMM that's
symmetric, and so it's more aesthetic. So I'll tell you what that is. That goes like this. You do
an X minimization and then you update the dual variable by half the amount. Then you do a Z
minimization and you update by half amount, and the aesthetics of that I think we all agree, that's
much prettier. And now, by the way, if I switch X and Z, it is isomorphic, because the odd steps
become equal to the even steps. But in this one, you switch F and G, and you could do it right
here, and you'd get a different algorithm. It would also involve exactly one projection per step
and one smooth minimization. And it would be a different algorithm. Generally, they work
about the same, but in fact, no one really knows, as far as I know. Yes?
>>: So the first formulation is like a general convex optimization from ->> Stephen Boyd: This one? Yes.
>>: So are you suggesting that every convex optimization problem, I should solve it this way?
>> Stephen Boyd: No, no. I already said this. I already said that for ADMM, I would say
something like this. For any particular problem, there will be one or many methods that are
superior to this. What this is, these are the Lego bricks, that if I said here -- it's one of these
imaginary things. You're on a desert island, but fortunately you have your laptop and a couple of
key libraries, and then you have to solve one of these problems and have to assemble it, this
would be an awfully good way to do it. That's all I'm saying. So I'm not suggesting this.
Something like this will work really well for certain things, right? This will be competitive or at
the state of the art, and if you put code time in there, too, it will be. Code development. Yes?
>>: You are always setting this at a [indiscernible].
>> Stephen Boyd: No. Row is constant. It's constant. You can change it, and in many
implementations, you do. We'll get to that when we answer the dreaded how-do-you-choose row
question. We'll get to that, but in fact here, we're just assuming it's constant. Everybody cool on
this? And the key part is this says I can solve a constrained minimization problem, provided I
have a method for minimizing F plus a quadratic and then projection. Okay. Lasso, it's a
modern regression method, so you want to minimize the least squares -- notice that's quadratic -and then plus an L1 thing. Okay, so fine. You split it. Again, look how ridiculous the splitting
is. You just replicate X, but basically, you have to think about what you're doing here when you
split is you're saying, I want one processor to handle this and I want another one to handle that.
But that's one of our special cases. That's minimizing a quadratic. That's also a special case,
because the prox operator on that is soft thresholding. And now, I assemble the two, and notice
that if you wanted, you could even have completely separate processors handling the soft
thresholding. I mean, it's silly, but handling the soft thresholding and the linear part. And so
here's what you get. You get something that looks like this. Soft thresholding, this is trivial, and
then this update. Here's an interesting thing. This thing, that's ridge regression, because this is
actually solving least squares plus, and then a row over two -- this is taken off regularization,
right? And not only that, suppose you cache? This says, here's how you do lasso. First, do ridge
regression. Cache the factorization. Now, you get a big discount, steep discount, unless it's
sparse enough or whatever, but if it's dense enough, you get a steep discount on subsequent
iterations, and it says go with the cheap -- now you iterate with the back solves. Okay?
>>: How do you turn back the individual regression? I couldn't see that directly.
>> Stephen Boyd: This? Because if I minimize norm AX minus B squared, plus row ->>: [Indiscernible].
>> Stephen Boyd: Right, so here's what's bizarre. This says by iteratively solving a sequence of
ridge regression problems, I'm actually solving a lasso problem. That's the naive way to say it.
In some sense, how many iterations does it take, I don't know, 200. And you go, well, that's kind
of a lot. Then you say, ah, but if you solve a sequence of ridge regression problems and you
cache the factorization, you're going to get a huge discount after the first step. So that's what this
is saying.
>>: In the end you'll be comparable to ->> Stephen Boyd: It won't be comparable. It'll be exactly the same.
>>: The last one?
>> Stephen Boyd: No, we're solving. This is not a heuristic. We are solving that problem
exactly. Okay, so all right. Let's do a lasso example, and these are just like trivial little things.
These are implemented -- this is MATLAB with some ridiculous four-line script. It's not
optimized in the slightest. It's just a little problem with 1,500 examples and 5,000 features or
something like that, and you just type in a three-line script. Look, the world does not need a new
lasso solver. My group's written three. Actually, if you draw, within 200 meters of my office, it
would include the stat department and the CS department, and now we've got like 10 different
implementations of lasso, so we don't need another one of these. That's for sure. But look, you
just turn the crank on these methods and here's what happens.
The factorization is 1.3 seconds. Subsequent iterations, by the way, there should have been a
discount factor of about 1,000, and this is, you can see, way off. It's 30 milliseconds. But, come
on, this is MATLAB, and so it' a joke, anyway. If you want to do that, either do it real or don't
complain about the times, basically. So this is just to give you the flavor of it, right? So the
point is, it's a lot faster. Notice also that somebody coming around saying, oh, that's a lot of
iterations you're doing. But the point is, it's kind of irrelevant when the first iteration is one
second and the subsequent ones are 30 milliseconds. It's something to bear in mind. So if you
do a lasso solve, that's about three seconds. That's all the same. For a very short, three-line
script, that's pretty good. And, actually, it goes back to this really cool idea. It says, if you look
at lasso and you say, why would you not do that, if you were to talk to a classical person in
statistics, they'd say, well, fine. We might like to do that, but there's no solution -- there's an
analytical solution to ridge regression of ticking off regularization. There's an analytical
solution, and there's not one for lasso. It doesn't make any difference. It's the same cost. It's the
same.
Here's another one. Suppose you did the full regularization path, so you take 30 lambdas. You
cache the factorization of A once, because the only thing that will change will change in the soft
threshold. And then it takes you -- so that's really 4.4 divided by 30, and now you're actually
substantially beating something like -- you're beating ridge regression, or ticking off
regularization. Everybody? It's just a baby example. It's not supposed to be -- okay. So we'll
do one a little bit more interesting, just for fun. Again, just to show you that you just turn the
crank on these things, and I won't go into the details too much, because I'm behind. Okay, so
here it is. You have an empirical covariance matrix, and what you want is you want to estimate
sigma inverse, and you want this to be sparse, so it's conditional independence and stuff like that,
so this is what you want. And so what you do is your form the negative log-likelihood function.
That's this, here. X is what you're looking for. You want X to be -- X is going to be sigma
inverse. That's the given data. And then to encourage sparsity, you put an L1 norm here, and
this is the sum of the absolute values over the whole matrix, so fine. This is relatively recent.
This is like five years ago or something like that people were looking at this and writing papers
about it. So we're just going to do -- just apply ADMM, vanilla ADMM. What we do is we
want to minimize this, and notice what we've done. We've taken this part and this part. We said,
well, those are smooth and X, so we'll lump those together, and that we know how to do. So
then you get something like this. You have to minimize this thing. Now, that's smooth, and you
might imagine you could use something like LBFGS. It turns out you can solve this analytically.
And the reason is, they're spectral functions. Everything here is invariant under a change of
coordinates, so you just orthogonalize it and it splits. You get an analytical solution, and the rest
are very simple things. It turns out what you do is you do an Eigen decomposition, and then just
-- actually, you're computing this function of the matrix, actually, and then you get that. So the
cost of the X update is an Eigen decomposition, and then you run it, and what's embarrassing is
it's kind of in the ballpark. This is something where you just take this silly algorithm, type in
three lines of something and you're kind of in the ballpark of a lot of these methods, and this
actually scales way better than some of the other ones.
It's kind of weird. These are the ideas. The numbers you shouldn’t take seriously. Just the fact
that there may be minutes, seconds, hours would be significant here.
>>: [Indiscernible] caching the Eigen decomposition and starting that?
>> Stephen Boyd: You could. Yes, so you could do all sorts of crazy stuff here. No, no, this is
just cold start. We're doing the normal thing. We reduce it to whatever it is, tri-diagonal and
then blah, blah, blah, so just calling Eig. Yes. All of these could be hugely improved. Now we
go back to what we started the whole thing about, which is distributed optimization, and we'll
start with what I think is the simplest distributed optimization problem there is, and it's actually a
very good thing to start with, because if you don't know how to do this, then you don’t know
how to do distributed optimization more general cases for sure. And, actually, this covers a lot
of things. It's really dumb. It's this. Choose X to minimize a sum of functions. There. The idea
is you want each agent to process a thread to handle one of these functions. By the way, this Fi,
because we're not afraid of things going to plus infinity, these could be constraints, too, right?
So, for example -- this is the kind of thing we've been doing. Fi could encode constraints. In
fact, one of the Fs could be all of the constraints, non-negativity could be one of the constraints,
whatever you like. So it looks like that, and if you want to put this in a machine-learning
problem, this is beautiful. This is very simple. X is a vector of parameters in a statistical model,
so it's weights in a logistic regression thing. It doesn't matter. Weights in a regression,
something like that, right?
Then, this is the negative log-likelihood function or the loss function for the Eigth block of data.
So you might have N data stores, and this is the loss function for parameter X on that data store.
Everybody got this? This says, it's as if -- this says please collect all the data together and find
the joint maximum likelihood estimate. Now, you could throw in a regularizer or prior, whatever
you want. So in ADMM form, again, the splitting is ridiculous. It looks like this. The first thing
you do is you let each of these -- you let each block have its own opinion of what X is, right?
You replicate. So you said everyone can have their own opinion, and then you add the constraint
that all opinions must agree, right? That's called the local variable. That's called the global one.
And the these are called consensus constraints, and now you could literally say -- if someone
says, what is your algorithm doing, what is this algorithm doing, the answer would be something
like this. Well, I have all these entities. They all have their local opinion about what X is, and
what they're doing is occasionally they talk to each other and they're being pulled into consensus.
When they're pulled into consensus, they've actually solved this gigantic problem, which would
have been the result had they collected all of their data in one place and solved it. Everybody got
it? I think it's very -- yes. The math is actually not that hard, but it's more interesting thinking
about how these things work and what they measure. Okay, consensus optimization, just turn the
crank, it just pops out beautifully. There's a Lagrangian. So all of these are just -- each block
does it separately. You minimize your local function, plus a quadratic. Then, the next step is
you minimize this over Z. But over Z, it's minimizing a sum of squares. It's the average, and
then you update, but it's even easier, because it turns out this sum is zero, which is also very easy
to show. And so the algorithm turns into two lines, and it's just absolutely beautiful. It's this. So
here's how you coordinate a bunch of entities, minimizing a sum. Nothing could be easier.
What you do is each one should minimize if local private function plus the current dual variable - that's the usual economic interpretation. Plus, this is a deviation from deviating from what the
average was at the last step. And if you think about it, if we have a bunch of people all solving
some problem and I send a message to everybody, I broadcast a message that says, stop, that's it,
what's X? My best guess of X would be to average all of those, and the reason, that's because for
convex optimization, you average things to get better. In other words, if everybody here had
their own private data and you're fitting a logistic model, and I said stop, and I collected all your
data, I should average those weights, right? That's guaranteed to be better. Yes.
>>: It's surprising that the averaging weights wouldn't be sensitive to the level of contribution of
each term.
>> Stephen Boyd: No, that's in there. That's built into here.
>>: Oh, I see.
>> Stephen Boyd: So if you have 10 data samples, your loss function is really tiny. If you have
100 million data samples, your Fi is really big, or rather, it goes up steeply. So what will happen
is, if your Fi is small and weak, the local opinion will pull you far. This term will dominate, and
you won't -- do you see what I'm saying.
>>: That's if you have a few rounds of exchange.
>> Stephen Boyd: No, no, this is the way it just works. Each time -- do you see what I'm
saying? Yes, yes, because your loss, your Fi -- if you've got a small amount of data, your Fi is
kind of very shallow and small, and that means this term is going to really influence your choice.
If you're some giant server and you've got just a billion examples of data, and you don't even
really care about what the other people think, that says your F is centered at the maximum
likelihood point, and then well, that's the Fisher information. If it's smooth, it goes up very
steeply, and that says that when you solve this problem, you barely budge from your private
opinion, so actually that is built in there. That's a good question. Okay, all right. And it's cool.
It's a gather and a scatter. Actually, it's not, of course. It's an all reduce is what it really is, so it's
not a scatter gather. It's an all reduce. Okay, so that's it. It's kind of cool. And then that's the
error. Actually, this is the error at the Eigth box. Basically, it's how much your opinion differs
from the common opinion, and then it says you should update your dual variable, which is going
to help drive your opinion towards the consensus value. So it's very intuitive. What's not
obvious is that this always works, period. Always works, no matter what F is. F can be plus
infinity. It has to be convex. Yes?
>>: [Indiscernible] if you had a server with like a billion rows of data, and then you had a
million smaller machines, each with a few rows, what would that look like?
>> Stephen Boyd: It'd work just perfectly.
>>: Because the sum of the smaller ones maybe should pull the big server.
>> Stephen Boyd: It would all just work. You can check. What's very important is that that Fi
is not the average loss on your set of examples. The scaling has to be done exactly correctly. It's
the total loss, and then it works. I mean, if you're just used to solving something, we would
always talk about average, the empirical loss plus a regularizer. If you do that here, total
disaster, because then someone with 10 samples has the same status, and it's just wrong. If I
want to do -- what do they call this, collaborative filtering? You know, estimating a parameter
from different data stores, you don't add up the average losses. You add up the total losses. I
promise you, it's going to work perfectly. It'll do just what it's supposed to do.
>>: So by the design to slow your convergence.
>> Stephen Boyd: Yes.
>>: You would put them into two giant machines, then it will come super-fast.
>> Stephen Boyd: Yes, yes. So, actually, we're going to talk about exactly this problem and
how you should do this.
>>: So this method is even more sophisticated than the [indiscernible], so at the end, you
average all the parameters.
>> Stephen Boyd: Yes, yes. I know what some people do. I can't say who it is, but I know what
they do. They do the first step without this. Each data store calculates its own thing.
>>: Yes, that's the most [indiscernible] way you can do it.
>> Stephen Boyd: Then they do the following. They collect the weight vectors, get sparse.
They're sparse. They don't quite agree. They average them, and then do soft thresholding.
Period, okay. That's what they do. And actually, it's kind of cool, because I looked at it and I
said, yes, congratulations. You're doing actually 1.5 steps of ADMM. All this would say is,
listen, if you wanted to, you could do it. They don't need to, because they said it's actually
working perfectly well, thank you.
>>: [Indiscernible] if it's the case that you have roughly equally distributed nodes, you'd
probably be okay by throwing away that extra term.
>> Stephen Boyd: Yes.
>>: Because it just kind of works out. But if they're in balance, you really need it to keep it
kosher.
>> Stephen Boyd: Yes, yes. That's exactly right, yes. You mean this term.
>>: You might be getting something that works by mistake.
>> Stephen Boyd: Yes, that's it. Yes. Actually, what's very cool about this is, if you think about
let's put this -- this is, of course, general, but let's think about it in a statistical learning context.
That's the negative -- then it says, if I have to coordinate a bunch of boxes that have access to
data, heterogeneous data, even. It doesn't matter. If I want to coordinate them to get the global
maximum likelihood, each one has to implement exactly one method. They have to minimize
their negative log likelihood plus a quadratic function. The quadratic function is the negative log
density of a Gaussian. This is a Gaussian map, so that says that the only thing I would have to
do, if we want to estimate a statistical parameter, is everyone has to support one method,
Gaussian map. That's it. And the only thing that will happen is, I don't care how much data you
have, I don't care what your data is. I couldn't care less. You could be prior. It doesn't matter.
One box could be prior. All that happens is, I send you a Gaussian, a mean in, you calculate
Gaussian map, I don't even care how you do it. Use your custom method that people put a lot of
time in, except now they have to back in and put in the hooks to add the quadratic term. But
that's trivial in all real methods, and then that comes back. And then using that, just by
averaging, I can actually coordinate a whole bunch of things. The only method you've
implemented is Gaussian map. You can actually force them to compute the maximum likelihood
estimate, had they collected all their data in one place. Yes.
>>: [Indiscernible]. So you have solve my problem and don't diverge from the ->> Stephen Boyd: That's this one. That's solve your problem. That's don't diverge from the last
round.
>>: And there, what's the history?
>> Stephen Boyd: Yes, yes. That's history.
>>: And how biased you are, how often you ->> Stephen Boyd: That's exactly what it is. That's right.
>>: Lopsided from the mean, right?
>> Stephen Boyd: This is state, and it's history. It encapsulates the state. It's the integral of the
errors. It's the thing that should skew your opinion to make you fall into consensus. That's
exactly what it is.
>>: But what do I estimate it ->> Stephen Boyd: That's what I mean.
>>: So it compensates for bias as well as what you're splitting.
>> Stephen Boyd: Absolutely.
>>: Your split just gives it a certain bias.
>> Stephen Boyd: That's exactly right, and what's nice, the bias goes away. In the end, you get
the global solution. So let's do a quick example. We'll do a consensus classification. I'll go
quickly, because I think people here know about this, but we're just going to do a little SVM or
something like that problem for fun, and I'll just show you how this works. We're just going to
split the data. This is a tiny, baby problem. We're going to do SVM, so you have hinge loss, and
I have two features and 400 samples, so I could solve that problem in microseconds, literally.
But, instead, I'm going to split it up in 20 groups, and I'll do it in an evil way. Each group gets
either all spam, or all positive or all negative examples, okay, period? Each group alone has a
hideously skewed view of the world, and they have to coordinate to do this. Okay, so here's
what happens. Here are some data. These are all the first classifiers. Well, of course, they're
terrible, because every single one of these has seen a ridiculously skewed view of the world.
Okay. Then you average. That's that, which is actually -- this is kind of like bagging or
something. You take a bunch of terrible, weak classifiers. Anyway, there it is. Now you run
ADMM. That's five steps, and then that's 40, and that is in fact the solution. Now, what you
want to think about this, what is so cool is what's actually happening. You're making an SVM
classifier, and I show you 20 examples. They're all positive. That's all that happens. So the only
thing that happens is you go back into your SVM and you put this Gaussian prior. And the only
thing you implement, the only thing you expose to the outside world is the ability for someone to
send a message to you that said this is the prior on the weight vector, and then I don't care how
much data you have, what data you have, I couldn't care less, you send me back the map
estimate. What's that?
>>: [Indiscernible].
>> Stephen Boyd: No, that's stored locally. That state is stored locally. That's internal. It's no
one's business. The Yi is actually only his business.
>>: So it will throw away my data. That won't tell me anything.
>> Stephen Boyd: Oh, no, oh, no, not at all. What's going to happen is, after 10 steps of this -by the way, you have absolutely no idea who else you're cooperating with. You could be sitting
next to someone who has a billion examples all over the place. You could be sitting next to
someone who has one example. It doesn't matter. You just support that one method, which is
this Gaussian map thing, and it works, so it's kind of cool. Yes.
>>: It looks great. I feel like I'm being sold this pre-owned vehicle.
>> Stephen Boyd: No, no. I'm going to say -- I already told you, for any given thing, there's
definitely a better method. I've already told you that. And there's all sorts of things we haven't
popped off the stack, like how do you choose row, and we're getting there. I told you it had the
absolute worst complexity in theory.
>>: The common pitfalls or limitations?
>> Stephen Boyd: Yes, we'll talk about that. It's not over. It's not over. This is not over. It's
not over.
>>: Can you talk about that in terms of what the ->> Stephen Boyd: What?
>>: Can you talk about those limitations in terms of what process you're ->> Stephen Boyd: Oh, I can tell you limitations on many, too. Oh, let's talk about that.
>>: You haven't talked anything about whether you need to randomize the split of the data.
>> Stephen Boyd: Oh, that's a great question. We're going to talk about that, too. That's a great
question. We're going to talk about that. We're going to talk about that. Okay. Actually, maybe
what I'll do is just to say that this would not be for solving problems that you can solve in 60
microseconds. You can solve real problems with it, and it does. The cool part is it does kind of
the right thing. Just for time, because these are all very interesting things. You just write -- this
is all the source for this is online. It's all super-simple and hideously non-optimized. But you get
something that it comes out with what you expect to happen, right? Basically, you take a big
problem -- and this is not big. This is a small one. But this was only limited by EC2 charges on
my credit card or something like that. But it's the same thing. That's the factorization time.
That's if you knew how to solve a least-squares problem of that size in parallel and you know
what you were doing on 80 cores. If you do that, it takes five minutes. And then, lasso, it's like
five or six. Right. Okay. So maybe what I'll do is I'll skip -- I won't. Maybe this would be a
great time to say a little bit about how -- we have a bunch of questions that are pushed on a stack,
and I'm going to pop one now. So one was how do you split data, and here's what's weird, that
you can make a reasonable argument, and I know people who've actually done both things. So
here's one that's very cool. Suppose we're doing machine learning and data fitting, so it's literally
that problem we were just looking at back here. It's this thing. So here's one thing you could do.
You hash the examples. So let's suppose I have a billion examples, I split it into 100 groups of
10 million. Do those multiply out right? Okay, fine. If they don't, then fix one of the numbers.
Anyway, so that's fine. Got it? So now the idea is I have 100 opinions -- let's do logistic
regression. It doesn't matter. I'm making a click-through thing, predictor, it doesn't matter. So
here's what happens. I ask each one to go ahead and make a prediction, okay? By the way, those
are probably pretty good, right, because if 10 million examples is not enough to give you a
reasonable classifier, you're in deep trouble and you should probably just go home, right? So
what you get now is 100 pretty good opinions. Now what happens is, the only thing that happens
when you run ADMM is you're gaining statistical power. That's the only thing that happens. So
what happens is, you collect all of these together, you get an average. I mean, by the way, if all
100 are totally different, it's time to quit, because you're not going to get a good classifier,
anyway. But if they're all about the same -- I mean, they're going to be a little bit different. This
was from a server in the Northeast, and so people had slightly different click through. That's
fine. You collect all these things together, you do this averaging, or you do an all reduce, they
get it, and what will happen is those things typically converge in five to 10 steps, and the data
didn't move, if you see what I mean, right? So that's an example where probably the right way to
distribute it is to hash it, unless it's already sitting in different places, right, so it's naturally
already segregated. So literally, I just randomly send down examples, right?
>>: But in theory, [indiscernible].
>> Stephen Boyd: The theory says it always work, but come on, that doesn't mean anything.
>>: But the theory says it converges. It doesn't say it will converge faster or slower depending
on your additional partition.
>> Stephen Boyd: Exactly. So the other extreme -- so that's one extreme, where you partition
just randomly. The other extreme is you actually do real partitioning so that you get a whole
bunch of stuff that has mostly these features, and there's a whole ton of features you've never
seen. You get some other features, and there's a whole bunch you haven't seen. Now, what
happens there is the local problems that you solve, you can solve much faster, because they're
much smaller. But the problem is, to gain any reasonable estimate, there's a diffusion
mechanism that has to happen, and you have to come to -- so then you have to balance these two
things, simpler sub-problems but many more iterations to get a useful solution. And so I would
say right now it's completely unclear which is the best way to do it, and it's going to depend on
all sorts of systems issues and things like that and all that. So I really don't know.
>>: But in a case that it was acceptable like that, shouldn't it be the case that most of the
variables that were being regularized on your behalf were the ones you weren't really dealing
with in your partitioning anyway?
>> Stephen Boyd: Yes, okay. So this is one step up. This is the most basic thing. This is called
consensus optimization. That's where everybody has all variables. But, in fact, if we split it up,
and there's a billion features and you only have a million of them, and you have a million, then
what you really do is you only -- that's what we call something in the general form. You make it
a bipartite graph, so you don't handle a billion. You only handle the weights for the features you
have, and then in fact it's actually -- it sounds like it fits perfectly with MapReduce, because it
does. What happens then is you just take all these things. Your weights you store as key val
pairs. I reduce them all and, if you have a key that matches your key, it means you guys had a
common feature, and I average your vals. Okay, yes, and this gets us to how these things don't
work. I know of no implementation of MapReduce where this is not a total disaster, but that's
apparently due to how schedule is worked and things like that. But it sure sounds good. That's
obviously not a fundamental thing. Okay, now, we could -- there's one more example I was
going to do, or one more topic. It's short. Is that consent? That's consent. Okay, got it. We're
going to do it. I'll go fast, right, because we're almost done with a lot of this stuff.
>>: [Indiscernible].
>> Stephen Boyd: Yes, yes, you want to do one more problem? We can fit in both, don't worry.
I'm not going to go anywhere. Well, at 3:00, I am. I'm going to leave, so until 3:00, we can talk
about -- okay, let's do it. Let's look at this. But, yes, don't worry. We're coming back to your
row question. Okay, so exchange ADMM is this. Whoa, that's not the problem. That's the
problem, there. So here it is. It's actually the dual of consensus. It says this, minimize a sum of
Fi of Xi. These are all different things -- subject to instead of all the Xis equal to zero, they sum
to zero. And it's exactly that's the dual. I can tell you why. Because if you stack the Xs, to say
that you're in consensus says you're in the range of the matrix iiiii, right? To say that you satisfy
exchange says you're in the null space of the transpose, which is iiiii, looking like this, right? So
these are in fact duals, right? This has a beautiful economic -- in fact, this is what a market is,
period. So here it is. F is like a negative utility or a cost. Xi is the vector -- it's a multicommodity market. Xi is the vector of things that agent i is either going to put into the market or
pull out, right? So that's what it is. This says the market clears. This says that, for every
commodity, the total amount contributed to the market will balance the total amount taken out.
Okay. This is something like -- if this was a negative utility, this would be like the negative total
social good. This is social cost. And then this thing is the equilibrium or market-clearing
constraint, and by the way, the dual variable is the actual set of prices for the commodities. It's
the commodity prices. Okay, so that's it. Now, if you work out what exchange ADMM is, you
get a beautiful thing. It looks very simple. It looks like this, or in unscaled form, it looks like
that. And here, you'd probably put it in unscaled form, because that's the prices. What's kind of
cool about this is, if you remove this term here and put that back to alpha-k, this is actually a
very famous old algorithm from I guess it's the [Debreaugh] something or other. It's a
tatonnement procedure. It's a price algorithm. It's Arrow. That's it. That's what it is. It's the
Arrow-[Debreaugh]. There's a bunch of them, but this is early. And, by the way, it kind of
worked in theory and sort of and doesn't -- you'd have lots of things about the Fs that you'd have
to assume. Well, they did it with maximizing, so you'd have to say that they were smooth and
strictly concave and all sorts of other crazy stuff. This just always works, period. It will clear
any market. Sorry, if there's a solution. If there's not a solution of it, then actually the prices
diverge. That's another. So it's a tatonnement process, and I just want to look at one quick
example of this. If you look in distributed dynamic energy management, so here you take a
bunch of devices that are going to exchange power at different times, and those are actually
different commodities. So power between 2:00 and 2:15 p.m. is different from power between
2:45 -- I mean, they're measured in the same units, which is joules, but they're different
commodities.
And so what you do is you have each device has a cost of the profile. That also gives you the
constraints, and that says that energy has to balance in each time period, and so you do exchange
ADMM, and what you're doing now, it's actually kind of cool. It is a smart grid. You're a
refrigerator, you're a generator, you're a load or something like that, you're the HVAC system.
And what happens is, each of them privately optimizes their energy consumption and generation
-- you're a storage device, so it's actually subtle for you. You should be charging when the price
is high, and you should be discharging when it's low. Did I get that right? No. I reversed it.
Okay, fine, sorry. Anyway, so the idea then is all that happens is, you don't even know if he's a
battery or not. We don't know that you're an HVAC system. All that happens is the node that
clears things tells you what the current primal residual is, which is how much we're violating the
energy balance, and we know the current prices. And then we would all update, and actually,
we're kind of doing this in the subjunctive, right, because we're saying, if those are the prices and
everything, here would be my profile, and then these would be averaged together. By the way, if
they average out to zero, done, that's it. We've solved the problem, and none of us knows how
many others were connected to the network, right? So it's kind of cool. So just to give a rough
idea, this is something where these things actually do work, so we implemented these things for
various things. We did some fun stuff. This is an example with about a million variables in the
optimization problem, and it's something you can solve centrally in hours, so you can find out,
and I forget what this was, but it was kind of seconds.
But the scaling is sort of fun. So this was just really simple things. I think these are problems
with, I think, 30 or 40 million variables, because these are something like -- I forget what this is.
This is like 300,000 devices. You're scheduling over 50-minute variables for a day. A generator
or anything else would have a couple hundred variables that involves transmission lines and
constraints. That's actually a serious -- that's a big problem. These are not small problems, and
those are being solved in a couple of minutes, and this is just on a -- this is just 64 threads.
Right? So this is where it actually worked, because you just make a thread pool and use
OpenMP, and it just worked. On more complicated platforms, it doesn't work at all. Yes?
>>: Another angle that I thought you were getting at with the energy one that seems important is
that this is a really interesting way to do privacy preserving learning, because you can store all
the information.
>> Stephen Boyd: Right, so when you do this sort of stuff, you can imagine doing stuff like let's
do statistics with medical data. You have a whole bunch of patients, you have some. You better
not have just one, right? That's not good. But you have 50, you have 150, 200, and then when
you're doing this, basically, you just support one protocol. I end up with a classifier regression
model, but you don't actually even know how many patients he has. So this is privacy
preserving, and I do say that, but I stopped saying it when it turned out there are people in the
audience who know about privacy, because it turns out it actually doesn't satisfy the most basic
rules of privacy preserving. In theory, I could figure -- because if I keep querying you, like I'd
say, oh, what if this was the prior and you'd tell me what it is, eventually I can learn your loss
function or something. But it would seem, as a practical matter, that it would in fact be privacy
preserving, but notice this is severely qualified, because this is being taped.
>>: Well, you can reduce the number of iterations, there's only so much information.
>> Stephen Boyd: No, that's it. By the way, I brought this up with my friends who do privacy,
and they looked at me, and they were like, are you done? And I said, yeah. And they said, no,
that's not called privacy preserving. So yeah. But I agree with you totally. That's a completely
reasonable thing. If you have 1,000 patients and we converge in 50 steps -- I mean, 50 is more
than enough for one of these things, because you're in the Southeast and you have more people
with diabetes or something, fine. It's going to be slightly different than you've got northern
California and people are healthy there. I don't know. Whatever it is. But the point is, 50 steps
is more than enough to get the common model. So in 50 steps, there's no way I can say, oh, oh,
guess what I just found out? I agree. Okay. So, okay, I'll just do the conclusions, but we've kind
of hit them all, but then yes, don't worry, don't worry, it's coming. Basically, it's the same as or
close to many methods with other names. It's been around for a long time, and actually it's kind
of fun, because you just kind of -- Lego like, you just put bricks together and you can implement
things that can be shockingly close to competitive with almost nothing. But, actually, what's
kind of cool is how gracefully it extends to distributed and other computation platforms. And I
should say one thing, just in case you think you're being sold something, I have to say actually
that things like this have been implemented in the last couple of years on a bunch of things.
Some have worked -- the dynamic energy stuff on a tiny little MPI box with a bunch of cores.
People are doing the big things with gigantic things split 100 ways, the example I just gave you.
That's working and running. There's a Pregel implementation that works, but basically most of
them have not, but just for reasons involving modern computation platforms and things like that.
They're just not optimized for things like this. You'd look at something and it would spend all its
time serializing and deserializing data, and then you'd tear your hair out and you'd say ->>: [Indiscernible] or they don't converge?
>> Stephen Boyd: Oh, no, they converge fine, but you do the arithmetic and it should be 100
milliseconds per iteration, and it turns out it's 10 minutes per iteration, and then you figure out
what's happening, and most of the time it's like serializing, deserializing data, and you're pulling
your hair out, because you're saying the whole point of this is that that data should stay there.
And they go, oh, yeah, we know how to give the scheduler a hint on that, and I'd say, well, give
it a hint. You get the idea. These things will work out, but just full honesty, some of these look
like they should just work.
>>: [Indiscernible] exchange problem, does the scaling that should ->> Stephen Boyd: Oh, no, that one works. That I can assure you.
>>: [Indiscernible]. It's mainly CPU bound? So it doesn't have to spend time?
>> Stephen Boyd: Yes. Our little thing is on a shared memory, 64 threads, running up. Sure.
Yes, yes. Well, the real trick is we use code generation to be able to solve the device problems
in microseconds. That's the real trick. That's also been implemented by Google, that one, the
dynamic energy thing, on real, separate processors, and it works. It's even asynchronous, so it
works, roughly.
>>: That's a convex problem to begin with?
>> Stephen Boyd: Yes, it is. No, but first, I'll be in big trouble -- I won't even make it out the
door. There's an escape path right there. So we're going to talk about row. So the question is,
does row affect the convergence rate? The answer is yes, of course it does. It turns out, if you
scale the data appropriately, there's values of row that are reasonable. There's also some
interesting methods where the key, as in -- curiously, in a lot of other algorithms, the key is to
balance the rate of decrease of the primal residual with the dual, right? So you want to balance
these. And so there's heuristics that say, if the dual is getting too small, too fast, then you either
increase row or decrease row. I can't remember. It's one or the other, but you do one of those
two things, right? And these things work okay, and then someone might ask, oh, does your
theory work when row is changing? The answer is no, and then you just say something like,
fine, we'll stop changing row after 50 steps, and then the theory works. So there's that. So
there's various ways to actually -- a bunch of my students came up with a method that tunes row
based on 1910 rules for PID tuning from control theory, because it's exactly that. If row is too
high, you get an overshoot like this. If it's too low or something, it's the other way around,
maybe, then you get this thing like that. And you want something that's like critical dampen.
Notice that I'm not really answering the question, just to make that completely explicit.
>>: [Indiscernible].
>> Stephen Boyd: So no one knows -- actually, it's not really. No.
>>: [Indiscernible] and you say you should [indiscernible] disagree with the mats, and they say,
oh, you can slightly disagree and we'll make progress, together, right?
>> Stephen Boyd: Yes.
>>: And then this is this tradeoff.
>> Stephen Boyd: It turns out for a lot of these things, there's a actually a big region where it
works just fine, and you do have to have competent scaling. It is a first-order method, and in a
first-order method, scaling is critical. That's the whole point of a second-order method, is scaling
is second order, right? So, yes, you have to scale things correctly. I guarantee you, if you make
one equation, the typical number is like 1E minus six, and the next one, 1E plus three, it's game
over. Of course it's not going to work. So if you equilibrate the matrices involved, it's actually
usually not that big a deal, or put it this way, there's a lot of people looking at this and, I don't
know, there's already been five papers on optimal row selection, and the truth is you don't even
really want rows. You want it to be a diagonal matrix, not a single scalar, and that's the correct
thing -- or even a full-on preconditioner. So you really want to choose the quadratic form.
So there's a lot of things going on like that, but it's by no means settled, but these things do work
if things are done responsibly.
>>: Does it require the data that X or whatever -- the data to be reasonably normalized?
>> Stephen Boyd: Yes. No, it doesn't require it. The theory requires nothing. It just works.
Yes, to get -- although we often do it without that and it's okay. It just works. If things are
reasonably scaled, then it just works. So for the dynamic stuff, row=1 works just fine. That's
assuming your stuff is scaled so that things are in kilowatts. The typical numbers you're going to
see are going to be between minus 30 and plus 30. You're just fine. This is just going to work.
Row=1 is fine. That doesn't mean it would be great to figure out good ways to do this, but it's
certainly not a show-stopper. That's for sure. There's one more thing.
>>: Non-convex, could you comment on that?
>> Stephen Boyd: Non-convex, and this is the last one, and this is super-interesting. Let's take
ADMM. You remember what it is, because you know what it is now. You minimize over F plus
a quadratic and G(Z). So you could do the following. People are doing this now, Jonathan
Yedidia and many other people are doing this. What you do is you apply ADMM, exactly that
algorithm, to a non-convex problem. For example, F could be convex and G could be the
indicator function of a variable being Boolean. So G is just the indicator function of every entry
being either zero or one. F is the convex relaxation of the problem. And, in fact, solving a
mixed-integer convex problem is minimizing F(X) plus G(Z) in that case, right? Now, you just
run ADMM, okay? How do you project onto that? It's called rounding. But you could do much
more sophisticated combinatorial structures. I mean, anything where you can compute the
projection easily. So now you know the algorithm, and now you might ask, what can you say
about the algorithm? The answer is absolutely nothing. Examples show it doesn't even have to
converge at all, let alone to the solution. That would be ridiculous, right? Now, having said that,
so what it means is you keep track of the best point you've found. Having said that, these appear
to be extremely effective heuristics, extremely. So for things like Sudoku and things like that,
those are too trivial problems anyway, but for things like 3SAT, you go and you look at the
phase transition boundary, they work as well as any other heuristic. So, where they're
generically solvable, these things work. Where they're generically unsolvable, they don't. And
then in that little band in the middle, where 3SAT problems are really hard, in the transition
band, of course, they don't work, but then nothing does. So people are doing this. It's very
interesting.
>>: So you said who did that?
>> Stephen Boyd: Jonathan Yedidia is one person doing this, and they have a great name for it.
You know what they call it? I should have said that. So it's the best algorithm name -- it's like
leave it to the physicists to come up with a good name. This is unbelievable. It's fantastic. It's
Divide and Concur. Come on, that's good. That alone is a huge contribution right there, because
that's what it is, right? We are calling them like replication consensus algorithms. Who would
you rather hang out with? Obviously, right? So that's good. You had a question. Notice, we're
not converging, but the residuals are getting smaller, so yes.
>>: Is this being applied to some other integral problems?
>> Stephen Boyd: Other which problems?
>>: Some integral programming problems?
>> Stephen Boyd: Well, yes, my students and I have been playing with it, and other people have
written some things about it, and it appears to be quite effective. You can't say anything about it,
as far as I know, yet, maybe never. Yes.
>>: So the people doing the [indiscernible] thing where they're putting in this purely non-convex
thing, the G, what about the -- has anyone tried to compare that to an alternative of a softened
convexified G and then seeing how those two may pop against each other?
>> Stephen Boyd: Yes, and a lot is unknown. Like nothing is known about these kinds of
things.
>>: [Indiscernible].
>> Stephen Boyd: But this is actually kind of doing that, because what you're doing is you're
solving a relaxation. What would you normally do? You'd solve the relaxation. That's just F,
and then you'd round. Right? That would be the -- what's a simple way to solve an integer
convex problem that actually is probably for most applications just fine? The answer is, solve
the LP relaxation and round. Maybe do a little local thing to adjust things or whatever if you
like, right? Fine. Right? We all agree on that?
So that's exactly one step. So what this does, that's the first step is exactly that. Then you know
what it does? It looks at the difference between the continuous variable and the rounded
variable. By the way, if that variable, if in the relaxation of that variable had been zero, one, the
difference is zero. And then what it does is it increments U or Y for that, and then what it does
is, the next time you solve a relaxation, variable number three, which had come out as 0.3 and
was rounded to zero, now has a slight incentive to move one way. But you're still solving this
big problem with everything coupled. The intuition behind why this would work is pretty good.
>>: So [indiscernible] with this story and kind of second-order information, you can collect like
sliding a range of gradients? It looks a bit like that, right, the Y.
>> Stephen Boyd: Yes, well, if you look closely at optimization methods, if you look long
enough at them, they all start looking the same anyway. We were just talking about that. So yes,
right?
>>: At the risk of prolonging -- it's a wonderful talk. These methods, I understand that the
iterations are fast so let's not focus on what over T.
>> Stephen Boyd: Though they need not be, right?
>>: But then there's that whole other range of problems, like you're basically solving the linear
solver with the iterative method, and it's all about condition numbers, and then the convergence
is just fantastically better, right?
>> Stephen Boyd: Yes.
>>: What's the hybrid? How do you get these kinds of techniques to look more like secondorder methods?
>> Stephen Boyd: That's a great question. A lot of it would be preconditioning. That's exactly
it, or chunking it. If you chunk a problem up that way, you're almost doing a preconditioning.
So you might do crazy stuff like this, take the linear part, do an incomplete Cholesky on it, use
that as a -- which could even be block, so you're going to do different blocks on different things.
You can do just block incomplete Cholesky, and then that's a preconditioner, and the hope is
things like this will just work really well. We want them to be kind of automatic, though, so we
don't want you to have to figure out a preconditioner every time. We want it to be automatic.
>>: There's the grounded part, but then there's the lasso condition that does soft thresholding. I
wonder if the two will play together or if they just fight.
>> Stephen Boyd: We don't know. I mean, we do know that when you use lasso the way you
want to use it, so that it's relatively sparse, what comes out, these things are very, very -- they
converge very quickly, because one things just keeps truncating all but 50 of the variables to zero
and it doesn't take long for the other one to converge. Good? Yes, thanks.
Download