16363 >> Lin Xiao: It's a great pleasure to welcome...

advertisement
16363
>> Lin Xiao: It's a great pleasure to welcome Professor Stephen Boyd from Stanford University.
Steven is the Samsung Professor of Engineering and Professor of Electrical Engineering and
Information System Lab at Stanford. He has done very influential work on convex optimization
and pioneered convex optimization in many areas such as zero control systems, signal
processing and circuit design.
Okay. Today he's going to talk about some recent alliances of convex optimization. Stephen.
>> Stephen Boyd: Thanks very much. I gather I'm finishing off Stanford week here or something
like that. We didn't plan it. So but anyway I'm very happy to finish up Stanford week at MSR.
So I'm going to talk about recent advances in convex optimization. Stop me any time. There are
parts I'll definitely go too fast. If I go too slow, that's really criminal. At which point just speed me
up somehow or make the universal sign of boredom or whatever, and I will speed up.
But I have a lot of material, though. So we'll just see.
So the first thing I'm going to do is start by just quickly saying what convex optimization is and
why anyone would care about it. The things I want to talk about I'll really talk about three different
things, in different levels of detail, actually none of them in horrible detail. But I'll talk about -- and
these are I guess the recent advances.
One is modeling tools for convex optimization. So I'll say what these are, how they work, what do
you do with them and things like that. I'll talk about that. This has had a huge impact on sort of
how teaching is done and research is done with these things.
I'll talk briefly about large-scale convex optimization. Beyond problem scales and sizes where a
direct solver can work and you move to iterative solvers; this is when you move into the million 10
million constraint type ranges.
The final thing I'm going to talk about is something I'm quite interested in right now, and it's kind
of -- I'll say a little bit about some preliminary work. Actually, what's there is I have very little on
this but I'll talk about it and this is sort of ongoing work and maybe if I still have time I will say a
little bit about some of the ongoing work. Real time convex optimization is embedding convex
optimization methods in extremely fast real time systems.
So we're talking about solving convex optimization problems of modest size in numbers like
microseconds or milliseconds.
We'll talk about this. But I can tell you the summary right now. You can do it. So that's the
summary. But we'll get to that. It's actually quite interesting, because for this one it opens up this
whole world, because people who do optimization, if you're trained in optimization, even the
people who make all the best solvers and stuff like that.
They're always talking about, oh, I solved this problem with a million variables blah, blah, blah.
But these are human time scales. I have this thing now where I ask people who write solvers:
How fast can you solve like a network flow, some small or some tiny problem with a couple
hundred variables or something like that.
Basically they don't know. They're like a second. If it's fast enough, they're like an Excel
spreadsheet update it's not seen in. They call that zero.
But the answer is a lot of those things, if you know what you're doing, can be solved in
microseconds, milliseconds. I'll give some examples of that. I have no idea what to do with this.
Well I have some ideas we'll get there.
I should say interrupt me at any time. All right. So we'll start with optimization. I think everybody
would know all about this. But just to set the notation. You minimize an objective function.
Subjective inequality constraints and some equality constraints.
And everything is an optimization problem. Well, fine, that's a totology that says nothing.
What really matters is whether or not you can solve that problem. And whether or not you can
solve it depends very much on the properties of these functions.
So, for example, if they're all linear plus a constant or affine, you get an LP. And these you can
solve very efficiently. I mean, in generic cases. I'm not talking about something where it's a
variable. You can always push the boundary and things get difficult there and custom methods
and all that. Roughly speaking, as a zero order statement, you can say you can solve LP very
efficiently.
In contrast, very small problems of that form with nonlinear FI and HI. So nonlinear functions
involved there, these can be intractable. It's very easy to write to to make a small -- if you go to
quadratic that's all you need. You can embed your favorite MP complete problem this way and
that's the end that.
These will be difficult in practice as well.
Okay. Convex optimization is this: Your equality constraints are all affine or linear, people would
say. And the objective and inequality constraint functions are convex. So their graph is bowl
shaped and they have positive curvature.
So that's convex optimization. And it's a sub -- it includes LP as a special case. LP is actually
the special case with a zero curvature. LP is the case when all the functions are affine and that
means that you have an equality here. And not just for theta between zero and one but all theta.
That's affine.
So that's the boundary of convex optimization problems in some sense are LPs. So that's what
they are. Now, these things can look really difficult. They can be nonlinear. Nondifferentiable
and so on, but they can be solved very efficiently. I should point out that that's something that
goes against much of the teaching and thinking in OR, in traditional optimization. Because they
focus in traditional optimization you talk about linear and then you go to nonlinear.
They even organize themselves this way. You do LP or you do MLP. And NLP is nonlinear. But
something like this. And a further step would be nondifferentiable nonlinear. That's a big step up.
And you could go to discrete or something like this.
Turns out when you talk about convexity differentiability is irrelevant, utterly irrelevant. Which is
not -- if you go look at your traditional first class on optimization, since the entire book is filled
with, first half of the book is filled with gradients and the second half filled with gradients and
Hessians. This is not a small issue in traditional OR. It's a big deal.
Second big point is these problems came up a whole lot more often than was once thought. So I
guess LP is widely used, widely suffused in lots of areas. But there are problems that are convex
that hadn't been noticed to be convex.
Now, there's a very large number of these. And it's one of these positive feedback systems. It's
a social system, where the more people know about these things, the more people, for example,
who know about semi definite programming and are out there looking at real problems, you know
the more people step back one day and say oh my God, that's an SDB or something like that.
So I mean the more applications there are of SDP, for example, the more motivation there is for
people to solve good solvers for it and work out complexity theory and all that stuff.
So I think that's what this is. There are recently discovered applications in control, combinatorial
optimization, signal processing, machine learning, statistics and finance and so on.
Now the impact on a field varies. Because it depends on lots of things. For example, in machine
learning I would say it's a conceptual impact, hasn't really changed anything because a lot of
these problems are dominated by huge scale. So it's not like you can just say, oh, okay, that's
convex and then walk away.
In other cases, actually you can actually solve, in control it actually just works. In smaller
problems in statistics and finance certainly and signal processing and circuit design you can just
do these things.
Okay. So there's some challenges. I'll go from the bottom up. So in the theory category, convex
analysis it's more than 100 years old. It's a branch of math. Pretty much knew everything by
1971. I mean you know more or less there's still stuff being done. People who do that would be
very irritated if they heard me say that.
But okay so basically this was done by 1970. Roughly. Which is not -- you need people doing
this, that's great.
And there's lots of people working out complexity theory for algorithms like that. That's good, too.
So I'll say a little bit about how these methods actually work in practice as well.
Okay. So a bunch of people working on algorithm development. So here you're working on
developing things like reliable efficient algorithms for special applications, special classes. Or in
some cases just general convex optimization problems. And, of course, there's people who work
on sort of at the high algorithmic level and those who work on the details of the software
implementation.
In fact, how these things work, actually, to be blunt about it, actually this often matters much more
than, for example, this stuff. So you go to conference see people getting up screaming at each
other saying my primal dual search direction makes this thing converge in 19 steps whereas
yours took 23.
But, of course, how you do your linear algebra, matters like to matter of by a factor of 100. So
here are the theorists arguing with each other 17 versus 23 iterations or less relevant, into the 3.5
or into 4 or this kind of thing. When in fact it turns out they really should be concentrating on how
they do their factorizations and things.
At the highest level, or at least at the top of the slide, there's a question of modeling. And so this
is the question of posing practical problems with -- you want to write a problem as a convex
problem, and this can be approximate. And the level of violence involved in mapping a problem
to a convex one can range from zero or minimal, in the sense someone gives you the problem,
you change some variables or whatever and it's exactly equivalent. Now it's an SOCP or
something.
If the problem is not dominated by huge scale or tight real time requirements, you're done. That's
it, you quit. And you have no apologies to make no footnotes, nothing like that. It's done.
Then there are ones where you, there's terms that you can't handle exactly. You put in
approximation. These are mild and so on. Then you get up to ones where there's approximately
as being very, very generous. These would be in relaxations of combinatorial optimization
problems where you can hardly say you've solved the problem but what you'll end up with -- you
end up with two things there. End up with lower bound on hard problem and end up with a
outstanding heuristic for a local suboptimal method.
Now, once you've done that, if you have anything to apologize for, for example, approximations,
misterms and ignored constraints and things like that you have to go back and verify the practical
performance of it.
Okay. So now I'll say a little bit about convex optimization modeling tools.
>>: Everything you said up until now except convex optimization you could have just said looking
for a local optimal -- is there anything special about ->> Stephen Boyd: No, you're right. That's not entirely true. But as a zero order statement that's
correct, that if instead of solving the problem you mean finding a local solution, then much of what
I said, almost all, will just translate immediately.
There is actually a difference. So it turns out there's issues of scaling and the algorithms are
trickier and things like that because you have to deal with things that are super flat or have the
wrong curvature, things like that. Whereas a when you get a convex problem these issues go
away. They don't happen.
Not only that, they're reliably things like 20 steps, 25 steps. Absolutely reliable. And, actually, I'm
pretty -- whereas -- what it is I think they're reliable, whereas a local, good local optimization
method will be quite reliable and quite fast but actually there's some parameters there that have
to be tuned.
There's a little bit more baby-sitting. When you take a look at one of these things. I won't name
any. But you run it and if it's your lucky day 25 steps, done. Actually, you can actually do a cool
experiment. Can you do this. Take a convex problem in a form that can be handled by a
nonlinear solver and call a nonlinear solver.
Right? You'll get the global solution. So the question is, which is better. The answer is the
convex optimization solvers are far, far superior. The other ones either fail -- if you go in set 14
parameters just right it will work. It's got to. Theory says it will.
So that was all of that is the difference between what I've just said, how it would transpose to
finding a local solution of a noncovex problem.
Here's some general approaches. So the first is to do the following. And this is maybe the
most -- it's a method. You use a general solver. Which is one for convex optimization and you
merely assume that the FI are convex and you proceed.
Now if you evaluate a Hessian and it's not positive definite, for example Gelaski factor fails,
everything grinds to a halt, and you can now go back and say your problems's not convex, indeed
there's an X and F3 doesn't have a Gelaski factorization, the Hessian of F3 doesn't have a
Gelaski factorization. You've proved it's noncovex.
This is easy on the user because it's the traditional interface for optimization. You provide
methods for evaluating the functions, gradients and maybe Hessians. So it's the traditional
interface. So it's easy on the user.
But you really lose a lot of the benefits of convex optimization. One is that, of course, that you
know that you can actually certify global, the solution is global.
But the other is the reliability you give up on here. So, okay, now the second gross category is
this: And it's one that makes sort of sense. You would take the problem described in some high
level language, some algebraic form, let's say. And you would write a tool that would scan that
problem description. High level problem description, and it would attempt to prove that the
problem you have is convex.
Maybe transform it to some convex problem, something like that. So that sort of makes sense.
And you can make things like this. And actually we have. Meaning sort of with students we've
done this maybe even a while ago.
Now, you can quickly find out that verifying a problem is convex in general, sort of MP hard. I can
easily write down something. All you're doing -- so I'll mention how these things work typically.
These are basically you do interval computation with second derivatives.
You do -- that's all you're doing is interval arithmetic. To establish convexity all you want to know
is the second derivative is non-negative. You can do all tricks with -- you could do all sorts of
tricks with interval arithmetic.
And if you're able ultimately to establish that the second derivatives of the FIs are all in 0
something, you win. By the way, there's at least one commercial version of this fielding right now.
This thing called solver.com did this.
Okay. I had a student who worked on this. This was his dissertation. He put it all together. And
so he comes into my office and had this very sophisticated thing where there was various bounds
on each of the functions and they would, you know, once anything known -- a tightened bound
known by one variable would send a message to all the things that would use that variable. They
would tighten their bounds.
Would go on for a long time. So we came into my office. He wrote down this horrible thing with
hyperbolic cosins and logs and X and stuff like that and he said do you think that's convex? I'm
like: Who the hell would know, right? Who would know right? He said it is. And it took 987
iterations of this method to prove that it's convex.
And we looked at this horrible expression. And we both said: That is really cool. And we looked
at it and we sat there for like in silence for like 30 seconds. We looked at it. We both looked at
each other and we said it's useless.
And we came to the conclusion that that's not what you want. Because even if you had a thing -it's kind of cool to have you can impress your friends by writing down horrible functions that are
convex.
But it's actually not the right way to do it, in my opinion. Well, to my thinking it's not the right way
to do it, because even if you know some horrible expression is convex, what it is is you're not -this is not maintainable. If someone writes some code that solves a problem and it actually is
convex, they leave or whatever, somebody else comes in, you have no idea why it's convex.
You change one constant from .1 to .3 and all of a sudden it's not convex anymore. You have no
idea. Some things, of course, you can change the sign of a coefficient nothing happens, and
other things it changes it. The point being that, that's the -- that's not maintainable if someone
writes stuff like that.
By the way, he was perfectly okay with this, the student. It's a lower bound result. You think of it
that way. Very useful. This is how it shouldn't be done or something like that. It's one of those.
Very useful, right?
Okay. Now, the one actually after much thinking about this, it seems to me that the way you
really want to do this is this: You really want to construct the problem as convex from the outset.
That's actually really what you want to do.
Because if you do that, and the way to do that is basically to follow a restricted set of rules and
methods. And I'll talk about these in a minute.
If you do this, then convexity verification is automatic, as we'll see transforming the problem to
one that is solvable by an interior point method is totally automatic. It's trivial.
So we'll see that. And the advantage of this is this is maintainable, because the intention of the
modeler is made explicit.
So this actually sort of makes sense to us. Okay. All right. So how can you tell if the problem is
convex? You need to check convexit of problems. We talked about this. You can use the first
order, various basic definitions. But in fact the best way to do this is via convex calculus. So you
start with a library of basic atoms examples that are convex and then you construct more
complicated functions from various calculus rules. And there are lots of these.
In fact, this is called convex analysis, basically, roughly, the two bullet synopsis of what convex
analysis is.
Then there's lots and lots of rules. But it turns out a very small number will get you very, very far.
Let's start with some basic examples.
So there's some obvious ones. Powers are either convex or concave, depending on the
exponent. Exponential, negative log, entropy, negative entropy, these are convex functions.
These are very easy to show. Any affine function is convex. Sum of squares is convex. Here's
maybe the first one that not a lot of people would know is convex. And it's not obvious until
somebody tells you. Well, okay, it's not hard. But someone has to tell you.
The sum of the squares divided by an independent positive variable. That's not obvious? That's
convex, jointly in X and Y. It's obviously convex in X because for any fixed positive Y it's a
multiple of the sum of the squares.
It's obviously convex in Y, because that's a positive number divided by Y, one over Y goes like
this it's convex that's obvious. What's not obvious that's jointly convex in X and Y.
So that's maybe the first example of a basic function that most people wouldn't know instantly or
immediately as convex, sum of squares divided by a positive variable.
A norm, that's obvious from triangle inequality. The max of a bunch of variables. The max
function is convex. And then log sum X is convex. Of course, that's sort of like a smooth version
of that. But log sum X is also convex.
And various people would know these things or would know them but maybe not remember that
they know them or something like that.
Okay. Here would be some others. The log of the cumulative density function of a Gaussian is
concave. And actually it's interesting because all sorts of classical inequalities come from these
facts. So, for example, here if you write down the classical inequality for, you get one of the -- I
forget which named inequality you get an approximation and a bound on the Gaussian CDF.
Here's one maybe some people don't know this but the log of the determined of the inverse of a
positive definite matrix. That's convex.
In the matrix. So, again, it's not obvious. These are not hard to show or something. These
would be sort of some of the examples.
Calculus rules. There's zillions of them. But turns out you can fit on one page the ones that will
get you 90 percent of the way. And actually this is not minimal, because they in fact all of these
follow the last one. So in fact you can just get one rule. Nevertheless, for explanation these to
people and not machines. What it means is actually the code is much shorter than the slides,
actually, when you actually do things like this.
So some things are obvious. You have nonnegative scalings. If you scale something, whatever it
remains convex, if it's a positive constant. The sum of convex functions is convex.
Affine composition. So it says if F is convex. And if you precompose F with an affine mapping,
this thing is convex. These are things very easy to verify. Point-wise maximum, that's maybe
sort of the first obvious one.
Here's one, partial minimalization, not totally obvious. So if you have a function that is jointly
convex in two groups of variables and you minimize over one of them in a convex set, the result,
which is the -- it's partial -- it's balance minimization, minimizing F over some of the variables,
that's a convex function of the remaining variables.
Now, most of these are very easy to show and things like that. Here's actually the real one. This
composition. So if a function is convex and increasing and F is convex, then H of X is convex.
And there are generalizations of that to multiple arguments and a few variations on it. But it turns
out this, from this I can construct all of those easily.
You can get it all sorts of ways. Some are just completely trivial here. For example, if F is
affine -- if F down here is affine, that's this, then so this one implies that one. It certainly does
sum, because the function that takes the sum of two numbers is convex. So that works.
It does point-wise maximum because the maximum function is convex and nondecreasing in all of
its arguments. Therefore, max of convex function is convex.
This one you can also get this way. But I won't go through that. So there's actually a very -there's actually two rules it turns out that you really need to know. But it's useful to think of them
this way.
Okay. So you can now construct examples. You can do things like this. You have affine
functions. They're obviously convex. You can have the max of a bunch of them and therefore
you get a piece-wise linear convex function like that, that's going to be convex.
We can look at the L1 regularized Lees squares cost. So this function is convex. That's affine.
That sum of squares is convex. That's convex. It's a norm. As long as lambda's positive, that's
convex, the sum makes the whole thing convex. Here's one. The sum of, for example, seven
largest elements of a vector.
So it's a complicated function, but it's convex. So no one is saying anything, so am I going too
fast? Too slow? Too slow. All right. Fine. So oh by the way how do you show this? What's a
quick way to show this?
>>: X ->>: X of all the functions.
>> Stephen Boyd: Of a lot of functions. Its maximum. So you consider the inner product. You
look at C transpose X where C is a vector with, let's go for 7. Seven 1s and minus seven 0s, of
which where there are N 2 seven of these, and the maximum over all of those N 2 seven of those
is that function, is the sum of the largest ones.
Okay. So I won't go into any more -- oh, one interesting thing is that most of the interesting ones
are actually nondifferentialable that's not a big deal. You can throw away your book that's filled
with gradients. Actually don't throw it away set it on a shelf because we're going to go get it back
later in the talk.
Okay. So just a couple more examples. Maximum Eigen value of a symmetric matrix, norm of a
general matrix. Here's one that's actually getting a lot of interest right now is the dual of the
spectral norm of a matrix. This is the sum of the singular values. This is the analog -- people
who don't know.
This is to the rank of a matrix as the L1 norm is to the sparsity of a vector. So what that means is,
let's see, it means that you can transpose all of the ideas from Y compress sensing and sparse
stuff where you regularize with L1. If you have a problem, that's where you want a sparse,
something to be sparse.
If you want to do the same -- if you have a problem where you're looking for low rank matrices,
this serves the purpose of the L1 mode.
Here's one, the negative log probability. This is the yield. So this basically says I have a convex
set. Z is N zero sigma. The fact is any log concave distribution would be work. Any distribution
whose density log of the density is concave, which is most of the distributions you know about. In
fact, it's very hard to think of one that's not. So, I don't know, you can go ahead and try to think of
one.
>>: I will.
>> Stephen Boyd: Sure, fine. No, really, no, no, sorry. It has to have a name.
>>: [inaudible] [laughter].
>> Stephen Boyd: A short name. A single name. [laughter].
>> Stephen Boyd: Come on, anybody can do that. Come on. Go ahead and try, you know. It's
a serious exercise.
>>: Y over 3.
>> Stephen Boyd: No, I think that might be okay. Wishart, all these fancy -- I actually don't know
the answer to it. There probably is one or two named single-named distribution, right, density
which is not log concave.
So I don't know somebody could look it up on Wikipedia or something. Anyway, so if you have a
log concave density and then this is the yield function, that's the yield function because it
basically -- so you pick a point. Z is manufacturing variation. And you want to know what's the
probability that when you set the target that the combined, the manufactured one is in so that's
this thing, and that is log concave, which means that it's minus log probability that you're in a set
that's convex.
There's actually a lot of actually interesting implications.
I'll tell you a couple right now that I won't talk about today. One would be this. As an example it
says that pretty much anything that you can do with linear measurements, estimation with linear
measurements. So if you have a whole bunch of measurements that look like AX plus V where V
is Gaussian or your favorite log concave density, so and you can do that, you can do, for
example, with horribly quantized measurements.
So it turns out all those problems are convex. So here is -- here's a good sensor. I'll give you the
sign of a transpose X plus B, that's all. That's a one bit sensor. And it turns out I can estimate X
superbly from one bit measurements.
And these are all convex problems. And that all follows from this. Because the negative log
likelihood is then convex. So a lot of these things have serious implications. Here's one. It says
that if you form a vector, X transpose matrix inverse X, this is actually convex in X and the matrix
provided the matrix positive definite.
Is there a question? We can go off -- we can -- any of these things we can go off on tangents and
things that work if you want. All right. So how do you solve a convex problem. Really the best
method is to use someone else's solver. By far.
That's fine. That's by far the best one. Now, the problem with that your problem has to be in a
standard form. Now, the good news of an arrangement like this is that you get the cost of
software development is amortized across many users so people who spent all day long just
making solvers with the knowledge that thousands of people are going to use them. It all kind of
makes sense.
You can write your own custom solver, and it's for your particular problem. And that's lots of
work, but actually you get huge advantages if you -- you can get very large advantages, because
if you know the problems you're solving, you know the scalings and there's also sparsity and
other structure you can exploit. You can often do really, really well.
But it's more trouble than this. This requires you to type make basically up here. This one
roughly. This is serious development. Now, there's something else -- the standard method is
this. Well, a method. Is to transform your problem into a standard form and then use a standard
solver, and you can think of this several ways. You can think of it extending the reach of
problems using standard solvers.
But this is actually a pain. We'll see that it is a pain in general to do this. It's not like it's just
cumbersome. And then we're going to look at methods that formalize this last approach.
Okay. Before I start, I should mention this. It's actually a good thing to know that there are
general convex optimization solvers that some of them are three lines long, literally. They work
always. They solve all convex problems.
The proof of that is about one paragraph. And there are even some that are actually efficient in
theory. So ellipsoid method would be your first one. Now you're up to five lines of code including
comments.
>>: Is it the same ellipsoid, is if you have a sigma, best, I remember, or at least you need very,
very good precision for fairly trivial problems, no?
>> Stephen Boyd: The problem is not with precision. The problem is it really takes as long as
the upper bound tells you. And so in practice these things don't work that well. Even if they really
are polynomial time.
So in fact all of these are typically slow in practice. By the way, they do work well because the
coding is not much as I said it's fine lines. The interfaces you require, all you need is something
that gets a sub gradient for you. And a sub gradient, unlike a gradient, is -- well, sub gradient is
actually if you can write down the problem you can get a sub gradient. So that much I can assure
you from sub gradient calculus. These are nice to know.
Actually, they do have their uses. If you need to solve a small problem and you really don't care if
you don't mind going to lunch or something like that, write the code go to lunch and come back
and your answer might be there if your problem is not that big, which beats whatever doing some
huge code development thing or finding something that works and then reading the terrible
documentation that someone wrote something like that.
Okay. So it's actually just good -- it's just cool to know that these things exist. And they actually
do have their uses. For example, the sub gradient methods it turns out. You can build a whole
theory or a whole theory and practice of distributed convex optimization based on these.
So they absolutely have their place. So these are not just sort of -- they're interesting to know
about. But they also do have -- I'm not going to tell you about distributed methods here, but they
do have their places.
Okay. Then you have interior point convex optimization solvers. Now people that work on it in
the modern era like to think everything started in the '90s. In fact, it goes back to '60s. But for
example for LPs they discovered some Russian guy who was completely unknown in the Soviet
Union known outside. And it was all there, all the stuff from like 1995 was there.
There's even books written on this stuff in the 1960s. They don't know everything. Oh, yeah
Fiaco and McCormack, do you know this book? Fiaco and McCormack. And I forget the name of
it. Do you want to name this thing? Sequential Unconstrained Minimization Techniques. That's
it.
So if you look at that, you'll find that they know a shocking amount there. I mean some of it you'll
see clearly what they don't know. They don't know, for example, the choice of barrier is like some
kind of voodoo type thing. They don't know how that works, right?
But bottom line is they knew what interior point methods were, period, in 1969. And could
articulate it pretty well. They didn't have a complexity theory, but then such things, no one
thought to ask questions like pardon me, but what's the largest number of steps it would take your
method to calculate an epsilon suboptimal solution. So no one asked those questions so it's
hardly fair to blame them for not answering them.
So seems reasonable, right?
>>: I'm more shocked that nobody asked the question that they forgot the answer.
>> Stephen Boyd: It is weird, isn't it.
>>: I thought it was an older subject.
>> Stephen Boyd: No, the style ->>: The order of ->> Stephen Boyd: No, no, order they did -- so the style of sort of algorithm analysis in the '60s
would be to do things like talk about quadratic conversion, super linear convergence and things
like that.
It was not to say I can solve this and into the 3.5 log over 1 epsilon steps or something like that.
So that was a shift that came later.
Seems weird, doesn't it? The most obvious question to ask is what's an upper bound on how
long it would take you to solve a problem.
>>: I don't understand the quadratic because you get frustrated, you're sitting there watching an
iteration on the computer and it never seems to get the number ->> Stephen Boyd: So quadratic was a big deal. Definitely.
>>: [inaudible] come into this side?
>> Stephen Boyd: Absolutely. So the popularization in the west, this is outside of Moscow, it
was from West Karmakar and Catchian who used these things. And around that time a lot of
these things sort of started becoming known.
But a lot of them -- actually, not really the interior point method so much but the others all traced
to Moscow state university in the '60s. That's where they all come from. And a couple from Kiev,
but that's pretty much it.
Okay. Now, these methods, they handled smooth functions, and also functions in chronic form.
So these are to be things like second order current programs, semi definite programs. But they'll
handle things like geometric programs and things like that.
And these are extremely efficient. Typically require a few tens of iterations almost independent of
problem type and size.
So that's accepted. It's an empirical fact. It's nothing else. It would be like the empirical fact that
simplex works really well. Right?
So it's just -- and it's now been verified across zillions and zillions of problems and scaling all the
way to immense sizes and things like that. So now each iteration involves solving a set of
equations. In fact, these are just Lees squares problems with the same size and structure as the
original problem. That's a very interesting -- now you have very interesting interpretation.
Computationally, if you were to -- if you were to look, to profile a convex optimization solver,
here's exactly what you would find it doing. It will solve 25 Lee squares problems, period. So
maybe it's 50 in one thing and maybe 10 in another. But roughly. A few 10s of Lees squared
problems. Period. That's it. There would be other calculations, but it doesn't matter.
That actually has very important implications. It says if you have a field where if it's large scale
you have a method to solve the Lees squared problem. For example, it's in medical imaging and
you have some fast transform and you use some precondition conjugate gradient, something like
this, all of that transposes. That means you can solve convex problems in 20 of those steps.
That's what it means.
Actually, some of it is faster. I'll talk about it later. So there would be absolutely no point to not
use an interior points solver if you can.
So I mean there are other methods that are neither of these and so on. But they each have sort
of one -- you can find a place where it would be better than something else. But these just seem
to work.
Okay. So here are some. This is what Google is for. So there's no point really doing this. But
so you'd have lots of LPs and QPs, and then there's different groups use different things. For sort
of general purpose ones in MAT lab would be sue do you mean me SDP 3. Cone solvers.
There are ones written in C that are open source. And each of these might have some specialty
like solving low rank SDPs or this or that or something, I don't know. There are commercial ones.
This is a very high quality one is Mosaic. Solver.com actually has an LP and SSOPs.
>>: [inaudible].
>> Stephen Boyd: That's front line systems, yes, exactly.
Other ones would be, there's the CDX op. This is by Lieven Vandenberghe at UCLA and some
others. That's Python C. Open source. Untainted all the way through.
For those of us who are in the GNU camp. That includes me. So it's untainted here. This one.
Some of the others are like this. Well, no, they're not. If you see it in there, no it's not. There's
lots of these.
Let me talk a little bit about this idea of what a point can -- how do you do this? Here you have a
problem with L1 norm. Obviously not differentiable. How do you solve this? This, of course,
would come up in whatever lasso or whatever, basis pursuit or whichever one you want to call
these things.
It's a convex problem. It's not differentable. So can't directly use an interior point method. The
basic idea is to transform it so you can use an interior point method. The way you do that is
something like this.
You start with these end variables like that and you introduce a new set of N variables here and
these serve as upper bounds on the absolute values of the XIs.
You simply write out a new problem that looks like this. The objective is now smooth, quadratic,
and there are now 2 N inequality constraints. And you have to actually sit down and you actually
show that these are equivalent in the following sense.
If you solve that, you've actually solved that by simple transformation. In this case it's ignoring T.
If you solve this one you can solve that one. That's simple, by setting TI equals absolute value
XI.
They're the same. You might say oh my God you started with a problem of N variables and zero
constraints and now you have a problem with 2 N variables and 2 N constraints. And, in fact, so
you might imagine that you've done a terrible thing. But it turns out there's no -- if you know if the
linear algebra is smart enough to handle this, you'll do unbelievably well. There will be absolutely
no loss whatsoever, even though the number of variables is doubled. Number of constraints went
from 0 to 2 N.
The reason is kind of obvious. The sparsity pattern on the constraints, it's extremely sparse. For
example, T 3 only involves X 3. X 3 only involves T 3. Period.
So you can just imagine when you visualize the sparsity pattern, if you interleave the variables
completely, you get tiny blocks. This, of course, will run in linear time, in linear algebra.
So you can extend this idea. People know these tricks. This is basically, these transformation
tricks have been known since 1950. So they're in Danzig's book on linear programming.
So when you take a traditional LP course, you learn how to solve not L1 but maybe like an L
infinity problem or something like that. So it's just these tricks.
You learn them and you use them and stuff like that. Some of the tricks are not obvious. Okay.
So here's the idea. You start with a convex optimization problem and what you do you carry out
a sequence of equivalence transformations. So, by the way, this is kind of like the same thing
you would do, this is what a compiler would do.
The idea is you start with a description and then you carry out transformations where at each step
you can prove or you know that they're equivalent.
So you carry out this, these are like you're doing transformations until you arrive at a target. And
the target is one that can be handled by your solver. So that's kind of the idea.
Then, of course, once you solve this problem, when you do this reduction here, it also comes with
a method that will transform the solution backwards.
So that's kind of obvious. So you do this. Now, when you do things like this, we've done a bunch
now, and you might imagine -- so the first idea is you use methods like this to give you rapid
prototyping of problems.
Because you type in a baby problem with 100 variables the next thing you know there's a
thousand. You think -- it's fine for rapid prototyping but you wouldn't want -- you think it's too
slow. It's not slow at all if you know what you're doing. It's amazing no slow down at all. A
problem with a thousand variables. It's huge but it's sparse.
And it's just -- it's the kind of thing that a sparse matrix method dreams of getting its hand on
because it's got little tiny blocks and long tendrils and things like that. It's just perfect. So actually
amazingly these methods work, they're actually efficient. Okay. So there's a connection between
those calculus rules and these transformations, in fact more than a connection. They're the same
thing.
So here's the way it works. You know that the max of two convex functions is convex. Well,
there's actually a rule that goes like this: When you observe this in one of these transformed
problems, when you observe this thing, you should do the following: So we have expression
trees for all the constraints and the objective. So we have the trees. We've parsed it and we do
the following.
When you see a node that looks like that, you simply replace it with a new variable T and you put
these two inequalities in. If you see something which is a convex increasing function of a convex
function, which we know is convex, you simply take this, that's a note on the tree and all you're
doing is composition, what you do is this: You replace it with H of T and you actually add a new
constraint that looks like this.
Now, this looks completely trivial. They're not. Because what happens is this. You could do this
actually whether or not the rules were correct.
However, these transformations actually -- you get an equivalent problem if and only if the
assumptions of convex analysis hold.
For example, here you'd say you can do this whether or not H is increasing. If H is increasing,
then you can prove actually that the new problem you've generated is equivalent. If H is not
increasing, it's not.
Including all the fine details that you have to handle. So I don't know if that made sense, but
that's the idea.
So there's a connection between these. All right. So I mean, it's kind of obvious, right? If you
have expression trees that describe the Fs here. High level algebraic expression that describes
the problem. Turns out, if somebody comes to you and says how do you know it's convex. You
go that's affine and that's convex. Really? I gotta speed up. I have to speed up a lot. Fine.
We'll go faster.
It turns out the same argument, if you take the parse tree and annotate it and give the argument,
give the proof that it's convex, you could also automatically generate the transformation.
Okay. This brings us to discipline convex programming. And here you would specify a convex
problem in natural form. And then it will follow a limited set of rules. I'll show you what those are.
You can implement those several ways. This is one.
CVX. Written on top of Mat Lab. And others that run under Python and couple others, C++ and
stuff like that. Here's an example. Here's a problem you want to solve. Who knows why.
Machine learning problem with sparsing inducing regularization with inequality constraints, and
this would be the executable source code in CVX. CVX would just parse that. Let's see. What's
that? What if lambda were negative here in CVX parse? What would happen?
>>: I think it should fail.
>> Stephen Boyd: It will fail. Actually, it will fail even if this problem, even if this expression -- it
might be that for lambda equals minus 0.1, this as convex. Very likely possible. Doesn't matter.
This will fail because this violates a rule. This would be concave plus convex. And it would give
you back something that says this one convex programming violation; you can't add a convex in a
concave function. That's what would happen.
Not the same as saying it's not convex. It says you failed to follow the rules. So that's simple
enough. Here's, by the way, what this problem looks like. This is how you would solve it
traditionally.
Traditionally, you would get a cone solver like saidyoume [phonetic] or something like that. You
would fill in a whole bunch of matrices, things like that, call the cone solver and pull out your
variables. I know we did this for years and years. This is how we did it.
We didn't do it, actually. We made grad students do it. That's how this was done.
So because you take one look at that, you realize that's not something, that's not appropriate for
a professor to do, really.
But, actually the boundaries have changed, right? Because you see this? That's fine.
Professors can do that. So it's really changing the boundaries between -- okay. All right.
So I want to speed up so I won't go into the history of modeling languages for optimization. I'm
not going to go into it. It has a long history. It goes back to the '70s. And there are lots of them.
They vary in the amount of problem structure that they assume and sort of what the whole idea of
the method is. Is it supposed to be a best effort method or anything you type in it does
something? Or is this supposed to be like a strongly typed method where it's very rigid but if you
follow those rules you're guaranteed to have, it will work.
So there's variations on it. I should also say that it's actually changed things like the course that I
teach on this subject dramatically. So I taught it -- it was already a big class, but I taught it at one
point when homework involved something like that. You know, that's a pain in the ass, basically.
You would very carefully think of about numerical problems and things like that when in fact you
can write three lines and do a lot of stuff. You can do machine learning, you could do financing,
signal control and network optimization. When it comes down to writing like five lines of code it
just frees you completely at least when you're teaching.
And it's a lot of fun now. Because the class, we do the theory. But you leave that class, you
haven't talked about support vector means. You haven't talked about machine learning. You
haven't talked about portfolio optimization. You've done all of it.
And not only that, it was shockingly small things. So it's actually just been a lot of fun. It changed
the research, too.
So we'll do some examples, but maybe in the interests of time I will skip over one of them and
we'll go just to this one, because it's fun.
So because I think a lot of you are in machine learning and stuff like that. Good. So this one has
a quiz. You better listen. In fact, it is a quiz. So here it is.
I have a random variable in R 2. It's got normal 01 marginals in X and Y.
And its 2nd, 3rd and 4th moments are going to match a Gaussian. Expected value XY is 0,
means X and Y are correlated, not independent but uncorrelated, whatever the third and fourth
moments are, those match. They're the Gaussian.
The question is how small can the probability that both X and Y are negative B? Actually, if X and
Y are normal, the answer is a quarter because it's the third quadrant.
So I'm waiting.
>>: Must be a really tight number or you wouldn't have asked the question.
>> Stephen Boyd: Okay. Tiny. You want to put a number on it?
>>: 20 minus 6. Otherwise you wouldn't ask it.
>> Stephen Boyd: No, no, no. That's old style. No, no. Because then if it's 20 minus 6 it would
be numerical error and you would have to figure out ->>: Something small.
>> Stephen Boyd: All right. Any other guesses?
>>: 0.
>> Stephen Boyd: That's indeed small.
>>: The mean has to be. The mean has to be small.
>> Stephen Boyd: Oh, yeah. Yeah. I guess the point is it's not obvious. I've asked a lot of
problemists all over, some get shockingly close. But no. It's not obvious. You can make an
argument either way.
So you can write this as an LP, as a -- you can dissecuritize it. And if you want to be careful and
get provable upper and lower bounds, you can dissecuritize it carefully. I didn't do that here but
you could. You would write it out as a giant LP after dissecuritizing. These are the marginals.
These are the second order, these are the second order moments, cross moments. The XI
squared comes for free because the marginals are Gaussian. So you don't even have to do
those.
Then these are the third moments. They work out. Okay. So here's the source code. It would
be a huge pain in the ass by the way to write this out as an LP.
Most graduate students would just say no when they finally, when they looked at the problem,
realized what was involved and all that kind of stuff. Anybody that's sensible would say no. Write
it out, absolutely nothing here.
Okay. So here's the Gaussian. And here's the answer. So it's 6%. That's the answer. So you
know, who knew. I've had people say anything from 0 to like .24. So you can't go above a
quarter because a Gaussian gives you a quarter.
Okay. What's that?
>>: [inaudible].
>> Stephen Boyd: And here's the distribution that does it. Now it's dissecuritized. But it's weird.
You are not -- you could not say I was about to try that distribution. I mean, there's no way.
And, yeah, you look at it. And look at it, it does just the right thing. It puts some sick little mass in
here. The other thing, if you look at it, you can see that the solution is probably a measure
supported on some weird, if we're lucky, one dimensional set. For all I know it could be some
sick cantor-like set. I don't even know. The point is the number .06 we're comfortable. If we
bound it, we can get the right thing and so on.
So that was just how much fun you can have with like five lines of code. Okay. So I think I'm
going to move on quickly, and actually I might be okay if I go quickly. I'll talk about interior point
methods.
So the worst case complexity theory says that the number of steps grows no faster than the
square root of the problem size. Those are the best bounds. Number of steps, I already
mentioned this, is between 10 and 50. If you want to be super safe you can go up to 80,
something like that. But that's what it is.
And this appears, by the way, to persist. This is certainly true for problems up to a million
variables. It's independent what kind of problems it is. It could be from machine learning,
finance, circuit design, network control. Doesn't make any difference. Always the same like this.
There's actually someone, Jack Donsio in Edinburgh [phonetic], recently came and gave a talk at
Stanford. And he solved an LP dense with a billion variables. We're like, wow, what did you
use? He goes same thing everyone used, primal dual. It's the homogenous self-dual
embedding, same one. How many iterations did it take? It took two days to solve LP, something
in finance where they expanded a full tree and all that stuff.
And he said it took two days. And I said how many iterations was it? He said 21. The thing is
the property of these taking 20 steps extends to problems with a billion variables. But each
iteration, by the way, was quite expensive.
But, okay, so each step requires the solution of a set of positive definite linear equations which
says solving a Lees squared problem and the fact it's a Newton system.
And you have 3 gross categories, direct dense, sparse iterative methods. The truth is in practice
they're not distinct. If you do direct sparse activation, two steps of iterative refinement. Someone
could say you're using an iterative method. And iterative methods rely on preconditioners that
often use direct sparse factorizations. But grossly these are the three main categories of linear
equation solving.
So does dense direct ones, this is your LA pack type thing. And these are just like rock solid.
This lists almost zero variance in how these things work.
Sparse direct. Here the runtime depends -- the sparsity pattern actually but not on the data
provided these are like positive -- because they're positive definite systems, that's why. Because
you do symbolic preordering and you don't have to do runtime pivoting. So that's what this is
based on.
But this actually requires a good heuristic for ordering.
By the way, this undercuts the whole thing about convex optimization if you think about it. Next
time you see someone who has convex optimization, you say what are you doing, I'm solving a
complex problem. They're all high and mighty. They say I get the global solution. It's
nonheuristic and everything. And then you say, how do you solve your linear equations, it will be
using a sparse solver.
And technically that's a heuristic. So it's only sparse solvers work only by the grace of whatever
god or gods or goddesses handle heuristics for sparse matrix orderings. So the gods of heuristic
ordering.
And that can really get people irritated, by the way, if you say that. Because you say, no, you're
using a heuristic method. They'll say: I am not. I made approximations so I could make this
convex. This is the global solution. I don't doubt that, but the method -- the fact it runs in 10
minutes and not a year is basically because of these parse matrix factorizations.
When you move into the scientific community. Iterative method. All bets are off. Runtime
depends on data, size, sparse. Required tuning and preconditioning. So these are certainly not
general methods by any means. They're just not. On the other hand, the only way you're going
to solve problems with 10 million or 100 million variables.
I'll skip over this. I'll just say I think everyone knows about conjugate gradients methods anyway.
These are iterative methods for solving large positive definite equations, large lease squares
problems.
And the interesting thing for us is not sort of the theory of them or something like that. It's that
you get an awfully good solution sometimes in a shockingly small number of steps. This depends
on this spectrum of the operator involved.
So I'll skip over this. And just go -- and say that if you take an interior point method and instead
of solving for the search direction using a direct method you use an iterative method there's lots
of names for it. Limited memory Newton. It turns out it's very close to BFGS. It's also called a
Newton it active method. But the total effort here is measured by the equivalent of CG steps.
These are not general purpose. The grad students are back in business not writing out code to
stuff entries in giant matrices, but they're back in business to work out to do things like get good
preconditioners and things up and running for problems and things like that.
Nice part is an interior point method. In fact, you really couldn't care less for solving for the
search directions anyway. Because all you want to do is get to the solution. So it's totally
irrelevant. Nobody cares about solving it or anything like that.
You want to preserve this idea that it takes 20 steps or something like that to get there.
I'll give a quick example and we'll move on to this last topic. This is just L1 regularized logistic
regression.
So you have a whole bunch of data that's XI with labels BI, binary labels. You want to fit a logistic
model to them for something, something like that. And you add an L1 term here. And what this
will do, of course, is if you crank land up and down, you will get a sparser and sparser solution out
here.
The nice thing about this one is you can actually do things like this. This is sensible. This is
utterly nonsensible if, for example, the number of examples is smaller than the number of
features.
You can do this for 50 examples and a million features. No problem. But, I mean, what it will do
is it will select obviously fewer than 50 features, but it will select, I don't know, whatever 10, and
so the idea is that that's a heuristic for you getting out to try all one million choose 10, sets of 10
features. Does feature selection for you.
So that's the problem. You can write a sparse direct solver for it and it works exactly as
advertised. They all look like this. Takes 30 steps or whatever to get you a perfectly good
solution.
This shows you how much it depends on the value of lambda and direct methods, as I said, are
certainly independent of the data and that's true.
Instead of taking 32 steps it takes 33 or something like that. And this would be -- I guess these
are just two examples, and each of these has a few thousand variables and constraints and these
are solved relatively fast.
We'll go to a much bigger problem with maybe, I don't know, three-quarters of a million features,
11,000 -- here's an example where you have more features than you have examples.
And you have about five million nonzeros in the data. And the interior point method -- the final
IPM problem has about a million and a half variables and so on. These are beyond the capability
of direct methods. With the relatively simple precondition for Newton system, you can solve this
in a couple of minutes and this is what it would look like.
Now, here you see everything you'd expect for one of these methods. But the first thing you see
is the actual time which is measured by cumulative PCG iterations, it's data dependent. As a
practical matter you're only interested in this one. Because the sparser the final selected features
are, the faster it is. So turns ought you're really only interested in that.
>>: You don't care what ->> Stephen Boyd: Of course you don't. No, you could go out here. So in fact you could have
stopped out here. I agree completely. Because the things that's heuristic for actually making a
classifier. If you judged it on a validation set, you would find, if you stopped probably right here,
those weights would just be fine. Truncated the ones that were clearly going to zero to zero, I'm
sure it would work fine.
All right. These are kind of obvious. But these methods will take you up to the ten million
variable problems here. Pretty straightforward. By the way, there are first order methods that
people have now developed for these. These are also quite good.
They're almost linear. Same as these. So those are quite good. Actually, that's sort of my
theory, if you have a specific problem and you're making an interior point method, a custom
interior point method will be the first in to get really fast. Then the crowds will sort of come in and
they'll end up with some simple first order variable at a time method and with a lot of tuning these
things will work. That's happened for a lot of L1 regularized problems and they're perfectly good
methods. We just have to make a variation on the property, add some linear inequalities and
those other methods don't work.
So the summary here is just that there's really sort of these three regimes of problem solving.
Ones that go -- I'm not talking now on a distributed we didn't talk about distributed solvers but we
talked about just a traditional solver on one machine multi-core doesn't matter but just one
machine.
There's these three regimes, the small, medium and super big.
Last thing I want to talk about is going to be pretty brief. Actually, I find it actually kind of really
interesting because actually I have no idea what it can be used for but I know something. It's just
really cool and it's this.
So let's imagine you're going to solve a specific problem. So and actually to figure the idea here.
Let's figure a modest problem, couple hundred variables, portfolio optimization problem or optimal
execution problem in finance or a problem in control, network flow. Let's just make it network
flow. You have a network. You want to decide the flow rates of 100 flows passing over 300 links
and you want to set these flow rates.
So, I mean, that's way beyond. No one just messing around not trained in optimization could
come up with something even remotely close to optimal.
If you're making a custom solver you can exploit the structure efficiently. And the point is you can
actually do this at code time, code generation time. You don't do it at runtime.
So solvers now generally work like this. It reads in the data. It takes some time to analyze the
sparsity. It uses whatever method it uses to generate some permutations.
At that point it knows how bad the fill-in is going to be. It allocates some memory for it. Maybe
some extra memory for dynamic pivoting and stuff like that and then it starts solving the problem.
That's how solver works now. That's great. That's how you want a general purpose solver to
work. But here I'm talking about something where you're going to solve, you want to optimize the
flows on this network. Topology is known or something like that ahead of time.
You can spend hours figuring out good permutations. You can also do things like determine all
the orderings and do the memory allocation.
You can move things around in memory for nice locality of reference and all that. You can do
crazy stuff. You can also cut corners in the algorithm.
So as you just pointed out, if you write a general purpose solver, you better make it get six or
eight digits of accuracy because you don't know what people are going to use it for, right? So if
it's finance or something like that, or who knows, it's something like that, where the six or eight
digits actually has meaning. Probably means the problem was scaled wrong but that's another
story.
Then you just don't know. But once you have a particular problem, it's very rare that the second
digit, actually, in a properly scaled problem, third digit is utterly inconsequential in all practical
problems, as far as I know, completely inconsequential. So once you have a specific problem
you can terminate way early.
You can use warm start. Because if you have something in sort of a real time embedded thing
where you solve one problem and then another and another it's often the case that these
problems are related and they're very close, and you can use warm start.
Now, if you put all these tricks together, you can end up with a very fast solver, and this basically
opens up the possibility of real time embedded convex optimization. So this is this. And I'll look
at some quick examples here. Here's grass force optimization. So this is for a robot. You have a
rigid body. And there are fingers grasping it and you have to decide the forces on these fingers.
They have to resist a given. There's a wrench, force of torque on the body. You have to resist
that. That's this, six equations. You have to satisfy the friction constraints, except I forget to put
in the coefficient of friction. Goes one of these two places, put the coefficient of friction in there.
Doesn't matter you can say it's horrible it's nondifferentiable. That's fine, which it is. But we're
past that stage. We say it's an SOCP.
And in this case, for example, if you exploit the sparsity pattern and various other things and you
have a custom method for calculating a dual point and all these things, you can solve these
problems here in around 80 microseconds. So the speed up, by the way, compared to sort of
one of these other things, 80 microseconds is a joke compared to these other things have just
pulled in the problem. They're just sort of waking up and thinking about finding and ordering and
all that kind of stuff, and this is done.
Okay. So 300 microseconds is cold start. That's just from nothing. That's just new data. Never
saw it before. You get the answer.
So we've done this now in a bunch of problems. Another one is model predictive control. I don't
know if people know about this. Do people know about this?
This is very, very cool. It's widely used. So you have a stochastic control problem. Let's make it
linear for simplicity. It could be a supply chain problem let's do supply chain.
So you have a bunch of product. Bunch of nodes on a graph. And at each step you can ship
things along nodes at certain costs. You can even pull some in from a factory or something like
that.
You have demands at various nodes things that get pulled out. Different models of it. One would
be, for example, you can actually allow the stock at a certain point to go negative. That would be
back order. You can have all sorts of games.
And you have what you know as the distribution of demands. It's statistical. So the question is
now how do you operate a real time supply chain this way? So you know your stock everywhere
and how to use -- these are very complicated stochastic control problems. There's some
amazingly good heuristics one is MPC. Works like this. You plan over a horizon. You just take
20 steps in the future.
Now, in the future, of course, you don't know the demands. So you go to an expert and you ask
them to guess the demands. It's you can use the conditional mean based on what you know so
far. It just doesn't even matter.
You get that. You pretend that's exactly what's going to happen and you work out a complete
planning trajectory of how you'd ship things across nodes to minimize shipping costs,
warehousing cost and some costs associated with back ordering and all that kind of stuff.
So you work out a whole plan and you'd execute only the first step. So it's got lots of names it's
also called rolling horizon planning dynamic linear programming. It's got lots of names.
Works unbelievably well. It is now universally used in chemical process engineering. So that
happened in the '80s. '80s and '90s. So that's how that works.
The problem with it is -- the reason it was there is because in chemical processes, the dynamics
are very slow and you have 15 minutes or an hour to make your new decisions.
Things are just going very, very slowly. So no problem. You can solve a 50,000 variable LP
quite reliably in that amount of time. So most people in control assumed, oh, they would call that
oh that's a numerical method. I work on like jet fighters or blah, blah, blah.
I make servos for disk drives. I need to make my decision not every hour. I make my decision
every 50 microseconds. So this will never apply to me. Anyway, that's not right. And these are
actually just very old numbers here.
So this is just model predictive control. Just to point to one here, this would be the types of things
people would actually use. So this would be a QP with 50 variables. 160 constraints. And if you
exploit all the structure you can solve times of 300 microseconds. Actually, these are off. These
are now down to like 150, things like that. This is on something like 2-gigahertz processer or
something like that.
Typical speed up they're typically on the order of 1,000 to 1. These are already very -- these are
already sort of the extremely good interior point methods. But it's not fair, right? Because these
things have solved a thousand of these problems before these things -- these things are just
waking up allocating memory, loading, starting to have a discussion about what the ordering of
the variables should be and these things have solved a thousand problems. But it's not -- it's of
course not a fair comparison.
So, actually, I actually -- this I think is really interesting. This is what I'm most interested in right
now is because this whole regime in optimization. Everyone thinks of optimization as either
solving giant problems, something like that, or solving things with a human in the loop, on a
human time scale, where somebody with a spreadsheet types in some what if and hits optimize it
or some stupid thing like that and all of a sudden something happens. Has to happen in a second
or two. Working out the scheduling for United Airlines for tomorrow that's what people expect.
What I haven't seen is a lot of interest or any focus on solving convex optimization problems in
microseconds and milliseconds, which now we know is absolutely possible.
I don't know what to do with it. So actually if anybody here has an ideas, like, for example,
suppose you could knock off a -- suppose you could knock off -- suppose you can knock off an
L-1 regularized logistic regression. You can update the Ws every two milliseconds. I don't know.
For some modest size problem. I'm trying to think -- I don't know what to do with that.
But I'm sure there's going to be something cool you can do with that. That much I'm confident of.
So, okay, so maybe I'll quit here. The references that's what Google is for so I won't say anything
about that. And I will quit here.
[applause].
>>: [inaudible].
>> Stephen Boyd: Yeah.
>>: You mentioned dense linear optimization of a billion there. But how many exabytes did you
need for memory?
>> Stephen Boyd: It was huge. Used some giant like the Blue Gene one of these huge things.
Each iteration took, I think he told me it was something like, I think it was like -- it was hours. It
was like two hours.
>>: Memory ->> Stephen Boyd: It was what.
>>: Two days.
>> Stephen Boyd: So we can do the arithmetic. It was a couple of hours. Two hours for each
iteration.
>>: [inaudible].
>> Stephen Boyd: But it was a room. He showed a picture of it in his slide you could see way off
in the distance there were machines there. So...yeah, it wasn't on ->>: The world's largest dense matrix ->> Stephen Boyd: He says it's not. But it was pretty big. You can find that out. It's Gonzio. Just
type Jack Gonzio in and I'm sure you'll find that.
>>: At the end of October you're still at the curve you don't have a financial organization?
[laughter].
>> Stephen Boyd: Sure. Sure. I can tell you this, actually. A lot of the stuff in convex
optimization is used for robust portfolio optimization. So I actually think that the status of the
people we're talking about, robust portfolio optimization, has risen substantially. Because before
that you'd hear people -- I talked to people who were actually in the trenches fighting other
people, my method, your method.
>>: No matter what your algorithm was, if you don't have the right data they were calculating
things on portfolios using three-year-old data for characteristics.
>> Stephen Boyd: Good point
[applause]
Download