16363 >> Lin Xiao: It's a great pleasure to welcome Professor Stephen Boyd from Stanford University. Steven is the Samsung Professor of Engineering and Professor of Electrical Engineering and Information System Lab at Stanford. He has done very influential work on convex optimization and pioneered convex optimization in many areas such as zero control systems, signal processing and circuit design. Okay. Today he's going to talk about some recent alliances of convex optimization. Stephen. >> Stephen Boyd: Thanks very much. I gather I'm finishing off Stanford week here or something like that. We didn't plan it. So but anyway I'm very happy to finish up Stanford week at MSR. So I'm going to talk about recent advances in convex optimization. Stop me any time. There are parts I'll definitely go too fast. If I go too slow, that's really criminal. At which point just speed me up somehow or make the universal sign of boredom or whatever, and I will speed up. But I have a lot of material, though. So we'll just see. So the first thing I'm going to do is start by just quickly saying what convex optimization is and why anyone would care about it. The things I want to talk about I'll really talk about three different things, in different levels of detail, actually none of them in horrible detail. But I'll talk about -- and these are I guess the recent advances. One is modeling tools for convex optimization. So I'll say what these are, how they work, what do you do with them and things like that. I'll talk about that. This has had a huge impact on sort of how teaching is done and research is done with these things. I'll talk briefly about large-scale convex optimization. Beyond problem scales and sizes where a direct solver can work and you move to iterative solvers; this is when you move into the million 10 million constraint type ranges. The final thing I'm going to talk about is something I'm quite interested in right now, and it's kind of -- I'll say a little bit about some preliminary work. Actually, what's there is I have very little on this but I'll talk about it and this is sort of ongoing work and maybe if I still have time I will say a little bit about some of the ongoing work. Real time convex optimization is embedding convex optimization methods in extremely fast real time systems. So we're talking about solving convex optimization problems of modest size in numbers like microseconds or milliseconds. We'll talk about this. But I can tell you the summary right now. You can do it. So that's the summary. But we'll get to that. It's actually quite interesting, because for this one it opens up this whole world, because people who do optimization, if you're trained in optimization, even the people who make all the best solvers and stuff like that. They're always talking about, oh, I solved this problem with a million variables blah, blah, blah. But these are human time scales. I have this thing now where I ask people who write solvers: How fast can you solve like a network flow, some small or some tiny problem with a couple hundred variables or something like that. Basically they don't know. They're like a second. If it's fast enough, they're like an Excel spreadsheet update it's not seen in. They call that zero. But the answer is a lot of those things, if you know what you're doing, can be solved in microseconds, milliseconds. I'll give some examples of that. I have no idea what to do with this. Well I have some ideas we'll get there. I should say interrupt me at any time. All right. So we'll start with optimization. I think everybody would know all about this. But just to set the notation. You minimize an objective function. Subjective inequality constraints and some equality constraints. And everything is an optimization problem. Well, fine, that's a totology that says nothing. What really matters is whether or not you can solve that problem. And whether or not you can solve it depends very much on the properties of these functions. So, for example, if they're all linear plus a constant or affine, you get an LP. And these you can solve very efficiently. I mean, in generic cases. I'm not talking about something where it's a variable. You can always push the boundary and things get difficult there and custom methods and all that. Roughly speaking, as a zero order statement, you can say you can solve LP very efficiently. In contrast, very small problems of that form with nonlinear FI and HI. So nonlinear functions involved there, these can be intractable. It's very easy to write to to make a small -- if you go to quadratic that's all you need. You can embed your favorite MP complete problem this way and that's the end that. These will be difficult in practice as well. Okay. Convex optimization is this: Your equality constraints are all affine or linear, people would say. And the objective and inequality constraint functions are convex. So their graph is bowl shaped and they have positive curvature. So that's convex optimization. And it's a sub -- it includes LP as a special case. LP is actually the special case with a zero curvature. LP is the case when all the functions are affine and that means that you have an equality here. And not just for theta between zero and one but all theta. That's affine. So that's the boundary of convex optimization problems in some sense are LPs. So that's what they are. Now, these things can look really difficult. They can be nonlinear. Nondifferentiable and so on, but they can be solved very efficiently. I should point out that that's something that goes against much of the teaching and thinking in OR, in traditional optimization. Because they focus in traditional optimization you talk about linear and then you go to nonlinear. They even organize themselves this way. You do LP or you do MLP. And NLP is nonlinear. But something like this. And a further step would be nondifferentiable nonlinear. That's a big step up. And you could go to discrete or something like this. Turns out when you talk about convexity differentiability is irrelevant, utterly irrelevant. Which is not -- if you go look at your traditional first class on optimization, since the entire book is filled with, first half of the book is filled with gradients and the second half filled with gradients and Hessians. This is not a small issue in traditional OR. It's a big deal. Second big point is these problems came up a whole lot more often than was once thought. So I guess LP is widely used, widely suffused in lots of areas. But there are problems that are convex that hadn't been noticed to be convex. Now, there's a very large number of these. And it's one of these positive feedback systems. It's a social system, where the more people know about these things, the more people, for example, who know about semi definite programming and are out there looking at real problems, you know the more people step back one day and say oh my God, that's an SDB or something like that. So I mean the more applications there are of SDP, for example, the more motivation there is for people to solve good solvers for it and work out complexity theory and all that stuff. So I think that's what this is. There are recently discovered applications in control, combinatorial optimization, signal processing, machine learning, statistics and finance and so on. Now the impact on a field varies. Because it depends on lots of things. For example, in machine learning I would say it's a conceptual impact, hasn't really changed anything because a lot of these problems are dominated by huge scale. So it's not like you can just say, oh, okay, that's convex and then walk away. In other cases, actually you can actually solve, in control it actually just works. In smaller problems in statistics and finance certainly and signal processing and circuit design you can just do these things. Okay. So there's some challenges. I'll go from the bottom up. So in the theory category, convex analysis it's more than 100 years old. It's a branch of math. Pretty much knew everything by 1971. I mean you know more or less there's still stuff being done. People who do that would be very irritated if they heard me say that. But okay so basically this was done by 1970. Roughly. Which is not -- you need people doing this, that's great. And there's lots of people working out complexity theory for algorithms like that. That's good, too. So I'll say a little bit about how these methods actually work in practice as well. Okay. So a bunch of people working on algorithm development. So here you're working on developing things like reliable efficient algorithms for special applications, special classes. Or in some cases just general convex optimization problems. And, of course, there's people who work on sort of at the high algorithmic level and those who work on the details of the software implementation. In fact, how these things work, actually, to be blunt about it, actually this often matters much more than, for example, this stuff. So you go to conference see people getting up screaming at each other saying my primal dual search direction makes this thing converge in 19 steps whereas yours took 23. But, of course, how you do your linear algebra, matters like to matter of by a factor of 100. So here are the theorists arguing with each other 17 versus 23 iterations or less relevant, into the 3.5 or into 4 or this kind of thing. When in fact it turns out they really should be concentrating on how they do their factorizations and things. At the highest level, or at least at the top of the slide, there's a question of modeling. And so this is the question of posing practical problems with -- you want to write a problem as a convex problem, and this can be approximate. And the level of violence involved in mapping a problem to a convex one can range from zero or minimal, in the sense someone gives you the problem, you change some variables or whatever and it's exactly equivalent. Now it's an SOCP or something. If the problem is not dominated by huge scale or tight real time requirements, you're done. That's it, you quit. And you have no apologies to make no footnotes, nothing like that. It's done. Then there are ones where you, there's terms that you can't handle exactly. You put in approximation. These are mild and so on. Then you get up to ones where there's approximately as being very, very generous. These would be in relaxations of combinatorial optimization problems where you can hardly say you've solved the problem but what you'll end up with -- you end up with two things there. End up with lower bound on hard problem and end up with a outstanding heuristic for a local suboptimal method. Now, once you've done that, if you have anything to apologize for, for example, approximations, misterms and ignored constraints and things like that you have to go back and verify the practical performance of it. Okay. So now I'll say a little bit about convex optimization modeling tools. >>: Everything you said up until now except convex optimization you could have just said looking for a local optimal -- is there anything special about ->> Stephen Boyd: No, you're right. That's not entirely true. But as a zero order statement that's correct, that if instead of solving the problem you mean finding a local solution, then much of what I said, almost all, will just translate immediately. There is actually a difference. So it turns out there's issues of scaling and the algorithms are trickier and things like that because you have to deal with things that are super flat or have the wrong curvature, things like that. Whereas a when you get a convex problem these issues go away. They don't happen. Not only that, they're reliably things like 20 steps, 25 steps. Absolutely reliable. And, actually, I'm pretty -- whereas -- what it is I think they're reliable, whereas a local, good local optimization method will be quite reliable and quite fast but actually there's some parameters there that have to be tuned. There's a little bit more baby-sitting. When you take a look at one of these things. I won't name any. But you run it and if it's your lucky day 25 steps, done. Actually, you can actually do a cool experiment. Can you do this. Take a convex problem in a form that can be handled by a nonlinear solver and call a nonlinear solver. Right? You'll get the global solution. So the question is, which is better. The answer is the convex optimization solvers are far, far superior. The other ones either fail -- if you go in set 14 parameters just right it will work. It's got to. Theory says it will. So that was all of that is the difference between what I've just said, how it would transpose to finding a local solution of a noncovex problem. Here's some general approaches. So the first is to do the following. And this is maybe the most -- it's a method. You use a general solver. Which is one for convex optimization and you merely assume that the FI are convex and you proceed. Now if you evaluate a Hessian and it's not positive definite, for example Gelaski factor fails, everything grinds to a halt, and you can now go back and say your problems's not convex, indeed there's an X and F3 doesn't have a Gelaski factorization, the Hessian of F3 doesn't have a Gelaski factorization. You've proved it's noncovex. This is easy on the user because it's the traditional interface for optimization. You provide methods for evaluating the functions, gradients and maybe Hessians. So it's the traditional interface. So it's easy on the user. But you really lose a lot of the benefits of convex optimization. One is that, of course, that you know that you can actually certify global, the solution is global. But the other is the reliability you give up on here. So, okay, now the second gross category is this: And it's one that makes sort of sense. You would take the problem described in some high level language, some algebraic form, let's say. And you would write a tool that would scan that problem description. High level problem description, and it would attempt to prove that the problem you have is convex. Maybe transform it to some convex problem, something like that. So that sort of makes sense. And you can make things like this. And actually we have. Meaning sort of with students we've done this maybe even a while ago. Now, you can quickly find out that verifying a problem is convex in general, sort of MP hard. I can easily write down something. All you're doing -- so I'll mention how these things work typically. These are basically you do interval computation with second derivatives. You do -- that's all you're doing is interval arithmetic. To establish convexity all you want to know is the second derivative is non-negative. You can do all tricks with -- you could do all sorts of tricks with interval arithmetic. And if you're able ultimately to establish that the second derivatives of the FIs are all in 0 something, you win. By the way, there's at least one commercial version of this fielding right now. This thing called solver.com did this. Okay. I had a student who worked on this. This was his dissertation. He put it all together. And so he comes into my office and had this very sophisticated thing where there was various bounds on each of the functions and they would, you know, once anything known -- a tightened bound known by one variable would send a message to all the things that would use that variable. They would tighten their bounds. Would go on for a long time. So we came into my office. He wrote down this horrible thing with hyperbolic cosins and logs and X and stuff like that and he said do you think that's convex? I'm like: Who the hell would know, right? Who would know right? He said it is. And it took 987 iterations of this method to prove that it's convex. And we looked at this horrible expression. And we both said: That is really cool. And we looked at it and we sat there for like in silence for like 30 seconds. We looked at it. We both looked at each other and we said it's useless. And we came to the conclusion that that's not what you want. Because even if you had a thing -it's kind of cool to have you can impress your friends by writing down horrible functions that are convex. But it's actually not the right way to do it, in my opinion. Well, to my thinking it's not the right way to do it, because even if you know some horrible expression is convex, what it is is you're not -this is not maintainable. If someone writes some code that solves a problem and it actually is convex, they leave or whatever, somebody else comes in, you have no idea why it's convex. You change one constant from .1 to .3 and all of a sudden it's not convex anymore. You have no idea. Some things, of course, you can change the sign of a coefficient nothing happens, and other things it changes it. The point being that, that's the -- that's not maintainable if someone writes stuff like that. By the way, he was perfectly okay with this, the student. It's a lower bound result. You think of it that way. Very useful. This is how it shouldn't be done or something like that. It's one of those. Very useful, right? Okay. Now, the one actually after much thinking about this, it seems to me that the way you really want to do this is this: You really want to construct the problem as convex from the outset. That's actually really what you want to do. Because if you do that, and the way to do that is basically to follow a restricted set of rules and methods. And I'll talk about these in a minute. If you do this, then convexity verification is automatic, as we'll see transforming the problem to one that is solvable by an interior point method is totally automatic. It's trivial. So we'll see that. And the advantage of this is this is maintainable, because the intention of the modeler is made explicit. So this actually sort of makes sense to us. Okay. All right. So how can you tell if the problem is convex? You need to check convexit of problems. We talked about this. You can use the first order, various basic definitions. But in fact the best way to do this is via convex calculus. So you start with a library of basic atoms examples that are convex and then you construct more complicated functions from various calculus rules. And there are lots of these. In fact, this is called convex analysis, basically, roughly, the two bullet synopsis of what convex analysis is. Then there's lots and lots of rules. But it turns out a very small number will get you very, very far. Let's start with some basic examples. So there's some obvious ones. Powers are either convex or concave, depending on the exponent. Exponential, negative log, entropy, negative entropy, these are convex functions. These are very easy to show. Any affine function is convex. Sum of squares is convex. Here's maybe the first one that not a lot of people would know is convex. And it's not obvious until somebody tells you. Well, okay, it's not hard. But someone has to tell you. The sum of the squares divided by an independent positive variable. That's not obvious? That's convex, jointly in X and Y. It's obviously convex in X because for any fixed positive Y it's a multiple of the sum of the squares. It's obviously convex in Y, because that's a positive number divided by Y, one over Y goes like this it's convex that's obvious. What's not obvious that's jointly convex in X and Y. So that's maybe the first example of a basic function that most people wouldn't know instantly or immediately as convex, sum of squares divided by a positive variable. A norm, that's obvious from triangle inequality. The max of a bunch of variables. The max function is convex. And then log sum X is convex. Of course, that's sort of like a smooth version of that. But log sum X is also convex. And various people would know these things or would know them but maybe not remember that they know them or something like that. Okay. Here would be some others. The log of the cumulative density function of a Gaussian is concave. And actually it's interesting because all sorts of classical inequalities come from these facts. So, for example, here if you write down the classical inequality for, you get one of the -- I forget which named inequality you get an approximation and a bound on the Gaussian CDF. Here's one maybe some people don't know this but the log of the determined of the inverse of a positive definite matrix. That's convex. In the matrix. So, again, it's not obvious. These are not hard to show or something. These would be sort of some of the examples. Calculus rules. There's zillions of them. But turns out you can fit on one page the ones that will get you 90 percent of the way. And actually this is not minimal, because they in fact all of these follow the last one. So in fact you can just get one rule. Nevertheless, for explanation these to people and not machines. What it means is actually the code is much shorter than the slides, actually, when you actually do things like this. So some things are obvious. You have nonnegative scalings. If you scale something, whatever it remains convex, if it's a positive constant. The sum of convex functions is convex. Affine composition. So it says if F is convex. And if you precompose F with an affine mapping, this thing is convex. These are things very easy to verify. Point-wise maximum, that's maybe sort of the first obvious one. Here's one, partial minimalization, not totally obvious. So if you have a function that is jointly convex in two groups of variables and you minimize over one of them in a convex set, the result, which is the -- it's partial -- it's balance minimization, minimizing F over some of the variables, that's a convex function of the remaining variables. Now, most of these are very easy to show and things like that. Here's actually the real one. This composition. So if a function is convex and increasing and F is convex, then H of X is convex. And there are generalizations of that to multiple arguments and a few variations on it. But it turns out this, from this I can construct all of those easily. You can get it all sorts of ways. Some are just completely trivial here. For example, if F is affine -- if F down here is affine, that's this, then so this one implies that one. It certainly does sum, because the function that takes the sum of two numbers is convex. So that works. It does point-wise maximum because the maximum function is convex and nondecreasing in all of its arguments. Therefore, max of convex function is convex. This one you can also get this way. But I won't go through that. So there's actually a very -there's actually two rules it turns out that you really need to know. But it's useful to think of them this way. Okay. So you can now construct examples. You can do things like this. You have affine functions. They're obviously convex. You can have the max of a bunch of them and therefore you get a piece-wise linear convex function like that, that's going to be convex. We can look at the L1 regularized Lees squares cost. So this function is convex. That's affine. That sum of squares is convex. That's convex. It's a norm. As long as lambda's positive, that's convex, the sum makes the whole thing convex. Here's one. The sum of, for example, seven largest elements of a vector. So it's a complicated function, but it's convex. So no one is saying anything, so am I going too fast? Too slow? Too slow. All right. Fine. So oh by the way how do you show this? What's a quick way to show this? >>: X ->>: X of all the functions. >> Stephen Boyd: Of a lot of functions. Its maximum. So you consider the inner product. You look at C transpose X where C is a vector with, let's go for 7. Seven 1s and minus seven 0s, of which where there are N 2 seven of these, and the maximum over all of those N 2 seven of those is that function, is the sum of the largest ones. Okay. So I won't go into any more -- oh, one interesting thing is that most of the interesting ones are actually nondifferentialable that's not a big deal. You can throw away your book that's filled with gradients. Actually don't throw it away set it on a shelf because we're going to go get it back later in the talk. Okay. So just a couple more examples. Maximum Eigen value of a symmetric matrix, norm of a general matrix. Here's one that's actually getting a lot of interest right now is the dual of the spectral norm of a matrix. This is the sum of the singular values. This is the analog -- people who don't know. This is to the rank of a matrix as the L1 norm is to the sparsity of a vector. So what that means is, let's see, it means that you can transpose all of the ideas from Y compress sensing and sparse stuff where you regularize with L1. If you have a problem, that's where you want a sparse, something to be sparse. If you want to do the same -- if you have a problem where you're looking for low rank matrices, this serves the purpose of the L1 mode. Here's one, the negative log probability. This is the yield. So this basically says I have a convex set. Z is N zero sigma. The fact is any log concave distribution would be work. Any distribution whose density log of the density is concave, which is most of the distributions you know about. In fact, it's very hard to think of one that's not. So, I don't know, you can go ahead and try to think of one. >>: I will. >> Stephen Boyd: Sure, fine. No, really, no, no, sorry. It has to have a name. >>: [inaudible] [laughter]. >> Stephen Boyd: A short name. A single name. [laughter]. >> Stephen Boyd: Come on, anybody can do that. Come on. Go ahead and try, you know. It's a serious exercise. >>: Y over 3. >> Stephen Boyd: No, I think that might be okay. Wishart, all these fancy -- I actually don't know the answer to it. There probably is one or two named single-named distribution, right, density which is not log concave. So I don't know somebody could look it up on Wikipedia or something. Anyway, so if you have a log concave density and then this is the yield function, that's the yield function because it basically -- so you pick a point. Z is manufacturing variation. And you want to know what's the probability that when you set the target that the combined, the manufactured one is in so that's this thing, and that is log concave, which means that it's minus log probability that you're in a set that's convex. There's actually a lot of actually interesting implications. I'll tell you a couple right now that I won't talk about today. One would be this. As an example it says that pretty much anything that you can do with linear measurements, estimation with linear measurements. So if you have a whole bunch of measurements that look like AX plus V where V is Gaussian or your favorite log concave density, so and you can do that, you can do, for example, with horribly quantized measurements. So it turns out all those problems are convex. So here is -- here's a good sensor. I'll give you the sign of a transpose X plus B, that's all. That's a one bit sensor. And it turns out I can estimate X superbly from one bit measurements. And these are all convex problems. And that all follows from this. Because the negative log likelihood is then convex. So a lot of these things have serious implications. Here's one. It says that if you form a vector, X transpose matrix inverse X, this is actually convex in X and the matrix provided the matrix positive definite. Is there a question? We can go off -- we can -- any of these things we can go off on tangents and things that work if you want. All right. So how do you solve a convex problem. Really the best method is to use someone else's solver. By far. That's fine. That's by far the best one. Now, the problem with that your problem has to be in a standard form. Now, the good news of an arrangement like this is that you get the cost of software development is amortized across many users so people who spent all day long just making solvers with the knowledge that thousands of people are going to use them. It all kind of makes sense. You can write your own custom solver, and it's for your particular problem. And that's lots of work, but actually you get huge advantages if you -- you can get very large advantages, because if you know the problems you're solving, you know the scalings and there's also sparsity and other structure you can exploit. You can often do really, really well. But it's more trouble than this. This requires you to type make basically up here. This one roughly. This is serious development. Now, there's something else -- the standard method is this. Well, a method. Is to transform your problem into a standard form and then use a standard solver, and you can think of this several ways. You can think of it extending the reach of problems using standard solvers. But this is actually a pain. We'll see that it is a pain in general to do this. It's not like it's just cumbersome. And then we're going to look at methods that formalize this last approach. Okay. Before I start, I should mention this. It's actually a good thing to know that there are general convex optimization solvers that some of them are three lines long, literally. They work always. They solve all convex problems. The proof of that is about one paragraph. And there are even some that are actually efficient in theory. So ellipsoid method would be your first one. Now you're up to five lines of code including comments. >>: Is it the same ellipsoid, is if you have a sigma, best, I remember, or at least you need very, very good precision for fairly trivial problems, no? >> Stephen Boyd: The problem is not with precision. The problem is it really takes as long as the upper bound tells you. And so in practice these things don't work that well. Even if they really are polynomial time. So in fact all of these are typically slow in practice. By the way, they do work well because the coding is not much as I said it's fine lines. The interfaces you require, all you need is something that gets a sub gradient for you. And a sub gradient, unlike a gradient, is -- well, sub gradient is actually if you can write down the problem you can get a sub gradient. So that much I can assure you from sub gradient calculus. These are nice to know. Actually, they do have their uses. If you need to solve a small problem and you really don't care if you don't mind going to lunch or something like that, write the code go to lunch and come back and your answer might be there if your problem is not that big, which beats whatever doing some huge code development thing or finding something that works and then reading the terrible documentation that someone wrote something like that. Okay. So it's actually just good -- it's just cool to know that these things exist. And they actually do have their uses. For example, the sub gradient methods it turns out. You can build a whole theory or a whole theory and practice of distributed convex optimization based on these. So they absolutely have their place. So these are not just sort of -- they're interesting to know about. But they also do have -- I'm not going to tell you about distributed methods here, but they do have their places. Okay. Then you have interior point convex optimization solvers. Now people that work on it in the modern era like to think everything started in the '90s. In fact, it goes back to '60s. But for example for LPs they discovered some Russian guy who was completely unknown in the Soviet Union known outside. And it was all there, all the stuff from like 1995 was there. There's even books written on this stuff in the 1960s. They don't know everything. Oh, yeah Fiaco and McCormack, do you know this book? Fiaco and McCormack. And I forget the name of it. Do you want to name this thing? Sequential Unconstrained Minimization Techniques. That's it. So if you look at that, you'll find that they know a shocking amount there. I mean some of it you'll see clearly what they don't know. They don't know, for example, the choice of barrier is like some kind of voodoo type thing. They don't know how that works, right? But bottom line is they knew what interior point methods were, period, in 1969. And could articulate it pretty well. They didn't have a complexity theory, but then such things, no one thought to ask questions like pardon me, but what's the largest number of steps it would take your method to calculate an epsilon suboptimal solution. So no one asked those questions so it's hardly fair to blame them for not answering them. So seems reasonable, right? >>: I'm more shocked that nobody asked the question that they forgot the answer. >> Stephen Boyd: It is weird, isn't it. >>: I thought it was an older subject. >> Stephen Boyd: No, the style ->>: The order of ->> Stephen Boyd: No, no, order they did -- so the style of sort of algorithm analysis in the '60s would be to do things like talk about quadratic conversion, super linear convergence and things like that. It was not to say I can solve this and into the 3.5 log over 1 epsilon steps or something like that. So that was a shift that came later. Seems weird, doesn't it? The most obvious question to ask is what's an upper bound on how long it would take you to solve a problem. >>: I don't understand the quadratic because you get frustrated, you're sitting there watching an iteration on the computer and it never seems to get the number ->> Stephen Boyd: So quadratic was a big deal. Definitely. >>: [inaudible] come into this side? >> Stephen Boyd: Absolutely. So the popularization in the west, this is outside of Moscow, it was from West Karmakar and Catchian who used these things. And around that time a lot of these things sort of started becoming known. But a lot of them -- actually, not really the interior point method so much but the others all traced to Moscow state university in the '60s. That's where they all come from. And a couple from Kiev, but that's pretty much it. Okay. Now, these methods, they handled smooth functions, and also functions in chronic form. So these are to be things like second order current programs, semi definite programs. But they'll handle things like geometric programs and things like that. And these are extremely efficient. Typically require a few tens of iterations almost independent of problem type and size. So that's accepted. It's an empirical fact. It's nothing else. It would be like the empirical fact that simplex works really well. Right? So it's just -- and it's now been verified across zillions and zillions of problems and scaling all the way to immense sizes and things like that. So now each iteration involves solving a set of equations. In fact, these are just Lees squares problems with the same size and structure as the original problem. That's a very interesting -- now you have very interesting interpretation. Computationally, if you were to -- if you were to look, to profile a convex optimization solver, here's exactly what you would find it doing. It will solve 25 Lee squares problems, period. So maybe it's 50 in one thing and maybe 10 in another. But roughly. A few 10s of Lees squared problems. Period. That's it. There would be other calculations, but it doesn't matter. That actually has very important implications. It says if you have a field where if it's large scale you have a method to solve the Lees squared problem. For example, it's in medical imaging and you have some fast transform and you use some precondition conjugate gradient, something like this, all of that transposes. That means you can solve convex problems in 20 of those steps. That's what it means. Actually, some of it is faster. I'll talk about it later. So there would be absolutely no point to not use an interior points solver if you can. So I mean there are other methods that are neither of these and so on. But they each have sort of one -- you can find a place where it would be better than something else. But these just seem to work. Okay. So here are some. This is what Google is for. So there's no point really doing this. But so you'd have lots of LPs and QPs, and then there's different groups use different things. For sort of general purpose ones in MAT lab would be sue do you mean me SDP 3. Cone solvers. There are ones written in C that are open source. And each of these might have some specialty like solving low rank SDPs or this or that or something, I don't know. There are commercial ones. This is a very high quality one is Mosaic. Solver.com actually has an LP and SSOPs. >>: [inaudible]. >> Stephen Boyd: That's front line systems, yes, exactly. Other ones would be, there's the CDX op. This is by Lieven Vandenberghe at UCLA and some others. That's Python C. Open source. Untainted all the way through. For those of us who are in the GNU camp. That includes me. So it's untainted here. This one. Some of the others are like this. Well, no, they're not. If you see it in there, no it's not. There's lots of these. Let me talk a little bit about this idea of what a point can -- how do you do this? Here you have a problem with L1 norm. Obviously not differentiable. How do you solve this? This, of course, would come up in whatever lasso or whatever, basis pursuit or whichever one you want to call these things. It's a convex problem. It's not differentable. So can't directly use an interior point method. The basic idea is to transform it so you can use an interior point method. The way you do that is something like this. You start with these end variables like that and you introduce a new set of N variables here and these serve as upper bounds on the absolute values of the XIs. You simply write out a new problem that looks like this. The objective is now smooth, quadratic, and there are now 2 N inequality constraints. And you have to actually sit down and you actually show that these are equivalent in the following sense. If you solve that, you've actually solved that by simple transformation. In this case it's ignoring T. If you solve this one you can solve that one. That's simple, by setting TI equals absolute value XI. They're the same. You might say oh my God you started with a problem of N variables and zero constraints and now you have a problem with 2 N variables and 2 N constraints. And, in fact, so you might imagine that you've done a terrible thing. But it turns out there's no -- if you know if the linear algebra is smart enough to handle this, you'll do unbelievably well. There will be absolutely no loss whatsoever, even though the number of variables is doubled. Number of constraints went from 0 to 2 N. The reason is kind of obvious. The sparsity pattern on the constraints, it's extremely sparse. For example, T 3 only involves X 3. X 3 only involves T 3. Period. So you can just imagine when you visualize the sparsity pattern, if you interleave the variables completely, you get tiny blocks. This, of course, will run in linear time, in linear algebra. So you can extend this idea. People know these tricks. This is basically, these transformation tricks have been known since 1950. So they're in Danzig's book on linear programming. So when you take a traditional LP course, you learn how to solve not L1 but maybe like an L infinity problem or something like that. So it's just these tricks. You learn them and you use them and stuff like that. Some of the tricks are not obvious. Okay. So here's the idea. You start with a convex optimization problem and what you do you carry out a sequence of equivalence transformations. So, by the way, this is kind of like the same thing you would do, this is what a compiler would do. The idea is you start with a description and then you carry out transformations where at each step you can prove or you know that they're equivalent. So you carry out this, these are like you're doing transformations until you arrive at a target. And the target is one that can be handled by your solver. So that's kind of the idea. Then, of course, once you solve this problem, when you do this reduction here, it also comes with a method that will transform the solution backwards. So that's kind of obvious. So you do this. Now, when you do things like this, we've done a bunch now, and you might imagine -- so the first idea is you use methods like this to give you rapid prototyping of problems. Because you type in a baby problem with 100 variables the next thing you know there's a thousand. You think -- it's fine for rapid prototyping but you wouldn't want -- you think it's too slow. It's not slow at all if you know what you're doing. It's amazing no slow down at all. A problem with a thousand variables. It's huge but it's sparse. And it's just -- it's the kind of thing that a sparse matrix method dreams of getting its hand on because it's got little tiny blocks and long tendrils and things like that. It's just perfect. So actually amazingly these methods work, they're actually efficient. Okay. So there's a connection between those calculus rules and these transformations, in fact more than a connection. They're the same thing. So here's the way it works. You know that the max of two convex functions is convex. Well, there's actually a rule that goes like this: When you observe this in one of these transformed problems, when you observe this thing, you should do the following: So we have expression trees for all the constraints and the objective. So we have the trees. We've parsed it and we do the following. When you see a node that looks like that, you simply replace it with a new variable T and you put these two inequalities in. If you see something which is a convex increasing function of a convex function, which we know is convex, you simply take this, that's a note on the tree and all you're doing is composition, what you do is this: You replace it with H of T and you actually add a new constraint that looks like this. Now, this looks completely trivial. They're not. Because what happens is this. You could do this actually whether or not the rules were correct. However, these transformations actually -- you get an equivalent problem if and only if the assumptions of convex analysis hold. For example, here you'd say you can do this whether or not H is increasing. If H is increasing, then you can prove actually that the new problem you've generated is equivalent. If H is not increasing, it's not. Including all the fine details that you have to handle. So I don't know if that made sense, but that's the idea. So there's a connection between these. All right. So I mean, it's kind of obvious, right? If you have expression trees that describe the Fs here. High level algebraic expression that describes the problem. Turns out, if somebody comes to you and says how do you know it's convex. You go that's affine and that's convex. Really? I gotta speed up. I have to speed up a lot. Fine. We'll go faster. It turns out the same argument, if you take the parse tree and annotate it and give the argument, give the proof that it's convex, you could also automatically generate the transformation. Okay. This brings us to discipline convex programming. And here you would specify a convex problem in natural form. And then it will follow a limited set of rules. I'll show you what those are. You can implement those several ways. This is one. CVX. Written on top of Mat Lab. And others that run under Python and couple others, C++ and stuff like that. Here's an example. Here's a problem you want to solve. Who knows why. Machine learning problem with sparsing inducing regularization with inequality constraints, and this would be the executable source code in CVX. CVX would just parse that. Let's see. What's that? What if lambda were negative here in CVX parse? What would happen? >>: I think it should fail. >> Stephen Boyd: It will fail. Actually, it will fail even if this problem, even if this expression -- it might be that for lambda equals minus 0.1, this as convex. Very likely possible. Doesn't matter. This will fail because this violates a rule. This would be concave plus convex. And it would give you back something that says this one convex programming violation; you can't add a convex in a concave function. That's what would happen. Not the same as saying it's not convex. It says you failed to follow the rules. So that's simple enough. Here's, by the way, what this problem looks like. This is how you would solve it traditionally. Traditionally, you would get a cone solver like saidyoume [phonetic] or something like that. You would fill in a whole bunch of matrices, things like that, call the cone solver and pull out your variables. I know we did this for years and years. This is how we did it. We didn't do it, actually. We made grad students do it. That's how this was done. So because you take one look at that, you realize that's not something, that's not appropriate for a professor to do, really. But, actually the boundaries have changed, right? Because you see this? That's fine. Professors can do that. So it's really changing the boundaries between -- okay. All right. So I want to speed up so I won't go into the history of modeling languages for optimization. I'm not going to go into it. It has a long history. It goes back to the '70s. And there are lots of them. They vary in the amount of problem structure that they assume and sort of what the whole idea of the method is. Is it supposed to be a best effort method or anything you type in it does something? Or is this supposed to be like a strongly typed method where it's very rigid but if you follow those rules you're guaranteed to have, it will work. So there's variations on it. I should also say that it's actually changed things like the course that I teach on this subject dramatically. So I taught it -- it was already a big class, but I taught it at one point when homework involved something like that. You know, that's a pain in the ass, basically. You would very carefully think of about numerical problems and things like that when in fact you can write three lines and do a lot of stuff. You can do machine learning, you could do financing, signal control and network optimization. When it comes down to writing like five lines of code it just frees you completely at least when you're teaching. And it's a lot of fun now. Because the class, we do the theory. But you leave that class, you haven't talked about support vector means. You haven't talked about machine learning. You haven't talked about portfolio optimization. You've done all of it. And not only that, it was shockingly small things. So it's actually just been a lot of fun. It changed the research, too. So we'll do some examples, but maybe in the interests of time I will skip over one of them and we'll go just to this one, because it's fun. So because I think a lot of you are in machine learning and stuff like that. Good. So this one has a quiz. You better listen. In fact, it is a quiz. So here it is. I have a random variable in R 2. It's got normal 01 marginals in X and Y. And its 2nd, 3rd and 4th moments are going to match a Gaussian. Expected value XY is 0, means X and Y are correlated, not independent but uncorrelated, whatever the third and fourth moments are, those match. They're the Gaussian. The question is how small can the probability that both X and Y are negative B? Actually, if X and Y are normal, the answer is a quarter because it's the third quadrant. So I'm waiting. >>: Must be a really tight number or you wouldn't have asked the question. >> Stephen Boyd: Okay. Tiny. You want to put a number on it? >>: 20 minus 6. Otherwise you wouldn't ask it. >> Stephen Boyd: No, no, no. That's old style. No, no. Because then if it's 20 minus 6 it would be numerical error and you would have to figure out ->>: Something small. >> Stephen Boyd: All right. Any other guesses? >>: 0. >> Stephen Boyd: That's indeed small. >>: The mean has to be. The mean has to be small. >> Stephen Boyd: Oh, yeah. Yeah. I guess the point is it's not obvious. I've asked a lot of problemists all over, some get shockingly close. But no. It's not obvious. You can make an argument either way. So you can write this as an LP, as a -- you can dissecuritize it. And if you want to be careful and get provable upper and lower bounds, you can dissecuritize it carefully. I didn't do that here but you could. You would write it out as a giant LP after dissecuritizing. These are the marginals. These are the second order, these are the second order moments, cross moments. The XI squared comes for free because the marginals are Gaussian. So you don't even have to do those. Then these are the third moments. They work out. Okay. So here's the source code. It would be a huge pain in the ass by the way to write this out as an LP. Most graduate students would just say no when they finally, when they looked at the problem, realized what was involved and all that kind of stuff. Anybody that's sensible would say no. Write it out, absolutely nothing here. Okay. So here's the Gaussian. And here's the answer. So it's 6%. That's the answer. So you know, who knew. I've had people say anything from 0 to like .24. So you can't go above a quarter because a Gaussian gives you a quarter. Okay. What's that? >>: [inaudible]. >> Stephen Boyd: And here's the distribution that does it. Now it's dissecuritized. But it's weird. You are not -- you could not say I was about to try that distribution. I mean, there's no way. And, yeah, you look at it. And look at it, it does just the right thing. It puts some sick little mass in here. The other thing, if you look at it, you can see that the solution is probably a measure supported on some weird, if we're lucky, one dimensional set. For all I know it could be some sick cantor-like set. I don't even know. The point is the number .06 we're comfortable. If we bound it, we can get the right thing and so on. So that was just how much fun you can have with like five lines of code. Okay. So I think I'm going to move on quickly, and actually I might be okay if I go quickly. I'll talk about interior point methods. So the worst case complexity theory says that the number of steps grows no faster than the square root of the problem size. Those are the best bounds. Number of steps, I already mentioned this, is between 10 and 50. If you want to be super safe you can go up to 80, something like that. But that's what it is. And this appears, by the way, to persist. This is certainly true for problems up to a million variables. It's independent what kind of problems it is. It could be from machine learning, finance, circuit design, network control. Doesn't make any difference. Always the same like this. There's actually someone, Jack Donsio in Edinburgh [phonetic], recently came and gave a talk at Stanford. And he solved an LP dense with a billion variables. We're like, wow, what did you use? He goes same thing everyone used, primal dual. It's the homogenous self-dual embedding, same one. How many iterations did it take? It took two days to solve LP, something in finance where they expanded a full tree and all that stuff. And he said it took two days. And I said how many iterations was it? He said 21. The thing is the property of these taking 20 steps extends to problems with a billion variables. But each iteration, by the way, was quite expensive. But, okay, so each step requires the solution of a set of positive definite linear equations which says solving a Lees squared problem and the fact it's a Newton system. And you have 3 gross categories, direct dense, sparse iterative methods. The truth is in practice they're not distinct. If you do direct sparse activation, two steps of iterative refinement. Someone could say you're using an iterative method. And iterative methods rely on preconditioners that often use direct sparse factorizations. But grossly these are the three main categories of linear equation solving. So does dense direct ones, this is your LA pack type thing. And these are just like rock solid. This lists almost zero variance in how these things work. Sparse direct. Here the runtime depends -- the sparsity pattern actually but not on the data provided these are like positive -- because they're positive definite systems, that's why. Because you do symbolic preordering and you don't have to do runtime pivoting. So that's what this is based on. But this actually requires a good heuristic for ordering. By the way, this undercuts the whole thing about convex optimization if you think about it. Next time you see someone who has convex optimization, you say what are you doing, I'm solving a complex problem. They're all high and mighty. They say I get the global solution. It's nonheuristic and everything. And then you say, how do you solve your linear equations, it will be using a sparse solver. And technically that's a heuristic. So it's only sparse solvers work only by the grace of whatever god or gods or goddesses handle heuristics for sparse matrix orderings. So the gods of heuristic ordering. And that can really get people irritated, by the way, if you say that. Because you say, no, you're using a heuristic method. They'll say: I am not. I made approximations so I could make this convex. This is the global solution. I don't doubt that, but the method -- the fact it runs in 10 minutes and not a year is basically because of these parse matrix factorizations. When you move into the scientific community. Iterative method. All bets are off. Runtime depends on data, size, sparse. Required tuning and preconditioning. So these are certainly not general methods by any means. They're just not. On the other hand, the only way you're going to solve problems with 10 million or 100 million variables. I'll skip over this. I'll just say I think everyone knows about conjugate gradients methods anyway. These are iterative methods for solving large positive definite equations, large lease squares problems. And the interesting thing for us is not sort of the theory of them or something like that. It's that you get an awfully good solution sometimes in a shockingly small number of steps. This depends on this spectrum of the operator involved. So I'll skip over this. And just go -- and say that if you take an interior point method and instead of solving for the search direction using a direct method you use an iterative method there's lots of names for it. Limited memory Newton. It turns out it's very close to BFGS. It's also called a Newton it active method. But the total effort here is measured by the equivalent of CG steps. These are not general purpose. The grad students are back in business not writing out code to stuff entries in giant matrices, but they're back in business to work out to do things like get good preconditioners and things up and running for problems and things like that. Nice part is an interior point method. In fact, you really couldn't care less for solving for the search directions anyway. Because all you want to do is get to the solution. So it's totally irrelevant. Nobody cares about solving it or anything like that. You want to preserve this idea that it takes 20 steps or something like that to get there. I'll give a quick example and we'll move on to this last topic. This is just L1 regularized logistic regression. So you have a whole bunch of data that's XI with labels BI, binary labels. You want to fit a logistic model to them for something, something like that. And you add an L1 term here. And what this will do, of course, is if you crank land up and down, you will get a sparser and sparser solution out here. The nice thing about this one is you can actually do things like this. This is sensible. This is utterly nonsensible if, for example, the number of examples is smaller than the number of features. You can do this for 50 examples and a million features. No problem. But, I mean, what it will do is it will select obviously fewer than 50 features, but it will select, I don't know, whatever 10, and so the idea is that that's a heuristic for you getting out to try all one million choose 10, sets of 10 features. Does feature selection for you. So that's the problem. You can write a sparse direct solver for it and it works exactly as advertised. They all look like this. Takes 30 steps or whatever to get you a perfectly good solution. This shows you how much it depends on the value of lambda and direct methods, as I said, are certainly independent of the data and that's true. Instead of taking 32 steps it takes 33 or something like that. And this would be -- I guess these are just two examples, and each of these has a few thousand variables and constraints and these are solved relatively fast. We'll go to a much bigger problem with maybe, I don't know, three-quarters of a million features, 11,000 -- here's an example where you have more features than you have examples. And you have about five million nonzeros in the data. And the interior point method -- the final IPM problem has about a million and a half variables and so on. These are beyond the capability of direct methods. With the relatively simple precondition for Newton system, you can solve this in a couple of minutes and this is what it would look like. Now, here you see everything you'd expect for one of these methods. But the first thing you see is the actual time which is measured by cumulative PCG iterations, it's data dependent. As a practical matter you're only interested in this one. Because the sparser the final selected features are, the faster it is. So turns ought you're really only interested in that. >>: You don't care what ->> Stephen Boyd: Of course you don't. No, you could go out here. So in fact you could have stopped out here. I agree completely. Because the things that's heuristic for actually making a classifier. If you judged it on a validation set, you would find, if you stopped probably right here, those weights would just be fine. Truncated the ones that were clearly going to zero to zero, I'm sure it would work fine. All right. These are kind of obvious. But these methods will take you up to the ten million variable problems here. Pretty straightforward. By the way, there are first order methods that people have now developed for these. These are also quite good. They're almost linear. Same as these. So those are quite good. Actually, that's sort of my theory, if you have a specific problem and you're making an interior point method, a custom interior point method will be the first in to get really fast. Then the crowds will sort of come in and they'll end up with some simple first order variable at a time method and with a lot of tuning these things will work. That's happened for a lot of L1 regularized problems and they're perfectly good methods. We just have to make a variation on the property, add some linear inequalities and those other methods don't work. So the summary here is just that there's really sort of these three regimes of problem solving. Ones that go -- I'm not talking now on a distributed we didn't talk about distributed solvers but we talked about just a traditional solver on one machine multi-core doesn't matter but just one machine. There's these three regimes, the small, medium and super big. Last thing I want to talk about is going to be pretty brief. Actually, I find it actually kind of really interesting because actually I have no idea what it can be used for but I know something. It's just really cool and it's this. So let's imagine you're going to solve a specific problem. So and actually to figure the idea here. Let's figure a modest problem, couple hundred variables, portfolio optimization problem or optimal execution problem in finance or a problem in control, network flow. Let's just make it network flow. You have a network. You want to decide the flow rates of 100 flows passing over 300 links and you want to set these flow rates. So, I mean, that's way beyond. No one just messing around not trained in optimization could come up with something even remotely close to optimal. If you're making a custom solver you can exploit the structure efficiently. And the point is you can actually do this at code time, code generation time. You don't do it at runtime. So solvers now generally work like this. It reads in the data. It takes some time to analyze the sparsity. It uses whatever method it uses to generate some permutations. At that point it knows how bad the fill-in is going to be. It allocates some memory for it. Maybe some extra memory for dynamic pivoting and stuff like that and then it starts solving the problem. That's how solver works now. That's great. That's how you want a general purpose solver to work. But here I'm talking about something where you're going to solve, you want to optimize the flows on this network. Topology is known or something like that ahead of time. You can spend hours figuring out good permutations. You can also do things like determine all the orderings and do the memory allocation. You can move things around in memory for nice locality of reference and all that. You can do crazy stuff. You can also cut corners in the algorithm. So as you just pointed out, if you write a general purpose solver, you better make it get six or eight digits of accuracy because you don't know what people are going to use it for, right? So if it's finance or something like that, or who knows, it's something like that, where the six or eight digits actually has meaning. Probably means the problem was scaled wrong but that's another story. Then you just don't know. But once you have a particular problem, it's very rare that the second digit, actually, in a properly scaled problem, third digit is utterly inconsequential in all practical problems, as far as I know, completely inconsequential. So once you have a specific problem you can terminate way early. You can use warm start. Because if you have something in sort of a real time embedded thing where you solve one problem and then another and another it's often the case that these problems are related and they're very close, and you can use warm start. Now, if you put all these tricks together, you can end up with a very fast solver, and this basically opens up the possibility of real time embedded convex optimization. So this is this. And I'll look at some quick examples here. Here's grass force optimization. So this is for a robot. You have a rigid body. And there are fingers grasping it and you have to decide the forces on these fingers. They have to resist a given. There's a wrench, force of torque on the body. You have to resist that. That's this, six equations. You have to satisfy the friction constraints, except I forget to put in the coefficient of friction. Goes one of these two places, put the coefficient of friction in there. Doesn't matter you can say it's horrible it's nondifferentiable. That's fine, which it is. But we're past that stage. We say it's an SOCP. And in this case, for example, if you exploit the sparsity pattern and various other things and you have a custom method for calculating a dual point and all these things, you can solve these problems here in around 80 microseconds. So the speed up, by the way, compared to sort of one of these other things, 80 microseconds is a joke compared to these other things have just pulled in the problem. They're just sort of waking up and thinking about finding and ordering and all that kind of stuff, and this is done. Okay. So 300 microseconds is cold start. That's just from nothing. That's just new data. Never saw it before. You get the answer. So we've done this now in a bunch of problems. Another one is model predictive control. I don't know if people know about this. Do people know about this? This is very, very cool. It's widely used. So you have a stochastic control problem. Let's make it linear for simplicity. It could be a supply chain problem let's do supply chain. So you have a bunch of product. Bunch of nodes on a graph. And at each step you can ship things along nodes at certain costs. You can even pull some in from a factory or something like that. You have demands at various nodes things that get pulled out. Different models of it. One would be, for example, you can actually allow the stock at a certain point to go negative. That would be back order. You can have all sorts of games. And you have what you know as the distribution of demands. It's statistical. So the question is now how do you operate a real time supply chain this way? So you know your stock everywhere and how to use -- these are very complicated stochastic control problems. There's some amazingly good heuristics one is MPC. Works like this. You plan over a horizon. You just take 20 steps in the future. Now, in the future, of course, you don't know the demands. So you go to an expert and you ask them to guess the demands. It's you can use the conditional mean based on what you know so far. It just doesn't even matter. You get that. You pretend that's exactly what's going to happen and you work out a complete planning trajectory of how you'd ship things across nodes to minimize shipping costs, warehousing cost and some costs associated with back ordering and all that kind of stuff. So you work out a whole plan and you'd execute only the first step. So it's got lots of names it's also called rolling horizon planning dynamic linear programming. It's got lots of names. Works unbelievably well. It is now universally used in chemical process engineering. So that happened in the '80s. '80s and '90s. So that's how that works. The problem with it is -- the reason it was there is because in chemical processes, the dynamics are very slow and you have 15 minutes or an hour to make your new decisions. Things are just going very, very slowly. So no problem. You can solve a 50,000 variable LP quite reliably in that amount of time. So most people in control assumed, oh, they would call that oh that's a numerical method. I work on like jet fighters or blah, blah, blah. I make servos for disk drives. I need to make my decision not every hour. I make my decision every 50 microseconds. So this will never apply to me. Anyway, that's not right. And these are actually just very old numbers here. So this is just model predictive control. Just to point to one here, this would be the types of things people would actually use. So this would be a QP with 50 variables. 160 constraints. And if you exploit all the structure you can solve times of 300 microseconds. Actually, these are off. These are now down to like 150, things like that. This is on something like 2-gigahertz processer or something like that. Typical speed up they're typically on the order of 1,000 to 1. These are already very -- these are already sort of the extremely good interior point methods. But it's not fair, right? Because these things have solved a thousand of these problems before these things -- these things are just waking up allocating memory, loading, starting to have a discussion about what the ordering of the variables should be and these things have solved a thousand problems. But it's not -- it's of course not a fair comparison. So, actually, I actually -- this I think is really interesting. This is what I'm most interested in right now is because this whole regime in optimization. Everyone thinks of optimization as either solving giant problems, something like that, or solving things with a human in the loop, on a human time scale, where somebody with a spreadsheet types in some what if and hits optimize it or some stupid thing like that and all of a sudden something happens. Has to happen in a second or two. Working out the scheduling for United Airlines for tomorrow that's what people expect. What I haven't seen is a lot of interest or any focus on solving convex optimization problems in microseconds and milliseconds, which now we know is absolutely possible. I don't know what to do with it. So actually if anybody here has an ideas, like, for example, suppose you could knock off a -- suppose you could knock off -- suppose you can knock off an L-1 regularized logistic regression. You can update the Ws every two milliseconds. I don't know. For some modest size problem. I'm trying to think -- I don't know what to do with that. But I'm sure there's going to be something cool you can do with that. That much I'm confident of. So, okay, so maybe I'll quit here. The references that's what Google is for so I won't say anything about that. And I will quit here. [applause]. >>: [inaudible]. >> Stephen Boyd: Yeah. >>: You mentioned dense linear optimization of a billion there. But how many exabytes did you need for memory? >> Stephen Boyd: It was huge. Used some giant like the Blue Gene one of these huge things. Each iteration took, I think he told me it was something like, I think it was like -- it was hours. It was like two hours. >>: Memory ->> Stephen Boyd: It was what. >>: Two days. >> Stephen Boyd: So we can do the arithmetic. It was a couple of hours. Two hours for each iteration. >>: [inaudible]. >> Stephen Boyd: But it was a room. He showed a picture of it in his slide you could see way off in the distance there were machines there. So...yeah, it wasn't on ->>: The world's largest dense matrix ->> Stephen Boyd: He says it's not. But it was pretty big. You can find that out. It's Gonzio. Just type Jack Gonzio in and I'm sure you'll find that. >>: At the end of October you're still at the curve you don't have a financial organization? [laughter]. >> Stephen Boyd: Sure. Sure. I can tell you this, actually. A lot of the stuff in convex optimization is used for robust portfolio optimization. So I actually think that the status of the people we're talking about, robust portfolio optimization, has risen substantially. Because before that you'd hear people -- I talked to people who were actually in the trenches fighting other people, my method, your method. >>: No matter what your algorithm was, if you don't have the right data they were calculating things on portfolios using three-year-old data for characteristics. >> Stephen Boyd: Good point [applause]