>> Kostya Markarychev: Okay. I will start. So it's great to have Nikhil Devanur today speaking about fast algorithms for online stochastic convex programming. So Nikhil is an expert on approximation algorithms, online algorithms, and [inaudible] game theory and today he's talking about I guess online algorithms, right? >> Nikhil Devanur: Thanks, Kostya, and it's nice to speak here. I think I'm speaking after for a long time. So it's nice to be on the podium. And I will talk about this work which is joint work with Shipra Agrawal from MSR India. As many of you know, I spent about a year in India, and this is one of the two papers I wrote, and I was there, really like both of them. And it's a very simple talk. I'll define a problem, I'll give you the solution, and I'll point to some future work. So what is the motivation. As again, many of you know, I worked, been working on problems coming up in display advertising. This is the things where we advertise on people's T-shirts. Just kidding. And previous to this work already the algorithms we design are actually being used by Microsoft in the past few years to run this [indiscernible] And here's the basic problem formulation. The basic decision you have to take in this problem is you get an add slot or an ad opportunity, and this is called an impression, so it's a technical term. And you have to assign this impression to one of many competing ads. So which ad do you assign the impression to. So if you assign an ad J to impression T, then you get some value VTJ, that depends on this pair. Now, you could say okay, why don't you just assign each ad to -- the each impression to the advertisement with the highest value. You don't do that because there are constraints. The constraints are each ad has a target number of impressions and you don't want to assign more than these impressions, this many impressions to this advertisement. Okay. It's still an easy problem if you know everything. Here's [indiscernible] you can solve the [indiscernible]. You can also round it very easily, right? So you want to maximize the total value generated by the location, the capacity constraint, and for every ad, J is an ad, you cannot assign more than GJ impressions. And every impression you can assign to one ad. Okay. The difficulty comes from the online part. The thing is you don't know the center thing ahead of time. The impressions come one at a time and you're to assign the impression without knowing what impressions you're going to get in the future. So there's uncertainty in the future. It's what makes the problem difficult. >>: [inaudible] meaning that just one -- >> Nikhil Devanur: >>: One impression. >> Nikhil Devanur: >>: Just assign it to one of the ads. [inaudible] >> Nikhil Devanur: >>: It's just one impression. I'm sorry? [inaudible] >> Nikhil Devanur: impression. Yeah. It's just impression by >>: There is a sum of G -- >>: It should be J. >>: Oh, this is J. Sorry. Sorry. Yeah. That's what I was going to copy this. Yeah, yeah. This should be sum over J. Yeah, I changed this thing here and didn't change this thing. Okay. So we're doing this paper joint with Balu and others. We designed a near optimal algorithm in a particular stochastic model. I'll mention what the stochastic model is later. And this is algorithm with some modifications that is being used by Microsoft. And then we met this guy called reality. So I'm exaggerating a little bit. The thing is this was an essentially linear formulation, and in reality, there are all these nonlinear things that creep up. So one thing is called under-delivery. So the way this works is this target is not just an upper [indiscernible]. It's essentially upper and lower bond, so this is a number of impressions you promise the advertiser, and if you don't give this many impressions, then you pay a penalty. So if you promised a million impressions, you only give 90 percent, you're to pay a penalty for this ten percent. And if you assume that this penalty is linear, then it still becomes a linear program. It's a linear formulation. But in reality, this penalty is nonlinear. It's some kind of convex. So if you give 50 percent, then it's -- no, it's really bad. So it's a nonlinear penalty. The other thing is, advertisers, they require a mixture of different impression types and it's -- if you deviate away from this mixture, then it's not so good, but, okay. So how does it look like. Again, it's not a linear relationship. It's some kind of convex penalty. So the farther there is an ideal mixture, let's say you want 50/50 of man and woman, but the farther you deviate from this ideal, the worse it is and again, it's not linear. >>: [inaudible] >> Nikhil Devanur: Yes. We know a lot of times. Most of the time we know because they're logged in. They've told us what the [indiscernible] is. So that's just an example. We might know other things. >>: You're saying it's not linear, but it's well defined. I mean, some of those were exactly how bad it is to have 48 percent man and 52 percent women. There is a score that defined by some human? >> Nikhil Devanur: Yeah, so you can model it. It's a little fuzzy. So advertisers don't really give you this function, but you can kind of capture this. So advertisers won't tell you here it's a penalty, but you know, you know. So someone who knows the business can model this and say, okay, look, you know, if we deviate too much, then this is how bad it should be. So some human will come up with this function. >>: At least for under-delivery it's precisely known. >> Nikhil Devanur: Underdelivery, yes, actually this is part of the contract. So this is part of contract. So we know exactly ->>: The algorithm designer -- >> Nikhil Devanur: For the algorithm designer this one [inaudible]. This function is given. Okay? And another place where this comes up is this value that I said, so there are many things we care about. Revenue, of course. Also relevance and maybe clicks, conversions. All kinds of things we care about. One thing you can do is just take a linear combination, so then it becomes this linear objective again, but you may not -- that may not again capture what you really want. So you might -- again, what you really want optimized might depend in some nonlinear way on all these different objectives. So all these things that are not captured, but the nice thing is either they occur as a concave objective or a convex constraint. And this is like a convex programming version, right? So from linear programming we can go to convex program. In the offline world we still know we can solve this, so there is hope. So in practice what we do is we took the linear programming algorithm for linear programming and, you know, you do some hacks, you do some changes that kind of make sense and it still works pretty well. But as a theoretician, it was not a principal approach, so I really wanted to figure out, okay, how can we handle these convex things that come up. that's for draw this problem. So Okay. So now I'm going to give you a very general problem [indiscernible] that is going to capture all these things, and we call it online convex programming. So this is -- we called it online convex optimization, but that was already taken, so we had to settle for online convex programming, but I think it's a very apt thing. So here you're given a convex function F and a constraint set S and a time horizon D. This you're given ahead of time and this is not changing. So is this is what you're talking about. The algorithm is given this objective function F. Okay. This is the same [indiscernible]. Now, this definition, it's going to look very abstract, so bear with me for a moment and I'll show you how this corresponds to the kind of problems we're talking about. So abstractly in every time T a request arrives and what is a request? It's just a set of options. And what the algorithm is required to do is pick an option from the set. What is an option? Each option is associated with a certain vector, okay? And this vector can correspond to cost or rewards or some dimensions can correspond to cost. Some dimensions can correspond to reward, and so on. So you encode everything in this vector. Okay? So the set, the choice set is just a set of vectors, R to the D. And the algorithm has to pick one vector from the set. So pick one option from this choice set, and every time. And then again, this is an online problem, so in every time you don't know what the future is going to be. You're to do it in an online fashion. And what is your goal? The goal is to maximize this function F that is given of the average of all the vectors that you pick. So you can think of it as a function of the sum, but it's just more convenient to think of it as an average for us. But the two are equal, and if you know the time horizon, you can just scale everything. Okay. So this is -- and then the constraint is that the average has to lie in S. So this is a global constraint and a global object there, because it binds across the decisions you take across time. So it's not part-time stuff. It's across what you do through the entire time horizon. So the average of everything has to line us. >>: [indiscernible] corresponds to some stochastic constraint? >> Nikhil Devanur: Yeah. >>: Then does it make sense to average it over time? I mean, I don't want to violate any capacity throughout time. >> Nikhil Devanur: So, yes. That's a very good question, and that was our linear formulation, linear programming formulation was you had this capacity constraint, and when you specialize it to that, we can actually make sure that you always satisfy it. But if you have arbitrary constraints, then it's not possible. In the intermittent steps it may not be possible to satisfy everything. So it may be possible to only satisfy everything in the end. So for this generality, you need that everything is only required to hold at the end. >>: [inaudible] >> Nikhil Devanur: You know the time horizon. >>: [inaudible] because otherwise it could end any time and expect you to be all the time. >> Nikhil Devanur: Yeah, you know the time horizon. Think of the opposite of packing the covering. In that case, you can only hope to satisfy at the end, for instance. So let's go back to -- yeah. >>: Just to make sure I understand. you don't know is AT? So the thing >> Nikhil Devanur: sum. AT you don't know what the AT >>: I see. And if you knew all the ATs ahead of time, then you would need just a [inaudible]. >> Nikhil Devanur: yeah. >>: It's like a convex programming, [indiscernible]. Oh, I see. >> Nikhil Devanur: So, okay. As you each 80 years convex, if you want, then it becomes convex programming, otherwise it's not -- if your ATs are discrete, there's still a notion, you know, this issue of discrete versus continuous, but because of averaging, you know, that is only like a one over T thing. I don't think that I should say, this is not really -- we assume this is bounded, so it's not really R to the D. It's minus 11 to the D. So then the discreteness issue is not a big issue, so it's essentially convex programming. Okay? So think of AT as just a polytope. >>: [indiscernible] >> Nikhil Devanur: Convex [indiscernible], yeah. So there are going to be more rounding issues, but it's going to be easy rounding. All right. So what about the distract column? So this is a generation of the distract problem. How is it [indiscernible] relation. Here's how we can encode it. The vector has N plus one dimension, so if there are N advertisers, it has N plus 1 dimensions. The first N dimensions encode which advertiser this corresponds to. So there's a one in the dimension corresponding to advertiser. The last dimension corresponds to the value. Okay? And your objective is just the last dimension. You just care about maximizing the total value. And most the constraints, right? The constraints, the capacity constraints or box constraints, right? I just divide it by T because the constraint is on the average, okay? So for in the J dimension, you know, you cannot execute [indiscernible] objectivity. Okay. So if you pick this vector, it corresponds to assigning this impression to this advertiser that uses up one capacity and it gives you value of your TG. It's clear how this encodes to this [indiscernible] problem, right? And now let's say you had an under-delivery penalty, a nonlinear under-delivery penalty. Here's how we would change it. [indiscernible] effects to be just the last coordinate. You would have this penalty here, so GJ encodes how this penalty looks like. GJ only depends on XJ, which is the count of the number of impressions you assign to [indiscernible]. Okay. So you would have something like this. And if G is convex, this is concave. They can maximize this. This also generalizes a more general linear version called the online packing. You have this packing constraints and, you know, you can do a similar thing. One example of an online packing is a network routing problem. The ith request corresponds to a request to route one flow from a known site to an OTI in a network. You have each capacities are given ahead of time. So that's a packing problem. Another example is that combinatorial auction. So buyers arrive. They have combinatorial valuations for sets of items, and let's say you have many copies of each item. So buyer arrives. He has some valuation you have to maybe offer prices or, you know, do something else. You ask him to tell you your valuations in some way and then you assign a bundle of items to him and you charge him something and then next pair arrives. So this is also an online packing problem. So these kind of things can be encoded in this format. And even though I said for the online packing, this is a linear version, this is already solved. Even for this we get an improvement. Improvement is technical, so earlier the dependence is currently that we get that I'm going to show later, it depended on the number of constraints. And the number of constraints in this formulation, for instance, like a network routing formulation, the number of constraints could be exponentially the number of variables. Right? Because you know if graph is images, the number of parts could be exponentially. So the number of constraints would be exponential, and we get [indiscernible] only depending, you know, only on the number of variables. So even in some cases we get improvement here, even in the linear cases. And there are potentially other applications. We are looking into this and people are studying similar problems. Some of it fit exactly in our model, some of it needs some work, and some of it, you know, it's not clear the fit. Hopefully we can extend these techniques to that. So there's a lot of work in especially operating, operations research community that look at similar problems. Okay. So one thing that I haven't told you yet, mentioned is this thing is in a stochastic model. So what is a stochastic model? So this is not the usual worst-case model of a comparative analysis. So the problem we concern is called the random permutation model. So this should be familiar to people who have seen the secretary problem. In this model it's collection of the sets ATS is adversarial. So an adversary picks the collection of ATs, but the order in which these are presented to the algorithm is random. So it's a uniformly random order. And that's the random permutation order. And that's, the stochasticity is only in the order in which it comes. There's a related model called the IID model. In this case there is a distribution on sets of vectors. It's not a distribution on vectors. It's a distribution on sets of vectors. And every time this AT is generated from this distribution, it's an IID sample. And the difference between these two is like sampling -- the difference between sampling with and without replacement. And it's known that this is the more difficult model to any problem -- any algorithm that works for this also works with this. Not necessarily vice versa. And there is also a model with time varying distributions. This is like the generalization of IID that is not covered by random permutation. Random permutation is inherently stationary. We can allow the distributions to change over time, but I'll not go into detail of this. So for this talk, I'm just going to talk about the random permutation orders. So just think of this. >>: So chosen adversarially but not unknown to me. >> Nikhil Devanur: Unknown to you. All you know is that they're coming in a random order. Okay. And how do we measure the performance of algorithm? We look at an [indiscernible] notion which is called the comparative difference. So this is clearly the difference between OPT, which is if I knew everything, what would I do. I would pick some vectors to be so that the averages in a set and maximizes functional average, so that OPT, so offline optimal solution. And you look at the difference between that and what we get. The thing is we cannot make sure -- we cannot make sure that our average is actually in the set S. But we will make sure that the distance from the set is one. So we have to -- it's like a [indiscernible] criteria. We have to relax the constraints a little bit. So both of these will go to zero, as you will see. And in the special case, so the traditional thing is competitive ratio, which is a multiplicative thing, and this is what is used in the linear version, for instance. And the difference there is, again, as I said, the constraints have to be satisfied at all times for this packing problem. This is too strong for the general framework, as I said. But when you specialize it to this, we can make sure you satisfy the constraint at all times and you can also make sure that we get the competitive ratio. So it's not like we are kind of playing some games through this. It's just the nature of the problem because of this generality, we have to do additive. So if you specialize it to this, our technique also gives you multiplicative [indiscernible]. >>: Is there some constraint on F except that it's convex or ->> Nikhil Devanur: >>: It's Lipschitz. It's Lipschitz. >> Nikhil Devanur: Okay. So what has been done previously. I said mostly people have wanted the linear version, and this online packing especially, and there are a bunch of results that are dual based. So essentially the idea is to learn an optimal dual variable to this packing problem and use this dual variables to do the assignment. This is efficient in the sense that it has to solve a batch LP, a logarithm in the number of time steps. So lot T times. It has to follow a batch LP, but every step picking this vector from the [indiscernible] set is fast. So in that sense, it's efficient, but it is suboptimal. So the guarantees that you get are not optimal. And then there is this paper that I mentioned earlier that is used. So this is something called the hybrid argument. It is both efficient. It also solves a batch LP log times. And it gets optimal bound for the IID model, but for random permutation model, it's suboptimal. So actually, we didn't know any bound for random permutation. So it only works for the IID model. And then last year there was this paper by Kesselheim, et al., that used a primal approach. So there's no duals here. And it was optimal for both the random permutation and the IID model, but this was really inefficient. It had to solve an LP every time step. So every time step to pick this vector, it has to solve an LP. And okay. This is [indiscernible] time, but this is not something we would ever use in the advertising application or most of the applications, for that matter. And so this result gives the best of both worlds, and also generalizes to the convex program, so it is efficient and so it's a primal-dual approach. We have to solve a batch LP or a batch convex program. In the convex programming case, we are to again solve it to log times, but for LPs we have to solve it only once, which improves even on these things that have to solve log times. And it is optimal for both RP and random permutation and IID models. And also another thing that I like a lot about this is these earlier ones, all of them, the proof is quite complicated. It's not clear -- I mean, it's not completely clear what's happening. And here we give a very simple and modular proofs. So we have different components and it's easy to see how these components interact with each other. And simultaneous related to our work people did similar things, but this is only again for the linear version. So they give similar results for the linear version that match ours. And this is also again a linear version, but has some local convex/concave objectives, not take the global concave objectives that we have. Okay. So what do we get? Essentially we get close to optimal results. We get this comparative defense of one by square root of T. So this goes to zero. ST goes to infinity. This is -- this looks very -this should look similar -- familiar to the people in online learning, the square root T regret. In particular for the objective, we get this one over square root T times Z plus L. L is the Lipschitz constant of the function that you asked about, and Z is another parameter of the problem. It's a problem dependent parameter that I will talk about later. And for the constraint, there's no such thing. It's just one over root T. Okay? >>: Assuming everything on the vectors are [indiscernible] -- >> Nikhil Devanur: Minus one. >> Nikhil Devanur: constraint. C is just some [indiscernible] >>: Oh, okay. >> Nikhil Devanur: Yeah, I don't order and C, yeah. And the thing better for some special cases, so smooth, the objective function is actually get log regret. know why I have is we get even if a function is smooth, you can And this was not known earlier. This is tide in general, even for linear problems, but if the function is smooth, you get log regret. This comes from the corresponding log regret for strongly convex functions for -- yeah, for online learning. So we get the corresponding things here, and you'll see why we get this, because of this kind of a modular proof. So we use this online learning as a blackbox and any improvement in the regret there translates into improvement here. >>: [inaudible] >> Nikhil Devanur: >>: Yeah. So smooth -- >> Nikhil Devanur: So if -- so smooth here translates to strongly convex in the online learning. Because that's the essential conjugate of this. So online learning will have essential conjugate of F. So smooth becomes strongly convex there and then ->>: [inaudible] be able to see that immediate? >> Nikhil Devanur: That the essential conjugate of smooth is strongly convex? >>: That is fine, but -- >> Nikhil Devanur: >>: Yes. But why you get to do it? >> Nikhil Devanur: No. I haven't told you, so you shouldn't -- yeah. Maybe you can, but I couldn't. Okay. So, yeah. And as I said, in the special case of online packing, we get the optimal [indiscernible] ratio that was known before. Okay. So now to the good part, the algorithm, right? What I'm going to do is I'm going to confer a very simple special case. It's a one dimensional problem, okay? And there are no constraints. There's only an objective. So all you're doing is, for the sake of the diagrams to be nice, I'm going to switch to minimizing a convex function, okay? It's equal, right? So you're given a convex function ahead of time, and that's all you're given, and the number of time steps. And in every period you see a set of points. So again, this should be minus 1, 1. So you see a set of just real numbers. You're to pick one number. So you see some numbers on this axis, you pick one and then you repeat. And the goal is to minimize the value of this convex function on the average of all the numbers that you picked. Okay? So this is a game. And, you know, we mentioned this comparative defense. We compare it to the optimal choice on hindsight. Okay? So here's an example. This is -- you're given this ahead of time and the first time step you get two points. You are to pick one. And then next step you get, let's say, two more points. You're to pick one and you repeat. Okay. And every time step you get two choices, you're to pick one. Okay? And these are the points you picked. We're at the end of the algorithm and we compute the average of these points. You locate what the function is on [indiscernible]. This is what we got. And you can look at all the points, all the choices you had on hindsight after everything is done and you can say, okay, if I knew everything, I should have done something else, and that is your optimal solution. And if you had picked these, you would have got this as average and you will have got something here. And this should be computed difference. So earlier we used to call it regret and somebody pointed out this is called [indiscernible], not regret. So, yeah. >>: Computing the best thing in hindsight, how do you do that? >> Nikhil Devanur: Doesn't matter. exponential time and do it. >>: You use Thought something that couldn't be done. >> Nikhil Devanur: So you can approximate it very well, once again. You could solve like a convex program and you can round it and it's going to be very close, so -- but for this sake, you could -- you might as well take exponential time and compute the optimal time. Doesn't matter. You're going to see how we're not really going to care -- [indiscernible] get away from having to characterize this exactly, so you will see how it doesn't matter. >>: [indiscernible] >> Nikhil Devanur: So here there's no constraint. There's only an objective. So this is like a very special case, but I just picked this to illustrate the algorithm, main ideas. And everything that I said generalizes very nicely. Okay? >>: So this choice of the average regret will depend on your random samples? Like if you did the same random sample ->> Nikhil Devanur: So in the random permutation model, the optimal solution doesn't depend on the stochasticity because the order doesn't matter. In the IID model it would matter, then you take the expectation of the optimal. Okay. So here's the algorithm. In every time step the algorithm is going to pick a tangent to this convex function. So this is a tangent where the parameters is a tangent with a slope. So the slope of this tangent is [indiscernible] 1, and if you fix [indiscernible] you think of this as a linear function of X. And I'm going to pretend that this is a function that I'm really optimizing. I'm going to forget about H. I'm going to just try to optimize L. Okay? And in one dimension it just means that if the slope is negative, you pick the point on the left. If the slope is positive or whatever, the vice versa. The slope is positive, you pick the point on the left because I'm minimizing the point on the right. But can you imagine in many dimensions this is going to be nontrivial. Right? So, you know, in many dimensions I'm going to have a hyperplane and I'm going to have a linear function, and given all the points, the choice at the ATM will optimize this linear function over the choice at AT. >>: So here you will always choose either the leftmost or the rightmost point? >> Nikhil Devanur: >>: Yes. Should it be clear to me that -- >> Nikhil Devanur: It's a linear function, right? >>: I understand. problem. The offline, you take the offline >> Nikhil Devanur: Not offline. This is an algorithm. . >>: No, I understand. But now go to the hindsight. You have the offline problem. You have one problem is the original problem. Just choose the optimal location. The second one is choose the optimal location, but you're restricted ->> Nikhil Devanur: >>: -- the left or the right one. >> Nikhil Devanur: >>: Yeah, the offline -- Yeah. Isn't there even there a gap? >> Nikhil Devanur: [inaudible]. >>: There shouldn't be because of Should it be clear to us why that's the -- >> Nikhil Devanur: It shouldn't be clear. It's not clear [indiscernible]. That's a very good point. It's only because -- because this works, there shouldn't be a gap. Because here in this case, algorithm is only picking one of the end points. It's a convex function, so -- yeah, that's a very good point. It's not clear that you can restrict the algorithm to do one of the two and get something close. >>: [indiscernible] offline -- >> Nikhil Devanur: Yeah. >>: [inaudible]. AT is always, you know, zero and X [indiscernible] always. 0X, 0X. I mean, maybe your algorithm will do something smart. >>: You just choose -- >> Nikhil Devanur: No, no. >>: [indiscernible]. choose zero. No, no, no. At the beginning it could >>: But think of the off -- just mention the offline version. You know everything. >> Nikhil Devanur: You have to have three points or more than two points for this -- I mean if there are only two points, of course you're either picking the leftmost or rightmost. So you have to have three points or more for this to be nontrivial. So you can imagine, actually, that if you have many points and maybe the points in the middle, you always want to keep picking the points in the middle, whereas the algorithm will keep picking the extreme points and it's not at all clear why this could do as well as a point in the middle. I mean, that's a very good point. It's not real clear. It's only through the algorithm that, you know, it has to be. >>: The gradient you're looking at, that's at the current average? >> Nikhil Devanur: No, I have somehow picked the theta one, and I'm looking at the tangent to H which slopes theta one. And think of this as a linear function of X. >>: Sure. >> Nikhil Devanur: And I'm going to pick the point that minimizes this linear function. >>: And that is your current -- >> Nikhil Devanur: No. So now I'm going to -- you know, think of this as a linear function of X and I have three -- two choices. Which one minimizes ->>: [indiscernible]. Is the slope of the current -- >> Nikhil Devanur: It's the slope of my current linear function that I'm going to pick. >>: Yeah, but I guess the question [indiscernible]. How do you pick ->> Nikhil Devanur: Yeah, okay. So, yeah. First time I pick some more, okay? So I just pick the one that minimizes. Now, how do I pick it in the next time step? I'm going to use this online learning as a blackbox. I'm going to feed what happened, essentially this XT to that, and that is going to spit out the next slope. Okay? And I'm going to use that slope now. So it spits the slope, and now when I get these two points, I'm going to pick this one, okay? >>: [indiscernible] >> Nikhil Devanur: >>: And I repeat. [indiscernible] what happened. >> Nikhil Devanur: Okay. I picked theta one somehow, but arbitrary. Doesn't matter. And that told me to pick this point. Okay. Now I'm going to feed this to this online learning blackbox. Exactly how this thing is going to work I'll tell later, but some blackbox, I'm going to feed it and that blackbox is going to tell me what to do, what slope to pick next. It's going to pick -- it's going to say, okay, pick this slope. I'm going to next pick the tangent with this slope, right? And I'm going to repeat. So now I get these two points and this thing says I should pick this one. So I pick this point, and then I'm going to repeat it and then go back to the blackbox. That's going to tell me what the next slope should be. So now you see where the Fenchel comes in. >>: There's something more, right? it's a tangent. >> Nikhil Devanur: >>: Yeah. That's a very specific linear function. >> Nikhil Devanur: doesn't matter. >>: You keep saying Okay. So actually, the Y intercept It doesn't matter. >> Nikhil Devanur: It doesn't matter for the choice. But it matters for the analysis. >>: Okay. [indiscernible]. >> Nikhil Devanur: For the choice it doesn't matter, right? You move the tangent function up and down. The Y intercept is like -- is where the Fenchel conjugate comes. So that's the Fenchel conjugate of theta. >>: You basically look at like where it aligns with the slope touches ->> Nikhil Devanur: Yeah. Usually there's an explicit formula. So this is always going to be the Fenchel conjugate of F star of theta. So it's just F [indiscernible] of theta plus X times theta. >>: Well, the blackbox takes is just the slope. online learning ->> Nikhil Devanur: Yeah. So, yeah. XT is all that matters, really. >>: The It takes this Oh. >> Nikhil Devanur: Assuming it knows H. It knows H, it takes -- I'll tell you what the -- exactly what the blackbox does later, but think of it as a blackbox and it uses -- there's a blackbox [indiscernible] and another blackbox of online learning. So think of it like that. >>: [indiscernible] some kind of framework. >> Nikhil Devanur: Yeah. I use this to define some kind of reward in that time step and that's going to tell me the next step. >>: [indiscernible] blackbox include not only your slopes, also the previous XT where it touches. >> Nikhil Devanur: Yeah. I'm going to tell you exactly what it is. So for now, you know, wait. Have some patience, all right. I'll tell you. It's coming soon. Okay. The algorithm is clear. There is some blackbox that keeps updating the theta. And that uses online learning. What I have not told you is how to define the reward. Okay. So that's algorithm. Every time I just do this, repeat it. That's how I pick the points. So we're done. This is the average and this was optimal in hindsight and this is the [indiscernible] difference. So now what's happening. So here I want to look at the average of the linear functions that I used, okay? So I use this L of X comma theta D. This is what I pretended I was optimizing, and I want to look at what happened in this case. And what I'm going to argue is this isn't lower bound on this H of X [indiscernible]. Okay. And then now I only -- this is [indiscernible] comparative difference, so I only have to bound this gap. I really don't care about what X star is. Okay? Assuming that this is a lower bound. clear why this should be a lower bound. It's not Okay? So this is -- what is this gap? This gap is the active function H at my average. So this is an actual function that I should have been minimizing, and this is my pretend function that I was pretending that I was minimizing. So what's the difference between the actual function and my pretense? Right. This is a gap that I have to bound, and this gives me a bound on the comparative difference. >>: It is a lower bound because you used the -- >> Nikhil Devanur: No. The lower bound uses stochasticity. It's not true always. Because I wanted to -- you'll see in a little bit. I'll show you why this is a lower bound. It is a tangent, yes, but there is more to it because this is only true. So now think of -- I said, you know, think of random permutation, but for moment this holds actually only for the IID case. Okay? So think of you're drawing -- think of the empirical distribution, right? You've seen everything, and now let's say you draw one of them at random. Think of your sampling from the empirical distribution. And let's consider the expectation of this draw, expectation of this L of XT comma [indiscernible], okay? For each, you know, each pair, the one that we picked minimized L and so it is smaller than the one that have been picked in OPT. So X star is an average of the optimum choice. So here's, you know, because they're optimizing for L, we pick the best one, so it is better than the one in the OPT. And that is why this expectation of this is less than L of X star of theta D. >>: So XT is your choice then? >>: What is this expectation over? >>: This XT your choice to [indiscernible]. >> Nikhil Devanur: So actually, okay. So -- >>: Why is this not true without expectation? You're just saying XT is optimal for theta T so it's better than X star. >> Nikhil Devanur: be is ->>: X star is not necessarily -- >> Nikhil Devanur: >>: So actually here what it should X star is an average. Oh, it's not an intercept. Gotcha. I see. >> Nikhil Devanur: Yeah. So if you fix theta T and now you're picking this space at random and you're picking the optimal one, so for every pair what you would pick is a point where you're better. >>: Yes. >> Nikhil Devanur: So here in this -- when you take the expectation, you just get X star. >>: So this expectation over which random is it? >> Nikhil Devanur: So this is -- you fix theta T and you pick expectations over -- you pick the AT at random. >>: [indiscernible] >> Nikhil Devanur: Which set you're optimizing. So this holds for the IID model because there, every time it's a random sample. And this is exactly the difference between random computation and IID. So for random permutation, the expectation is going to be slightly off, and what we need to do is we have to look at what the difference between the expectation for random permutation and IID is cumulative over all the time steps, and that's exactly, we get an extra error [indiscernible] because of that. And that is also like one over root T, so everything works out. So this is exactly the point where random permutation and IID differs. This was one of the things where we didn't understand, okay, why do they differ. The earlier answers for -- didn't throw any light, but whereas here, we capture exactly where the two differ. >>: The difference [indiscernible] is one over root T also [indiscernible]. >> Nikhil Devanur: No, no, actually for smooth version only works for ->>: [indiscernible] >> Nikhil Devanur: >>: It's going to be there, right? >> Nikhil Devanur: >>: Because this root T, yeah is -- Okay. Yeah, so that only works for IID. Yeah. >>: And is this is only place where you use the stochastic model? >> Nikhil Devanur: Yeah. >>: You could also say that the AT could be completely adversarial but you just want the X start to be the convex star of the AT time step. If that's true ->> Nikhil Devanur: Yeah. Yeah. >>: [indiscernible] >> Nikhil Devanur: Yeah. There's a list that we use [inaudible]. And yeah, this is because they're using a tangent. But this step is a nontrivial step where we use the stochasticity. So the next slide, in expectation, this bound. It's not true point-wise. Some not be, but in expectation it is true. also convert it into high probability. fine. is the lower point it will And we can So that's Okay. So the summary of all this is we can forget about X star. We just have to bound this gap now, the gap between the H and the average and the average of this linear functions. Okay. And this is where -- okay. So I already gave some of the proof. So this is where online learning comes into picture. Okay? And what is online learning? I think most people here know, but let me make -- let me do a quick summary. You want to make accurate predictions using what's happening -- what happened in the past. So at any time T, you want to predict some vector theta D in some domain. And after you predict this, you get a reward, some reward LT. This is a concave function. And you want to maximize, so here again, these are reward vectors. You want to maximize them. And the regret is, if you knew all the reward vectors, then work would be the single theta that you pick, on hindsight, versus the thetas that you picked, the reward of the thetas you picked online. Okay. So this is a problem, this is an online learning problem. So notice here that these are local, unlike in our case we are [indiscernible] for global. So this is only some of the LT, right? And there are many algorithms available here that essentially get this one over square root T regret. So here I define everything in terms of average so that [indiscernible] becomes one over square root T. Okay. So how do we use this? In fact, it turns out the way I've set it up, the gap we had to bound is exactly the online regret if you define this L correspondingly, accordingly. Okay. So the reward that we use is this tangent. Okay. So now think of this as a function of theta. Earlier we were fixing theta and we were thinking of it as a function of X, but now I'm going to fix XT and think of it as a function of theta. So this is a convex function and this is a reward as a function of theta that I'm going to feed back into the online learning. Okay. So what is this function? I fixed XT and as you change theta, the tangent changes and it's a value of the tangent at XT. So this is a concave function because here we are maximizing. >>: Sorry. AT, right? I'm a bit confused. Like I've got a set >> Nikhil Devanur: Yeah, XT is what I picked. I picked some point XT. Now I have to go to the online learning and define what the reward function is for this time step for it to spit out the next one, right? And what is the reward function? It's a function of theta, right? Now fix XT and look at this L of XT comma theta. That is a function of theta. >>: [indiscernible] because XT could change. I mean, 2XT is [indiscernible]. You have a slope at some point. If I tilt it that way, then its [indiscernible] are XT, right? >> Nikhil Devanur: No, no, XT is whatever I picked -- the vector that I picked in the last time step. >>: So this is [indiscernible] plus one. >>: Oh, okay. >> Nikhil Devanur: So my algorithm picked the particular XT in time step T. >>: Okay. So here you're picking theta T plus one. >> Nikhil Devanur: one. >>: Here I'm picking theta T plus [indiscernible] >> Nikhil Devanur: Okay? The algorithm picked the particular XT. Now I'm going to use this XT to define the reward for the online learning. Okay. So people had questions about what this reward is, what it uses. So this is what it uses. So if you fix XT that defines a function of theta, which is you look at the tangent with slope theta and look at how much this tangent is at this XT and that is the reward. And this is a concave function. Okay? So, now what is this online learning regret? So it is reward of this optimum choice on hindsight minus whatever the algorithm [indiscernible]. So this is exactly the -- in our gap, this is exactly the lower tongue, right, which we have. And I'm going to show you in a moment that this turn, the optimum on hindsight is going to be exactly this H of average. Okay? So this regret is exactly the gap that we wanted to bound. Y is this. So note that L is linear in X. Okay? So for any theta, so theta star will give the optimum, right? Or any theta star. Any theta. This average, I can just take the average inside because L is linear in X. It's a linear function, right? If you fix theta. Okay. So now what is this function? This is -- you look at the average and for any theta you look at the tangent, it's slope theta and what the tangent is at this average. Now you want to pick the theta to maximize it. So what would you pick? You would, of course, pick the tangent at the average itself. Right? Because if you pick anything else, you'd only get a smaller value, so obviously the best thing to do is to pick the tangent at the average. And then you just get the value of the function. >>: [indiscernible] >> Nikhil Devanur: No, no. What is this function? This function is you pick any tangent, look at what the value of the tangent is and the average. That is its function. >>: [indiscernible] >> Nikhil Devanur: I mean, this is my definition of L, right? L of XT comma theta and the function of theta is. If you take the tangent with slope theta, what does this tangent -- what is the value of this function at XT. >>: [indiscernible] >> Nikhil Devanur: Yeah, yeah. Convex H of course. Right. So this best thing, on hindsight, is exactly H of the average. So this regret is exactly the gap you wanted to bound. Okay? So that's it. So note that there is a switch here, in some sense. Just think the best thing, on hindsight, is somehow the performance of the algorithm, right? How much algorithm got. And this online learning thing is somehow the lower bound on OPT. So there is a switch, which is interesting. Maybe it's natural. I don't know. Okay. So here's the oral algorithm. Of course, the same thing extends to many dimensions. Now, consider the case where, like in the special case you either have only an objective or only a constraint. So if you have only a constraint, it's like having only an objectivity. Just the distance percent. Okay. So if you have only the objective and many dimensions, you do exactly this. And so these thetas are like Fenchel conjugates. So these thetas give you some kind of linear function. You can interpret it as shadow price and so on based on the specific scenario. And all you're doing is you're picking, you're optimizing the linear function given by theta over this choice set. Okay. And then you're getting this and you're feeding this to the online learning. You're defining the reward appropriately. It's the same thing. Take the tangent in this dimension and then it's a linear function -- or sorry -- it's now still a concave function of the thetas, and that's going to tell you what the next theta, next slope should be. And then you repeat this. It's exact same thing that I said works in any dimensions. >>: [indiscernible] for finding theta one. >> Nikhil Devanur: So, okay. In this case when you have either only an objective or only a constraint, I don't have to solve any LPs. I have to solve LPs when I have both. Even in the earlier work we don't have to solve any LP if there is only an objective or only constraint, right? Okay. So now what happens when you have both objective and constraint? So we're going to run two separate instances of the one I showed you in the previous slide. Okay. So for the objective, I'm going to have this theta. For the constraint, I'm going to have lambdas. But now how do I pick the vector? I'm going to combine these two and I'm going to combine this using a value Z. Okay? So what is the Z? It's a tradeoff between the objective and the constraint. If I violate the constraint by epsilon, how much does the objective change? So OPT of epsilon is I take an epsilon ball around us, so I take all points within epsilon distance around S and look at the OPT of this relaxed problem. There's an OPT of epsilon. OPT of zero is just OPT. And Z is this derivative of OPT of epsilon, it's just epsilon. So this is what I said, you know, if I violate the constraint a little bit, how much should I expect to get in the objective? And you assume that you're given the Z ahead of time or you can estimate from a sample. So this is where we need to solve like a batch LP or a batch convex program. And the thing is we only need a constant factor approximation to do this, so we don't need very accurate estimates. We need a constant factor [indiscernible] to Z. So it's a much easier estimation problem. Okay? And then what we're going to do is in every time step, we're going to combine these two. So we have the theta and then we have the lambda. We're going to combine them within Z. So this is again a linear function, and then we're going to optimize this linear function [indiscernible]. Okay. >>: This is also the reason why you do not solve LPs often, right? >> Nikhil Devanur: Exactly. >>: Because with the small sample, you get a constant factor anyway. >> Nikhil Devanur: Very good point. So for LP version, you only have to solve it once to get a constant factor. >>: Small samples of -- >> Nikhil Devanur: Yeah, small sample, and only once is enough to get a constant. So like I just need epsilon squared or epsilon squared fraction of all the requests. So if there are T requests, I take it times squared T. Samples and it already gives me a constant factor approximation. And I only have to do it once. So why does this work? Okay. So analysis is a little more complicated and I don't have the time to go through this, but this works. Okay. So let me summarize. Here are the, I think, salient features of this algorithm. It's very general. It works for arbitrary -- essentially arbitrary convex constraints and concave rewards. It's [indiscernible] and the Fenchel duality, I didn't say things in terms of Fenchel duality, but that is what is underlying it. It seems to take the very powerful or at least the right tool to attack this problem. And for a long time we were struggling with this random permutation for society. They should be the same or very similar, but we didn't really have a good explanation. This captures exactly where this comes. So already the expectation is the same throughout. Random permutation expectation is changing and it's just a difference. So, you know, if you take the random permutation and every time step you take the expectation of the remaining and you see how different it is from the expectation over the whole, you sum it up and that's exactly the extra term you get for random permutation. So really captures the difference between the two. And the other nice thing is this modular proof where we use this online learning as a blackbox and we don't really care what algorithm you use for the online learning. Earlier problems were very specific to either -- even the other work that I mentioned was simultaneous later, they were very specific to the particular algorithm you use to solve the online learning. But here we show that okay, doesn't matter. All I care about is it has low regret. And this was actually conjecture even back in 2007 when Mehta, et al. came up with this adverse problem. They said, oh, there should be some relation to this expert's algorithm, and people, all the subsequent ones hinted at this connection but there was no formal connection, and we make the first connection. And the other progresses beyond this, any other advantages, if the learning problem can -- you can get better regret, then that translates to a better computer defense as we saw with the smooth functions. And of course, this is only limited to IID. It's a random permutation. This extra term is already one over T, so [indiscernible]. Okay. I think there's more to do here. One direction yeah, that could be useful is to go beyond the stochastic models that we have. So we have this IID random permutation. We have some time changing distributions. But these are mostly stateless. So maybe some kind of Markovian process or something might be able to extend it. The other problem is -- so this is dealing version of this, which doesn't quite fall in this general framework that I mentioned, and I think figuring out similar algorithms and similar guarantees in scheduling would be very interesting, so we are working on this with Balu and others. And for me personally, another thing I would like to see is more applications of this. It seems like a very general framework. So I'm interested in, okay, what are the other problems that people have worked on that we can formulate in this framework. Maybe that will point to some other ways of fixing in the model. So maybe they won't fit in exactly, but it's close enough that we can extend it. So, thanks. [applause] >>: Questions? >>: Can you go the other way? [indiscernible]? >> Nikhil Devanur: >>: Is there a connection That's a good question. Maybe. Just like [inaudible] >> Nikhil Devanur: I actually haven't thought about that. But it is definitely possible. >>: There is some reason in the worst-case analysis of the online algorithms convex costs. How are these algorithms related? I mean, I notice ->> Nikhil Devanur: different. >>: The problems are a little The resulting algorithms are singular because -- >> Nikhil Devanur: Not sure. I think the algorithms, are they similar? No, probably not. The results are different. So there you have to keep covering at every time step. So you're to maintain, so that makes a difference. >>: So you have more freedom intuitively to -- >> Nikhil Devanur: Yeah. We only have to satisfy at the end. So yeah, I can say we have more freedom. And also we're crucially using the stochasticity. Maybe not so crucially as I mentioned. Yeah. Actually, that's a good point. >>: You can just take the condition. >> Nikhil Devanur: Yeah, but -- or else it's only -it holds only an expectation something, so it doesn't follow that condition maybe too strong, right? >>: I was saying -- >> Nikhil Devanur: this or that. >>: No, no. It's like two statements: Either You can say -- I mean, that you have a distribution that satisfies this. The distribution could be, you know -- [indiscernible] I mean, something. >> Nikhil Devanur: a good point. >>: Yeah. Okay, okay, yeah. That's Some factor [indiscernible]. >> Nikhil Devanur: So this is -- what I'm saying is actually very similar to this time [indiscernible] distribution model that I mentioned but didn't say. >>: Okay. >> Nikhil Devanur: It's very similar to that. In some sense that's what this time [indiscernible] distribution does. Still, each step has to be independent of other steps, but that distribution, in some sense has to include this X star. >>: [indiscernible] >> Nikhil Devanur: Yeah, so you go to the linear version this one over root T is tied. >>: For a constraint as well? >> Nikhil Devanur: Yeah. Yeah. Unconstrained. So you can, I guess with linear [indiscernible] You can just think of it as the [indiscernible] from the set kind of thing, I think. I don't know, Balu. Even for just minimizing a function. Yeah, I don't know. But at least for the packing problem, you know, you have a linear function and you have packing constraints. It's tied. For instance, for smooth functions it's not. I mean, you can just get this log T, right, so for instance. >>: [indiscernible] >> Nikhil Devanur: That's a good question. So the thing is even though, you know, it's only a linear version that we saw and the rest is hack, that doesn't work very well. So it works so well, they're so happy with it that they're not developing on the algorithm anymore. So they're focusing on other things. So currently they're happy with the algorithm as it is. So maybe at some point of time they come back to it and want to revamp it, maybe then, you know, we will get them to use this. But as of now, they are quite happy with it, so ->>: So right now it's a linear model that -- >> Nikhil Devanur: Yeah. Yeah. >>: No need to solve an LP, you're not solving an LP ->> Nikhil Devanur: So, yeah, they are solving an LP. They solve this LP offline and then that kind of feeds the online -- the online, they don't have an LP. It's just ->>: [indiscernible] >> Nikhil Devanur: Offline LP is big. >>: Offline LP [indiscernible]. >>: But what if they use -- >> Nikhil Devanur: same algorithm. >>: So actually, they're using the What tools are they using for that? >> Nikhil Devanur: So they're using the same algorithm. When you have to solve it offline, you can shuffle the order, so you know you're getting a random order. So that way you can solve. [indiscernible]. >>: I think there's another interesting model. So in your [indiscernible] in terms of T, [indiscernible] you have this [indiscernible]. But think about the variety is not really [indiscernible]. But that means your AT is actually -- somehow from the final site. [indiscernible]. >> Nikhil Devanur: Yeah. >>: Same requests like a third time. >> Nikhil Devanur: But that's something -- I don't think the support of this distribution is small. Right? It's distribution on sets of vectors and the support of this distribution I don't think is small. I think the support is much bigger than T something. That's is how I would model it, right? Or at least as T. >>: What happens as far as [indiscernible]? >> Nikhil Devanur: If the support is small, then yes, you could actually learn the distribution and do something simple. But every impression is essentially different. So ->>: [indiscernible] >> Nikhil Devanur: So the support is huge. And that's why we need to do something other than learn the distribution. [applause]