22903 >> Nikhil Devanur Rangarajan: Hello everyone. Welcome to the MSR Talk Series. Today it's my pleasure to introduce R. Ravi, who is a professor at Tepper School of Business at CMU, and formerly Associate Dean of Technical Strategy. Ravi has been a regular visitor here at MSR Redmond, and it's a great pleasure to have him here always. And so today he's going to talk about correlated stochastic knapsacks and non-Martingale bindings. >> R. Ravi: Thank you, Nikhil. Thank you all for coming. This is one of the papers that grew out of teaching in the MBA classroom. I was teaching a class on applying optimization to marketing and you know after you teach a class for two or three times you actually start focusing on what the model is that you're trying to solve. And you strip away everything and what we came away with was some kind of a multi-armed bandit problem with rewards that don't obey a commonly assumed property. The Martingale property. So that's sort of where we started. But then we worked our way -- we simplified, simplified and we came down to a knapsack problem. So I'm going to do the talk in the reverse order. So I'm going to start with the simplest problem, the knapsack problem, and I'm going to add one or two complications to it. And so it's sort of monotonically increasing in difficulty. And then once I have a solution for stochastic knapsacks with all the extra constraints I'll try to make a leap to this multi-armed bandit problem. But before I get started, this is joint work with my colleague in the computer science department, Professor Anupam Gupta and two graduate students, Ravishankar Krishnaswamy and Marco Molinaro. They are at the computer science department and the school of business, respectively. The paper, the full paper is out there. It's about 30, 35 pages, and there will be a short version in the proceedings of the Fox Conference this year. So most of you probably know the knapsack problem. So I'll start with a very simple extension of the knapsack problem. And the knapsack problem, we have a knapsack with some capacity B to fill. And then the traditional knapsack problem, items that you're trying to fill the knapsack with have sizes and profits or weights sometimes I call them. Rewards. And the traditional knapsack problem, the deterministic knapsack problem you have to pack a subset of items that don't overfill the knapsack and maximize the reward that they fetch. The stochastic version now has uncertainty associated with the size of the item and/or the reward that you get from the item. Okay. So you pick an item but you actually don't know how much of the knapsack it's going to occupy. Okay. And maybe you also don't know how much reward you're going to get. You have some distributions over these. Okay. So the goal of the stochastic version is the same. But there's a slight subtlety here. You first pick an item and only when you pick an item you know how much it's going to fill up this knapsack, right? And once you pick an item you report it on the knapsack. And then you look at what happened, what all items you have in your knapsack and what the current sizes are, and you can use all that information to pick the next item to fill your knapsack with. So you're filling the items one by one. There's some kind of inherent online nature to this problem. Right? And I'm again maximizing expected reward. For now let's think of the rewards of being deterministic. Only the item sizes are uncertain. So the first difficulty here is that the solution that I'm looking for is in general a strategy. I'll have to -- I'll have to tell you if I saw these four items before with these respective sizes, which are realized from the distribution, then I have to tell you what to pick next. So if I have to explicitly write out the whole optimal solution, it would be exponential size, because I'll first pick an item and there will be different possible size realizations. For each one of those choices I'll have to tell you which is the next item to pick and so on. And in general this problem is P space hard, because of this complexity of description of the optimal strategy. >>: So do the item sizes and rewards come from the succinct ->> R. Ravi: Yes, so you're given distribution. Usually in this case -- I'm just thinking of simple finite support here. Very simple discrete case. No distribution. So let's do an example. Okay. This example is from the paper that introduces problem by Brian Dean, Mitchell Goemans, Juan Vondrak, in the general paper. We have three items. The knapsack size is size 1. The first item with probability -- all of them have the same deterministic reward of one. And the first item has size .2 of the whole knapsack. 20 percent of the knapsack with probability half and .6 for probability half. And so on. Here is the optimal strategy. So first you insert item one. If it comes up to be of the smaller size, then you know for sure you can insert item two. Right? Because you have enough .8 space in the knapsack. But if it doesn't, you know two is useless to you, right because you know after the .6 you can't possibly not fill the .8 which comes up with deterministic size. And so you try your luck with .3, with the third item, and indeed the third item will fit with probability half. So one-fourth of the overall time, right, you'll be able to fill it with items one and three. Right? And one-fourth of the time you'll be able to fill it with one and two. And one-half the time. Sorry. And then the remaining one-fourth of the time, the second item that you tried to put in, which is item number three, will overflow and you'll only get value one. Okay. So the expected profit here is, yeah, one plus .5, plus .25, 1.75. 1.75 is the expected reward from the strategy, and it is optimal for these three items. Okay. Good. So one of the big ideas of the Dean Goemans Vondrak paper, which I'll just call DGV from now, is that they wanted to ask if I don't want to bother with telling you about this tree, this whole strategy, if I just want to give you a very succinct solution, and this succinct solutions are nonadaptive policies, they're just permutations of the items, I just give you an order, you go down that order and when you pick an item which overfills the knapsack you stop and that's the end of it. So I'm not allowed to change the order depending on the real size that the item came up with previously. Okay. So nonadaptive policy is just an ordering of the items in which I will process them in order and no matter what the size that's realized for each item, I'll just keep going down the row until the thing overfills. >>: What do you mean ordering, the ordering is fixed, right? >> R. Ravi: So I want to compare, again, the best such order. That would be the best known adaptive policy. >>: Wait. I thought the order was adverse ->>: It's an off line. >> R. Ravi: It's an off line, as I started it it's an off line problem. You're allowed to determine the -- yeah. So in fact the strategy tree will give me different permutations, yeah. >>: I don't need to know that. I only probably need to know the sum of the sizes of the things that I -- or do I really -- could explain also the individual ->> R. Ravi: Probably the sum is enough, if you sort of write it as a decision process and try to solve it, yeah, that's probably enough in this case, yeah. So in this case, for example, if the order I use is actually one, two, three, what would be my reward of that policy, well, I lose the chance that when I schedule one and the item comes up with size .2, right, I lose that chance, that third branch in the tree where I schedule item three. >>: [inaudible]. >> R. Ravi: Sorry. When it comes up with .6, I lose the chance of going to three, thank you. So what I'll have to do then is just have to schedule two and I'll lose -so in fact I don't get to play -- yeah, when it comes .6 I can't insert three, if I think of it as this tree, right? All the nodes in this level of the tree are the second item in that ordering. So I can't insert two here and three here, this also has to be a two, right, in which case -- yeah, this will overfill. >>: So if you -- I didn't understand. So if you pick item two, if you get to .6, then you get to .8, what happens? You have to stop completely? >> R. Ravi: Yeah, at that point the knapsack is full, and I end. That's it. That's the end of the procedure. >>: So as soon as you start to overflow. >> R. Ravi: As soon as you overflow it's over. Even if there is a tiny item which for sure will fit in later, yeah, the game's over. >>: And that's overflowing on ->> R. Ravi: That's right. That's right. >>: I see. >> R. Ravi: That's right. Good question. >>: So size is realized after you made a decision to put in your knapsack. >> R. Ravi: That's right. In fact, very soon I'm going to switch to the perspective where these sizes are stopping times of jobs. So each of these items is a job. I'm going to put it in my machine. And I actually won't know until .2 whether it's actually going to stop at .2 or it's going to continue on until .3. That's actually the version I'm going to work with from now on. So in that case it's kind of clear that I've blown the time. So that's sort of a better motivation. >>: In that case you can do preemption. >> R. Ravi: That's precisely what I'm going to allow. That's what -- then I'm going to ask if allow preemption how much better can I do and things like that. That's where I'm going. Any other questions? We need to -- so what is the expected reward from the order one, two, three? So I'll definitely get one. Right? And only .5 of the time I'll get two and that's it. So it's 1.5. Okay. So in 1.5 is not too far from 1.75 and what DGV show is that, right, is that there always exists such a permutation, nonadaptive ordering for this problem where the expected reward is at least nearly one-fourth of the best possible adaptive strategy. Okay. But they need two key assumptions. One is that these distributions of the item sizes and rewards are independent of each other. Okay. There's no correlation between the amount of money you'll make and the size of the item. Okay. That's one important assumption. And the second is, like Yosi was saying we can't have any cancellation. That is, you can't take an item and after a .1 say I don't like this item, I'm just kind of canceling the item. In the context of knapsack we'll call it cancellation. In the context of scheduling, it will be a preemption. Okay. So it's a .2 and a .6 job. But at .1 I say I'm just going to preempt this guy. Okay. So you can't cancel. And you have to have independence between sizes and rewards. And then you get a permutation. And actually I'll prove to you this, a weaker version of this. In fact, what I'm going to do is give you the solution for the basic stochastic knapsack, a weak solution which I'll need. Then I'll add these two things. Correlation between item sizes and rewards, and the cancellation preemption. And I'll still come up with a constant factor approximation. Right? And then the way I'll do that is using a linear program with time indices and a very simple rounding. All I'm going to use today is just Markov inequality. It's amazing how far Markov inequality will take me. I'll just extend it to these multi-armed bandit problems and at some point I'll have to stop. And when I extend to multi-armed bandit problem I have do LP decomposition and rounding. I think the first part should be fairly clear. >>: Not cancellation so strengthening the DGV; is that correct? >> R. Ravi: It's just an assumption. If you allow cancellations. >>: Getting one-by-four without cancellation? >> R. Ravi: That's right. >>: Could also ->> R. Ravi: Could be much larger. In fact I'll show you some examples. You guys are asking good questions. They're all in my talk. So from now I'll interchangeably use item with job and size with processing time. The natural way, right? And capacity is the deadline for the scheduling problem. So filling the knapsack, inserting an item into the knapsack is like picking that job and starting to schedule it. And I'll only know when it stops, when it's, what the size of that item is. >>: All of these are inherent peer strategies that you're talking about? >> R. Ravi: That's right. >>: Do they say anything about if you mix the problem, nonadaptive strategies [inaudible]. >> R. Ravi: That should not be bad. I think there's an averaging argument that shows that one of them will get just as much, because I'm doing expectation, expected reward. >>: As much as ->> R. Ravi: I want at least, right? So I want a strategy, I want to maximize expected reward. So if I have a combination of strategies, then my expected reward is just a linear combination of their rewards. And so one of them will be just as good. I think I'll be okay there. Yeah. >>: So with the job scenario, we're assuming that when it comes to [inaudible] time like two jobs can be running simultaneously? >> R. Ravi: Absolutely. That's my capacity constraint of the knapsack. Right? That's precisely what I mean by the capacity constraint. If I schedule a job, no other job can be running at the same time. So if it runs for .2, nobody else is running for that .2 of the knapsack. Good. Okay. I think you guys are all with me. Perfect. So now I'll obfuscate things a bit and write a linear program. Linear program is a simple upper bound for the amount of reward I can get from any strategy. Okay? So this is the way I want to think of it. So there's a slight caveat. I'm going to change the problem on you a little bit. Okay? Again, the profit at all is not random. It's a deterministic number W for each item. The processing time is random. It comes from some discrete distribution, let's say for now. And in fact I'm going to allow myself to get the reward. If I get to start the job. I don't even have to complete it. I just have to get -- I just have to put it in my knapsack. Okay. It's just got to be space enough in the knapsack. So this is what's called the start deadline version in scheduling, where you get the reward not for ending the job but for actually starting it, just putting it. But it's just sort of a scale factor, because if I have to wait for that guy to finish, how long will I wait? At most the size of the knapsack, right? So these two things are very close to each other. >>: [inaudible]. >> R. Ravi: It's an MBA class. >>: [inaudible] strategy for businesses. [laughter]. >> R. Ravi: Yeah, I mean I'm doing approximation algorithms, right. So sort of things come up. So what is my LP? I have a variable for the item. And you should think of it as the probability over all possible realizations of these item sizes that an optimal strategy will schedule this item. So an optimal strategy, you know first may pick an item and then its size comes up with something. And then depending on the size it may change which item it picks next. Right? So an item may be chosen at different points in this decision tree. I'm just saying like what's the overall the runs of this item sizes, what is the probability that a particular item is chosen by this optimal strategy. All right. So now think of how much money and expectation this thing is making. Well, for every run where this is a 1, it makes WJ. So in expectation it is making XJ times WJ. So that's the right expression for the expected profit of opt. Now look at any particular run. Right? In a run, if I schedule J, that is if XJ is one, this is a 01 case in the run, right? Then the sum of the processing times that that particular item took -- this is a random variable -- right, remember it's a distribution of sizes. So the sum of the processing terms for the items that are scheduled, right, and I took reward from, should be less than or equal to the start deadline D. So this is the deadline that I'm addressing. And the reason I'm able to truncate this processing time random variable at 1, which is a deadline, is that I'm only averaging over the runs where opt went through. So if opt actually went through and included some guy, all the guys before are definitely terminated before the deadline. Otherwise opt is out of the game. Opt is out of the game at deadline one. So since I'm only looking at trajectories where opt is picking some items, all the item sizes that it looked at are at most one. So trajectory by trajectory the sum of the truncated item sizes for opt is at most one. At most a deadline. Right? And so if I take expectations, right, these two are independent variables, because how big the item is independent of how opt is making its decisions. So expectation of those 01 variables is this probability, expectation of these guys exactly that. >>: So what is D again? >> R. Ravi: D is some parameter deadline. So I have to set it to one. >>: Row one ->> R. Ravi: That's a parameter. It's either one or -- in fact, I'm going to argue in the next side that setting D equals 2 gives me an upper bound. Why? Because the last guy that I schedule, right, his size may be as big as almost one. Right? So if I sum up the times of everybody who has been given reward, right, all the guys up until one, they fill up to one. The last guy fills up his PJ luckily I've truncated his PJ at one. He overfills by at most by one. And so definitely in every trajectory of the algorithm if I include the last guy too, the overhang I'm going to have is at most 1. Therefore if I set D equals 2, then every trajectory of opt is feasible, and therefore this is the expected profit, this is an upper bound on the expected profit of opt. So that's why I use 2. So that's the explanation. And also because I'm looking at the objective function of a maximization LP as a function of this right-hand side, this is a concave function. Another way to think of it is these things scale nicely. If I take the solution for this LP with right-hand side 2, right, and I divide the solution by two, that's a feasible solution with right-hand side being one. Therefore, the solution for the right-hand side being one has to be at least as much as that previous solution. It could get better. Okay. So in fact I can use either phi of one or phi of 2 as my upper bound and they're off by only a constant factor. I'm going to use that next. Okay. So I think I've set up enough. Now I'll tell you what the algorithm is. The algorithm is a knapsack algorithm. DGV algorithm is just a greedy algorithm for knapsack. Sort in decreasing order. You want high profit, low time. Decreasing order of profit by time, except that my time is this expected truncated size. This is how you fill a knapsack typically. Okay. And now you just go down that order. That is my permutation I was telling you about. The claim is that if you go down that permutation, it's a nonadaptive policy. It's just one order you're going to go down. The claim is you'll get at least one-fourth of the rewards of opt. Let me actually show you a slightly weaker algorithm which gets one-eighth of the rewards. Okay. I'll need the proof of this. So this is what's going to drive the rest of my proofs. So let me solve that LP, upper bound. The right-hand side I'm going to make one rather than 2. It doesn't matter. It's just a scale factor. And I'm going to sort the items not in that greedy order, but any order you like. Okay. So here is where my power is really -- this is where I'm going to extract the power of this weak algorithm. I don't have to go down in the specific order of cost, profit by time. Sort in any order, but just damp the probability with which you'll actually pick that item. So you're supposed to pick it with probability XI according to the LP. But just pick it probability XI over 2. Okay. And then just take them in any order and keep scheduling them until you run out, you overflow. And the claim is this is one-eighth. I'll show you a proof very quickly. Okay. The way I'm going to show that you're getting one-eighth of the profit is I'm going to show that for any item you consider, when you get to consider it, I'm going to argue with probability at least a half the knapsack is not full. That means you can get this started and you can make your money. It's enough to get started to make money. And therefore what is the expected profit that I make in this weak algorithm? Well, first I have to pick it. And then I'll have to schedule it. And if I schedule it I make this money. If this lemma, the probability of being able to schedule it is a half. Probability of picking is damped by 2. I'm not -- the original probability ought to be XI according to the LP. But I damp it by a factor of 2. And this is the expected value. But now if I look at this expression, this is just phi of one because that's objective function of my LP divided by 4. But because of the relation between of phi of one and phi of two it's scaled by a factor, this an upper bound on opt. This is where I started. So all I have to show is when you consider any item, any item, right, the knapsack is not full with probability at least a half, now you just look at how much size is occupied by everybody else when I consider a particular item. When I'm considering an item, look at all the items other than yourself, because you're not being considered yet. You're just being considered. What's the chance I pick the other item and how much does it fill the knapsack? Well, how much does it fill the knapsack, well, I can again truncate at one because I'm considering -- I mean this is only happening when I'm still in the game, right? The knapsack is not yet full and I'm considering what's the chance that knapsack is not full? So the probability of picking any item, this is where I'm using my damping is at most its LP value by 2 times this truncated size. Now, the expression up in the numerator is exactly, the left-hand side of my knapsack constraint, since I used phi of 1, it's at most 1. At most half. So the expected size that the other damped guys are expected to fill is at most half by Markov inequality, the chance that I exceed two times that, right, is at most half. That's it. Therefore, when I consider any item, the space filled by everybody else is at most an expectation of at most half the knapsack size, just damping, that's all, very simple. Good. Now let's start canceling items. Okay. So let's think of -- I should note -okay. So if I gave you an instance where I gave you identical items they all have reward of one each. The size is either tiny or half the knapsack, and each of these happen with probability half. Now, if I don't allow you to cancel, right, very soon you'll get an item of value half and that's the end of the game. Right? So your expected reward will be a constant. But I allow you to cancel, then I'll schedule this job. I'll run it for epsilon. If it goes for a little bit more than epsilon, I'll preempt it. I'll throw it away. Okay. And I'll keep doing this for every job. And so I'll get all the epsilon jobs. Right? I'll preempt away all the long jobs. I'll keep all the epsilon jobs and my reward will be N over 2. Right? So with preemption I get much higher reward. Sorry. Cancellation, as I call it here, for items, right? Whereas without it the reward is actually quite low. So there's this huge gap between what you can do. Unfortunately, if you use the other LP, the DGV LP, as it is, it won't work out of course, because what is the expected truncated size of a job? Well, a job with probability half is half the knapsack. So its expected size is one-fourth the knapsack. So how many truncated jobs can you put in a knapsack for? So that LP will only be able to give you four items of reward. It has an upper bound of constant. Clearly it's not capturing the ability to truncate. Okay. Now if you look around and you find that someone has already solved this problem. So there's a series of papers which I'll talk about briefly by Sudeep Guha and Munagala. They have LPs like this for solving these Markovian bandit problems. So in those bandits you can either kind of play the bandit or stop and say I want to use this bandit and collect my reward. Either explore or exploit. So sort of stopping and saying I'm going to exploit this is like preemption, because you're sort of going along, right, and then at some point you're saying I'm going to stop and take the reward. So that's kind of like preemption. If you see how they do it, how they decide between stopping or going, you can actually adapt the LP and work out this case. But unfortunately if you throw in the second complication, which is correlation between the reward sizes, rewards and the sizes, right, their approach doesn't work anymore. So what is correlation? Basically the -- here's an example. N items, each item can be very tiny. Give you no reward. But that happens most of the time. And the very -- with the remaining tiny chance, right, it fills up the whole knapsack, and it gives you reward of one. So how much can you get ->>: You get again the reward, when do you realize? >> R. Ravi: So when you realize this -- so think of it in a job setting, it's easiest. You start running it. And if it stops at size epsilon, right, you get a reward of 0. That's at the point where it stops. So, unfortunately, because of that, if I start, I'm hosed because that first thing that I started with is a one size item, I won't get any reward, because all the other items that I can get will only fill the knapsack, but the only thing that will give me a reward, right, will need the whole knapsack. So therefore the expected amount of reward I can collect is only one and N. Because with probability one and N I'll get that one reward. But now you see what you've done with that fix that I talked about, and that doesn't work. Okay. So these bandit LPs that take care of the cancellation part, they don't work for the correlation part. So now you look around further, further, you don't find anything, now you have to solve it yourself. So now we really have our first problem. Okay? Our problem is stochastic knapsacks with cancellations or preemptions and correlation. And a convenient way to think of it is we have a deadline and for every job we have a distribution over sizes. So for each size I tell you what's the probability that this item will take that size T as of time it runs for, and what reward it will get you, if it's stopped at that size. Okay. So for discrete item sizes 1 to D, I have the probability that the size of that item is T. And the associated correlated reward. It's complete correlation. Okay. You can cancel any job while running. What does that mean? You're running the job and it's not stopped at time T. So that means it's still running at that point. So at that point you can just say I'm not going to worry about this item anymore. But that also means you can't come back to this item. So I'm also assuming that once you like toss something away preempt something away you never come back. >>: You get 0 rewards. >> R. Ravi: You get 0 rewards, that's right. So only the jobs ending by time in the knapsack way. That's the problem you're going to work on. It's a very simple idea that solves this also. So it's a standard idea. You basically split where you're getting your reward from. Are you getting your reward from items which are ending before half the size -- half the deadline, B over 2. Or are you getting most of your rewards from the items which are ending at time D over 2 and higher? So if -- you're definitely getting at least half the reward from one of these two types of items. Overall in this optimal strategy. So let's just break up our problem into two problems. One where you zero out the reward for all the values from B over 2 plus 1 to B. So you get reward only for small item sizes. And then you keep the remaining part of the reward support in the second instance, and you flip a coin and you pick one of them and you do the right thing for each of them. The nice thing is if you look at the second type of instance where you get your reward if you run for at least B over 2 or more you never need to preempt because once you start a guy he's only going to give you a reward after B over 2. You never preempt that guy and get everybody else because everybody else is in the same boat they'll have to run B over 2 plus 1 in which case they'll overshoot the knapsack. The second type of jobs is like the DGV type of job, one-shot thing. No need for preemption. I can just use my solution from before and take care of the second type of jobs. I only have to worry about this first type of job where I'm getting reward from the first half of the support. Okay. So I'm going to quickly run through this first part, because there's not much going on here. And then I'm going to focus a little bit on this, and that will shoot me into the bandit part. >>: From the size of the chart, dealing with the strategy just one specific ->> R. Ravi: That's it. So that final thing is just 1. Yeah. Yeah, because -- that's all I can do in that case. Okay. So the way you can actually do it -- so I'm going to rush through this. Don't bother if you don't follow the next three slides, I'll blitz through it. The LP here basically is just DGVLP. I've just scaled it from 1 to T. Now it's become a B knapsack. Okay. So, again, what does this variable tell me? It tells me if I'm going to start this job I at time T, between 1 and B, right, some global time T, okay, so if I did start this job at time T, how much reward will I make? Well, I can only go until the deadline. The deadline is only B minus T steps left. Right? So whatever I -- if ever I get to stop by the deadline, which happens with these probabilities, right, I get the corresponding -- expected reward, conditional expected reward if I start at T. And so that's what I should be collecting in the objective function. I can't start more than one job at a particular time. That's the capacity constraint. And this one is just the same packing constraint. It says, look, if I start any job, right, at any time T prime, then the -- if I just sum up all the expected truncated sizes that they add up, then it can't be more than 2 T. And the 2 is because of the overhang in the, of the last job that I scaled. It's exactly the same thing, previous arguments tell us that this is a legitimate upper bound. Right? And like I said before cancellation does not help. So just solve that LP. Pick an item with probability -- so now actually for each item I have values for what different times to start it at. So now my LP is a little more refined for each item I for each T it tells me with what probability I should start at time T. So I first go to an item I and then I pick a T from its X distribution, right? And I damp it. So only with probability one-fourth I even pick anybody. With probability three fourth I don't even bother with this item. But when I go to an item and I pick it, I pick it to start at a particular time T with probability XIT over 4. Right? And then I just schedule this. So now every item which has been picked successfully, right, is at some time slot. Now I just walk down the time slot, I pick the first guy and that's it and that's all I have time for. Okay. And the proof is kind of identical, conditioned on when you start the probability that you get added at all, is that the previous guys have left some space for you. The probability that they don't fill up more than the current time T, right, is going to -- it's the same Markov inequality. Nothing is going in. So I start with at least probability one-fourth of my LP value. I get to go, continue with probability at least a half. Therefore overall with probability one-eighth I get what my LP value tells me I should get. Therefore, I'm one-eighth of expected. Okay. Now let's do the more interesting case. If this is confusing, it's okay, you can record it back. But this one has a new idea. Okay. Now this LP, things start to get a little more interesting. So remember all I have is support every job will give you a reward if it finishes between 1 and B over 2. Right? The upper part of the support is giving me an award, this is the small jobs case. So now I have an LP which is starting to look like this, like this Markovian bandit LP. So I have two variables for each item. One is the visit variable and the other is the stop variable. Again, I have the variables indexed by the job and the particular time T, right? Running time T. So if VIT is 1, that means that my policy, right, allowed this particular item to run for T steps. So I visit the T at time step. And if SIT is 1, that means my policy said I'm going to stop this item at time T. Now, the interesting thing is if you run for T items, inherently that item will stop with some probability pi. Remember, there's some underlying distribution according to what the job stops. So the rate at which my policy should ask an item to stop at time T should be greater than that. Right? So inherently the item if I say run for T steps might stop by itself at time T but I can forcibly terminate at T. That would be a preemption. Okay. So this variable stop SIT is going to take care of both. So now let's look at what this says. It says if I visit time T, that is if I let the job to run for time T, item idle to run for time T either I have to stop it at that time or I let it run for one more time. Right? That is if I visit the state of letting the job run for time T I should either stop it right now or I should let it run for one more time. These are the only two options. Now, the stopping rate should be at least what the stopping probability distribution tells me it is. What is that? The conditional on my getting to T, right, the remaining stopping probability is this much. And the probability of stopping right now at T is that much. So this is the conditional stopping rate if I got to time T. So your stopping probability better be at least the probability of visiting times the stopping rate. I mean because it's going to stop by itself at that rate but it could be strictly bigger. If it's strictly bigger that means you're preempting at least this fractional policy is preempting. >>: The conditional probability of stopping or original probability of stopping? >>: [inaudible]. >>: [inaudible]. >>: It seems like it's conditional probability, no? >> R. Ravi: On visiting T, that's right. Yeah. I mean, SIT means, yeah, conditional on visiting T. So let's look at this and see what this says. This says that if I stop at time T, no it's the original probability. No, it's just -every item has values of stopping at any value between like 0 and B. 0 means you never even get to start the item. Right? So if I stop at time T, then this item has actually occupied T units of my knapsack or time. Right? So if I just sum over all possible real stopping times, right, multiplied by the time it has run for, it has to fit in my knapsack. >>: I see, this is why you multiply by ->> R. Ravi: That's exactly right. >>: So the real probability. >> R. Ravi: To make it the real probability, right. So by the same arguments that I've been making at the beginning this is an upper bound on how much an optimal policy can make, again just by tracing trajectories. Okay. Now, how can we use this LP for designing an algorithm? Well, we should try to visit with probability V in our algorithm. And we should try to stop with probability S in our algorithm. That's what we should really try to do. Right? And that's sort of what I'm going to do. I'm going to use my previous trick of damping, right? So I'm not going to start this whole process off with probability VI 0. Right? I'm just going to pick an item only with probability VI 0 over 4 or 8 or something, and I won't even get going with an item. But once I get going, I'm just going to simulate what the LP is telling me. Right. That is if the LP has SIT value strictly bigger than for the remaining fraction of the time I'm going to preempt that job a bit. I'm going to toss a coin with that probability I'm going to preempt it. Okay. This is not great text form, but this is just laziness on my part. So if -- so what do you do? Again, you solve the LP. You get an optimal solution. SS and Vs, right? And right off the bat with probability three-fourths you just forget about an item. Right? And then you start -- you take any item you like, that's what you're doing, you ignore the item with probability three-fourths, and then you just go through the time steps from 0 to B over 2, because that's when it matters after that, the rewards are 0, you cancel at the rate that the LP is telling you. Remember, the rate at which it inherently stops is this conditional probability times V. Right? So anything that's left is what LP tells you should cancel with. So with that probability you cancel. I've just extracted it out of the LP solution. And when I cancel, that means I just throw this item away and I go to the next item. But if not, I go on to the next time step. When I go to the next time step, I look at what the LP value is and I cancel with the corresponding preempting probability and I keep going around and around. But, of course, notice when you go to the next time step, right, if I process it, it may also terminate by itself, according to an inherent -- I'm just running this job. It may stop by itself. If it stops by itself it's great I'll get the reward. I'm just simulating the LP. So the reason I'm damping is so that I can use Markov inequality to make up space. Because when I'm running a job I don't want it to get interfered by the space used by the other jobs. So that's what this is going to buy me. And this cancellation is going to buy me by induction this nice property. If I schedule a job, conditional on not throwing it away, then I'm doing exactly what the LP is asking me to do. That's just induction. Probability of stopping is S star. Probability of processing is V star and probability of completing is exactly how the job is. Okay. So this is set up to do that induction. Now let's finish the proof. When I'm trying to schedule a particular item, right, so that means in this loop I'm just picking this item I. Now, how much are the other items already using up my knapsack, my time? It's exactly the same calculation. For all the other guys I sort of scheduled that guy. Must have been in the one-fourth probability that I actually did something with that guy, right? And then there's some probability with which that guy stopped at some time T. If I stopped at time T then I've used up this much time steps from that guy. If prime stops with time T which happens with probability I star by that inductive claim then I used this much space. Now I use the damping, the probability that I even let this guy in, the algorithm is one-fourth. Therefore, the amount that's used by this other guy is one-fourth of this amount. But now my LP inequality says the one-fourth of this amount is at most B over four. So that means an expectation, the amount of space used by everybody else is at most one-fourth of the knapsack. Great. I'm done, because with probability by Markov inequality, the probability they occupy more than half of the knapsack is at most half. That means I'm starting at time, at least B over 2 or earlier. Everybody's starting in time B over 2 or earlier and they have enough time to finish up because they're only going for time B. Okay. That same weak DGV algorithm that I'm milking again and again but now I have this nice LP that tells me when to cancel using the LP. So it's still the same calculation. All right. So now I'm going to make a real leap. So I think I'm ready to make my last grand leap. Shoot, is it really 4:22? Wow. Okay. So in fact the stochastic knapsack with correlated rewards that I was talking about is just example of a Markov chain, a Markovian bandit that looks like this. So if I have one arm, if I have one item, okay, and its rewards are R1, R2, R1, R 3 and R 4 for stopping at times 1, 3 and 4, right? Then it's really like going through this Markovian process. That is, I start with probability half I terminate. This is the terminal state of stopping at 1. And I get some reward, right? Then with the remaining probability I continue. I don't terminate at all at time 2. So I continue with probability 1. And then with some other conditional probability on having arrived here, right, I continue and terminate at 3. So every PIBIRI sequence that I got has input can be converted into one of these part bandits. So what I was actually doing was solving a problem of trying to figure out, given all these different part bandits, right, which one should I kind of move now? Right? This sort of the most general preemptive case. I move one guy, I park the other guy in some state. I say, okay, if I move you then I get a certain reward from you. And so I just have a limited number of pulls. B pulls in these bandits, and I'm just trying to figure out which one of them I should move. So that is the Markovian multi-armed bandit problem. So you have different arms. Each arm has a set of states SI. It has a root state where you start, and it has a transition graph, you know, which is -- the sum of the probabilities of getting out of a state to the other states as one. So when I go to a particular node in this Markov chain, if I say play that node, that means I might get a potential reward from that state, from playing it. And then the chain goes to one of the states that comes out of it, according to that transition probability. Okay. And now I'm given a budget on the number of pulls. And I ask you, maximize the total expected reward that you collected. So this is the corresponding generalization. Here's some examples. And one particular sort of assumption has been very strongly used in all the past work, until our work. And that assumption is that these reward numbers follow a Martingale process. So what I mean? If I look at the reward values, say, on these two descendants, the expected value of the reward -- for example, if this guy was ->>: [inaudible]. >> R. Ravi: Is half and half. So, yeah, so, for example, yeah, this guy is 4 and this guy's 2, let's say. Right? Then the reward at this state is 4. But if I pull and I go to two different states, the expected rewards from playing them is also 4, probability half I'll play this guy and get 6. With probability half I'll play this state and get 2. So the expected reward at the next state is going to be equal to my actual current reward. Okay. So when -- and this kind of a reward structure comes up when you have a bias, a bandit is really effectiveness of drug against a disease or how much money you can get, the bias of a casino machine. And you're trying to learn that from no priors or from the uniform prior. So you just have a distribution of how many positive and 0s and negative examples you've got. And the expected reward that you get follows this process. Okay. So a lot of the previous papers looked at the case when this, especially by these authors. Looked at the case when this expected reward process has this Martingale assumption. But before that this whole area of multi-armed bandit exploration, this very vast -- there's some beautiful papers for infinite horizon policies where I'm going to play this arm forever. And the rewards I'm getting are time discounted. That would make sense, if you look at infinite [inaudible] policy. And in that case there are some very simple policies called index policies. You don't have to look at any other arm, just look at one arm and compute a number for every state and then your algorithm is just play the arm which has the highest index. And that's optimal for these infinite horizon time discount policies. This recent stream of work is for finite horizon problems. They're called budgeted learning problems. Okay. And all of them arise either from model-driven optimization and database systems so you see some parts papers, or you see them from manufacturing -- this is an OR paper from a manufacturing motivation. But all of them use this kind of Martingale assumption. So where we started with this whole project is how do we get to this Martingale assumption. Okay and the way we did that is by rewriting that last LP I had for small jobs, for the general multi-armed bandit. Okay. Remember I said visit and stop? Right now I have -- the LP looks kind of simpler. Again, I have two variables for each state and remember you can be a state in any of the bandits, any one of the different bandits. So ZUT says that at global time T, I'm actually going to pull or play that arm U. The arm in which U is U is actually a state. And W is going to be the -- is this the time if WUT is 1, then T is the time when I'm first entering that state U. Okay? So sort of the first entering time. This is entering and this is playing. Okay. Now, look at this. You enter at time T precisely if you were at the parent at time T minus 1 and you pulled in your parent. Right? And you with some probability you got to where you are. Okay. And this is valid if you have just one parent, for example. So if I have, for example, a tree transition graph you have exactly one parent. And enter your state at time T if I was at your parent and I pulled at your parent at time T minus 1 and this probability. This is the transition probability of that length. Okay. You can't pull, right, more than you have entered. You can't pull at a state until you entered that state. We visit that state before you get to pull it. That's what this is saying. This is saying at any point in time you can pull only one on average. Right? And this says you can start at the root of everybody. That's the initial state. And if you pull you make the reward. Okay. So now my argue has been simplified because earlier I had to actually put the conditional reward. All of those, the messy conditional rewards came because I had this ->>: The budget you can see the time horizon is one to ->> R. Ravi: That's right. I only get B pulls. Because at each time I can pull once. Okay. So earlier it was easier with the knapsack to figure out how to use the LP to simulate what the LP was doing, canceling. Right? But this one was much more complicated. Because the solution to this will be some kind of a fractional strategy going over this forest, over this tree. Right, I have many arms. Each arm is a tree. Right? And I have for each state in each arm a probability of visiting it of some time and also probability of pulling it at a certain time. So it doesn't decompose that nicely. Okay? So basically in a nutshell this is what we do. We do something like a flow decomposition. We take the LP solution, whenever we find that in any arm there's any state which has a positive pulling probability, if I pull a particular state at any point of time, that must be because I was able to enter it. But if I was able to enter it I must have pulled its parent. So its parent must have a positive pulling probability. So this way we work our way back until we infer that the root has a positive pulling probability. Now what we do, we find the smallest amount of that pulling probability that we can strip off from this whole tree that's left. Okay. So we strip off this fractional forests of plays. That's the convex decomposition. And then we look at that fractional forest of plays in a timeline and it sometimes has these large gaps and what we have to do is we have to fill these gaps to make our Markovian inequality proof work. So we have a gap filling phase. >>: [inaudible]. >> R. Ravi: So I actually have -- so if I'm actually with you -- so this is what I just said. As long as there's a positive play probability, I strip off a tree. Okay? So something that I strip off may look like this, that is for a particular arm, right, this state at time three has some positive play probability. This following state at time four has positive play probability. But this following state has positive play probability only at time seven, not at time four. Why? Because my LP says you can enter you can wait and pull later. So when I look at this forest in a timeline, right, this guy's seven. So after this happens, there's actually a gap in my strategy. Okay. So this is what I mean by gaps in my schedule. So if I take one of these fractional strategies that comes out and I actually think of what it is asking me to do, it's asking me to play this at three, and then it's asking me to wait and play this at seven. So it's actually asking me to do this. >>: With the gaps it's a perfect decomposition, right? >> R. Ravi: With the gaps? >>: Yeah, as long as you leave the gaps in ->> R. Ravi: Right. Now we have one problem with that. So why can't I just run this strategy? Okay. So what would running the strategy mean? I start this guy. Right? And in case it actually transitions to that state, I wait for three steps. Right? And then I run this guy. And in case it transitions here, then I wait and actually that's the end of this, this strategy just ended it preempted it. That means it's the end of it. The problem is this weighting, because these weightings will add to the capacity or load of everybody else. Remember, I'll pick a random strategy, random arm, and pick a random forest from it and I'm going to try to do this; these things will add up. So in fact the particular problem is if I schedule a guy at a particular time, okay, so I'll tell you exactly why the gap is a problem. If I schedule a guy who is at depth six from the root, okay, and he's scheduled at time ten, okay, so in this particular strategy, if I actually -- this is global time, by the way, between one and B. Okay. So if I actually do this, then I know that there will be absolutely no time left for any of the other forest in the first. Right? So this is bad, because I don't want to fill the whole time B with just one guy. So in fact the real problem is if there is a node, a component in this forest whose depth, right, is less than -- if twice the depth is greater than the current time that it wants to get scheduled, right? So that means if I'm scheduled at time ten and my depth is six, then more than half the ten units will be used just on me alone. That's not good, because remember the way I tried to do the previous proof was I only wanted to run this for half of the budget. So I don't want myself to fill more than half of the budget. So the quick fix to this is you just take the strategy forest, sweep your way from the back, and whenever anybody is scheduled at a time, right, which is smaller than twice my depth, another way of saying it is if my path to the root contains more than half of my own ancestors, then I'm just going to push this guy back, I'm just going to push all these guys back. So I'm going to compress this. But there's a problem with compressing because when I compress I'm adding more play on previous time steps when there was no play. Okay. If you're still with me, right, then adding that is not really going to be a problem, that much of a problem because how many different guys might come and crowd a particular time step? Well, if somebody ever got pushed back, it was because more than half of its ancestors were on the path before. So if I'm coming back to time T, there must be more than T over 2 guys of my ancestors in the past but the total capacity up until time is only T. So how many such guys can come fill me up? On average two. Okay. That's really -- it's a very simple argument. Just an averaging argument. When I push this back, I'm claiming that no time will get overcrowded for play from its current value of one to a value that may have been three. It will just get extract two units of play. Fine, damp it by another factor of three. So instead of damping by one-eighth, damp it by 1-24th. So the algorithm is that, you compress the forest and you just pick a particular strategy forest with probability equal to the roots playing probability damped by 24. And then you just run this strategy. That's what the proof is just exactly what I did before. There's nothing more you have to do. Okay. I'm way past my time. So thanks for hanging in there. And there's really only -- the one comment I want to make, even though I did this only for a tree, you know I said this analysis works only for a tree, it actually works for very general Markovian bandits. Because general bandits can be converted to layered DAGs because we have a finite time horizon, only running for T steps. So whatever transitions are happening in that general Markov chain just let it happen in a time index chain like this. And DAG bandits can really be blown up into an exponential-sized tree. Right? Depending on I keep track of where I came to the sea from. Did I come from this row, this A or this B. If I blow up that thing I get an exponential sized tree so I can keep a representation of that exponential sized tree implicitly and run this algorithm. So in fact everything I said will work even for the most general multi-armed bandit with non-Martingale rewards. So really the interesting things are can we use these kind of LP decompositions to come up with adaptive approximations? Because our algorithm, once it strips out this forest, depending on the sizes, it sort of does different things. And we know from all the previous examples that you need to do that. You need to be adaptive to get any kind of constant factor approximation for this problem. So this seems to be a very appealing hammer to use for other sort of adaptive approximations. And there's some generalization we have some results for. Okay. Thank you again. Sorry for going over time. [applause]