23776 >> Yuval Peres: It's a pleasure to have Ravishankar Krishnaswamy. He's going to talk about approximation techniques for stochastic optimization. >> Ravishankar Krishnaswamy: Thanks, Yuval. Thanks for the invitation. It's a pleasure to be here. And okay. So like Yuval was saying, I'll talk about stochastic optimization from an approximation algorithm point of view. And feel free to stop me at any point in the talk if you have questions and so on. So it's joint work with Anupam Gupta at CMU, and Ravi, also a professor, and Viswanath, who is a student at CMU, over at IBM TJ Watson and Marco, who is a student at CMU. So what's the outline of the talk. Very high level outline. A little more formerly we'll talk about the models that we study. I'll concretely present the problems that we look at before moving on to a result. And the main technical part of this talk will focus on this problem called orientary [phonetic], I'll define the problem in that section, and then it will come up with a general solution framework for this problem and apply it to a couple of other problems, of which I'll describe in the focus and future work. So let's move on to the model that we study. So one thing we all know and we released we have to be solving optimization problems all over the place. And there are different models of looking at this notion of optimization problems. So the one extreme is the notion of deterministic optimization. So where before solving the problem we assume we know the entire input in advance and we solve deterministic problem, combinatorial problem it may be. And the downside of that is that we assume that we know everything about the input before we come up with the solution. So I call it the know all model and in some sense it's a little optimistic in real life. We don't always know the entire input ahead of time. So the other extreme is the model of online optimization, where we assume we know as little as possible about the input before coming up with the solution. And the solution is built incrementally as we know more about the input. Maybe it's a little harsh to call it known. We know something about the input, but just to illustrate the point. And the flip side of it is that typically the algorithm assumes that it knows very little about the input and it knows only as the input is revealed it can figure out solutions. So it's a little pessimistic in its modeling assumptions. And the gray area in between is sort of how most problems in real life are where we know more than what is assumed by online but not as much as what is assumed by deterministic, which is what I call the notion model, the concept of stochastic optimization. And it's been studied in different flavors all the way from 1955 where it was initiated by DanSIG right after he came up with the simplex method for solving LP, one of the things that he started thinking about is how do you handle uncertainty in the matrix and the input. And we're not going to be looking at this model of stochastic optimization. We're going to be looking at some different models, which I'll talk about when I get there. Okay. And so here we tried to get -- we tried to use the nice results that we know about deterministic optimization. And also handle some sort of robustness in the input uncertainty. So that's the problem. So since it's been initiated and then last 50 years there's been a lot of work on this problem, and I'm just going to paraphrase and abstract from this book by Birge and Louveaux on stochastic optimization to motivate why we're interested in this problem. The first thing it says a lot about the framework of stochastic optimization. The goal is to come up with decisions, optimal decisions to handle uncertain data. And it's received a lot of contribution from different areas in science. And it has a lot of practical applicability. And our goal in some sense is to sort of further, take more steps towards adding computers to the list it's not absent from the abstract. And what do I mean by adding computing as a computer science or computing to this list? So the first thing that comes to mind, you want to compute the decision efficiently and some notion of polynomial time in the input. But not always can we find efficient solutions; in particular, if the problem is NP complete, you can't expect to find optimal decisions efficiently so you add the notion of near optimality and efficiency. That's sort of what the high level goal of this talk is. What does that do to our model? What do we need to do to address these questions at a high level? So the first thing is we need to come up with nice modeling, nice ways to model uncertainty in such problems. Combinatorial optimization problems. Second thing is how does -- once we introduce uncertainty, how does the solution best change? And the third thing is what techniques do we need to handle uncertainty. And these are the three things that we look at in this talk from a point of view of approximation algorithms. So all the problems we look at are NP complete. Yes? Okay. So the canonical NP complete problem is the knapsack problem, the maximum knapsack problem, is going to be the focus of this talk, initial part of this talk, at least to answer the questions from a high level. What is this problem? So we're given N jobs. Each job has a reward and a size. And what's the -- and we're also given a knapsack of size B. So in this example the rewards are the numbers returned inside each. There are five jobs, and the sizes are their lengths. And the knapsack has a capacity of B. And it's a goal so we want to identify a subset of these jobs, which all pack into the knapsack and the reward, total reward is maximized. So in this example you can notice that all five jobs don't fit in, but four out of five does, and you get a large reward. And, yes, you can think of it as a scheduler which has a time reserve of B units of time. Each job takes a deterministic amount of time, and you want to schedule as many jobs as possible before you run out of time. Again, it's an NP complete problem. We know extremely good approximation algorithms for this. In fact, you can get one plus epsilon type approximation in time that's polynomial in N over 1 epsilon for every epsilon. So to address the first question, so how do we introduce uncertainty into this framework? So the way we do that is through this problem called stochastic knapsack. Again, it's been very widely studied in operations and also in approximation algorithm. So let me just talk about what this problem is. So here instead of jobs having a deterministic size and reward they have a random size and reward. But the variant that we study is the distribution is given us input. So to think about this, you may think that you may assume that each job is a randomized algorithm. It tosses internal coins, and there's a distribution over what its utility will be in the end and there's a distribution over its running time. It's given as part of the input. What do we need to do? So the only thing is that only after we run the algorithm do we get to see the true realization of the algorithm. So how useful was it to us on that given run and how long it took to run. So we know the distribution and the true [inaudible] is revealed on the weight on it. For example, if this was the original deterministic reward some size, now it's given as a distribution. It's either size small size and reward time with probability half or large size and reward zero with probability half. So for each job we now have a distribution in this form. And we need to somehow come up with a scheduling policy. So that's the addressing the first part. So this is how we model uncertainty for this combinatorial optimization problem or about the second question. So how does this change the solution space? So recall that in the earlier model, solution is just a set of items that you pick. And that fit into the knapsack. Now it's no longer as simple as that. So in the most complex solutions can actually be an adaptive policy. So what do we mean by that? So all that I mean is that subsequent actions can depend on how the randomness turned out. So in this example maybe the algorithm decides to schedule the green job first. Just by looking at all these distributions and depending on how the randomness spanned it can do different actions. Perhaps in this example the green job came up small size but reward only five. So maybe it chooses to do the -- for no reason, I don't have any reason for making the scheduling policies. These are all ad hoc. Maybe it chooses to run the brown job. On the other hand, if the green job had come up in a different instantiation it could have done a different thing, that's the point where adaptivity is the key. Decides to do the last job. Maybe the last job instantiates to a top size and it decides to do the middle job which is always deterministic, it comes up only in one possible way. And it decides to do the fourth job. And when it does the fourth job, the knapsack is already expired by then. It does not get any reward from the first job because it's not completed in time and the game ends still. >>: So when you say adaptive policy are you saying next action, according to is there an order on the items where the randomization appears in order? I don't understand ->> Ravishankar Krishnaswamy: Yes. So it's a completely off line problem. Every item is given as a distribution. The algorithm has to first pick an item to schedule. Just by looking at all the distributions. Then it is irrevocably committed to scheduling the item. >>: Then it sees, the random coins. >> Ravishankar Krishnaswamy: Then it sees this item took five times to complete and gave me a reward, can choose a different job. That different job can change depending on how this randomization occurred. >>: I see. >> Ravishankar Krishnaswamy: And there are different models here. So in one model it can, if a job is taking too long, it can just abruptly cancel the job maybe run a different job and come back to this job and so on. But we will look at the basic model. It has to commit to scheduling a job completely just before running it. And after it completes, it can decide to run the next one. Yes. So earlier while it was just a subset of items to schedule, now it's a complete decision tree. So to look at it in a more formal definition maybe the algorithm decides to run job just by looking at all the distributions, and job two has two ways in which its randomizes. And for each of this branch the algorithm can choose a different job to schedule. So maybe if it has reward five and size four it decides to insert job five. On the other branch decision are job three and so on. On each of these sample parts, the knapsack gets expired at some point and you have to count the total reward you have collected before it expires, and the goal is to come up with this tree, decision tree, which maximizes the total expected role. So that's the problem. So earlier this is just a single path and you want to maximize the reward. Okay. So that's addressing question two about this model. And what about question three? And the answer I'd love to say is you'd have to stay for the rest of the talk. Okay. So what's another motivation -- aside from being interesting in its foundation, what's another motivation for studying such problems? These are a form of branch of problems called stochastic scheduling problems, where, for example, like I said the jobs are randomized in order to come up with policies to schedule them. And there have been rightly looked at from an OR point of view. There's a lot of research. I'm just going to highlight some of them and in fact this is a good survey I'll talk about them. And there's also another interesting point. A lot of these algorithms use nice heuristics that just ignore everything about the distribution, but just look at these expected values of the variance of the distribution and deterministic algorithms on the reduced problem. And one meta question for looking at -- for focusing on this problem is: Can we explain the effectiveness from a worst case point of view, for any distribution at these algorithms could be required some nice property distributions. That's the motivation. The problems that we look at in this talk are, like I said, the stochastic knapsack problem. Then an extension called stochastic routing problems where if you just want a quick -- a quick insight into this problem, these are problems that, random jobs are located at different vertices on the metric you actually have to travel to a vertex before you can process that job. And another generalization is those of multi-[inaudible] problem where I guess I'll get to this at the end of the talk. If you know it's there then that's fine otherwise I'll define them at the end of the talk. So we look at all these problems and try to provide a unified framework toward solving them, viewing them from the approximation algorithm's point of view. And, yes, so more concretely or results to be sure. So we give efficient approximation algorithms for the knapsack problem for the bandits problem and for the orientating problem. And like I said so we only knew theoretical guarantees for special cases of these problems. Okay. Again to reiterate the meta problem, so we want to understand the techniques that we need. A, to come up with algorithm design. Because the solution space has changed. And B, how do you analyze them for combinatorial optimization problems, stochastic point of view. And take aways, optimal solutions can be very adaptive and algorithms may need to be also. So that's the first part of the talk. So now we move to the main technical part of this talk, we'll focus, in fact the only technical part of this talk we'll focus on the problem of stochastic orientating, and I'll define that problem now. So once I define the problem, we look at the section results. And then motivate the reduction before moving on to proofs. >>: How is this problem related to the decision tree. >> Ravishankar Krishnaswamy: This is one of the problems I mentioned. This is the routing section. This is an example of a routing problem. >>: Okay. >> Ravishankar Krishnaswamy: So what's this problem? So if you want the quick one-sentence thing for this slide, one-sentence take away for the slide essentially a stochastic knapsack but jobs are located at different vertexes the metric. More detail we're given a graph with metric distances, and we're also given as part of the input a distribution of processing time or waiting time at each vertex. Think of these as the random jobs we have to do and each has a distribution over how long it takes. And we're given fixed rewards at this vertices. If I complete a job I get a fixed reward. There's a total budget of B of how long I have. That's both combined for traveling so we need to first travel according to these distances, and for weighting. So there's a combined budget. >>: Objective is time? >> Ravishankar Krishnaswamy: Yes. So these two are in the same -- yes. They're in the same unit. >>: When you said grab metric distances did you just mean the gravometric. >> Ravishankar Krishnaswamy: Yes. Exactly. Metric is a notion of time. So how long it takes from traveling from A to B. And additionally you have to weight the vertex because there's a random job you have to process. >>: Uncertainties would weight ->> Ravishankar Krishnaswamy: Uncertainties only at the vertices and not in the travel lane. >>: Not in the rewards. >> Ravishankar Krishnaswamy: Not in the rewards, correct. So we can handle uncertainty in the rewards but currently we have to have a deterministic metric in this problem. >>: Look into the problems, how much time does it take to switch from one job to the other? >> Ravishankar Krishnaswamy: You could think of it like that. But although that is not the motivation. Because at least I talked to an architecture person and he sort of said that not really realistic in that framework, but you can think of them as -- I'll give an example which is more maybe -- a example but a motivated objective. >>: Total distance traveled. >> Ravishankar Krishnaswamy: Plus the weighting time. >>: I see. >> Ravishankar Krishnaswamy: Okay. >>: So the knapsack has been the metric is ->> Ravishankar Krishnaswamy: So, yeah, this reduces to the previous problem and there's no traveling at all. The metric is zero. Okay so what is the objective. The objective like a style is to come up with an adaptive strategy. A strategy just says, it's a sequence of vertices to visit. Maybe it's vertex two first and depending on how long the job at vertex two took it can visit different vertices and so on. After some time it's run out of time, after a while it's run out of time and you want to come up with the strategy to maximize the total expected reward. >>: Stopping vertex is fixed. >> Ravishankar Krishnaswamy: Stopping vertex is fixed. There's a given route. So, yes, it's fixed. So the example to address all these questions that you guys asked. So let's say we start at home at 9:00 a.m. and we have to try to do a bunch of chores. So we have to send mail at the Post Office. We have to maybe deposit some checks at the bank get some food and also buy some tickets. And all these places close at 5:00 p.m. so that's the budget for this problem. So eight hours is the budget for this problem. And there's some distribution over how long we have to wait at each of these vertices you can think of Qs at these vertices and this is the distribution of how long we have to wait. We started home at 9:00 a.m. and maybe one possible solution is to say visit the Post Office first. So it takes two of us to visit the Post Office. So the traveling times are indicated there. It's 11:00 now, and maybe the distribution at the Post Office is such that they're extremely unlucky and it takes six hours to, we are still waiting at the Post Office and not got to the head of the queue, which means in that case we haven't done anything all the shops closed. Ultimately we could be lucky and it only takes two hours so in the second sample path by the time it's 1:00 we've completed the Post Office, and we can do more actions. So maybe we decide to get some food, but it takes three hours to travel to the restaurant, and by the time it's 4:00 we've reached the restaurant. >>: Distribution make you pay or the time? >> Ravishankar Krishnaswamy: Yes. It's an assumption we make it's completely fixed, independent of the time. And maybe the restaurant as I said it's a pipeline takes half an hour deterministically to get the food. After half an hour you've got the food and maybe we decide to go to the bank but we run out of time walking to the bank. Notice if we were lucky at the Post Office, we got two tasks done. Otherwise we got zero tasks done and the goal is to come up with such best policy to maximize the number of tasks. Each number of tasks has unit reward also. So the interest on the practical side. So this orientating problems are more generally routing problems. Have been looked at from approximation algorithms point of view and OR point of view. More than the deterministic case and stochastic settings. The deterministic box, very extensive literature and there's a lot of literature and you can refer to the citation I made earlier. And all the stochastic problems, all these problems, stochastic routing problems, have been looked at, but the results that are only for special cases of special types of distributions. And they're absolute guarantees. So we're sort of interested in a worst case guarantee over all distributions, what about the deterministic problem from an approximation point of view. So this problem where the weighting time at every vertex is deterministic has received attention in the past and a very nice elegant constant approximation for this. >>: What is the name? >> Ravishankar Krishnaswamy: It's called orientating problem. And fourth box is what we're interested in, and that's the focus of this talk. Okay. Again, so this just reiterating one more fact. So, again, if you look at the heuristics we know a lot of nice heuristics work for this problem in the practice and simulations, and they all throw away everything about the distribution and keep only very few factors, like the variants and the mean and so on. So it can be explained their effectiveness from a point of view, from approximations point of view. And for the more set of tools required. So with that are results. So we find nonadaptive solutions after talking about adaptivity we end up finding a solution which is nonadaptive a sequence of vertices to visit, which is a factor log, log B approximation to the best optimal adaptive policy. And recall B is the budget of this problem. And the nice thing about that nonadaptive solution, it's simultaneously a constant approximation if you're only comparing against nonadaptive policy sites. So that's the results we get. Yes, we assume that everything is -- the problem is discrete as generalists. So this is the focus of this next section of this talk is going to look at the log-log B approximation and how do we get it. So the main motivation for our reduction is from the tools that work in practice and simulations, and throw away everything about the distribution just look at expectations and try to work with them. How well does that perform? Here's the high level approach. So we are very good at solving deterministic optimization problems, in particular for orientating, very good approximation algorithm, factor three. Not try to reduce the stochastic instance into a deterministic instance. Let's try to use these expectations and reduce it to a deterministic question of the problem, solve the deterministic question; and the deterministic question naturally gives a nonadaptive solution. There's no notion of adaptivity. This is the high level approach that we're indeed going to follow. The natural candidate for this deterministic problem is orientating in itself, which we know [inaudible]. So since the title says I don't run there, it's not a reduction that's going to work in entirety. So let's first look at what the reduction is, present what the bad example is. Such reduction. For every job we're just going to replace the random variable debt by a deterministic variable with value equal to the expectation. I'm going to think of the waiting time as the expected waiting time. Now we want to find a path whose length plus waiting time is at most B and the reward is maximized. So this is a problem on a completely deterministic instance, and in fact this is the deterministic question of the orientating problem, for which we know constant approximations. So the real question is how well does this perform in the original instance? And as you might have guessed as I already indicated this is not -- this does not work for the original instance. So that's a bad example. So imagine that the home is here. And all the random tasks are located at a distance of B minus 1 and they're all located at the same vertex. Essentially look at it as the same vertex. And there's an alternate community, a neighboring community of a distance B minus 1. Now how do these jobs look like? In the first community every job has extremely high variance. So each job has a waiting time of zero with overhead min of probability or a waiting time of B with some tiny probability. So that -- and the other community, they're all very completely deterministic jobs waiting time one-half. So how does this reduction perform, so when you throw out everything about the distribution, in particular we throw out the variance. We get that the waiting times here is 1. And the waiting time here is half. So when we solve the deterministic question of the problem, it will prefer traveling here and picking two jobs out of this. So it gets a reward of 2. But on the other hand, so you can see the deterministic solution is just two jobs. It prefers this cloud or this cloud. On the other hand, for the stochastic instance you can notice because these things are very high variant, it's going to turn up 0 with overwhelming probabilities. In fact, omega B of these jobs will all finish with at size 0 and you can collect a reward for all of them. With constant probability you can get B jobs therefore it gets omega B profit in total. You can notice by throwing out the variant, the deterministic instance got fooled into going into the second community or preferring the first community. So, yes, the take-away is that the deterministic reduction here completely characterizes the mean perfectly, but it ignores the variance. >>: Are there equally bad counter examples for natural distributions like normal, Poisson, anything ->> Ravishankar Krishnaswamy: I would say not. Yeah. I don't know for sure. But my guess would be -- so one thing is if the reward also can be correlated with the size. Then I think there are some easier, bad examples, not so harsh bad examples, but again they may not be true, the more the distributions are nice in their respective marginals. Okay. So the main thing that we observed that was that we need to know -- we need to somehow control the variance in that reduction. And what we do is we at least in the previous example we know the optimal solution a lot more so we can do a lot more to control the variance. So in particular, so suppose we know that the optimal solution always travels for some fixed P start time units. It's not true in general but suppose this were true. Imagine there were true. In the previous example it's always B minus one regardless which community it goes to it travels B minus one time units. And again the traveling may not occur all up front but let's say we rearrange and put all the traveling up front. So then what it's doing, it's solving a residual knapsack problem on the jobs that are located with the knapsack budget of B minus T star. And this may be adaptive. It can depend on the sample parts; but in all the sample parts it's solving a knapsack problem with budget B minus T star. In the previous example it was T star was B minus 1 and this was just 1. Therefore, the insight that we get is that we should really not look at the distributions beyond B minus T star. In particular what I'm saying is there is no point in the distribution supporting the sizes because they're not going to get any profit from those large instantiations in any way. So because we're traveling T star, if any size crosses T minus T star the game ends and we're not going to get a reward from that job. So why not minus the distribution up to B minus T star and look at every large size as just being B minus T star epsilon. So what I mean by that. So one thing in particular is that because we don't look at very large sizes, it gives us much better control over the variance. So what does this do in the previous example? So now let's ignore [inaudible] and we'll argue that this community own by itself does very well. So why is that. Earlier the mean waiting time was 1. But because we always have to travel B minus 1 units the knapsack problem we're always solving is size 1. Therefore, I'm not going to treat it as B anymore. I'm just going to treat it as 1 plus epsilon. And I won't collect any reward in the second case. If I do that, then I get the expected value. To be 1 over B and deterministic question after truncation can collect all the B here and here. So now the deterministic instance has good property. So the questions that come out, the main assumption is that optimization always travels for some fixed amount of time on all its sample parts, which is probably the dubious assumption. That you might have guessed. The question is there such a single good threshold that we can truncate and is it good in general, is the solution good in general? So we answer that affirmatively. So to address the question of is there a good truncation in general, the main issue is that we don't know the structure of optimal solution as well as in the prior example that we considered. And in particular it's very adaptive solution, and it can have different traveling times on different sample points. Okay. So the answer is, yes, as you might have guessed and in fact there's a very simple solution fixed for it. Try all possible truncation thresholds and one of them will work out well. So what's the final algorithm that we look at? It's more or less just reducing the deterministic instance. But after preprocessing of truncation. We'll try all possible values of truncation, one, two up to B, log many values, powers of two. Choose the one which has more deterministic reward. What's the ability that we solve? First step is we want to truncate every job size distribution at this value B minus T. So what I mean by that is suppose this was the size distribution of the job. I'll draw a line at B minus T. And think of every large size as being B minus T itself. And we'll zero the reward in this section because there's no point in getting -there's no way you can collect the reward in that. So after this truncation, notice the size of the job which is a deterministic quantity to be the expectation after truncation. And so now what's the problem that we need to solve? So notice that we assumed that we are traveling for T units of time. That's what led us to truncating at B minus T. We want to find a path that travels at much T units of time. Such that a separate constraint -- there's a separate constraint on the size. So the size of the jobs it picks up along the path should be at most B minus T where the size is computed in step two and the reward is maximized. It's the final subverting we solve. So barring the truncation, this is almost the simple heuristics suggested up front. So the question is the first observation is since we're solving a deterministic problem it gives a nonadaptive solution problem which I claimed up front. Two things that linger how good is the solution for the stochastic instance? And is it even polynomial time. So why am I talking about that? It's unclear if step 3 can be solved in polynomial time. So we answer about these questions affirmatively and ensure the theorem that this algorithm is log, log B approximation to the best adaptive policy. Okay. So let's move into the proofs. So what do we need to show? So we need to show the first thing we need to solve is that we have an algorithm and we want to show that there is, when we choose the solution at the most reward we want to show there's actually indeed a good candidate solution with height reward, compared to the stochastic solution. The second question is can we implement all three steps in polynomial time. And for this we need to implement this sub routine where there's a bound on -we need to find a tour which has bounded length. And in a separate dimension it has bounded size of jobs it picks up and the reward is maximized. So this is a problem we call the orientating problem with the knapsack constraint which is a sub routine for our overall goal. And we ensure we can approximate this within constant factors. So I won't talk about that in the talk, but trust me on that. And the third step is we find a solution for this such as nonadaptive. The question is how does it relate to the general instance? So having ignored everything and just looking at the expected values does it still do in the original instance. So these are the three questions that we talk about, that we address now. So the first question -- so in particular indeed if you go back to the old outline, we're just using the surrogate problem as the orientating problem with knapsack constraints. So to answer the first question, we saw that if you look at the optimal adaptive solution it can embed into the deterministic instance for some guest T star with the gap log log B. And the second question says can we implement knapsack orientating we can answer it giving an approximate constant for that. And the third question is once we found a solution, how well does it do in the original stochastic instance. We again show it does very well up to constant factors. Three parts to address. So let's look at the easiest part first, which is moving from the knapsack orientating solution to the original stochastic instance. Okay. So what do we have? So assume that we have a good solution to the knapsack orientating instance. In particular, you have a path which travels at most T. Has reward at least T star and the size of items it visits is at most B minus T. Each of these sizes at the expected values that we have computed. So now what's the candidate nonadaptive solution? It says we have a candidate path. Let's follow the same prescription. Visit all the vertices in that order and try to fetch the reward. But one thing that this did not capture was the variance. So to account for some slack, we also do a preprocessing step we visit a vertex we simply toss a coin probability half don't process the job at that vertex and probability half actually process that job at that vertex. So this is our final nonadaptive policy. So how well does it do? So the analysis is extremely simple. And it fits into this little box here. And it's just an application of Markov inequality and linearity expectation. First thing we know, since we follow this prescription, the total traveling we do is at most T. But what about the total size? So the total size, if we were to play everything, has an expectation of B minus T because each size is in itself an expected size. But we also sub sample probability one-half. So the total size and expectation is now at most B minus T over 2. And therefore we don't have any control over the variance. So the best we can do is apply Markov's inequality here. So every job finishes within total B minus T with probability at least half. In that case we get the claim of O of R in the original instance. So this gives us the nonadaptive policy which fetches an expected policy of row R over 2. >>: So [inaudible] distribution. >> Ravishankar Krishnaswamy: Again. So we have to be -- I was cheating a little bit here. So we have to handle the truncation separately, but we can handle it. And the proof follows this mostly along the same lines. So the truncation actually reduces the size a little bit. But as we don't have any control or truncation here. So we need to handle that but it can be done. Okay. So we've seen that if we solve the knapsack orientating, well we can get good solution for the original stochastic instance. Let's move on to the harder part. I'm going to skip this part. So other part is to show there's indeed a good feasible solution for the deterministic surrogate that we have created. Okay. So notice what does it entail, we need to show the following. So we have an optimal stochastic solution. So you can think of this as a stochastic process, which inserts some vertex visits, some vertex up front and depending on how random it is it can do different things on different sample parts. Think of it as a stochastic process, and we need to show that what do we need to show? We need to show the following. That there is some sample path in this process as indicated by the curved line there. Which has the following nice properties. Just that. If the sample path travels for T star units of time in total, then if you look at all the vertices it visits along the way, truncate their means at B minus T star, then some of these truncations is also at most B minus T star that we need. And furthermore it has good reward. Why are we done if we show this? If we show this then we obtain a feasible solution for the deterministic surrogate we created. So in particular for this choice of T star, T equal to T star we will find this sample, but at least that sample but in step three. So we are done if we showed this. That there's some nice sample path, which has good bound on some of the means and reward. So okay so let's for sure with some assumptions. So for that I'm going to -- before showing this, I need some little bit of notation. So I'm going to call a sample path as well behaved, if it satisfies the following property. In particular, if it travels at most T units of time in total, then the sum of these truncated means is at most some alpha times B minus T. Why do you expect this to be true? So one intuition is we know if you look at any sample path if it's traveling at most T, and some of the actual processing times, then it's at most B minus T. Because it's a feasible solution for -- it has to travel -- it has to spend at most B time and fountains traveling T wait for B minus T we know it's true for every sample path. And we expect for some constant alpha, even if you look at their means, it's reasonable to expect that this concentration holds. So it's reasonable to expect that most of these sample paths are well behaved with constant alpha. And the easiest case is if each sample path is well behaved. So in that case it's just going to be a convexity argument. So if each sample path is well behaved, it's very easy. So we know that since the average of all these sample paths fetches a reward of opt, one in particular must be have a reward of opt and expected value of opt. And all of of them, if you sum up all the truncated means, it's at most alpha B minus T and so I can pick the best fraction, alpha fraction of them, and get there's a feasible solution for the deterministic problem we create. So the real case is every sample path is not well behaved. The question is how do we handle the best behaved sample parts? So the answer of the fix we have is also very simple. We look at the stochastic process and terminate the process and then it misbehaves. What does it mean formally. We look at the stochastic process and travel down every sample path. So this was the first vertex it visited on the left branch it visited some second and third vertex so on travels down each sample path and look at the first node along any sample path where this condition is violated. So satisfied. So if you sum up all the truncated means of all previous vertices, then that should exceed alpha times B minus T. So the first time T when this happens, I'll call it stopping time and let's mark it by A Star. So this is one technical aspect here is that these means are truncated at the stopping time. So I look at all the previous random variables. And truncate their expectations at B minus T. Where T is the depth of this process. And that's the star node. So on each sample project computes the thing and marks it. That's the first time when opt misbehaves. What we say is we truncate star nodes so we don't look it up. So now it's easy to, it's clear that the new solution we've constructed is well behaved by definition. But the only thing we need to show is that by doing this chopping off we don't lose too much reward and expectation. So in particular that can be carried upper bounded by the probability that the optimization reaches such a star node times the expected value of [inaudible] under each star node it cannot get more than opt node. That's a contradiction optimality mode. The main crux of the argument is showing that the probability that you reach a star node is at most half. And to help along the way, I'm just going to construct -- I'm just going to explain how you should view the optimal solution as a stochastic process. So imagine this was the optimal solution. So you start at home, then you go to the Post Office. You have to travel three units of time to go to the Post Office. Or maybe you have to travel four units of time to go to the Post Office. So the random variables they're going to create are the stochastic process we're going to look at is the following: So look at the optimal solution. Look at where it is after it has traveled to units in time. So after it's traveled for units in time, if it's actually at the vertex and processing a job, then XMT is going to be the random variable for the processing time of that job. On the other hand, if it's along an edge and traveling between two vertices, then XMT is the random variable. So in this example, suppose it took four units of time to reach from home to the Post Office. And X sub 1, X sub 2, X sub 3 are all basically random variables. And XM 4 is random variable corresponding to the Post Office distribution. Depending on how long it took at the Post Office, XM5 can be a different random variable. In this case XM5 is traveling on both branches XM50. But on the left branch, it's still traveling at the sixth unit of time, whereas it's already reached the restaurant on the right side. Except six is different anyway. So what is the motivation behind all this? So the nice thing is sum up X sub T from 1 to capital T you get the total time that opt has weighted first after traveling units in time. So that's one thing. And the second thing, is if you look at this difference, XMT minus the expected value that forms a process, these are not IAD because XM6 for instance can depend on what happened in XM4 but forms a process nonetheless. So now using this notation, what do we need -- what does proving the lemma translate to? So recall that we were truncating the opt at the first points where it was misbehaving. That is, looking at it from this definition. The sum of all these expectations, truncated at B minus T. The first point where it exceeds alpha B minus T is the stopping time in the process. So I look at the min of T such that this condition is satisfied. And we want to show that the optimal solution does not cross such star nodes at too high a probability. So what does it translate to? So it says, firstly, there must exist a time where the optimal solution can actually cross it. That means the total time for which it is weighted is at most B minus T because T is the time it travels for. And it must satisfy the stopping time criterion, which if you expect the sum of these expected values, it must exceed alpha T times T. We only get some hand delivered property, show it's at most half. That's what we want to show. If we show that we are done. The natural thing might be to use some sort of Martingale concentration bounds. So the only thing I'm going to mention here without going into too much detail is that typically Martingales bounds, deviation of some random variables, sum of random variables from expected values. Whereas what we have here metaphorically it's like picking out a bad apple out of a bunch of bad apples. What we want here is to bound deviation over some random variable over a bunch of the expected values, but each expected value is a function of the stopping time. So if you look at the stopping time T, go back in time, look at all the previous random variables. Alter them based on the stopping time and compute their expected values, and you want to bound this deviation which is why we need nonconcentration standard bounds because it's comparing apples and oranges and in particular it's like picking a bad apple out of a whole bunch of oranges. So that's the take away from this slide. So we do that by running several Martingale processes for each truncation thresholds. And again in each truncation threshold also the optimal solution is the stopping time is a random variable in itself. So assuming it does not suffice and we use this variance bound for equality for Martingales and to get some concentration. And the reason for log, log B is we can get exponential concentration in each of them we have log B different Martingales need a differentiation of log, log B overall gives union bound and log, log B. So we've seen this. So the last couple of minutes I'm just going to talk about the bottom arrow. And all I'm going to say is that we can approximate this problem well. And that's where a reduction to the -- it uses the Lagrangian relaxation method and lifts the knapsack constraint into the objective function and says for suitable choice of Lagrange parameter you can solve this problem and get a constant approximation. Okay. So what did we learn from this whole sojourn. So the things that we learned the first thing is we follow this framework of we have high entropy problem here, all the distributions. We say let's ignore the entropy in entirety use expectations. So it's deterministic problem. So, yeah, and solve it and then recover a solution for the original problem. But the natural candidate did not work because it's only encoding mean and did not capture variance. So we were able to capture -again, this is the problem that we saw, that if you don't encode variance, then there's a problem. So regarding that way enforcing structure on the optimal solution by saying, look, it has to travel for some T units of time. And now we can look at every random variable, truncate it and get much better control over the variance, and this standard turned out to be good enough to be log log B factors. And now the bulk of the analysis goes into showing that the -- now getting better structure on the optimal solution does not change things too much. So that's the Martingale proof that I skipped. So what did we learn? So we sort of looked, identified a general framework using a deterministic surrogate problem. But the things that we need to think about when we come up with the surrogate should be able to handle the variance in the problem. That is what this previous example explained. And then if indeed there are correlations then it should be able to handle correlations between random variables. So the previous case I completely used independence all over the place. And thirdly, sometimes the surrogate must be general enough to yield to adaptive solutions. So there are some problems you cannot do with nonadaptive solutions, there's a large adaptivity gap and your surrogate should be general enough to handle adaptive solutions and these are parameters you need to come up with when you construct a surrogate. For a lot of problems our framework says you can't come up with nice surrogates but the difficulty is often pushed into the analysis. Why is that? So at a very high level the surrogate is a string running instead of being a relaxation. Usually when we write an LP relaxation or mathematical problem the relaxation is truly a relaxation, optimal solution is embedded into that. And here we are saying let's ignore everything about entropy, we start with a high entropy problem and create low. I'm going to look at the means and variances solve the problems using only those parameters. So it assumes a lot more about the input than the original problem. So it pushes the analysis into the -- pushes the difficulty into the analysis. Now having seen this framework, how does it apply to the other problems that I mentioned up front? So that's the focus of this part of it. So now going back to the knapsack problem, again, recall that now the jobs have distributions of sizes and rewards, and these can be correlated. So that's now -this part of the talk will address how to handle correlations. Size variable can be coordinated with the reward. That's a very easy example of that is randomized algorithm. So depending on the internal coin tosses, both the running time and the utility affect us and they could be correlated. Now the goal is to adaptively schedule to maximize the total expected reward. And there's no metric here. All the jobs are at the same vertex. What can we show you can get a constant factor for this problem. Mentioning up front we can even handle general models where you can preemptively stop a job if it's taking too long and try to run two jobs and so on. Maybe the distributions are such that you can infer the conditional information about the distribution after the job has run for a little while. So for five units of time maybe the distribution is such that it's always going to take a million units and I guess you can throw this job and move to different jobs. >>: So it's more adaptive. >> Ravishankar Krishnaswamy: It's not adaptive still. So figures when to stop and when to move and all that up front. >>: Thank you. Any correlation between ->> Ravishankar Krishnaswamy: Yeah, let's talk about that. And one thing I'd like to say is that prior work assumed independence between these parameters. There have been very nice algorithms that sort of motivated the framework that we also studied and that deduced in implicit manner and that assumes independence. So one example where they're not independent are, for example, let's say that when you start a task, it can crash immediately with some probability. Clear crashing does not give you any utility, right. If it crashes, then you get maybe tiny size zero reward otherwise you get the real size and real reward, perfect example on which prior algorithms fail. And just a few lines of about how we handle correlations here. We go back to the same framework that we need to come up with the deterministic surrogate, solve a surrogate recover a good solution. What's a good choice of surrogate for these problems? Based on prior work, we sort of want to use linear programming formulations as good surrogates. But these LPs use some kind of expected value. We throw away everything about the distribution. Use the expected value in the linear program formulation and think of that as a surrogate. So surrogates are LPs. The only thing I'd like to state is natural LP does not work as a surrogate. So we need to come up with a stronger LP to handle correlations. And maybe I'll address Nico's question we can bring up the LP at the end of the talk and argue how it helps after the talk maybe. But just if you use surrogates as LPs, this embedding argument becomes easier because it's just an averaging argument. Mainly because LPs can tolerate fractional solutions if you use deterministic like in the previous case we could not tolerate fractional solutions. And the moving to a nonadaptive solution is simply an argument of randomized learning. These are both very nice tools that LPs provide for us. So one reason why this framework would not play in the previous problem is we don't know good LP relaxation for the orientating problem. So like I said state LP doesn't work we have to use a stringed LP. It's just -- now a second problem that I mentioned at the beginning of the talk is what's known has multi-young bandits problem. So here's the problem. Imagine we're given N Markov chains up front and each of them, each state and each Markov chain has an associated reward and transition probability. So the input consists of all these parameters. So the input consists of the Markov chains the transition probabilities and reward of each state. And there's also a starting state. And there's a budget of B in the -- there's a budget of B. Okay. I'll tell you what that budget means. So what should the algorithm do? So the algorithm at any time step should pick on Markov chain. And that Markov chain will make a random transition according to its distribution and move to a new state and give us the reward of the new state. So in this example let's say that the budget is three. So I'm allowed to make three pulls. So I make the first move, maybe the algorithm chooses Markov chain one. And then it makes a random transition and gives me the reward of the blue state which is $1. It gives me a reward of $1. Now the algorithm, depending on what happened here, the algorithm can choose a Markov chain to move again. So the second time step maybe it chooses this Markov chain, because there's a very high chance that this gives a zero reward here. So maybe it chooses to play the second one to be safe. So the second one makes a transition and gives me a third of a dollar. And the third time step maybe the algorithm chooses to risk it all because there's a tiny chance you can get a large reward here. Maybe it chooses the first chain again and so on. So the goal is to come up with such a policy to maximize the total reward of all the states that you traverse within the time of B. And alternately you could consider different objectives where after playing for B time steps the objective is to maximize, the maximum over all the final configurations of all the Markov chains so on. So the max version has been studied from machine point of view we're looking into some objective. And so that's the problem here. So for those of you know the exploration framework, this phase is completely the exploration phase because the objective tells you what you can get in the exploitation phase this is what can you do in B time steps to discover most about the chain. With temper many times I'll talk about this at the end of the talk. We look at this problem, as Nicol was asking, this is probability where adaptivity is key. We have examples to show that nonadaptive solutions can perform arbitrarily worse than adaptive solutions. And how do we do that? Firstly, what do we show? So we show constant factor algorithms which are adaptive and also we show adaptivity is necessary. Prior work assumed that each Markov chain in itself should be a Martingale. It should satisfy the Martingale property, and we don't need that. A simple example, this is not true, is this is the case of knapsack with correlated size intervals. Not going into the reduction but that's a very simple example that's not a Martingale. So, again, what's the framework? We apply the same framework, and we want to use LPs again, we use stringed LPs again because it has to handle the correlated knapsack as special case. But the addition that we have to do early year-rounding that rounding should recover adaptive strategy, nonadaptive strategy. So this is -- yes, this is the interesting part of this step here. So we've looked at these. Orientating problem first interesting part because surrogates were not capturing fraction solutions and the second part we could string LP to capture correlations and the third part we have to come up with rounding algorithms that are adaptive. So what about how do we -- what about the future directions in this framework? So the first thing is we have looked at approximate algorithms problems from a stochastic point of view. And there are lots of problems that just scratched the surface and euphemistically, metaphorically, and we only understand very basic problems right now. Another example of problems we don't understand is things where there's an inactive transition the bandit problem Markov chains are making transitions even though you're not processing them actively. So one thing we don't know how to handle is that sort of correlation. So there's much more global correlations that we don't know how to handle in some sense. And another from a technical point of view, all our algorithms only work if the objective is linear. So it's all expected reward and so on and so forth one thing is what about nonlinear objectives we need to come up with different processes to come up with, different analyses techniques for handling this. And, for example, one problem might be that the objective itself is an optimization problem do adaptive probing to solve optimization problem tomorrow. And one practical application of this is if you think of a learning problem, immediate correlation between our framework and learning setting is this, the domain of active learning. So there the question is you have a prior distribution over possible classifiers and the goal is to probe some data points and figure out the labels so that after probing say some K data points you can get the most information about the classifier you can minimize the resulting error of the classifier so exactly it fits into our framework only thing is the objective is nonlinear error of a classifier as linear objective. So the question is we can use our framework to generalize to nonlinear objectives first so we can solve such problems. And some of the other research areas that I've been looking at, one of them is network design. So we look at problems like survey built design network and so on, and so we give good online algorithms for this problem so the nice contribution is that we can sort of use embeddings in the random sub trees to get, to come up with a very nice set cover and encoding of the survey built design network problem and some of the things we've looked at are scheduling from, scheduling problems. So we have some nice algorithms broadcast scaling and one thing I'm interested in looking at is how well do -- how can we solve problems that convex objectives using online problems convex objectives using some primal dual framework, and that has applications in scheduling with abject constraint and so on. To summarize, we looked at stochastic optimization from the approximations point of view. Presented general framework to solve using a deterministic surrogate. So the key is to come up with a nice surrogate which can handle whatever you want it to handle. And it works for three problems. And future direction, can we use this framework for nonlinear objectives which can help in machine learning and other problems with this. And, yes, so this should be MSR -- this should still be [inaudible] because I don't have this. Thanks a lot. And questions? [applause]. >> Ravishankar Krishnaswamy: Any questions, comments, suggestions? Okay. Thank you. [applause]