>> Yuval Peres: Hello, we're delighted to have Sebastien... Princeton tell us about two basic problems in stochastic optimization.

>> Yuval Peres: Hello, we're delighted to have Sebastien Bubeck from Princeton tell us about two basic problems in stochastic optimization. >> Sebastien Bubeck: Thank you. I think I prepared way too much. But so maybe it will only be one program instance stochastic optimization. Let me see. So feel free to interrupt me. So I'm going to start with and maybe only talk about that but I'm going to start with roughly my favorite problem in research. So this is this problem. So you have K unknown probability distribution with just Gaussian. So unknown K. Unknown probability which are sub Gaussian. With variance followed by 1 something like that. And what you want is to find the distribution which has the maximum mean. So the goal is to find the one that has the maximum means. So the aug mass of mu I where mu I is defined as the expectation of X when X is drawn from mu I. [indiscernible] so let's go to I star and I'm going to assume it's unique. So my goal is to find I star. And so I need to tell you how do I interact with this probability distribution. So what I can do I have a budget of N samples so I can sequentially query the probability distribution and I get realization from this probability distribution. So I want to find I star using N observations. Which are sequentially chosen. More precisely a little bit more formally, we have a sort of sequential game. So time is going to go from 1 to N and that each time step what do, I choose IT, which is in 1 to K. The index, I choose one probability distribution and what I receive is so choosy T and I receive a realization. So I receive YT, which is drawn from mu IT and this drawing is independent of everything else conditionally on IT. Once I've done N samples at the end what I want is to output one of the probability distributions. So I output JN which is K and my hope is JN will be I star, that one. And how am I evaluated? I'm evaluated by what is the mean of this guy compared to the true best means. So my regret, so the regret of time N is the difference between mu I star to the best means that I could have obtained and I compare it to mu JN, which is the mean of the selected probability distribution. So I'm going to call this K probability distribution I'm going to call them R. That's my -- that's what I use from the terminology and I put an expectation here. This is what I call the simple regret. It's just my optimization error. I want to optimize this mu I and instead of the max I got this guy and I look at the distance, and I want this thing to be as small as possible. If you know about -- yes? >>: [indiscernible] when you go out, you inspect this guy one by one? >> Sebastien Bubeck: Yeah. >>: Then you select that guy so you have the option to select this guy or you can choose don't select it and go ->> Sebastien Bubeck: You have total freedom when you choose IT in 12 K it's going to be independent on all previous observations you made, I don't know, imagine I made plenty of trials and now I choose to observe probability distribution No. 3 and at the next time step I choose probability distribution No. 10. >>: Can you return? >> Sebastien Bubeck: I can return. I can do anything I want. >>: Would you have to return to the same things ->> Sebastien Bubeck: Exactly. So I would need to try -- so to estimate the mean of R1, for instance, I'll need to try it many, many times to have an estimate, an accurate estimate. Because the first time I tried imagine this is a Bernoulli distribution with a parameter 1. I try it 1 second time I get 0. I want to learn -- essentially I want to learn the mu star. Okay. So that's what I want to do. So I'm not really the first one to look at this problem, as you can expect. This is as basic as it gets. I just want to find the max of K finite things. But the issue is that roughly there has been two approaches which are min and max and bayesian. Because what I want, I want to find the optimal in some sense allocation strategy. I want to find the best allocation strategy so as to minimize this simple regret. But this problem is not well defined a priori. There is an optimal allocation strategy if I know the mu I. I do whatever I want but at the end I opt with the best guy. This is optimal. Now, in min and max what you say is that you take, you design your allocation strategy and then I can choose which set of probability distribution I'm going to throw at you and I look at what is your regret in that case. That's min and max. And then best is well defined. It's a mean of a max. >>: The normalization, there's no normalization. >> Sebastien Bubeck: There's no normalization. There's no normalization here. So this -- right. So in the min and max sense, the answer is that RN, the optimal RN is a folder, square root of K over N. That's the best you can do. And that's not very difficult to prove, and I think that's totally uninteresting. So a trigger strategy will get you K log K over N, K log N. But you have to work a little bit to remove the log but that's not really the point. We see you can gain order of magnitude with a new point of view. >>: [indiscernible]. >> Sebastien Bubeck: >>: Sub Gaussian. What does it mean what? >> Sebastien Bubeck: Just means the usual thing. >>: That's the normalization. So does that have the constant? So which ->> Sebastien Bubeck: Right. Right. So sub Gaussian with constant one, with known constant. >>: [indiscernible]. >>: Terms of Gaussian can be used. >> Sebastien Bubeck: No, that's okay. So I said it quickly. I said the proxy for the brands is bounded by one. So it's not the main point. Imagine everything is Gaussian with variance one. And the talk makes sense. And is nontrivial. So min and max, that gives uses. So this has been studied since the '50s, et cetera. Bayesian, what you do is put the possible parameters for the probability distribution and then you want to find the allocation strategy that minimizes the expected simple regret with the expectation is with respect to the draw from the prior. So this is also well defined but then you have to come up with a prior, et cetera, and it's not clear how to do it. So I want to go beyond these things, and I will propose a new sort of new perspective, which allows you to talk about optimal strategies without making an assumption such that it's min and max or bayesian. We'll come back to it. So for those of you who know multi bandit it feels like a multi unbanded program except the performance measure is different. So in bandit what we look at is this capital RN, which is a cumulative regret. So we look at the sum of the simple instance in this regret at every time step. So we look at the sum for T equals 1 to N. So at every time step we could have played the optimal on, we could have gotten mu I star. But instead we played IT so we got mu IT. And I'm going to look at this in expectation. So that's called the cumulative regret. And a trivial, the trivial thing is that the simple regret is upper bounded by the cumulative regret divided by N. What I can do is that at the end, when you ask me to output something, I can just select the time step at random, output the actions that I played at that time and that gives me this bound. So this is always true. And this gets you the min and max rate. But now I'm going to show you can get much better. So okay -- yes. >>: [indiscernible] you think it's because -- the IT? >> Sebastien Bubeck: Yes. So it's expectation with respect to everything, it's with respect to IT. So IT could be randomized. But even if it's deterministic it depends on the previous observations, which are themselves random. So I take expectation. It's a complicated expectation. I mean, it's simple but, to analyze it's not obvious how to do it. >>: Expect this thing, how many do you want to select from them just the maximum of them you want to select. >> Sebastien Bubeck: Right. That's exactly what I'm going to talk about, how do you do this. What is the allocation strategy? How do you choose this guy? >>: You choose several times. >> Sebastien Bubeck: You choose one at a time but you will make N selections. >>: Make N selection. N is given to you. >> Sebastien Bubeck: N is given to you. N is given to you. >>: Visually large, can't find it ->> Sebastien Bubeck: Yes but the question is you want the optimal rate. You want to make the most out of it. Like with N you will see. So let me show you one trivial thing so that we're all on the same page. So what is trivial is just you have a budget of N samples. You have K options. So you just allocate N over K to each arm. So you just try each option N over K times. So what does it give you? So one thing is that to simplify for the talk I'm going to look at this thing which is so the regret, let's say every other means are in constant. This is certainly smaller than EN, which is the error rate which is the probability that JN is not equal to I star. Okay. Let's say that all the mu Is they live in 01, for instance. I'm going to denote it's going to be important for the rest of the talk. Delta I is mu I star minus mu I. So delta I is the sub optimality gap. The distance between the quality of I and the best quality that you can get. So this is certainly larger also the delta times EN. Where delta is the smallest gap. Delta is the mean of the delta I or I not equal to I star. All right. So in first order approximation it's fine to focus on EN, the error rate. Okay. So most of the talk I'm going to talk about the error rate rather than the simple regret. So let's see what is the error rate of this simple thing just apply general, so deviating from epsilon is going to be bounded by exponential minus T times epsilon squared. If I sample T times someone what I mean is everybody stays in an interval roughly size data I if this is the case there's no way that at the end I output somebody else and the best guy. If everybody's in the interval of delta size I then I get the best thing. So I have mu I. If it stays within let's say one-half of delta I, if the empirical -- so going to denote by mu hat I. So that's my empirical estimate using my samples, if the empirical estimate stays within an interval of size one-half delta I for every I then at the end I just look at the best one and I will get the true best. So what does this give me? So with a union bound we just get the sum for I calls 1 to K exponential minus N over K. That's the number of samples that I get times delta I squared. That's the deviation that I'm looking at and the constant. Okay. So let me -- so to commence on this, the first one is that this goes exponentially fast to zero. So the priority of making a mistake goes exponentially fast to zero. And this is going to be small roughly when. So this is going to be dominated by the largest of this term. The largest of this term happens when I look at I which minimizes the gaps. So this is small. This is small when N over K times delta is large, when NK delta squared is large. So it's smaller than delta if this is larger than log 1 over delta. So what we mean -- it's small when N is at least omega of K over delta squared. So if the number of samples, the number of trials that I can make, the number of experiments is at least K over delta squared, just uniform allocation will find the best action. >>: Another log K? >> Sebastien Bubeck: Right I want another log K. I absolutely want another log K. If this is log K over delta, then refer this one is delta I find it. So that's what I get. Okay. So this is good. This is really good. So clearly it doesn't feel very good because you spend -- I mean you try everybody, but it could be that very early on you identify that some of these guys they're not competitive. There is almost no chance that this guy will be the best one. But still you keep them. What you want maybe is to focus your attention of the guys which look difficult to distinguish. Okay. So okay so what I'm going to do now is I'm going to show you first what are the limits of these problems. So here we said that if N is that large, we can find it. Now, how large -- I mean, if N is smaller than something you also can't find it. So this was the main theorem that we did back in 2010 with J. Audibert, myself and Rene Munos. And this is what gives us the new point of view which is min/max by approximation. It goes as follows. So for any mu. So mu is the product of the mu I. And I'm going to look at the mu, which is just a Gaussian distribution. So it's going to be a lower bound. So if I stick to Gaussian distributions, that's fine. So it's a Gaussian distribution with mu, and with covariance the identity. And dimension K. So it's just a product of N mu I 1. For any algorithm, in particular for any means that this algorithm can depend on mu, maybe we know mu, I'm going to say that there exists sigma, which is a permutation of the indices, such that if I look at the probability of error of this algorithm on new sigma, the new sigma, which means that I have just permuted the distribution. So now instead of having mean mu on II have mu sigma I, on I. This is larger than some constant times exponential minus some other constant times N over what we called H and you have log K where H is a sum of the inverse gap squared. The sum of 1 over delta I squared, equal to I star. So what this means is that to be small, to have this thing smaller, so this is equivalent to saying that EN smaller than delta must imply that N is larger than H some constant over log K. So if you find, if the probability of error is smaller than delta, sorry you have a log 1 over delta, if you have a small probability of error, then it means that the number of samples that you have must be larger than the complexity we call H as the hardness of the problem, H divided by log K. And H is the sum of 1 over delta I squared. So now if you compare this to this downstairs, you see that here, instead of H, I have K over delta squared. But K over delta squared can be much bigger than H, right? So potentially for some mus, H could be much smaller than K over delta squared. >>: [indiscernible]. >> Sebastien Bubeck: In the worst case, they are the same. And that's why the min and max analysis is not interesting because in the min/max analysis, the worst case is basically when H is equal to K over delta squared. And in that case uniform is almost optimal, uniform allocation. So this is a thing that goes beyond that. It's distribution dependent, it depends on the distribution and it tells you it's the right measure of hardness -- not yet, but yes. So far. >>: But at the end of the day EN is just a proxy to the simple regret. >> Sebastien Bubeck: Absolutely. >>: Because they're the same, this would be terrible but SQL regret could be easy. >> Sebastien Bubeck: Absolutely. Yes. That's definitely correct. So this is only a first step. It's step-by-step. So here this tells us something very precise about EN. But it's not very precise for RN. I mean for little rn, for the simple regret. It's an approximation only for the simple regret. This is not the end of the story. This is the end of the story I think for the error rate but not for the simple regret, I agree. I don't know what is the full answer for the simple regret. >>: [indiscernible] the results so. >> Sebastien Bubeck: Exactly it's along the lines of the Lyon's Roberts results. >>: [indiscernible]. >> Sebastien Bubeck: Exactly in some sense the way to understand it for those who know what is a Lyon's Roberts results, it's actually the version of Lyon's Roberts for the simple regret, rather than the commutative regret. Now proof techniques are completely different. Now I know how to prove this for the -- let me spend two minutes on this. I know how to prove this theorem for the community regret. But Lyon's Roberts is not that. Lyon's Roberts they have to such something for the algorithm. They prove lower bound for the commutative regret but the algorithm has to be consistent for any possible distribution, no matter what's the distribution you have to be consistent and go at a certain rate. Here I need to assume nothing about the algorithm. Could be terrible for some distributions. And actually Lyon's Roberts in that sense is not true. You can get constant regret if you know the distribution up to a permutation. So it's similar but it's not the same. >>: [indiscernible]. >> Sebastien Bubeck: Exactly. So now it's not the end of the story. Now I need to tell you can you get this H? So you can. Almost. So you can almost and so this is the strategy which is basically just writing down mathematically what the intuition was which is you try a little bit everybody and there's one guy that looks like clearly is not going to be the best, then just stop stamping it and then focus on the rest and so on and so forth. This is the [indiscernible] reject. The [indiscernible] reject works like this. So it's going to go by faces. So you have face, K equals 1 to K minus 1 and it will have a set of active arms. So at the beginning A is everything. And during a phase you sample NK minus NK minus 1 times everyone in A. So I need to tell you, I will tell you the formulas for this thing. So all the arms that are still active you sample them a certain number of times and then remove from A the worst guy. The worst guy. So you do that and at the end of K minus 1 you're left with only one winner. You say okay this is the guy I believe to be the best. And this theorem is that with NK, which is some formula, so NK is proportional to N over capital K minus 1 plus 1 minus little K, you get that the priority of error is bounded by a constant. So actually K times responsible minus N over H times log squared K. So this strategy, so this means that N needs to be of order H times log cube K. It's the number of arms, number of actions. Little k is an index in the algorithm. I go by phases. And phase little K that's how many times I sample. It's a formula that comes out of it. So if you compare to the lower bound over there, the lower bound was that you need at least H over log K. And here you need -right. You need -- so if you have more than this, you are smaller than delta. So it's tied up to the logs. So this was in 2010. Now in just recently there's been a very, very nice work from a group of people I think at Yahoo! and what they did -- so this is [indiscernible] and Sommer in [indiscernible] 2013, they can get down to -- they get -they modify [indiscernible] it's still the overall same idea and they get H times log K times log log K. But it's roughly still the same idea but the analysis is a little bit different. I'm not going to explain in details. So we still have a gap. But I can also improve the lower bound. I can remove this log K. So except that it's not so easy to remove it. So this proof is difficult -- this proof is the only thing which is difficult which I've written on the board so far. And you cannot really go into the proof and try to trick some things. I mean, everything is really -- it's tightly together. But I can modify a little bit the assumption and get a much simpler proof. So maybe this is of some value because the statement is in a sense is less powerful but the proof technique is much simpler it could be applied to other settings. That's what I want to talk about now. This is CRM to be written with [indiscernible] Kaufman. So let's put 14. So CRM goes as follows. So for any mu, any mu which is again an N mu IK and for any algorithm. So the beginning is the same -- now the invariance isn't going to be over permutations because that's what makes this proof difficult. There's not much you can work with. You have to permute this guy and this guy is not nice. You can be in trouble. Now what I'm going to do instead of looking at permutations I'm going to take one of these suboptimal arm and maybe this guy I will put it up. So I'm going to define let's say FI-I of mu. That's a vector in RD such that F I of mu let's say is a J's component is equal to mu J if I is not -- if J is not equal to I. And it's equal to mu J plus 2 delta. So mu I let's say, there's 2 delta I if J equals I. So I have my vector mu and I have this operator FI-I it takes the Ith coordinate and puts it up to be the best one. So the first thing to observe is that the hardness measure, so note that H on mu for the hardness of mu is always larger than the hardness of FI-I of mu. I can only make the problem simpler when doing that. Because when I pull this guy up then everybody is further away from the best arm than it was previously. The distance. So in mu you have a certain distance to the best guy, but in fi of mu the best guy becomes the Ith one which is above the best before so the gap increases. So hardness measure is decreasing by this operation, and what I can show is that there exists -- so for any mu and for any algorithm there exists I such that the regret -- so the simple regret on F I of mu is lower bounded by exponential minus a constant times N over H. So this gets rid of this log K over there. And this proof is five lines. Let's say maybe even four lines compared to four pages here. Now, it's weaker, right, because I mean it's in a sense it's weaker because the algorithm -- it's hard to say if it's weaker or stronger, because here the class over which we take the maximum is a class of size factorial K. Here it's only a class of size K. Sorry. I put [indiscernible] to RK. But anyway, so now we know the result up to a log K. Lower bound is H. Exactly H. And the upper bound is H log K, log, log K. I think the truth is H. Perhaps up to a log-log for the reasons. And this is really a fundamental problem. And we only know basically one algorithm, which is this. Everything else is a variant. >>: [indiscernible]. >> Sebastien Bubeck: This one? It's a variant. It's not this algorithm. So instead of having [indiscernible] for instance instead of having K phases they have log K phases but it's still the same structure. And I think you need to get rid of the logs you need something much smoother. You don't need to be tied to a schedule that you had beforehand. Like the schedule should be adaptive. But we don't know how to analyze this. >>: [indiscernible] a little more than ->> Sebastien Bubeck: Absolutely. With the log case, we remove half every time. Yes, so that's the algorithm. Now I said it. >>: [indiscernible] detection was to wait ->> Sebastien Bubeck: No, because -- no. Because I think very far from it. But it's not clear. So multiplicative weights are typically designed for the min/max rate. They're not adaptive in this sense. That's the great weakness is that they work with almost no assumption but if it turns out that the word is much nicer then they don't adapt. And here it's really about trying to adapt as much as possible. even want to adapt as much as if we knew exactly the mu. So -- We >>: [indiscernible]. >> Sebastien Bubeck: So UCB, there is 20 minute story. So let me try to do it in two minutes. So UCB, you can provably show that it cannot go at exponential rate. So the probability of error will be polynomial. Now, you can do a modification of UCB, which actually goes at this rate. But the modification -- so UCB it looks like it plays at time step T the actions that maximizes the current mean plus square root of 2 log T over TIT. So this you can show that this is not going to work. What you can do is a modification which is called UCBE, which we did also in 2010, which goes like this a constant replace the 2 log T by N over H over TIT. You need to know H. If you know H and do this algorithm then you get exactly at this rate. Okay. But not knowing H you could try to adapt to it online but we don't know to prove anything about it. But let me say something from a practical point of view. In practice, this problem is going back to one of your earlier questions, this problem is only interesting is in the range where N is the further H. If N is much, much larger than H, anything will find the best action. Just do uniform and you will find it. What is interesting is that when you're in the problem where it's hard, like it's just, where this is close to a constant, but if this is close to a constant, 2 log T is a constant. So you can view the analysis of this as an analysis of the true UCB for the cases where N is order of H. This is showing the basic UCB should work in something. So now let me just quickly show you some pictures. So can I get the thing down? Thanks. Yes. >>: My question, what if you have two or three at the very top so ->> Sebastien Bubeck: Right. So I'm going to show you an experiment like this. >>: [indiscernible] [laughter]. >> Sebastien Bubeck: So I'm just going to show you two experiments. So the first one we have 15 arms. So in this experiment we have Bernoulli distributions everywhere and the mean of the best arm -- so the best arm is arm one and its mean is .5. So in the first experiment, which I call Experiment 5, the mean goes down in an arithmetic progression. So you have .5 minus and then .025 times I. For I equals 2 to 15. And what I plus here, so the bars are the probability of error for different algorithms. And so the first one is just uniform sampling. So you see uniform sampling as the worst probability of error, as expected it's the most naive algorithm. So in this problem for instance so N is 4,000 uniform sampling will get you the right arm, with priority at this minus .45 or something like that, then 2 to 4, algorithm having races appeared in the literature previously. So they perform a little bit better than uniform but not too much. Then five is the six or reject which I just told you about. Bar 6 to 9 are the UCB, the finely tuned UCB. And the rest are the UCBE, where I try to estimate online H using a procedure which is nontrivial. So you see successive reject does it better than all the other -- than the previous versions. It almost divides like here you see in the second experiment the probability of error of uniform is close to .6, and the priority of successive reject is close to .2. So you can get real improvement. So this second experiment, by the way, is what you just talked about. So we have one good guy, which has a mean .5. Then we have five guys which have a mean .45. So close by. And another group at .43 and the third group. So this is three groups at the bottom. And here you see with three groups of bottoms like this, it's really worth trying to adapt. So these algorithms they will quickly remove all the very bad ones and focus on the good ones. Okay. So these are the numerical experiments. Now let me move on to something else, which is what if instead of finding the best option you want to find the N best. So now maybe you want to find the five best arms instead of just the best one. So here I had the two experiments the name changed from 7 to 6 but the same one. On the Y-Axis I have the probability of error and X-Axis I have how many arms I want the algorithm to output. So the first point here that corresponds to the previous slide. I just want to find the best one. Blue corresponds to uniform allocation and at the end I return the N best. Yes? >>: [indiscernible]. >> Sebastien Bubeck: Yes, this one is more. So you are good only if you get the M correct. So that's a difficult task. >>: [indiscernible]. >> Sebastien Bubeck: Sorry? >>: Ordering as well. >> Sebastien Bubeck: No, no, no, no. That's the key point. You don't care about the ordering. You just want to find the M best. Within the M best you don't care about the ordering. Right. So uniform is in blue and C successive reject is in red. In the previous slide, red is much below blue. And what happens when you move M. It was successive reject optimal for M equals 1 compared bad to even uniform. What this is saying this is expected at some point but successive reject because it was designed to find the best one, it has a very rough idea of the ranking below the best one, below the first two bests. The other guys it doesn't really know what's going on. So if you ask him to output the five best arms, it does a very bad job. So what this says is you fundamentally need to modify the algorithm when you want to find the N best. You can just not do successive reject. This is a variant of successive reject where you don't have K -- we rightly tune the number of phases and the samples per phases. So you need to modify it. So the modification that we did with two students is this green line which is called successive accept and reject. So the key and all new idea of this paper is that when you want to find the M best, you not only reject bad guys, but you also need to stop sampling guys that look good, because they look good so now just stop sampling them and say, okay, this one is going to be in the batch that I accept and that's it. So this is a successive accept and reject. Now, the difficulty is at the end of a phase, how do you decide if in this phase you should accept someone or you should reject someone? So what you do is you compare how confident in some sense you are about the acceptance and the rejection and you do whatever is the best for you. And so the analysis is harder than for M equals 1 but you see in practice it really works. That's the green line, and here for the other experiment it really works. It's really better. And gap E is a variant. All right. Is there any question on this? Can you get it up? Can you get the screen up? Thanks. Okay. So is there any question on this? Yes? >>: Why is the successive -- [indiscernible]. >> Sebastien Bubeck: So different from what? >>: The other one ESR. >> Sebastien Bubeck: Yes. >>: When you try to accept the best guys, can you just treat it as the invert problem, the [indiscernible] problem of rechecking now you have two versions. Just invert ->> Sebastien Bubeck: Absolutely. The question is where do you put the baseline? Where do you invert? Like where do you put -- you see what I mean? Like you say you want to take the negative, but when do you stop taking the negative. So that's exactly the key point. So basically the key is that if when you want to find the best one let's assume that mu 1 is larger than mu 2, et cetera, up to mu K. Then the new gaps that I define are so if you are in the best one, if you are one of the best, then you look at the distance between mu I and mu M plus 1. So you look at the distance between mu and the first guy with nothing but M best. And conversely, for those not in the M best you look a letter your distance to the last guy with one of the M best. So you have those gaps, and what the algorithm does is that it estimates these gaps and then it decides to accept or reject based on those gaps. Those estimated gaps. It's along the lines of what you just said. And the complexity, what you can show in this case the complexity is a sum of these guys 1 over these guys squared. So maybe just open forum and maybe I'll spend ten minutes on the other topic. What's important is here we want to find the N best arms but we put no structural assumption on the N best arm. What would be very interesting is the following more combinatorial problem where assume that you have a graph and the arms -- so you have a graph G like this. G on K edges. All right. So you have K edges. And on each edge you have a priority distribution. And now what you want to find is maybe the best spanning tree. So find the best spanning tree. So let's say that the spanning trees are a certain size M in this example. So now we want to find a subset of size M of the K edges, but this subset also has to satisfy some structural properties. So more generally we are given a subset C of 2 to the K, right? So it's a set of subset of 1 to K. And what you want is to find the outmatch over let's say S and C of the sum over I and S of mu I. So you want to find a subset of the 1 with combinatorial structure subset C which maximizes the values within this. I think this is completely open and I don't think there's a general theory that will -- I don't think there's a general algorithm that they see as an input and then find the best structure at the optimal rate. I think you need to have the algorithm to really think hard for every single problem. The first one is the best spanning tree. I don't know how to do it. You could try to think about finding the best matching. Let's say you have a bipartite, complete bipartite graph. And what you want is to find the best perfect matching. I have no idea how to do that. I mean you can go through the list of all combinatorial optimization problem and try to redo things in this stochastic framework, with this point of view. The point of view of optimal rates, optimal distribution dependent rates. >>: Is this going through that, the spanning trees? >> Sebastien Bubeck: No. So the issue with the min/max is that more or less you gain only log factor. Here you get order of magnitude. You move from K over delta squared to H. But no we don't know for min and max either. >>: But it's something from previous, something on the spanning tree. >> Sebastien Bubeck: No, it can be. It's going to be much smaller than this. I mean something trivial would be to view each spanning tree as a norm and then find the best -- but that's exactly not what you want to do. But what you're saying is a trivial upper bound. And the key is that it's not going to be like this. The one thing that could be just how influential is this edge. So let's say you look at what is the best spanning tree that goes, contains this edge, the best spanning tree that does not contain this edge. There's a gap between these true value is small, maybe it's a sum of one over this gap squared. I don't think so. I think they're all nontrivial correlations between the edges. So I don't know what is the answer. Okay. Now I want to quickly talk about something else. So one nice thing about this theory is that I talked with quite a few people about practical problems. And of discrete optimization, and often they can be casted within this framework. But sometimes they can't. And I will give you one example. Well, you have to think to get something. So I have a set X. X is a countable set that is known to me and A is a subset of X. So think of X as a set of integers and A is a set of prime numbers. Okay. So A is a set of interesting elements. In some sense. But I don't know A. Okay. I don't know A. And what I want is to discover A. I want to discover as many elements of A as possible. Now I need to tell you how do I access these sets. So I access them through experts. So I have experts, which operate a distribution mu 1 up to mu K. Probability. Supported on X. And now I can play the same game as before. Sequentially I can make requests to this expert, to this probability distribution, when I make a request I get a random variable drawn from the underlying probability distribution and I observe that I get an interesting element or not. So the game goes like this. So choose ITNK, receive YT which is drawn from mu IT. And I also observe whether or not YT is an A. So I observe the indicator that is IT and A. And what I want is after N samples I look at F of N, which is a number of interesting items that I found. So how many interesting items did I found. Well, I found this item. Y 1 up to YN. That's the set of items that I found. Which ones were interesting, it's the intersection with A. How many did I find? It's a cardinality of this. So now I want to maximize F of N. >>: Don't accept when ->>: Except when I receive ->>: Told you ->>: If it's interesting or not. So that makes sense. So imagine I have distribution on the integers. I don't know what are prime numbers. And I sample from one of the distribution, I get an integer, and then somebody tells me if it's a prime or not. >>: [indiscernible] scenario a little more, why. >> Sebastien Bubeck: Yes. Now you are on one. No, absolutely. So I actually think that there are many, but I don't know yet many. So I know one. Imagine that you have a big graph. It's really enormous. Let's say the electrical grid in the United States. Okay. And you have nodes. So the nodes are going to be your element in X. And now there are some nodes in this network which are faulty. So there is something going on with this node. You actually need to physically go there and fix the node. What you can do is that given a node you can run a security analysis and test whether or not the node is faulty and if it's faulty you can go there and fix it. But, of course, you cannot run the security analysis for every node in the network. But then what you do is that you hire some engineers and they think hard about the problem and what they could come up with maybe are some kind of random walks on this network which are biased towards 40 node. Maybe they're very smart, they were able to do that. So you have K of these engineers and they each came up with their own heuristic. So you have K random walks, rather probability distribution on the network, and what you can do is you are going to -- you have only one computer that can render security analysis. So every day you need to choose one of the K engineers, run his or her heuristic and then run the security analysis and then move on to the next day. Okay. Is this a good example? You could also have, I don't know, you want to -- you have code for computer code and you want to find all the bugs in your code. And you have different heuristics to find elements of code that could be wrong. And you want to combine these heuristics in the best way. >>: [indiscernible] problems, how do you apply different heuristics to show ->> Sebastien Bubeck: Exactly. I mean three about [indiscernible]. >>: [indiscernible] problem with gathering ->> Sebastien Bubeck: Exactly. Equal to the previous problem. So I don't think so, because the reason is that this is much more dynamic. So look at what happens if mu 1 is a direct on an interesting item. So this guy, gives me an interesting item. But it gives to me only once. I mean, when I come back to him he always gives me the same interesting item. So it's not interesting to me anymore. So this guy could be super good for one time step and then bad forever. So there's this dynamic aspect that was not in the previous framework. >>: [indiscernible] the set, multi set. >> Sebastien Bubeck: No, it's a set. It's just a regular set. So meaning exactly that if you see twice the same interesting item, it doesn't count twice. It counts only once. That's where the -- if it wasn't like that it would be exactly like the previous one. Because it counts only once. >>: [indiscernible]. >> Sebastien Bubeck: Yes, exactly. Discovery. Want to discover A. That's the -- so we called it optimal discovery with expert advice. So now what can you do? Well, so what would be the optimal -- so imagine you use a distribution mu I, what would you do? What would be a simple thing to do. If you do distribution mu I you could for each distribution estimate what is the probability that you will get an interesting item that you have not seen yet. So you could define this MI quantities which is the missing mass, which is the probability mass that mu I puts on the set A where you remove everything that you observed so far. Let's call this MI of T. So if I knew what was A and mu I, I could compute those things. And what I would do is just pull the awg max, go to the guy that maximizes this thing. But I don't know those MI of T. But what I can do I can estimate them. This is a famous problem. It's this thing is a good thing estimator. So it's something very famous to estimate the missing mass. So you can have an unbiased estimator of this guy and even have concentration in equalities. And so what you do is the algorithm we did is that you sample this guy, plus a confident [indiscernible] which is given by the deviation. So I'm going to show you now some experiments that we did with this algorithm. So can I get the screen again. So this is the algorithm sort of clear. So for each expert we estimate the missing mass, and we add the confidence term which is given by deviations which you can derive. And so now this is an example. So -- >>: The UI here in order to ->> Sebastien Bubeck: No, so the MI of T was if I knew the distribution, but then I can do the Guttering estimator which doesn't not need to know the mu I and T. What you do is you just estimate for each expert you estimate how many interesting items did I see exactly once in my sequence, and the normalized, constantly normalizes the estimator, which is a good touring estimator. I look at the program where I have seven experts. The Q is Q1 equal 50 percent and Q2 equals 25 percent, et cetera, but the proportion of each distribution for each distribution. So expert one has priority mass of 51 percent on the interesting item. The distributions are disjoint, and N is the overall size of the problem. So each distribution is uniform on the set of size 128 in this plot, 500 in this plot, 1,000 in this one and 10,000 in this one. So I have distribution which are uniform on set of different size and they are different proportion of the sets are interesting. And you see something is going on, right? You can see this convergence. So what I put is the number of interesting items that you found. This is time and this is the number of interesting items that you found. So the top is this Oracle strategy that plays the arms that maximizes the missing mass. This one is our algorithm, the second one, and this one is just uniform. Just allocate things uniformly. And you can see that we have a uniform convergence of the number of interesting items that we found for our strategy as a size of the problem grows. That's what we gave it the name of microscopic limits. So as the size of the problem grows, our strategy is uniformly optimal in time. So this is completely different from the multi unbanded problem. The multi unbanded or the one before I was looking at as N goes to infinity. Here for any fixed N as the problem size goes to infinity, I'm being optimal. So it's a very different kind of limit. So this was for disjoint set of distributions. This one is the one with integers and prime numbers and it gives the same -- we obtain the same result so we have a CRM, et cetera, and I don't have time to explain. And so these are just some references of the 2010 paper with the lower bound that this optimal discovery and that's the book which Nick [indiscernible] unbanded programs if you want to know more about this. And that's it. [applause] >> Yuval Peres: Thank you. Questions. >>: So just the first part you described, so you had explained to us the 8 over log K lower bound and then you said you would remove the log K that was changing the problem. But for the problem with the permutations ->> Sebastien Bubeck: No idea how to do it. >>: So the rest [indiscernible]. >> Sebastien Bubeck: Yes. >>: And the other bounded, bolted from the ->> Sebastien Bubeck: Yes. >>: So that bound was for which problem, only shipped one of them? >> Sebastien Bubeck: Right. So the upper bounds are assuming nothing. So the upper bounds are when you know nothing about the problem and still you can adapt at the right rate. So the upper bound holds for both settings, if you want. >>: And if you know there's accepting more doesn't lead to improvement there? >> Sebastien Bubeck: Well, what this theorem says at most it gives you improvement in log factor but it could be. We don't know. It's an excellent question. Maybe with the log it's -- I don't think so. But it would be nice to have a better proof for the case where you just have permutation. But key trick was to change the problem so as to simplify the proof. That was the point. Yes? >>: What about lower bounds and here are the problems of setting dependent lower bound. Can you say anything about the problem? >> Sebastien Bubeck: Lower bounds are basically trivial. It's immediate to get the optimal one over square root rate. There's nothing really interesting to do there. And that's why despite the simplicity of the problem, it has not been really looked at in the past. It's because if you look at it from the simple point of view of min and max it's not interesting. You have to do something else than min and max to make it interesting, basically. But people in practice have been looking at it because in practice it's an interesting problem that people face. >>: Does not say much about R and R. The story over there. >> Sebastien Bubeck: No. So originally I can just tell you something, just one thing, which is this lower bound from actually before from 2009, if you look at the simple regret, so this was this thing that could go exponentially, this thing I can lower bound it for any strategy by exponential minus a constant times a cumulative regret. So if -- so optimal cumulative regret order of log N. So if you're optimal for the computive regret then you have a polynomial decay for the simple regret. So this lower bound, this was the start of the work because it says that it's a completely different problem minimizing the cumulative regret and simple regret. If you try to optimize for the cumulative regret you don't get optimal strategies for the simple regret. Yep. >>: [indiscernible]. >> Sebastien Bubeck: Oh, yeah, yeah, sorry, sorry, yes, yes. So for the capital RN, we know strategies which do capital RN is smaller than some constant times log N and. >>: For small rn. >> Sebastien Bubeck: For small rn and not EN, is that your point? No, we don't know anything beyond applying those strategies. But I believe that those strategies are good for little rn. I think those strategies are good for the simple regret. I just don't know how to do a better analysis than the trivial one. >>: Won't give you anything initially. >> Sebastien Bubeck: No. >>: [indiscernible]. >> Sebastien Bubeck: I think -- yes. But I don't know how to prove something beyond triviality. You can say trivialities but anything beyond I don't know. Thank you.

>> Yuval Peres: Hello, we're delighted to have Sebastien... Princeton tell us about two basic problems in stochastic optimization.

Related documents

Products

Support

&gt;&gt; Yuval Peres: Hello, we're delighted to have Sebastien... Princeton tell us about two basic problems in stochastic optimization.

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Yuval Peres: Hello, we're delighted to have Sebastien... Princeton tell us about two basic problems in stochastic optimization.