Document 17865183

>> Yuval Peres: All right. Welcome. So we should pay attention. This is the only talk of the day from someone who does not report to Jennifer. Ofer Dekel will tell us about bandit convex optimization. >> Ofer Dekel: Thanks. Yeah. So this is joint work with Sebastien Bubeck and with Yuval and with Tomer Koren who is a student at the Tech [indiscernible]. It's going to be about learning theory. I'm going to talk about one of the most important still open problems in learning theory, really one of the kind of embarrassing gaping holes in our understanding of online learning. So here's the problem. Let's just jump right into it. The problem is adversarial bandit convex optimization. We'll parse what this means. So this is going to be a T round repeated game between a randomized player and an oblivious adversary. So the player has access to random bits and the adversary does not adapt from round to round so he's oblivious to the player's actions. The player has an action set and this is going to be some convex set C which is known in advance. The number of rounds in the game is also going to be known in advance and here's how the game proceeds: so the adversary privately chooses some sequence of functions, F1 through FT, and each one of them is convex so each one maps each one of the actions that the player can play to the interval 01 so they’re also bounded and convex but otherwise there's no relation between F1, F2, F3. They can change arbitrarily from round to round and he chooses them privately so the player doesn’t know these functions at all and then the player starts to iterate. So the adversary is oblivious so he can make all his decisions before the game begins. Another player plays the game, so T-rounds of the game. On each round he chooses a point in the convex set and plays that point. He can do this using some random bits. He incurs the loss which is the value of the function. So if he's on round T and he chooses the point XT then he pays a loss which is FT evaluated XT and this is the number that he sees and he sees only this number and then has to go on to the next round. So he’s collecting this loss as he’s going on for T-rounds. Let's see a little picture that shows this. So again this game starts out by the adversary choosing the entire sequence of functions so here the action set, the domain of the functions is just the two-dimensional square and he chooses arbitrary functions, they have to be convex but that’s it and they have to of course have a bounded range as well, but you can see they have no specific form other than being convex and then the player starts to iterate. So on round one he chooses a point, again he can do this using some randomization, and he plays this point and he incurs the loss. The loss is just the value of the first function at that point but he doesn't get to see the function, he just gets to see this one number 0.3. And then he goes on to round number two and he chooses another point and incurs another loss and so on and so forth. He never gets to see any of these functions but he collects this loss. Is that clear? >>: So just to make sure, so this is a convex function so the best fields are actually not going to be on the boundary of the body[phonetic], but somewhere inside the body[phonetic]? >> Ofer Dekel: They could be. They could all be linear. >>: But generally speaking>> Ofer Dekel: The adversary can do whatever he wants, they can be inside, they can be>>: Not necessarily on the boundary because it's a loss [inaudible]. >> Ofer Dekel: Yes. Absolutely. So let's define a few definitions. The expected cumulative loss of the player is just the expectation of the sum of the losses that he incurs as he plays the game. This is his loss, this is the thing that he wants to this to be small but since the loss functions are arbitrary just wanting to this to be small it's not enough; we have to compare to some benchmark. It's meaningless to just look at this loss because the adversary could just choose all the loss functions to be the constant one so you can always incur a big loss. That's not the point here. We have to compare this expected cumulative loss to some benchmark. The benchmark, what is it? It's just the value of these same functions evaluated the best fixed point in hindsight. So if you took the off-line problem, if you knew all these F’s you just solve this convex minimization problem, just find the point that minimizes the average function that's what I want to compare to. So the difference between what the algorithm accumulates in expectation and their loss is the best point in hindsight. This is called the regret. So this is how the player penalizes himself. Just to define a notation: so row will denote the policy that the player plays, so the player plays some policy row, this is the loss functions of the adversary and this is the regret with respect to those two things. So this is how a player evaluates himself and now we can measure the difficulty of the game is this Minimax Regret. It’s simply the regret when both the player and the adversary play optimally. So it's minimum over all possible policies of the player, maximum over all possible losses sequences over the regret played by the player and now just note that the adversary knows the policy when he’s choosing these loss functions. This is important because later on we'll see that the rows will be reversed. So he knows with the policy is; he doesn't know the players random bits. I would say this game is learnable if this Minimax Regret is sub linear in T. So if in the worst case when the player, when the adversary is doing his worst, still the rate at which you accumulate regret is sub linear means that on each round you’re accumulating something that has to tend to zero so you’re getting better and better as the game gets longer. So you're learning. So this learning just means the thing is sub linear. So how do we know if this game is learnable or not? So let's take a step back. Let's make some assumptions. So first let's assume that the functions are also Lipschitz. So this is a technical assumption; we’ll get rid of this in a minute. And now let's pretend for a minute that instead of getting the value of the function I actually get something much more informative; I get the value of the gradient of the function at that point. So just pretend that for a minute. And now in 2003 Zinkevich showed that just simple gradient descent will already guarantee assembling your regret of square root T. So this is already kind of an amazing thing. The function, I'm taking the gradient step based on everything I've seen in the past or the current function and tomorrow's function has absolutely nothing to do with it yet still I can guarantee this learning if I just follow the gradient path and the idea here, the idea of the proof is just to show that I don't know what the adversary is going to do, he can do whatever he wants, but if indeed the adversary chooses functions whose minimum is very, very different, so he somehow mixes it up, I take a gradient step in this direction and he puts the minimum down there and then I take a gradient step in that direction and he puts the minimum over there then in fact he'll be hurting all the points uniformly so this best point in hindsight will also get worse. What the adversary would want to do in fact is not that but he would want to hide a really good point, hope that I will never find it, but make it consistently good across many of these rounds and if I do gradient dissent I will actually find that point. So that's the idea of why I can learn even though the past and the future have nothing to do with each other. So this is just something we have to get used to. There's a matching lower bound. So square root T is as good as it gets. Now let's get back to the problem that we do care about, not the one where we get gradients but the one where we can only get the evaluations of the function. >>: When you say matching lower bounds for this method or for>> Ofer Dekel: For any algorithm the worst-case adversary for that algorithm will in fact inflict this much damage. >>: In the algorithm it sees the variant. >> Ofer Dekel: Yes. It’s for any algorithm that can see the entire function so you can even assume that he received the entire function. So here's just a depiction of this. So again I get the gradient, I take the step, for the next loss I play this point and I take a step, for next loss I play that point and I take a step and so on and so forth. This is gradient descent. The first real progress on the bandit version of the problem, so bandit means that I only get the value of the function, it was done in 2005 by Flaxman, [indiscernible], and Brendan McMahan and here's the idea: I can estimate this gradient using only the evaluation of the actual function at only one point. So I can have a one-point estimate of this gradient. I can run gradient dissent with this estimate. So here's the estimate: so I estimate the gradient at the point that I want to play by the value of the function at a point that's nearby. So let me show you what these things are. So I start with this point. This is my state, the point that I have in my head, but rather than playing this point what I do is I choose a uniform point on a sphere around the point that I want to play. So this is XT, U is going to be some uniformly chosen unit vector, delta is going to be some scale, the radius of this circle, and I choose a uniform point in the circle and then I'm going to take a step in the direction opposite to this point, the size of the step is going to be proportional to the value of the function at the point that I played. So this kind of magically turns out to be an estimator of the gradient. So this is how it goes: so you have some point in your head, you actually play different point, but then you take an estimated gradient step. Now the next function comes along, again I choose another random point, this time the value is very low so I take a small step, now the value is maybe bigger so I take the bigger step and you can show that this is estimating a gradient step except for a few things add some noise. So this estimator has some bias, it has some variants, also I'm not playing the point that gradient descent is telling me to play, I'm playing a point that's nearby, all these things accumulate to some additional noise and my regret bound from square root T jumps up to T to the three quarters. So I have to pay for all these estimations. Yeah. >>: Is this game hugely different if you get five values here instead of one value? >> Ofer Dekel: Yes. If you query the function at two points then you can get square root T. It’s exactly a question. So I mean I look at the radius of this ball. So this delta, it turns out that if delta is very, very big then the variance goes down, if delta is very, very small the variance goes up and the bias behaves the other way around when you can have a two point estimator then somehow variance is taken care of and you're good. So our bound has deteriorated to T to the three quarters, the lower bound is still square root T. So now we already have a gap. Another interesting observation is I can now remove Lipschitz. So I do need the function to be Lipschitz and this is simply due to the observation that when you have a convex function with a bounded range it’s effectively Lipschitz. Think about it, you have a bounded range, you have a bounded domain, you're trying to build a function that's as non-Lipschitz as possible, it can only shoot up very, very close to the boundary of the set; it can’t shoot up in the middle because if a slope starts being a very, very big it has to keep being a very big because it convex but I just told you that the range is bounded so you just can't construct a function like that. So basically if you take your domain and you shrink it a little bit you're already Lipschitz in that area; the non-Lipschitz part anyway is going to have very, very high losses so these are not points you're going to play in any case. >>: [inaudible] upper bounded based>> Ofer Dekel: So the range is 01 assuming that the losses are always between zero and one. So using that trick you lose a little bit more; you get T to the five sixth. So it's learnable. >>: [inaudible] that you can take on the [inaudible] boundedness of the function and [inaudible] domain your uses? >> Ofer Dekel: Yeah. So I'll show you the theorem in a minute. I show the whole thing in a minute. So you know it's learnable, is a sub linear regret, it's great, but it's very [indiscernible] our lower bound of square root. So they did this, this was beautiful, we knew that it's learnable, we want to look close the gap. So that's what this talk is about. After that a sequence of papers were published that tried to do better and no one was able to do better in the general case but in special cases people were able to make improvements sometimes with the same algorithm, this estimated gradient algorithm, and sometimes with small tweaks to that idea, but still the idea of estimating a gradient to do gradient descent. So in a paper with Alec Agarwal and Lin Xiao we showed that if the function is strongly convex you can get the T to the three quarters down to T to two thirds. Another paper showed that if it’s a smooth function again you can go down to two thirds. Recently we proved this to T to the seven eleventh so you can see we’re kind of inching downwards. If the functions are linear in fact in Lipschitz then you can get the square root T, the tight thing, and we recently generalized this to general quadratic forms. So again, in the special case, we actually have the tight characterization. Also very recently we showed that if a function is both strongly convex and smooth you can get square root T the idea being that people looked at the same end dimensional problem and just said if I restrict the adversary to choose functions from a more specific family I can get slightly better rounds and sometimes even tight bounds. So this was the game for a long time but progress on the main thing was very, very elusive. This is what we're going to talk about today. So we're going to talk about the general case where we are just assuming convexity and boundedness, no Lipschitz, no smoothness, no strong convexity, none of these other assumptions, but we're going to talk about the problem in the one dimensional case. So surprisingly even in the one-dimensional case we really knew nothing better than this T to the five sixth regret upper bound with a lower bound being still square root T. So our kind of the first step into solving this very, very basic problem and all I'm learning is just to close this gap in the one-dimensional case. So the theorem that we're going to talk about is if each one of these loss functions is just a mapping from interval 0,1 to the range 0,1 it’s just convex, that's all I need, then the regret is going to be on the order of square root T up to some logarithm log terms. Yeah. >>: The question [inaudible] strategy but it’s not the issue of computation [inaudible]? >> Ofer Dekel: Correct. So all of these are algorithms that you can compute and it’s easy to compute them and we run them. Our proof is going to be nonconstructive. I'll get to that in a minute. >>: Okay. But it's not often [inaudible] particular issue [inaudible] two dimensions is zero. >> Ofer Dekel: There will be a surprise at the end. Maybe. Let's see. So again we want to lower this guy all the way down to square root T. We want to show a tight bound in one dimension and this is how we do it. So first observation is the following: we can discretize. So when we are in one dimension, also in arbitrary dimension, we can restrict ourselves instead of playing the entire of interval 0,1 to a grid. So we can find an epsilon squared spaced grid X1 through XK and restrict the player only to points in this set. This is goes back to [indiscernible] question, how much do I lose? So a simple lemma shows that the best fixed point in hindsight within my set is not much worse than the best fixed point overall. This is the penalty I pay for this discretization. So if I discretize finely enough then I find playing in this discrete set and now my analysis becomes easier because I'm just playing some finite set of actions. This exactly has to do with the fact that a convex bounded function is in fact kind of Lipschitz already. >>: This is in one dimension or this also generalizes to multiple images? >> Ofer Dekel: This is not restricted to one dimension. >>: But then what’s K here? But K might grow exponentially in the dimension>>: And here you need Lipschitz. >> Ofer Dekel: Yeah. But that's not going to be, there's no Lipschitz here. So there's no Lipschitz in this thing. This is exactly because the bounded convex function is effectively kind of Lipschitz. >>: And then epsilon tries to shrink T. >> Ofer Dekel: We will see this in a minute. It has to depend on T. So you're right. K can be exponential. That’s not going to be the reason why we are a one-dimensional proof. We have some other technical reasons why our proof is only restricted to one dimension, so the fact that you’re grade will grow exponentially is not going to be a problem for us. We'll see this in a minute. So with this observation we can already solve the problem using machinery that we already have. So a K-armed bandit problem is the same problem where we just have K discrete actions, it's not some convex function or some structured space, they’re just K actions, each one has some loss. You could think of the functions that we talk about before just being arbitrary bounded functions, not convex functions. So I have a grid, at each point the function has some loss value, and we know how to solve these problems with regret that scales like square root TK with a finite number of actions K. So if I discretize and then forget about convexity altogether, just treat each point as an action, I can choose one of these actions and see its loss, now it’s just a K-armed bandit problem, I pay this regret, I pay this for the discretization, epsilon and K are related through this; so if it's epsilon square space and K is one of epsilon squared I optimize over epsilon and it comes out to be T to the minus quarter which gives me regrets that's T to the three quarters. So that's already better than the 5 over 6 that we had before just by discretization. Another little comment we can also do a non-uniform discretization and get a little bit better still. So again, as I said, if a function is going to be bounded and convex it may be non Lipschitz but only very near the boundary if we make our grid a little bit more dense towards the boundary and more sparse towards the interior of the set then we can do even a little bit better. So just forgetting the structure of the functions and just treating it as a discrete problem this is how far we can get. So it’s better than 5 over 6 but it’s still not square root T. Okay. So how do we get square root T? We have to work a little bit better, a little bit harder. So this is where the non-constructiveness comes into play. So we're going to use the Minimax Principle and we're going to say the Minimax Regret, the thing that we are after, is equal due to the von-Neumann’s Minimax Principle to this max and min regret, so this is called going to be called the Maximum Bayesian Regret. So what's this? Here the adversary chooses a distribution over the entire segments of loss functions, so the loss functions are going to be drawn from some prior distribution that he chooses adversarialy[phonetic]; the player, knowing this distribution, is going to play his policy and now we look at the regret which is just going to be the mean regret as defined before over this distribution over losses. So this is going to be called the Maximum Bayesian Regret and now this is a different setting. So before we talked about the adversarial Bayesian bandit convex optimization setting, this is the Bayesian bandit convex optimization setting. So again, the adversary chooses some prior distribution, this is a distribution over the entire sequence of loss functions, not over one of the loss functions, he reveals this distribution to the player but then he privately draws a concrete instantiation of the losses and then the game is played as before. So it's important to note that this sequence of loss functions they are not independent, they're not identically distributed, and this is important to know because if you notice literature then a very popular kind of cousin of the adversarial problem is what's called the stochastic bandit problem. So stochastic in this world is synonymous with independent. So when people talk about stochastic bandit they talk about loss functions that are all drawn IND[phonetic]. >>: [inaudible]? >> Ofer Dekel: This is the stuff that actually makes money for Microsoft. It pays all of our salaries. So we've shown that the Minimax regret that we care about, we just used a straightforward Minimax Principle to conclude that the thing that we are interested in is the same as this Maximum Bayesian Regret. So now we can think about the Bayesian setting and our strategy is going to be okay let’s upper bound this Maximum Bayesian Regret. Let's think now only the Bayesian setting even though what we started caring about was the adversarial setting and we'll use this to get a non-constructive bound on the Minimax Regret in the adversarial setting. So a little bit of notation. So now we’re in the Bayesian setting. So we are the player, we get this prior, we know the distribution from which losses are going to be drawn. At each point in time we have this HT, this is going to be the information that we have at the end of round T. So formally it’s just the single field generated by all our actions and all the losses that we’ve seen, so this is the history; given this history that we've seen, so we are at round T we can actually compute the posterior distribution. So we can apply Bayes’ rule, we can rule out all the functions that are not consistent with what we've seen, and we can have some posterior distribution. So this is going to be an interesting way that now the past and the future are in fact related. So the things I've seen in the past do tell me to narrow the possibilities for the future. So this is a much more structured setting than what we had before. It's going to be much easier to work with. We’re going to have this little shorthand. So we're going to take conditional expectations, E sub T is just going to be expectations conditioned on this history, so this is what we're going to work with, and then we are not going to worry about computation because is not constructed to begin with. So let's try something. This is going to fail. But let's just think of something. I mean if I have a posterior, so at this point I know the distribution, the posterior distribution for which the loss functions are going to be sampled, I can take the average loss function for today's round, so I know on average what the adversary is going to do today, maybe just play the minimum of that function. So that's a bad strategy. That’s not going to work. So let's see why that doesn't work. So this is just to show that even when I know these posteriors it’s still a hard problem. So here's the example that shows us that fails. So imagine [indiscernible] assume that the prior is such that the adversary chooses between two functions, one is a parabola with a minimum at 0.3, the other is a parabola with a minimum at 0.7. So he chooses one of these two functions and just chooses that function consistently for the entire game. So he draws once between the two with equal probability and just sticks with that one function forever. So if I just do one round of exploration just to see which of the two he's played then I'll know what the function is going to be for all the rounds. I'll just play the minimum of that and my regret will be zero. Playing the minimum of the [indiscernible] world just means you'll play the minimum of the average of the two which is just the point in the middle where exactly the two values of the functions are equal. So if I minimize the average of the two I can get zero information. I'm going to get this number and I get no information about whether we are in the red world or in the blue world and therefore in the next round again all I have is the mean being this and this goes on forever, I'll keep playing this point, I'll keep getting zero information about whether we are in the red world or the blue world; if we are in the red world this point is definitely lower than that and we're going to suffer this constant regret times T and that's going to be a linear regret. >>: [inaudible] after you've seen ones [inaudible]? >> Ofer Dekel: So I played this. I only get the value of the function. >>: [inaudible] some kind of gradient descent on this thing like if you put two this value by a little bit you would learn, if you’re not then you’re>> Ofer Dekel: That's exactly the exploration that I must do. What I'm doing here is just pure exploitation. I'm not exploring at all. I know that this is the world so I've had just played this point one time and if I get this function I know we are in the blue world, if I get this value I know we are in the red world. So the sacrifice one round for exploration I can exploit from then on. So doing this is what we're going to be calling pure exploitation and that's always bad. So there's still this exploration, exploitation trade off even though we have a model of what the guy is going to do to me today and tomorrow and so on. >>: This is a zero probability that people have the information that one or [inaudible]. I mean it’s sort of like zero probability and it seems like you are so unlucky that if they exactly cross at that point, if you just at a little bit of random noise>> Ofer Dekel: In this example you're right. >>: But in general but there are other examples that can make it that you really can’t get the information. You can't just make this noisier and then we get rid of that. >>: Because usually it’s the summation of an infinite number of functions. >> Ofer Dekel: But you'll see something similar that will work in a minute. >>: But actually there is hope to analyze the strategy where you place the minimum and you have a little bit of randomness. >>: Oh, I see. So you don't know whether it will work. >> Ofer Dekel: So we have to work a little bit harder. So let me define a little bit more notation. So X star is going to be the best point in hindsight. This is the minimizer of this random sequence of functions. So X star is the thing that I'm trying to discover. So remember we discretized. It's one of these discrete points. So it's the minimum within our discrete set. Now I'm going to define two very important quantities that we are going to work with: one is the instantaneous regret. So this is the expected regret due to playing X on this round conditioned on the past. So R sub T of X is at time T. If I play X how much regret do I expect to pay? So just the difference between the value of the function at the point that I play and the value of the function at this best point in hindsight. So this is the thing that's added to regret at this point in time. The total regret, the total Maximum Bayesian Regret is just the expected sum of these values. So in words R sub T of X is given what I know, again all expectations are conditioned on the past, given what I know how much do I expect to pay for playing the point X? So that's the first thing I want you to remember. The second thing is the instantaneous information. So we'll see what it is in a minute, but in words it's going to be given what I know how much information do I expect to get about X star by playing X? So for each point in my domain I want to be able to say how much do I expect to pay for it and how much information do I expect to get by playing it. And it's all going to be a question of balancing exploration, exploitation. Let's see what this definition is. So the information at time T by playing X is just going to be the conditional variance of this random variable, so this is a random variable that removes all the noise that's independent of X star and just keeps the randomness that's due to X star. So perhaps maybe all the functions are also polluted with some independent noise on each round, just average that out, just keep the randomization in these functions which is due to the identity of X star, the best point in hindsight. So now at some point X I can say look at this point, X is my guess for what X star is. As nothing varies this value of this function is going to change. If this thing has a big variance then knowing that value will tell me a lot about the identity of X star. If you think about the previous example where those two parabolas met that was the point where X star could have been here or here but the variance was zero. I played; I knew exactly the number that I would get. >>: You need condition on X star, you need condition on the value of what was changed in X star. >> Ofer Dekel: This is on knowing who X star is out of this grid. So X star is one of the guys on the grid>>: So the variance here is [inaudible] choice of X star and the expectation is over everything else in some sense? >> Ofer Dekel: This is a random variable that averages out everything except for the randomness in X star and now the variance is over the different values of X star. So this is in words is a little bit complicated, but intuitively you’ll see it's very, very simple. You're just saying that I have a high variance it means that knowing that value is going to give me a lot of information. And now we have a lemma. And this is lemma adopted by from a paper by Russo and van Roy and this says that the sum of all the information you can collect throughout the game is upper bounded. There's some finite amount of entropy about who X star is and I can't experience more variance than that. I mean it will play points with sufficiently high variance I know what X star is and there’s no more variance. So the sum of the square root of this information term is upper bounded by some total amount, this is using some information theoretic arguments>>: You used them in single dimension so far? >> Ofer Dekel: No. But look at the square root T that’s suspiciously hiding here. So we are going to use the fact that this is going to be a square root T. So the total amount of information I can collect throughout the game is bounded so this immediately gives me a very simple recipe for getting the type of bound that I want. >>: What's the definition of XT you're using now? >> Ofer Dekel: Which one? >>: What’s XT? >> Ofer Dekel: For any policy that I play, for any adversarial distribution of F’s the total amount of information, so how much entropy could there be about the identity of X star it’s this much. >>: What is K here? >> Ofer Dekel: K is the size of the grid, the number of points. >>: So that's actually used this way. >> Ofer Dekel: Yes. >>: So X star depends only on have script F not on my policy? >> Ofer Dekel: X star is a random variable which is the minimizer. It doesn't depend on the policy. The minimizer of the actual instantiation. So here's the very simple recipe that I get for proving regret bounds. So if I can find an algorithm that guarantees an upper bound on this quantity this is going to be called the information ratio. So this is instantaneous regret divided by square root instantaneous information. If this is bounded by a constant then just my regret is going to be upper bounded by that constant times the total information that I get which we already said is upper bound by square root T. >>: So again, algorithm is determining X sub T. >> Ofer Dekel: So the algorithm is controlling X sub T and what we've basically done here, what initially Russo and van Roy did we extended to our case is to break it down into a sufficient condition which looks only one round at a time. So if I can prove that on each iteration the amount of regret that I paid is controlled by the amount of information that I get then I'm good and I can get my square root T. So I don't know if I'm going to be a big regret or a small regret, but if I pay a big regret I'm guaranteed to also have collected a lot of information. If I’ve only got a small amount of information I'm guaranteed to have suffered only a small regret. So somehow the two things are proportional to each other then I can immediately get my red bound. So is that clear? Does everyone see that? Okay. Good. So here's strategy number two. This is something that does work. We got saw attempt number one to play the minimum of the expected loss function that doesn't work. Here’s something that does work. So this is something called Thompson Sampling. This is something that we use in Microsoft product and it's a very simple and intuitive concept. Here's what it is: is so we have this posterior, we can compute the posterior, draw a concrete loss function from that posterior, so we know that the real function was drawn from this posterior but we draw independently our own version of the loss function from this posterior and play the minimum of that thing. So pretend that that's a real loss function and play the minimum of that. So again we draw some F prime T from our posterior and we play the minimum of that. So that's Thompson Sampling. This is what Russo and van Roy were mainly interested in. They cared about this in the IID case, but this also holds in our case more generally; and what they said is that if these functions are bounded, and not necessarily convex, so again in this K-armed setting where there's no structure like convexity and we choose this X sub T according to this very simple Thomson Sampling rule than this constant, which we saw here, is just going to be square root K. So now if we go look at the previous slide for that case for functions if our grid has K points and we don't even use convexity we already have a bound which looks like square root K, square root T log K which is exactly the bound that we have and we know it's an alternative proof with a bound for the K-armed bandit problem. And this is what they were interested in. But in our case that's not going to work precisely because of the comment you made before that in our case because we are doing this discretization of this continuous problem and we pay for that the discretization has to be fine enough, specifically the number of points in our grid is going to have some dependence on the T. So if we use that lemma that I showed you before then K is actually going to be equal to T, if we do a non-uniform grid maybe it will shrink down to something like square root T, but in any case it’s going to be something which is has a .0 with dependence on T so our bound will be this thing or square root this thing times square root T it will be bigger than square root T. So this theorem is not powerful enough, but it didn't use convexity of course so it's not going to be powerful enough. So we're going to use convexity to get a stronger theorem. And that's our strategy. So in order to prove our theta of square root T Bayesian regret bound we're going to define a slight variant, we’re going to have to tweak Thompson Sampling a little bit and we're going to get this kind of result. So we're going to show that the instantaneous regret is controlled by the instantaneous information times some polylog of K, not square root of K. So convexity is going to turn this from square root K to polylog K. That's exactly what we need, and that's why we are not going to hurt even if you have exponential grids and so on. There’s also could be a small little turnover there. >>: That would be true only in dimension one. >> Ofer Dekel: Yeah. So we're getting to the point where we’re using dimension one. >>: I imagine you wouldn’t get a better>> Ofer Dekel: So I think we know how to generalize this part to a higher dimension. The part that we don't know how to generalize to higher dimension I will point out in a minute. But so far everything is either already worth high dimension or we think we know how to do it. So here's a proof sketch. This is where it gets a little bit technical. So we're going to look at two functions. So we are at time T; we are going to look at the main function, so this is the average of the posterior so we know what the posterior is, we're just going to say what is the mean function that the adversary is going to play and then we're going to look at the mean function conditioned on knowing what's the optimal value in hindsight is. So this is going to be F bar and F bar of X. So assuming the optimum was at X what's the minimum function? We're going to look at these two guys. And it turns out that their instantaneous regret is almost equal to the expected value of the difference between the mean function and the mean function where you are playing the point that you know it's going to be optimal in the end so this is going to be equal to our instantaneous regret. The instantaneous information is going to be almost equal to the expected value of the L2 norm between these two functions. So somehow this is an expectation over a local pointwise difference between these two functions and this is an expectation over a global distance, an L2 distance between them. So the variance is going to be proportional to the other distance and the regret of is proportional to the pointwise distance so this is just some technical stuff that we can prove, but now let's compare this random variable with this random variable for each X. So now the question is, again, I want to show that this is upper bounded by that>>: What is the distribution of a point? >> Ofer Dekel: Almost the posterior. So I'm hiding a little bit of yuckiness, but assume it’s just the posterior. So there's just one distribution. The distribution of X star and the distribution of the point that I play is the same because it’s Thompson Sampling. So that's exactly what Thompson Sampling does. So again, I want to show that this is upper bounded by that time a small thing and I'm going to compare now each one of the terms inside, so ideally it would be great if I could show you that this term, the regret, the thing that the expectation of which is the regret is upper bounded by just the L2 times our polylog K but that's not exactly correct, but here's intuition that we used. So we haven’t use convexity anywhere before and this is where convexity comes into play. So again, the adversary has two goals: he has this F bar that he controls; he has this F bar assuming that I know what the optimum is. So we are looking comparing these two functions. Making the difference between them at the optimum big means the regret will be large because our regret is kind of proportional to the expectation of this function, so he wants to find a point X star such that if you knew that this is the best point it will actually be a very, very low value. But not knowing that, just taking the posterior, it's somehow hidden from you. So the difference between these two numbers is going to be proportional to our regret. The L2 distance is going to be proportional to the variance, the information that we get. So the adversary has two goals: he wants the regret to be large, he wants the information to be small. So he's going to take the reference function which is just expected loss at this round and he's going to want to try to hide a really, really good value from us by pulling the function down to make a point that's much, much better than I can see but keep the distance between the functions very, very small. So you can see that it's very easy to do this if there are no restrictions on the functions, but if I want to take two functions and make their pointwise distance big but they're L2 distance small then I can do it very, very easily. Not when they're required to be convex. So this is where convexity kicks in. So this is our local to global lemma which says that a local change in the function, if you take a convex function and pull it down at some point to a point that's below its optimal, so you hide something really, really good you're going to have to change it globally in the sense that the L2 distance between the original function and the new function is going to change a lot. So any local change to the function which is significant in our setting is going to give us information because it’s going to make the energy between two functions very, very big. This is where most of the work it in the paper is spent, I'm not going to talk about this anymore because I'm out of time, but this is intuition. Really, this is the property of convexity that makes that square root T into a polylog. >>: Does this mean you get some random inequality between the L [inaudible] norm and L2 norm or something like that? >> Ofer Dekel: No. What we get is we show that the ratio between this guy and that guy is always going to be bounded by some term which has to do with the energy of the function F independent of the red number and this term when we take, it could actually be big for some cases, so actually there could be points, for example in this case if you pull down very close to the optimum you can actually get a very unfavorable ratio. You can move the function by a little bit and change the energy between the functions also by a little bit, but you take expectation then the term that we lower bound this guy is going to look like a harmonic sum and it's going to turn into this log. So there's some math magic that happens there. On average if you change the function locally you're going to have to change it globally. >>: In just one sentence [inaudible] distance is not with respect to [inaudible] but it’s with respect to the posterior distribution. >> Ofer Dekel: I'm cheating your expectations. There's more cheats. It’s the distribution only supported on the interval between this and that. I mean there's some details. This is kind of a hairy, messy little, so it’s beautiful math and also kind of disgusting at the same time. So this is the idea. This is the property of convexity that makes us>>: This is the property when you know how to prove in one dimension? >> Ofer Dekel: This is the real showstopper for a high dimensional proof to show that the local change induces, must induce a global change. Our proof is very, very manual. It’s very much a picture. You say this line has to be lower bounded by different line. >>: Do you know if it's true or not in say two dimensions? >> Ofer Dekel: Yes. In a minute, in the next slide I'll tell you. So anyway, that's the end. That's all I have time for. Let me conclude. So what we have is a non-constructive on the upper bound on the Minimax Regret of the adversarial bandit convex optimization problem in one dimension. We used the Minimax Principle to reduce the adversarial setting to the Bayesian setting. This has been done before but as far as we know not for bandit problems, so when you have what's called full information problems people have used this trick but not for bandit learning. And then we have this local change induces global change property of convex functions. This is somehow independent of our work. This is some something that I would have expected to find in books on convex function that we exploit and the combination of all these things give us our nonconstructive upper bound. This is in answer to your question. So this is breaking news from just a few weeks ago. So [indiscernible] were able to generalize this to arbitrary dimension albeit with an exponential dependence on a dimension. So before I didn't explicitly tell you how all these regrets that we had before depended on dimension but they were all a small polynomial, so something like N squared, N cubed. Here, if you're willing to pay exponentially in the dimension, then the new result which is built on the same principles but uses a different type of algorithm for the Bayesian case is still able to get square root T regret but with an exponential dependence and that's a topic for perhaps a future talk by>>: [inaudible] Lipschitz? >> Ofer Dekel: Lipschitz is not important. We get away from Lipschitz just using that discretization trick and now we are in a finite problem. It's no longer even>>: [inaudible]. Instead of [inaudible] or something we use probability [inaudible]. I'm sure that there exists two points that will give you enough to>> Ofer Dekel: So our algorithm, if you care about the Bayesian setting as a first-order thing then it's constructive perhaps very hard. For some families of functions you could maybe actually compute this strategy. So it’s a Thompson Sampling with some small modifications. So I told you what it is. You can solve it. We just don't know how to map that back into an answer to get adversarial case. >>: [inaudible] optimization. It's crazy difficult. Is there any kind of dimension reduction in this sample? >> Ofer Dekel: Dimension reduction. What is that? >>: Is there a way to somehow [inaudible]? >>: I don't think so. >> Ofer Dekel: I don't know. We don't know. >>: [inaudible] so far. >>: It would also probably work for convex optimization, right? >> Ofer Dekel: So if anyone knows something like this in high dimension, I think we’re done. I mean we have almost all the other parts but just this idea of how much of the function change globally will you change it locally for any norm. So the norm is arbitrary. It's governed as a subset by the posterior so if you could show that for any norm this thing holds, a local change is going to make a global change, you cannot hide a little good point here without making the functions very, very far away we’ll be done with this big problem, still we have the problem of finding an actual algorithm that achieves this but we will have proven that one exists. >>: So are you going to prove something like that? >>: Essentially we do something like that but we are able to do it with respect to [inaudible] instead of with respect to the posterior and then we show that there is a small discrepancy between [inaudible] and the posterior solution. So it's in two stages. >>: If you had, instead of a convex function, if you had a multilayered polynomial bounded height and degree it’s known, for instance, there’s something called Marco’s Inequality[phonetic] which says more or less things like local change, if there is a drastic variation in a small neighborhood, then there's a drastic variation overall. >> Ofer Dekel: What does it have to be? A polynomial? >>: I think it's called Marco's Inequality[phonetic]. It’s not the [inaudible]. It's the sum [inaudible]. But for polynomials, depending on height and degree, but multivariable polynomials it’s not. >> Ofer Dekel: But they have to be Lipschitz as well. >>: Well, they have to have bounded coefficients, heights, and degrees in all. >>: I suspect something like this isn’t true. I mean you can get [inaudible] to this problem. [inaudible] computation and really constructing very complicated regions for what types of aspects of the kind of [inaudible]. It sounds like a few points [inaudible] you really construct very complicated regions to chop off>> Ofer Dekel: And you're saying this implies what? >>: That I suspect that the analog you want in [inaudible] is going to be [inaudible]. It might be real to exponentially have your algorithm. That's my question. Do you think the algorithm is actually exponential [inaudible]? >>: It is exponential, but I believe that the one way you place a minimum and you just randomize a little bit around the minimum that should give you polynomial in that. >>: But even though this is true, in the stochastic setting this would be wonderful. >> Ofer Dekel: We are saying a lower bound. >>: Even [inaudible] stochastic setting. >>: Okay. I think we're out of time. So we'll have another break where further questions can be discussed in private, but we're going to continue to the next talk in four minutes.

Document 17865183

Related documents

Products

Support

Document 17865183

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib