Document 17865183

advertisement
>> Yuval Peres: All right. Welcome. So we should pay attention. This is the only talk of the day
from someone who does not report to Jennifer. Ofer Dekel will tell us about bandit convex
optimization.
>> Ofer Dekel: Thanks. Yeah. So this is joint work with Sebastien Bubeck and with Yuval and
with Tomer Koren who is a student at the Tech [indiscernible]. It's going to be about learning
theory. I'm going to talk about one of the most important still open problems in learning
theory, really one of the kind of embarrassing gaping holes in our understanding of online
learning.
So here's the problem. Let's just jump right into it. The problem is adversarial bandit convex
optimization. We'll parse what this means. So this is going to be a T round repeated game
between a randomized player and an oblivious adversary. So the player has access to random
bits and the adversary does not adapt from round to round so he's oblivious to the player's
actions. The player has an action set and this is going to be some convex set C which is known
in advance. The number of rounds in the game is also going to be known in advance and here's
how the game proceeds: so the adversary privately chooses some sequence of functions, F1
through FT, and each one of them is convex so each one maps each one of the actions that the
player can play to the interval 01 so they’re also bounded and convex but otherwise there's no
relation between F1, F2, F3. They can change arbitrarily from round to round and he chooses
them privately so the player doesn’t know these functions at all and then the player starts to
iterate. So the adversary is oblivious so he can make all his decisions before the game begins.
Another player plays the game, so T-rounds of the game. On each round he chooses a point in
the convex set and plays that point. He can do this using some random bits. He incurs the loss
which is the value of the function. So if he's on round T and he chooses the point XT then he
pays a loss which is FT evaluated XT and this is the number that he sees and he sees only this
number and then has to go on to the next round.
So he’s collecting this loss as he’s going on for T-rounds. Let's see a little picture that shows
this. So again this game starts out by the adversary choosing the entire sequence of functions
so here the action set, the domain of the functions is just the two-dimensional square and he
chooses arbitrary functions, they have to be convex but that’s it and they have to of course
have a bounded range as well, but you can see they have no specific form other than being
convex and then the player starts to iterate. So on round one he chooses a point, again he can
do this using some randomization, and he plays this point and he incurs the loss. The loss is just
the value of the first function at that point but he doesn't get to see the function, he just gets to
see this one number 0.3. And then he goes on to round number two and he chooses another
point and incurs another loss and so on and so forth. He never gets to see any of these
functions but he collects this loss. Is that clear?
>>: So just to make sure, so this is a convex function so the best fields are actually not going to
be on the boundary of the body[phonetic], but somewhere inside the body[phonetic]?
>> Ofer Dekel: They could be. They could all be linear.
>>: But generally speaking>> Ofer Dekel: The adversary can do whatever he wants, they can be inside, they can be>>: Not necessarily on the boundary because it's a loss [inaudible].
>> Ofer Dekel: Yes. Absolutely. So let's define a few definitions. The expected cumulative loss
of the player is just the expectation of the sum of the losses that he incurs as he plays the
game. This is his loss, this is the thing that he wants to this to be small but since the loss
functions are arbitrary just wanting to this to be small it's not enough; we have to compare to
some benchmark. It's meaningless to just look at this loss because the adversary could just
choose all the loss functions to be the constant one so you can always incur a big loss. That's
not the point here. We have to compare this expected cumulative loss to some benchmark.
The benchmark, what is it? It's just the value of these same functions evaluated the best fixed
point in hindsight. So if you took the off-line problem, if you knew all these F’s you just solve
this convex minimization problem, just find the point that minimizes the average function that's
what I want to compare to. So the difference between what the algorithm accumulates in
expectation and their loss is the best point in hindsight. This is called the regret. So this is how
the player penalizes himself.
Just to define a notation: so row will denote the policy that the player plays, so the player plays
some policy row, this is the loss functions of the adversary and this is the regret with respect to
those two things. So this is how a player evaluates himself and now we can measure the
difficulty of the game is this Minimax Regret. It’s simply the regret when both the player and
the adversary play optimally. So it's minimum over all possible policies of the player, maximum
over all possible losses sequences over the regret played by the player and now just note that
the adversary knows the policy when he’s choosing these loss functions. This is important
because later on we'll see that the rows will be reversed. So he knows with the policy is; he
doesn't know the players random bits.
I would say this game is learnable if this Minimax Regret is sub linear in T. So if in the worst
case when the player, when the adversary is doing his worst, still the rate at which you
accumulate regret is sub linear means that on each round you’re accumulating something that
has to tend to zero so you’re getting better and better as the game gets longer. So you're
learning. So this learning just means the thing is sub linear.
So how do we know if this game is learnable or not? So let's take a step back. Let's make some
assumptions. So first let's assume that the functions are also Lipschitz. So this is a technical
assumption; we’ll get rid of this in a minute. And now let's pretend for a minute that instead of
getting the value of the function I actually get something much more informative; I get the
value of the gradient of the function at that point. So just pretend that for a minute. And now
in 2003 Zinkevich showed that just simple gradient descent will already guarantee assembling
your regret of square root T.
So this is already kind of an amazing thing. The function, I'm taking the gradient step based on
everything I've seen in the past or the current function and tomorrow's function has absolutely
nothing to do with it yet still I can guarantee this learning if I just follow the gradient path and
the idea here, the idea of the proof is just to show that I don't know what the adversary is going
to do, he can do whatever he wants, but if indeed the adversary chooses functions whose
minimum is very, very different, so he somehow mixes it up, I take a gradient step in this
direction and he puts the minimum down there and then I take a gradient step in that direction
and he puts the minimum over there then in fact he'll be hurting all the points uniformly so this
best point in hindsight will also get worse. What the adversary would want to do in fact is not
that but he would want to hide a really good point, hope that I will never find it, but make it
consistently good across many of these rounds and if I do gradient dissent I will actually find
that point. So that's the idea of why I can learn even though the past and the future have
nothing to do with each other. So this is just something we have to get used to. There's a
matching lower bound. So square root T is as good as it gets.
Now let's get back to the problem that we do care about, not the one where we get gradients
but the one where we can only get the evaluations of the function.
>>: When you say matching lower bounds for this method or for>> Ofer Dekel: For any algorithm the worst-case adversary for that algorithm will in fact inflict
this much damage.
>>: In the algorithm it sees the variant.
>> Ofer Dekel: Yes. It’s for any algorithm that can see the entire function so you can even
assume that he received the entire function. So here's just a depiction of this. So again I get
the gradient, I take the step, for the next loss I play this point and I take a step, for next loss I
play that point and I take a step and so on and so forth. This is gradient descent.
The first real progress on the bandit version of the problem, so bandit means that I only get the
value of the function, it was done in 2005 by Flaxman, [indiscernible], and Brendan McMahan
and here's the idea: I can estimate this gradient using only the evaluation of the actual function
at only one point. So I can have a one-point estimate of this gradient. I can run gradient
dissent with this estimate. So here's the estimate: so I estimate the gradient at the point that I
want to play by the value of the function at a point that's nearby. So let me show you what
these things are. So I start with this point. This is my state, the point that I have in my head,
but rather than playing this point what I do is I choose a uniform point on a sphere around the
point that I want to play. So this is XT, U is going to be some uniformly chosen unit vector, delta
is going to be some scale, the radius of this circle, and I choose a uniform point in the circle and
then I'm going to take a step in the direction opposite to this point, the size of the step is going
to be proportional to the value of the function at the point that I played. So this kind of
magically turns out to be an estimator of the gradient.
So this is how it goes: so you have some point in your head, you actually play different point,
but then you take an estimated gradient step. Now the next function comes along, again I
choose another random point, this time the value is very low so I take a small step, now the
value is maybe bigger so I take the bigger step and you can show that this is estimating a
gradient step except for a few things add some noise. So this estimator has some bias, it has
some variants, also I'm not playing the point that gradient descent is telling me to play, I'm
playing a point that's nearby, all these things accumulate to some additional noise and my
regret bound from square root T jumps up to T to the three quarters. So I have to pay for all
these estimations. Yeah.
>>: Is this game hugely different if you get five values here instead of one value?
>> Ofer Dekel: Yes. If you query the function at two points then you can get square root T. It’s
exactly a question. So I mean I look at the radius of this ball. So this delta, it turns out that if
delta is very, very big then the variance goes down, if delta is very, very small the variance goes
up and the bias behaves the other way around when you can have a two point estimator then
somehow variance is taken care of and you're good.
So our bound has deteriorated to T to the three quarters, the lower bound is still square root T.
So now we already have a gap. Another interesting observation is I can now remove Lipschitz.
So I do need the function to be Lipschitz and this is simply due to the observation that when
you have a convex function with a bounded range it’s effectively Lipschitz. Think about it, you
have a bounded range, you have a bounded domain, you're trying to build a function that's as
non-Lipschitz as possible, it can only shoot up very, very close to the boundary of the set; it
can’t shoot up in the middle because if a slope starts being a very, very big it has to keep being
a very big because it convex but I just told you that the range is bounded so you just can't
construct a function like that. So basically if you take your domain and you shrink it a little bit
you're already Lipschitz in that area; the non-Lipschitz part anyway is going to have very, very
high losses so these are not points you're going to play in any case.
>>: [inaudible] upper bounded based>> Ofer Dekel: So the range is 01 assuming that the losses are always between zero and one.
So using that trick you lose a little bit more; you get T to the five sixth. So it's learnable.
>>: [inaudible] that you can take on the [inaudible] boundedness of the function and [inaudible]
domain your uses?
>> Ofer Dekel: Yeah. So I'll show you the theorem in a minute. I show the whole thing in a
minute. So you know it's learnable, is a sub linear regret, it's great, but it's very [indiscernible]
our lower bound of square root. So they did this, this was beautiful, we knew that it's
learnable, we want to look close the gap. So that's what this talk is about. After that a
sequence of papers were published that tried to do better and no one was able to do better in
the general case but in special cases people were able to make improvements sometimes with
the same algorithm, this estimated gradient algorithm, and sometimes with small tweaks to
that idea, but still the idea of estimating a gradient to do gradient descent. So in a paper with
Alec Agarwal and Lin Xiao we showed that if the function is strongly convex you can get the T to
the three quarters down to T to two thirds. Another paper showed that if it’s a smooth
function again you can go down to two thirds. Recently we proved this to T to the seven
eleventh so you can see we’re kind of inching downwards. If the functions are linear in fact in
Lipschitz then you can get the square root T, the tight thing, and we recently generalized this to
general quadratic forms. So again, in the special case, we actually have the tight
characterization. Also very recently we showed that if a function is both strongly convex and
smooth you can get square root T the idea being that people looked at the same end
dimensional problem and just said if I restrict the adversary to choose functions from a more
specific family I can get slightly better rounds and sometimes even tight bounds.
So this was the game for a long time but progress on the main thing was very, very elusive. This
is what we're going to talk about today. So we're going to talk about the general case where
we are just assuming convexity and boundedness, no Lipschitz, no smoothness, no strong
convexity, none of these other assumptions, but we're going to talk about the problem in the
one dimensional case. So surprisingly even in the one-dimensional case we really knew nothing
better than this T to the five sixth regret upper bound with a lower bound being still square
root T.
So our kind of the first step into solving this very, very basic problem and all I'm learning is just
to close this gap in the one-dimensional case. So the theorem that we're going to talk about is
if each one of these loss functions is just a mapping from interval 0,1 to the range 0,1 it’s just
convex, that's all I need, then the regret is going to be on the order of square root T up to some
logarithm log terms. Yeah.
>>: The question [inaudible] strategy but it’s not the issue of computation [inaudible]?
>> Ofer Dekel: Correct. So all of these are algorithms that you can compute and it’s easy to
compute them and we run them. Our proof is going to be nonconstructive. I'll get to that in a
minute.
>>: Okay. But it's not often [inaudible] particular issue [inaudible] two dimensions is zero.
>> Ofer Dekel: There will be a surprise at the end. Maybe. Let's see. So again we want to
lower this guy all the way down to square root T. We want to show a tight bound in one
dimension and this is how we do it. So first observation is the following: we can discretize. So
when we are in one dimension, also in arbitrary dimension, we can restrict ourselves instead of
playing the entire of interval 0,1 to a grid. So we can find an epsilon squared spaced grid X1
through XK and restrict the player only to points in this set. This is goes back to [indiscernible]
question, how much do I lose? So a simple lemma shows that the best fixed point in hindsight
within my set is not much worse than the best fixed point overall. This is the penalty I pay for
this discretization. So if I discretize finely enough then I find playing in this discrete set and now
my analysis becomes easier because I'm just playing some finite set of actions. This exactly has
to do with the fact that a convex bounded function is in fact kind of Lipschitz already.
>>: This is in one dimension or this also generalizes to multiple images?
>> Ofer Dekel: This is not restricted to one dimension.
>>: But then what’s K here? But K might grow exponentially in the dimension>>: And here you need Lipschitz.
>> Ofer Dekel: Yeah. But that's not going to be, there's no Lipschitz here. So there's no
Lipschitz in this thing. This is exactly because the bounded convex function is effectively kind of
Lipschitz.
>>: And then epsilon tries to shrink T.
>> Ofer Dekel: We will see this in a minute. It has to depend on T. So you're right. K can be
exponential. That’s not going to be the reason why we are a one-dimensional proof. We have
some other technical reasons why our proof is only restricted to one dimension, so the fact that
you’re grade will grow exponentially is not going to be a problem for us. We'll see this in a
minute.
So with this observation we can already solve the problem using machinery that we already
have. So a K-armed bandit problem is the same problem where we just have K discrete actions,
it's not some convex function or some structured space, they’re just K actions, each one has
some loss. You could think of the functions that we talk about before just being arbitrary
bounded functions, not convex functions. So I have a grid, at each point the function has some
loss value, and we know how to solve these problems with regret that scales like square root TK
with a finite number of actions K. So if I discretize and then forget about convexity altogether,
just treat each point as an action, I can choose one of these actions and see its loss, now it’s just
a K-armed bandit problem, I pay this regret, I pay this for the discretization, epsilon and K are
related through this; so if it's epsilon square space and K is one of epsilon squared I optimize
over epsilon and it comes out to be T to the minus quarter which gives me regrets that's T to
the three quarters. So that's already better than the 5 over 6 that we had before just by
discretization. Another little comment we can also do a non-uniform discretization and get a
little bit better still. So again, as I said, if a function is going to be bounded and convex it may
be non Lipschitz but only very near the boundary if we make our grid a little bit more dense
towards the boundary and more sparse towards the interior of the set then we can do even a
little bit better. So just forgetting the structure of the functions and just treating it as a discrete
problem this is how far we can get. So it’s better than 5 over 6 but it’s still not square root T.
Okay. So how do we get square root T? We have to work a little bit better, a little bit harder.
So this is where the non-constructiveness comes into play. So we're going to use the Minimax
Principle and we're going to say the Minimax Regret, the thing that we are after, is equal due to
the von-Neumann’s Minimax Principle to this max and min regret, so this is called going to be
called the Maximum Bayesian Regret. So what's this? Here the adversary chooses a
distribution over the entire segments of loss functions, so the loss functions are going to be
drawn from some prior distribution that he chooses adversarialy[phonetic]; the player, knowing
this distribution, is going to play his policy and now we look at the regret which is just going to
be the mean regret as defined before over this distribution over losses.
So this is going to be called the Maximum Bayesian Regret and now this is a different setting.
So before we talked about the adversarial Bayesian bandit convex optimization setting, this is
the Bayesian bandit convex optimization setting. So again, the adversary chooses some prior
distribution, this is a distribution over the entire sequence of loss functions, not over one of the
loss functions, he reveals this distribution to the player but then he privately draws a concrete
instantiation of the losses and then the game is played as before. So it's important to note that
this sequence of loss functions they are not independent, they're not identically distributed,
and this is important to know because if you notice literature then a very popular kind of cousin
of the adversarial problem is what's called the stochastic bandit problem. So stochastic in this
world is synonymous with independent. So when people talk about stochastic bandit they talk
about loss functions that are all drawn IND[phonetic].
>>: [inaudible]?
>> Ofer Dekel: This is the stuff that actually makes money for Microsoft. It pays all of our
salaries. So we've shown that the Minimax regret that we care about, we just used a
straightforward Minimax Principle to conclude that the thing that we are interested in is the
same as this Maximum Bayesian Regret. So now we can think about the Bayesian setting and
our strategy is going to be okay let’s upper bound this Maximum Bayesian Regret. Let's think
now only the Bayesian setting even though what we started caring about was the adversarial
setting and we'll use this to get a non-constructive bound on the Minimax Regret in the
adversarial setting.
So a little bit of notation. So now we’re in the Bayesian setting. So we are the player, we get
this prior, we know the distribution from which losses are going to be drawn. At each point in
time we have this HT, this is going to be the information that we have at the end of round T. So
formally it’s just the single field generated by all our actions and all the losses that we’ve seen,
so this is the history; given this history that we've seen, so we are at round T we can actually
compute the posterior distribution. So we can apply Bayes’ rule, we can rule out all the
functions that are not consistent with what we've seen, and we can have some posterior
distribution. So this is going to be an interesting way that now the past and the future are in
fact related. So the things I've seen in the past do tell me to narrow the possibilities for the
future. So this is a much more structured setting than what we had before. It's going to be
much easier to work with. We’re going to have this little shorthand. So we're going to take
conditional expectations, E sub T is just going to be expectations conditioned on this history, so
this is what we're going to work with, and then we are not going to worry about computation
because is not constructed to begin with.
So let's try something. This is going to fail. But let's just think of something. I mean if I have a
posterior, so at this point I know the distribution, the posterior distribution for which the loss
functions are going to be sampled, I can take the average loss function for today's round, so I
know on average what the adversary is going to do today, maybe just play the minimum of that
function. So that's a bad strategy. That’s not going to work. So let's see why that doesn't work.
So this is just to show that even when I know these posteriors it’s still a hard problem.
So here's the example that shows us that fails. So imagine [indiscernible] assume that the prior
is such that the adversary chooses between two functions, one is a parabola with a minimum at
0.3, the other is a parabola with a minimum at 0.7. So he chooses one of these two functions
and just chooses that function consistently for the entire game. So he draws once between the
two with equal probability and just sticks with that one function forever. So if I just do one
round of exploration just to see which of the two he's played then I'll know what the function is
going to be for all the rounds. I'll just play the minimum of that and my regret will be zero.
Playing the minimum of the [indiscernible] world just means you'll play the minimum of the
average of the two which is just the point in the middle where exactly the two values of the
functions are equal. So if I minimize the average of the two I can get zero information. I'm
going to get this number and I get no information about whether we are in the red world or in
the blue world and therefore in the next round again all I have is the mean being this and this
goes on forever, I'll keep playing this point, I'll keep getting zero information about whether we
are in the red world or the blue world; if we are in the red world this point is definitely lower
than that and we're going to suffer this constant regret times T and that's going to be a linear
regret.
>>: [inaudible] after you've seen ones [inaudible]?
>> Ofer Dekel: So I played this. I only get the value of the function.
>>: [inaudible] some kind of gradient descent on this thing like if you put two this value by a
little bit you would learn, if you’re not then you’re>> Ofer Dekel: That's exactly the exploration that I must do. What I'm doing here is just pure
exploitation. I'm not exploring at all. I know that this is the world so I've had just played this
point one time and if I get this function I know we are in the blue world, if I get this value I
know we are in the red world. So the sacrifice one round for exploration I can exploit from then
on. So doing this is what we're going to be calling pure exploitation and that's always bad. So
there's still this exploration, exploitation trade off even though we have a model of what the
guy is going to do to me today and tomorrow and so on.
>>: This is a zero probability that people have the information that one or [inaudible]. I mean
it’s sort of like zero probability and it seems like you are so unlucky that if they exactly cross at
that point, if you just at a little bit of random noise>> Ofer Dekel: In this example you're right.
>>: But in general but there are other examples that can make it that you really can’t get the
information. You can't just make this noisier and then we get rid of that.
>>: Because usually it’s the summation of an infinite number of functions.
>> Ofer Dekel: But you'll see something similar that will work in a minute.
>>: But actually there is hope to analyze the strategy where you place the minimum and you
have a little bit of randomness.
>>: Oh, I see. So you don't know whether it will work.
>> Ofer Dekel: So we have to work a little bit harder. So let me define a little bit more
notation. So X star is going to be the best point in hindsight. This is the minimizer of this
random sequence of functions. So X star is the thing that I'm trying to discover. So remember
we discretized. It's one of these discrete points. So it's the minimum within our discrete set.
Now I'm going to define two very important quantities that we are going to work with: one is
the instantaneous regret. So this is the expected regret due to playing X on this round
conditioned on the past. So R sub T of X is at time T. If I play X how much regret do I expect to
pay? So just the difference between the value of the function at the point that I play and the
value of the function at this best point in hindsight. So this is the thing that's added to regret at
this point in time. The total regret, the total Maximum Bayesian Regret is just the expected
sum of these values. So in words R sub T of X is given what I know, again all expectations are
conditioned on the past, given what I know how much do I expect to pay for playing the point
X? So that's the first thing I want you to remember.
The second thing is the instantaneous information. So we'll see what it is in a minute, but in
words it's going to be given what I know how much information do I expect to get about X star
by playing X? So for each point in my domain I want to be able to say how much do I expect to
pay for it and how much information do I expect to get by playing it. And it's all going to be a
question of balancing exploration, exploitation. Let's see what this definition is. So the
information at time T by playing X is just going to be the conditional variance of this random
variable, so this is a random variable that removes all the noise that's independent of X star and
just keeps the randomness that's due to X star. So perhaps maybe all the functions are also
polluted with some independent noise on each round, just average that out, just keep the
randomization in these functions which is due to the identity of X star, the best point in
hindsight. So now at some point X I can say look at this point, X is my guess for what X star is.
As nothing varies this value of this function is going to change. If this thing has a big variance
then knowing that value will tell me a lot about the identity of X star. If you think about the
previous example where those two parabolas met that was the point where X star could have
been here or here but the variance was zero. I played; I knew exactly the number that I would
get.
>>: You need condition on X star, you need condition on the value of what was changed in X
star.
>> Ofer Dekel: This is on knowing who X star is out of this grid. So X star is one of the guys on
the grid>>: So the variance here is [inaudible] choice of X star and the expectation is over everything
else in some sense?
>> Ofer Dekel: This is a random variable that averages out everything except for the
randomness in X star and now the variance is over the different values of X star. So this is in
words is a little bit complicated, but intuitively you’ll see it's very, very simple. You're just
saying that I have a high variance it means that knowing that value is going to give me a lot of
information. And now we have a lemma. And this is lemma adopted by from a paper by Russo
and van Roy and this says that the sum of all the information you can collect throughout the
game is upper bounded. There's some finite amount of entropy about who X star is and I can't
experience more variance than that. I mean it will play points with sufficiently high variance I
know what X star is and there’s no more variance. So the sum of the square root of this
information term is upper bounded by some total amount, this is using some information
theoretic arguments>>: You used them in single dimension so far?
>> Ofer Dekel: No. But look at the square root T that’s suspiciously hiding here. So we are
going to use the fact that this is going to be a square root T. So the total amount of information
I can collect throughout the game is bounded so this immediately gives me a very simple recipe
for getting the type of bound that I want.
>>: What's the definition of XT you're using now?
>> Ofer Dekel: Which one?
>>: What’s XT?
>> Ofer Dekel: For any policy that I play, for any adversarial distribution of F’s the total amount
of information, so how much entropy could there be about the identity of X star it’s this much.
>>: What is K here?
>> Ofer Dekel: K is the size of the grid, the number of points.
>>: So that's actually used this way.
>> Ofer Dekel: Yes.
>>: So X star depends only on have script F not on my policy?
>> Ofer Dekel: X star is a random variable which is the minimizer. It doesn't depend on the
policy. The minimizer of the actual instantiation. So here's the very simple recipe that I get for
proving regret bounds. So if I can find an algorithm that guarantees an upper bound on this
quantity this is going to be called the information ratio. So this is instantaneous regret divided
by square root instantaneous information. If this is bounded by a constant then just my regret
is going to be upper bounded by that constant times the total information that I get which we
already said is upper bound by square root T.
>>: So again, algorithm is determining X sub T.
>> Ofer Dekel: So the algorithm is controlling X sub T and what we've basically done here, what
initially Russo and van Roy did we extended to our case is to break it down into a sufficient
condition which looks only one round at a time. So if I can prove that on each iteration the
amount of regret that I paid is controlled by the amount of information that I get then I'm good
and I can get my square root T. So I don't know if I'm going to be a big regret or a small regret,
but if I pay a big regret I'm guaranteed to also have collected a lot of information. If I’ve only
got a small amount of information I'm guaranteed to have suffered only a small regret. So
somehow the two things are proportional to each other then I can immediately get my red
bound. So is that clear? Does everyone see that?
Okay. Good. So here's strategy number two. This is something that does work. We got saw
attempt number one to play the minimum of the expected loss function that doesn't work.
Here’s something that does work. So this is something called Thompson Sampling. This is
something that we use in Microsoft product and it's a very simple and intuitive concept. Here's
what it is: is so we have this posterior, we can compute the posterior, draw a concrete loss
function from that posterior, so we know that the real function was drawn from this posterior
but we draw independently our own version of the loss function from this posterior and play
the minimum of that thing. So pretend that that's a real loss function and play the minimum of
that. So again we draw some F prime T from our posterior and we play the minimum of that.
So that's Thompson Sampling.
This is what Russo and van Roy were mainly interested in. They cared about this in the IID
case, but this also holds in our case more generally; and what they said is that if these functions
are bounded, and not necessarily convex, so again in this K-armed setting where there's no
structure like convexity and we choose this X sub T according to this very simple Thomson
Sampling rule than this constant, which we saw here, is just going to be square root K.
So now if we go look at the previous slide for that case for functions if our grid has K points and
we don't even use convexity we already have a bound which looks like square root K, square
root T log K which is exactly the bound that we have and we know it's an alternative proof with
a bound for the K-armed bandit problem. And this is what they were interested in. But in our
case that's not going to work precisely because of the comment you made before that in our
case because we are doing this discretization of this continuous problem and we pay for that
the discretization has to be fine enough, specifically the number of points in our grid is going to
have some dependence on the T. So if we use that lemma that I showed you before then K is
actually going to be equal to T, if we do a non-uniform grid maybe it will shrink down to
something like square root T, but in any case it’s going to be something which is has a .0 with
dependence on T so our bound will be this thing or square root this thing times square root T it
will be bigger than square root T. So this theorem is not powerful enough, but it didn't use
convexity of course so it's not going to be powerful enough.
So we're going to use convexity to get a stronger theorem. And that's our strategy. So in order
to prove our theta of square root T Bayesian regret bound we're going to define a slight variant,
we’re going to have to tweak Thompson Sampling a little bit and we're going to get this kind of
result. So we're going to show that the instantaneous regret is controlled by the instantaneous
information times some polylog of K, not square root of K. So convexity is going to turn this
from square root K to polylog K. That's exactly what we need, and that's why we are not going
to hurt even if you have exponential grids and so on. There’s also could be a small little
turnover there.
>>: That would be true only in dimension one.
>> Ofer Dekel: Yeah. So we're getting to the point where we’re using dimension one.
>>: I imagine you wouldn’t get a better>> Ofer Dekel: So I think we know how to generalize this part to a higher dimension. The part
that we don't know how to generalize to higher dimension I will point out in a minute. But so
far everything is either already worth high dimension or we think we know how to do it. So
here's a proof sketch. This is where it gets a little bit technical. So we're going to look at two
functions. So we are at time T; we are going to look at the main function, so this is the average
of the posterior so we know what the posterior is, we're just going to say what is the mean
function that the adversary is going to play and then we're going to look at the mean function
conditioned on knowing what's the optimal value in hindsight is. So this is going to be F bar and
F bar of X. So assuming the optimum was at X what's the minimum function?
We're going to look at these two guys. And it turns out that their instantaneous regret is
almost equal to the expected value of the difference between the mean function and the mean
function where you are playing the point that you know it's going to be optimal in the end so
this is going to be equal to our instantaneous regret. The instantaneous information is going to
be almost equal to the expected value of the L2 norm between these two functions. So
somehow this is an expectation over a local pointwise difference between these two functions
and this is an expectation over a global distance, an L2 distance between them. So the variance
is going to be proportional to the other distance and the regret of is proportional to the
pointwise distance so this is just some technical stuff that we can prove, but now let's compare
this random variable with this random variable for each X. So now the question is, again, I want
to show that this is upper bounded by that>>: What is the distribution of a point?
>> Ofer Dekel: Almost the posterior. So I'm hiding a little bit of yuckiness, but assume it’s just
the posterior. So there's just one distribution. The distribution of X star and the distribution of
the point that I play is the same because it’s Thompson Sampling. So that's exactly what
Thompson Sampling does. So again, I want to show that this is upper bounded by that time a
small thing and I'm going to compare now each one of the terms inside, so ideally it would be
great if I could show you that this term, the regret, the thing that the expectation of which is
the regret is upper bounded by just the L2 times our polylog K but that's not exactly correct, but
here's intuition that we used. So we haven’t use convexity anywhere before and this is where
convexity comes into play.
So again, the adversary has two goals: he has this F bar that he controls; he has this F bar
assuming that I know what the optimum is. So we are looking comparing these two functions.
Making the difference between them at the optimum big means the regret will be large
because our regret is kind of proportional to the expectation of this function, so he wants to
find a point X star such that if you knew that this is the best point it will actually be a very, very
low value. But not knowing that, just taking the posterior, it's somehow hidden from you. So
the difference between these two numbers is going to be proportional to our regret. The L2
distance is going to be proportional to the variance, the information that we get. So the
adversary has two goals: he wants the regret to be large, he wants the information to be small.
So he's going to take the reference function which is just expected loss at this round and he's
going to want to try to hide a really, really good value from us by pulling the function down to
make a point that's much, much better than I can see but keep the distance between the
functions very, very small. So you can see that it's very easy to do this if there are no
restrictions on the functions, but if I want to take two functions and make their pointwise
distance big but they're L2 distance small then I can do it very, very easily. Not when they're
required to be convex. So this is where convexity kicks in.
So this is our local to global lemma which says that a local change in the function, if you take a
convex function and pull it down at some point to a point that's below its optimal, so you hide
something really, really good you're going to have to change it globally in the sense that the L2
distance between the original function and the new function is going to change a lot. So any
local change to the function which is significant in our setting is going to give us information
because it’s going to make the energy between two functions very, very big. This is where most
of the work it in the paper is spent, I'm not going to talk about this anymore because I'm out of
time, but this is intuition. Really, this is the property of convexity that makes that square root T
into a polylog.
>>: Does this mean you get some random inequality between the L [inaudible] norm and L2
norm or something like that?
>> Ofer Dekel: No. What we get is we show that the ratio between this guy and that guy is
always going to be bounded by some term which has to do with the energy of the function F
independent of the red number and this term when we take, it could actually be big for some
cases, so actually there could be points, for example in this case if you pull down very close to
the optimum you can actually get a very unfavorable ratio. You can move the function by a
little bit and change the energy between the functions also by a little bit, but you take
expectation then the term that we lower bound this guy is going to look like a harmonic sum
and it's going to turn into this log. So there's some math magic that happens there. On average
if you change the function locally you're going to have to change it globally.
>>: In just one sentence [inaudible] distance is not with respect to [inaudible] but it’s with
respect to the posterior distribution.
>> Ofer Dekel: I'm cheating your expectations. There's more cheats. It’s the distribution only
supported on the interval between this and that. I mean there's some details. This is kind of a
hairy, messy little, so it’s beautiful math and also kind of disgusting at the same time. So this is
the idea. This is the property of convexity that makes us>>: This is the property when you know how to prove in one dimension?
>> Ofer Dekel: This is the real showstopper for a high dimensional proof to show that the local
change induces, must induce a global change. Our proof is very, very manual. It’s very much a
picture. You say this line has to be lower bounded by different line.
>>: Do you know if it's true or not in say two dimensions?
>> Ofer Dekel: Yes. In a minute, in the next slide I'll tell you. So anyway, that's the end. That's
all I have time for. Let me conclude.
So what we have is a non-constructive on the upper bound on the Minimax Regret of the
adversarial bandit convex optimization problem in one dimension. We used the Minimax
Principle to reduce the adversarial setting to the Bayesian setting. This has been done before
but as far as we know not for bandit problems, so when you have what's called full information
problems people have used this trick but not for bandit learning. And then we have this local
change induces global change property of convex functions. This is somehow independent of
our work. This is some something that I would have expected to find in books on convex
function that we exploit and the combination of all these things give us our nonconstructive
upper bound.
This is in answer to your question. So this is breaking news from just a few weeks ago. So
[indiscernible] were able to generalize this to arbitrary dimension albeit with an exponential
dependence on a dimension. So before I didn't explicitly tell you how all these regrets that we
had before depended on dimension but they were all a small polynomial, so something like N
squared, N cubed. Here, if you're willing to pay exponentially in the dimension, then the new
result which is built on the same principles but uses a different type of algorithm for the
Bayesian case is still able to get square root T regret but with an exponential dependence and
that's a topic for perhaps a future talk by>>: [inaudible] Lipschitz?
>> Ofer Dekel: Lipschitz is not important. We get away from Lipschitz just using that
discretization trick and now we are in a finite problem. It's no longer even>>: [inaudible]. Instead of [inaudible] or something we use probability [inaudible]. I'm sure
that there exists two points that will give you enough to>> Ofer Dekel: So our algorithm, if you care about the Bayesian setting as a first-order thing
then it's constructive perhaps very hard. For some families of functions you could maybe
actually compute this strategy. So it’s a Thompson Sampling with some small modifications. So
I told you what it is. You can solve it. We just don't know how to map that back into an answer
to get adversarial case.
>>: [inaudible] optimization. It's crazy difficult. Is there any kind of dimension reduction in this
sample?
>> Ofer Dekel: Dimension reduction. What is that?
>>: Is there a way to somehow [inaudible]?
>>: I don't think so.
>> Ofer Dekel: I don't know. We don't know.
>>: [inaudible] so far.
>>: It would also probably work for convex optimization, right?
>> Ofer Dekel: So if anyone knows something like this in high dimension, I think we’re done. I
mean we have almost all the other parts but just this idea of how much of the function change
globally will you change it locally for any norm. So the norm is arbitrary. It's governed as a
subset by the posterior so if you could show that for any norm this thing holds, a local change is
going to make a global change, you cannot hide a little good point here without making the
functions very, very far away we’ll be done with this big problem, still we have the problem of
finding an actual algorithm that achieves this but we will have proven that one exists.
>>: So are you going to prove something like that?
>>: Essentially we do something like that but we are able to do it with respect to [inaudible]
instead of with respect to the posterior and then we show that there is a small discrepancy
between [inaudible] and the posterior solution. So it's in two stages.
>>: If you had, instead of a convex function, if you had a multilayered polynomial bounded
height and degree it’s known, for instance, there’s something called Marco’s
Inequality[phonetic] which says more or less things like local change, if there is a drastic
variation in a small neighborhood, then there's a drastic variation overall.
>> Ofer Dekel: What does it have to be? A polynomial?
>>: I think it's called Marco's Inequality[phonetic]. It’s not the [inaudible]. It's the sum
[inaudible]. But for polynomials, depending on height and degree, but multivariable
polynomials it’s not.
>> Ofer Dekel: But they have to be Lipschitz as well.
>>: Well, they have to have bounded coefficients, heights, and degrees in all.
>>: I suspect something like this isn’t true. I mean you can get [inaudible] to this problem.
[inaudible] computation and really constructing very complicated regions for what types of
aspects of the kind of [inaudible]. It sounds like a few points [inaudible] you really construct
very complicated regions to chop off>> Ofer Dekel: And you're saying this implies what?
>>: That I suspect that the analog you want in [inaudible] is going to be [inaudible]. It might be
real to exponentially have your algorithm. That's my question. Do you think the algorithm is
actually exponential [inaudible]?
>>: It is exponential, but I believe that the one way you place a minimum and you just
randomize a little bit around the minimum that should give you polynomial in that.
>>: But even though this is true, in the stochastic setting this would be wonderful.
>> Ofer Dekel: We are saying a lower bound.
>>: Even [inaudible] stochastic setting.
>>: Okay. I think we're out of time. So we'll have another break where further questions can
be discussed in private, but we're going to continue to the next talk in four minutes.
Download