Ofer Dekel: Let`s get started this morning. So it`s our pleasure to host

advertisement
>> Ofer Dekel: Let's get started this morning. So it's our pleasure to host Sebastien Bubeck.
Sebastien finished his PhD in France and then did a postdoc in Barcelona with Gabor Lugosi and
now he is an assistant professor at Princeton and today he's going to tell us about linear
bandits.
>> Sebastien Bubeck: Thank you. Thank you Ofer. Thank you very much for the invitation. I'm
really happy to be here and to present this work. I have been working for five years on linear
bandits. This has been one of the topics where I spent most of my time and I am going to show
to you why I decided to spend five years on this and some of the results that I got. Let's start
right away by defining mathematically what is the problem and then I will show you some
applications with it. First I want to set up the notation and everything. The linear bandits
problem is a sequential decision-making problem where a player is going to make a sequential
decision is going to play for n rounds so n is given. He has an action set A which is a subset of
Rd, a compact subset and there is an adversary who is going to play against him and the
adversary also has an action set which designated as Z and Z is also a subset of Rd. The game is
going to go on sequentially so at each round for t equals 1 of 2n. We have three steps. The first
one is that the player chooses an action at, in his action set A. Simultaneously, the adversary
chooses an action in his own action set which is Z and is denoted by by Zt as a chosen action.
The loss of the player is going to be the inner product between at and zt. That's going to be the
loss of the player and I am going to assume that the only information that I have as a player
about the action that the adversary played is this loss, this inner product between at and zt; it's
the only thing that I know. What is my goal? My goal is going to be to minimize my loss overall
time steps. So the for sum t equals one of 2n of z’s inner product of at zt. More formally I will
define what is called the regret or the pseudo-regret in that case but it doesn't really matter for
us in this quantity. Here is my own loss is the sum for t equals 1 2n of the inner product
between at and zt. Let's not take care of expectation of the moment and I'm comparing myself
to how well I could have done in hindsight if I knew what was the sequence of actions z1 of 2zn
that the adversary was going to play. If I had to select one action to play against this sequence
of actions I would have selected the action a which minimizes the sum for t equals 1 add to n of
at times zt. That's how well I could have done in hindsight and I'm comparing myself to this and
this is what I call the regret. What is this expectation? We'll see that we have to play randomly.
We have to randomize our decision because otherwise if our decision is a deterministic function
of the past, then the adversary can just select the worst possible action for my action. So we
don't want the adversary to know exactly what we're going to play. That is the product that am
going to look at and I am going to try to characterize how small can be the regret, what is the
strategy that can work in this setting. Yes?
>>: So at is constant in the single case. It seems that the quantity is going to be big no matter
what, right? You're comparing with something that's not going to be very good.
>> Sebastien Bubeck: Right. Here is how you should view it. You should view it like this. Either
this is small, in which case, so either my loss is small, in which case I'm happy. And if it's not
small but my regret is small then I know that it's not small because nobody was good, no action
was good. Either no action was good or if there is an action which is good then I will identify it.
That's the intuition.
>>: You are comparing to a constant action?
>> Sebastien Bubeck: I am comparing to a constant action. You will see right now I'm going to
give you an example and you will see why it's meaningful. Maybe right now it's a little bit hard
to see why this is meaningful. Please interrupt me whenever you want if you have a question.
It's much easier for me if you have questions. I'm going to give you now one example, which
will be kind of a canonical example. This is a program of online routing. This is a map of
Barcelona where I did my postdoc. It's not because I didn't want to change the slides since my
postdoc. It's just that if I give you a map of Princeton there is no really shortest path program.
There is only one road. [laughter]. Anyway, so this was more interesting. Point A is where I
lived when I was in Barcelona and point B was the Centre de Recerca where I was working.
Every morning I had to choose a path to bike from my place to the university and I had many
possible choices. My loss in this case was how long it took me to reach my destination. This
time is the sum of the time on each street, so on each street there is a delay which depends on
many things. My loss, my total time to travel is going to be the sum of the delays on each
street and this is going to be modeled as a linear bounded problem. Let's see a little bit more
formally how this goes. We have a graph of the streets of Barcelona and I want to go from this
point to the other one over there. As a player what I choose at each day and each time step is I
choose a path, so I choose a path like this. Simultaneously, you should think that there is an
adversary. I mean, nature is putting delays on the streets of Barcelona, so I have delays. These
delays are real numbers Z1 to Zd. I assume that I have d edges in my graph. So I have Z1, Z2 et
cetera. What is the delay of my path? Here it is Z2 plus Z7 et cetera up to plus Zd is the total
delay on this path. Every day I'm going to choose a path. I will get a loss like this. Now I need
to tell you what is my feedback. So I tell you that the main thing that I'm going to be interested
in is a case where my feedback is this loss, the sum of the delays. These are all of the feedbacks
that we could think of. I'm going also to be somewhat interested in them. The most, the full
information feedback, the full feedback is that at the end of the day, I don't know, I listen to the
radio and I record the delays on all of the streets of Barcelona. That is very, very strong. It
requires a strong investment of my time to do that, but in some cases I ask for information. I
would have served exactly what the adversary played. This is what I'm going to call the full
information feedback. A weaker feedback which makes more sense, and as you will see is very
interesting in many applications, is what I call the semi-bandit feedback. The semi-bandit
feedback is that I observed the delays only on the paths that I chose. So in this case I would
have observed Z2, Z7 up to Zd, but I don't know what are delays on the streets that I didn't
take. The most difficult type of feedback is that I just record how long it took me. That's also
the most natural one in some sense. I just record the total time.
>>: Not in the biking case. In the biking case the semi-bandit seems reasonable because you
would know if you are sending a packet through some network you might only know the
example.
>> Sebastien Bubeck: Yes. I agree with that, but even in the biking case it could make sense
that I only record the total time. It's not, it's the weakest thing that I could do. I mean, all three
settings make sense in my opinion. Yes?
>>: The adversary for example doesn't seem adversarial. So this is more like a reinforcing or a
planning problem?
>> Sebastien Bubeck: It's sort of a planning problem except that what you have to understand
is that the key point is that the sequence of actions, you don't have to think of them as the
adversary. You don't have to think of them as chosen adversaries, but they are fixed. There is
no i idea assumption on there, for instance. There is no stochasticness, so it's a worst-case
model. It doesn't have to be an adversary choosing the actions, the Zt; it's that it's going to be
true, what I'm going to say is going to be true for any sequence you want, Z1, Z2 up to Zn.
>>: Yeah, but it seems like the results would be different if on the one hand you have a real
adversary who is really trying to mess you up. On the other hand when there is simply just
nature out there that is a natural process. Is not trying to intentionally make your reward…
>> Sebastien Bubeck: It turns out that the result won't be different.
>>: Oh.
>> Sebastien Bubeck: That's one of the nice things.
>>: Especially if you live in New Jersey. [laughter].
>> Sebastien Bubeck: I didn't think of that.
>>: But I mean just to be completely clear on what you are focusing on is outside your
minimum over A as a maximum over Z.
>> Sebastien Bubeck: I'm going to be very clear on the next slide, so everything will, yes. That's
exactly what I'm getting at. Let me make this a little bit more formal, because this is a semibandit framework, for instance, how does it relate to what I was saying before? What I just
described was the combinatorial setting. In the combinatorial setting my action set is a subset
of the hypercube in d I mention. Is that represent of the item though? Passing the graph, st
passing the graph. It kind of represents the spanning trees in the graph can represent many
combinatorial structure, so I have this action set which is given to me. In my previous example
A are the incident vectors of paths in my graph. Z is going to be the full hypercube 0 run to the
d so the adversary can put normalize delays on the edges. On each edge I can put a delay
between 0 and 1 and when I choose a path because of this notation, my loss suffered is in the
inner product. You see my action At for this path is a vector which has a 1 on this edge, a 1
here and 1 there and 0 everywhere else, and Zt is a vector, is a vector of weights on the graph
so the inner product is indeed the sum of the delays. So that makes sense. I'm going to look at
the minimax combinatorial regret, so I look at the regret. Now I'm going to try to find the best
strategy such that for any, for a given set of [indiscernible] we are going to look at the worst
adversary so the inf, infimum of all strategies for my player and the supreme of all adversaries.
The adversary may even know what is my strategy and I will want bounds, I mean I will do two
things, but for now let's think if I want bounds, I do not depend on the combinatorial set. My
strategy can depend on the combinatorial set, but I want a universal bound that will allow us
something simple. What I want is to characterize this quantity R n C in the three types of
feedback for full information for semi-bandit and bandit information. Is that clear? That's what
I want to do.
>>: [indiscernible] this guy [indiscernible]
>> Sebastien Bubeck: Yes. But it's placed sequentially, but it's placed once.
>>: [indiscernible]
>> Sebastien Bubeck: But I can learn about the sequence in one of two [indiscernible]. To keep
on coming back your question is that if there is nothing to learn there is nothing to learn and
then I will not do good. But maybe there is something to learn in the sequence. Maybe there is
one direction which is really good and then I will learn it. If I have a small regret I must learn it.
This means that I will identify what is a good direction.
>>: Question. [indiscernible] characterized [indiscernible] function class that would respond to
a malicious intentional adversary versus a stochastic sort of generative model-based adversary?
Is that a reasonable prediction?
>> Sebastien Bubeck: This would be, I could do a talk on this but this would be completely
different. Yes. There are many categories of adversaries. In this talk I'm going to assume the
worst possible adversary and it will work fine. Even without making any restriction on the
adversary I will be able to make strong statement.
>>: [indiscernible] adversary there would be a natural process then the [indiscernible] could be
in a sense better.
>> Sebastien Bubeck: Absolutely. Absolutely. And this is a very interesting direction. For
instance if the Zi is iad then you can prove better bounds. One interesting direction is to have Z
better bounds, but also the worst case guarantee, like you want the best of worlds, but that's a
different topic. That's the combinatorial setting. That's one setting. The other one is the
geometric setting. The geometric setting goes like this. So I just recall what is the definition of
a polar, so if I have a set, a subset X of Rd, a polar of X is this set. It's a set of points in Rd that
have an inner product smaller than 1 in absolute value for any X in x. So if capital X is a unit ball
of some norm and we have two balls than the polar is just the unit ball in the dual norm. It's
exactly by definition. If X is the Euclidean ball then X polar is the Euclidean ball also. Why is this
interesting? Because it follows for certain normalizations, so in the geometric setting I have a
certain action set, no assumption on it, and the adversary is playing on the polar of the action
set and that allows for a normalization because then it tells me that the loss is the
instantaneous loss is always between -1 and 1. Before here I have a normalization which is Zt Z
of the hypercube or 1 to the d. And here the normalization is at the adversary place on the
polar. So this is many of particular interest. I think it considers the application but it allows to
have a natural and simple normalization. In this case I'm interested in the minimax geometric
regret R and G which is the supreme of all action set, all subset of Rd when Z is the polar or a
subset of the polar and I gain the same thing in sup of the regret. Now some applications. I
think it is a linear bandit problem is really a fundamental problem, as fundamental as linear
regression or linear optimization. Right now we only have a few applications but I think in the
next year there will be many more. Let me tell you about the application that we have right
now. The first one is contextual bandit and I'm using the model of Li, Chu, Langford and
Schapire. Let's assume that you have N ads and you are running a website and every time there
is a user that comes you have to show him one of the ads. When the user comes he arrives
with certain contextual information. This contextual information gives rise to a subset of Rd
which are features for each ad. So ait is a point in Rd and it's a feature vector associated to the
ith ad when the [indiscernible] arrived. So I have this feature vector and now I'm going to
assume that my payoff for showing the ith ad to the user of the [indiscernible] user is going to
be a function of Z [indiscernible] product between the feature and some Zt where Zt is
unknown. And Zt also depends on the user. It depends on these contacts. It depends on
maybe some randomness. Okay. It depends on many things. What I observe is just this inner
product or maybe a transformation of this inner product. I don't know what will happen if I
showed to him another ad. If I show him ad i what I observed is inner product ait times Zt but I
don't know what will happen if I show him ad j. But now because of the Euclidean structure I
can infer something but how much can I infer? So it's going to be the bandit program. Here let
me just say something. This is a linear bandit program but where the action set is changing
over time. This is my action set. It's changing over time and some of the strategies that I will
describe do not work in this case. Some of them do work. That's one application. Another one
which is maybe slight more theoretical, but I think it could have plenty of applications. I don't
want to spend too much time on this in describing exactly what I mean, but I think many people
in the room know what is MDP. So an episodic loop free MDP is just an MDP we have no loop
in the action state space, so you always move forward, basically, and you stop at some point
and then you start over. That's an episodic loop free MDP. With bandit information, meaning
that you only observe what happens when you are in the state. You take an action. You
observe what is the reward but you don't know what will happen if you had taken another
action. It's very natural. Zillions of problems can be modeled like that and I mean really many
problems can be modeled as MDP's. In this paper, NIPS 2013 they show that basically episodic
loop free MDP's are just combinatorial semi-bandits. It's a one-to-one mapping and it's
computationally efficient. So solving, combinatorial semi-bandits allows you to solve this very
wide class of problem of episodic loop free MDP's. Actually, also add something similar in
another paper and 2012 with a reduction to linear bandit. The last one I should also show you I
think the generality of the trouble, and this one to my knowledge has not really been used in
practice. Consider your favorite linear program, whatever it is, a real application or an
inventory problem or something more surgical. You want to minimize c times x subject to some
constraint. In linear programming you know c and you know A. Now what is going to be the
linear bandit problem on the action set that corresponds to this constraint? So of the x is that
Ax is smaller than b. But it's just you are solving a sequence of linear problems like this but with
a cost vector changing over time and you don't really know which cost vector you are
optimizing at a certain time. That's in my opinion it's very meaningful. Think of an inventory
program. You don't know exactly what is going to be the cost vector of tomorrow. You may
have some idea because you have past information and you have learned about the processes,
but you don't know exactly what's going to happen tomorrow and so this, in my opinion, is a
better model for what happens in practice rather than solving one big LP once and for all. You
are probably better off solving a sequence of LPs with some robustness in how the cost vector
is going to change. Yes?
>>: [indiscernible] do you assume that A and b are known?
>> Sebastien Bubeck: Yeah. Absolutely. The constraints on them, that make sense. You know
what our the constraints, right?
>>: On your [indiscernible] change you can have some cost to the violation of constraints, not
so [indiscernible]
>> Sebastien Bubeck: Yes. This is related to the first model like changing action set is definitely
something meaningful. I agree. Any questions on the application? So the back edge
[indiscernible] is also a [indiscernible] one. I'm not going to come back to applications. Let's go
into the geometric setting where we want to understand what is the order of magnitude of this
quantity and the bandit feedback and what kind of strategies can we do? This is a little bit of
the history of the geometric setting. The first thing that was considered is in the seminal paper
by Auer, Cesa-Bianchi, Freund and Schapire in 1998, they consider the case where the action
set is a simplex, a very simple action set. And they proved the regret bound of order square
root of Dn. Remember d is a dimension and n is a time arrival, so I put my paper also with Jean
Yves from 2009 because they could only prove square root of dn log d and we could remove the
log. Okay. That's not very important. We will not care about logs in this talk. Okay. So that
was 1998. Then four years later the first real paper introducing a linear problem was Auer
paper in 2002 and he consider only the case of Z1 up to Zn being in iid sequence. If you have an
iid sequence you are in known territory. You can do many, many things. It's not such a difficult
problem in some sense. He proved a d to the 5/2 square root n regret bound. Two years later
Awerbuch and Kleinberg showed an n to the 2/3 regret bound only in the oblivious case which
is going back to your question here oblivious means there's no adversary. It's just a fixed
sequence Z1 up to Zn. In the same year McMahan and Blum showed an n to the 3/4 regret
bound if you assume nothing about the adversary. This was the first general regret bound with
only n to the 3/4. When we don't get square root n I don't put the dependency on d because as
you can see you can always get squared n. So n is the most important parameter. n is the
number times you are going to play. But d is also very important because think of the context
of our bandit information, d is the size of the feature space, so a difference between d to the
5/2 and a d square root n which we get at Zn, that can be a huge difference in your regret, a
thousand or something like that. You could move from performing an average as good as the
best ad with a regret of order I don't know, average regret of the .1 when you have d to the 5/2
and when you just have d if d is just an order of 1000 then you get .001, so you can do much,
much better when you improve the dependency on d. Anyway, then two years later they finally
proved the n to the 2/3 without the oblivious. Again, two years later in 2008 Dani, Hayes and
Kakade, so this was a seminal paper, they got the first square root n regret bound without any
assumption on the adversary so they get d to the 3/2 squared n and they show that the best
you can hope for is d square root n. In the same year there was an important paper by
Abernethy, Hazan and Rakhlin where they showed d to the 3/2 square root n and it's
computationally efficient. I will come back to that. Now four years later, finally, so two years
ago we proved with Nicolo and Sham the optimal regret bound, d square root n and this is
optimal because of the lower bound of Dani, Hayes and Kakade and that's what I'm going to
describe. It took 10 years to get to it, but now we have Z optimal bound. Surely it's not going
to be computationally efficient and at the end of the talk I will give you, I will show you a new
idea for a computationally efficient strategy. Let's get into it. Let's get into, yes?
>>: What is special about doing it just to the geometric setting? You are saying that all of this
has been considering the geometric setting as opposed to the original. Why is it significant?
>> Sebastien Bubeck: If you have a mechanism for the geometric setting, you immediately have
an algorithm for the combinatorial setting.
>>: But strictly more general.
>> Sebastien Bubeck: It's more general. The catch is when you move from the, so in the
geometric setting you have this normalization of I play in my action set and you play in the
polar. In the combinatorial setting, I play in the action set and the other one is not the polar.
It's the hypercube. So you have a sudden fixed constraint on the adversary which has nothing
to do with my constraint. That's, potentially you could use that. And that's what I'm going to
do. I'm going to show that in the combinatorial setting you can do better than viewing it as a
geometric problem. That's going to be the key. That will be the second part of the talk. This is
the most important slide. We want to estimate the unseen. We only observe our product a t
times zt but we would like to know what would have happened if we had selected another
action. We only observe one real value and we want basically d real values. How can we do
that? There is a well-known technique; it's importance sampling. How do we do this? You play
a t at random, so you have a probability distribution Pt supported in A, so you play your action a
trend on this priority distribution Pt and using this we can build an unbiased estimate Zt of t of
Zt with this formula. Sigma t is the second moment of my distribution, so the expectation of
the outer product a times a transpose when a is drawn from my distribution Pt. I'm going to
assume that this is invertible. If it's not, you can just take the [indiscernible] to the inverse and
that works. But let's say it's invertible. So Zt of t is equal to sigma t-1 at transpose Zt. The first
thing we need to verify is that I can indeed do this computation. So at times Zt that's the real
number that I observed. That's the value that I get. Now sigma t -1 I can compute it. It
depends on Pt and at, I know what it is. It's what I played. I can compute this quantity. I
cannot compute Zt. Zt I never observed it. I just observed this inner product. I have Zt of t.
Now let's verify that this is unbiased. I want to show to you that the expectation of Zt of t when
at is run from Pt is equal to Zt. With just one real number you can get an unbiased estimate of
the entire vector. So let's see. The expectation is this. So Zt of t is just this formula. Sigma t -1
at at transpose Zt. What is random in this quantity when I draw at from Zt? Only at at
transpose and what is the expectation of at at transpose? It's sigma t by definition. So we have
sigma t -1 times sigma t times Zt so I have Zt. That was it. That's the most important slide.
Yes?
>>: But if Zt is independent of the cost of Zt it's not going to be very efficient. It's going to cost
you a lot to make this [indiscernible]
>> Sebastien Bubeck: In what sense? Are you thinking of the variance control?
>>: [indiscernible] exploring without feedback because [indiscernible] is independent of…
>> Sebastien Bubeck: Yes, but Pt, yes, I see what you mean, but Pt is dependent on Z1 of 2Zt-1,
so again it's coming back to this idea of if there was something to learn then I am learning it.
It's not, yeah.
>>: This is not written that Pt depends on Z1 to Z [indiscernible]
>> Sebastien Bubeck: No. I didn't. And that's a key point. My action is the way I'm going to
choose my action is going to depend on everything I observed in the past, so all junior products.
In the next slide I will tell you how to choose this Pt. Right now I am just expressing that there
is a Pt. One other question which is for anyone who has looked at importance sampling knows
that the issue of this now is okay. This is unbiased, very good, but what about the variance. If
this is just unbiased you said nothing. So the variance we can control it and this is, these two
inequalities that gives you everything. If in 2002 Auer knew about it he would've proved almost
everything that I'm going to tell you. Here is the variance control. I'm going to control the
variance in the known induced by my own sampling. It's not clear why I would do that but I'm
telling you this is a good quantity to look at. I look at Z ~ t. When I apply the quadratic form
sigma t, so this is a variance [indiscernible] and I want to control this in expectation. So what is
the expectation of Z ~ t times sigma t Z ~ t? Let's just look at what is Z ~ t existing? It's a real
number times a vector. This real number, because I'm in the geometric setting I can upper
bound it by one. I can remove this because I will have this guy square and this is upper
bounded by one. Here this upper bound is just removing this, upper bounding it by one. Now
what do I have? Well sigma t is symmetric so the inverse is symmetric and when I take this guy
transpose I have at transpose times sigma t-1. So I have at transpose sigma t -1 sigma t sigma t
-1 at. So one of the sigma t -1 cancels and I get at transpose sigma t -1 at. I did almost nothing.
Now I just realized that this is the same thing as a trace of sigma t -1 times at at transpose. This
is a real number so it's a trace of x is, I mean x is equal to the trace of x and now I just use trace
of ab is equal to trace of ba and so I get this and now the trace is a linear variable so the
expectation can come in and I have expectation of at at transpose, by definition that's sigma t,
trace of sigma t -1 sigma t. That's trace of identity. I get d. And this seems as tight as it gets,
right? The variance is controlled by d, just the dimensions and this is what allows us to get
square root n regret bound instead of the n to the 2/3 of Awerbuch and Kleinberg. So this
estimate I'm going to use it in the entire talk, this Zt ~ t and [indiscernible]. Now I need to keep
on. I'm going to tell you how to choose Pt, because right now I still don't have a strategy. Here
is what we propose. Let's assume for now that A is finite. If A is finite I am going to define Exp2
which is a certain parameter distribution Pt which depends on the past. It goes like this. Let's
not care about gamma and nu for the moment. The probability of selecting action A is going to
be proportional to exponential minus some learning rate eta time to some s equals 1 to t-1 of Z
t of s times a. So Zt of s times a, that's my estimate of the loss. If I had played action a in time
step s. I can make this type of reasoning now. What would have been my loss if I had play
action a in time step s. This is what would have been my loss if I had play action a from time 1
of 2 time t -1; that's my estimate. My probability of playing action a is going to be smaller if this
is bigger. If I estimate that I would have had a big loss, then I don't want to play this action.
Makes total sense, so I play something proportional to this and I will just tilt it a little bit, so
what I will do is refer it so people in the room who are used to bandits know that it's important
to do a certain amount of exploration. So what I'm going to do is refer t-1 minus gamma. I
replay according to these exponential weights and refer to gamma I will play according to an
exploration distribution mu supported, I mean distribution supported on the action set. Yes?
>>: [indiscernible] related to the xb3?
>> Sebastien Bubeck: Yeah, of course. Yes. Exp3 is this when the action set is a simplex. Sorry,
so here when a is a canonical basis. So if a is a canonical basis this is Exp3. I don't want to get
into why I call it Exp2; there is a reason. [laughter] let's look at it that way. Okay. Yes?
>>: [indiscernible] amount of separation because you assume [indiscernible] horizon?
>> Sebastien Bubeck: Exactly, yes absolutely. I assume that I know the horizon. Everything
could be done if I didn't know the horizon. I would put a gamma t and an eta t and the proof
would be more complicated but everything would work. This is not critical. So now I need to
tell you what is this mu. This mu is going to be critical. I want to go quickly over this. In Dani,
Hayes and Kakade they use mu as a barycentric spanner. What that means is I get a regret
bound which is d square n log continuity of A. In 2009 Cesa-Bianchi and Lugosi used mu to be
uniform over the action set. You can imagine uniform on the action set is going to be very bad
in certain cases. Let me just do this picture. If my action set is a grid like this, so I have a grid.
That's my grid graph and I want to go from this point to this one and what I can select his
increasing paths like this. If I select uniformly at random a path to explore, then the probability
that I visit this edge is going to be exponentially small. It's going to be 1 of a 2 to the m if it’s an
n by m grid. Uniform exploration in a combinatorial setting or in general, I mean just in an
action set analogy, it's a bad idea. You don't want to do uniform because of these things. But
still if there is enough symmetry in some sense you can hope that it will work and they show
that for certain action sets a get you actually square root of dn log continuity of A. What we
propose is a distribution based on the results from convex geometry which is called John's
theorem and that gets you this square root dn log continuity of A and this is optimal. If you just
do a discretization you get a d square root n okay for the geometric setting. Let me just tell you
remember what is John's theorem. This is really very well known in convex geometry. I mean
you open any book on it and it's page 5 or something like that. You have a set of points in Rd
and I'm going to look at the ellipsoid of minimum volumes that contains my set of points. Let's
call it E and let's say that E is the unit ball in some norm scalar product. That's always the case.
Then I can show that John proved that there exist contact points, so there will be contact points
between my convex set, my set of points and the ellipsoid and there is M contact points such
that you can find a distribution on this point which gives you an approximate octagonal
decomposition. I don't want to spend more time on this. The key point is that now what we do
is we have these contact points between the ellipsoid and the action set and we have a
distribution on them and we will use this as our exploration distribution. The issue is that this is
not computationally efficient. It's NP hard to compute the John's ellipsoid. You can do a square
root d approximation, but then you use the optimal regret bounds. Even sampling from
exponential weights on the combinatorial structure, that's a very difficult problem. You have
exponentially many actions and you want to sample from the exponential distribution. We
don't know how to do that. In certain action sets we can do it. For instance, if we look at
Birkof’s polytope matchings on the bipartite graph, then with the permanent approximation
algorithm of Jerrum, Sinclair and Vigoda we can actually simulate the exponential weights. But
this is a difficult result for a very specific action set. Certainly we have nothing general. There
are plenty of computational difficulties. My personal impression is that it's impossible to
overcome them. This is not the way to get something computationally efficient. What I and
other people propose is using ideas from convex optimization. Now I'm going to describe to
you, I will take 7 minutes to describe to you an idea in complex optimization that should be
more well known. Let's do this quickly. We have a function which is defined on, so even if you
don't care about bandits, what I'm going to say is relevant. Let's say we want to optimize a
function Rd which is a convex function. Let's say it's a closed function because that's just a
technical assumption such that the norm of the gradients are bounded in there too. We want
to find the minimum of these functions then Cauchy in the 19th century said okay. Let's do
that. So you are the given point. You look at the gradient of the function and you move in the
opposite direction. This makes total sense. We first have the learning rate eta. So the
equation is Xt +1 is equal to Xt minus eta times the gradient of f Xt. We want to optimize our
ball, so if we get out of the ball, what we do is we project back on the ball. That's projected
gradient descent. Everybody should know. What you can do is you just, if I ask you okay. Now
you have done all of this. You are to some certain point Xt. What do you believe to be the
minimum. You shouldn't tell Xt if f is just a convex function, but you should tell the average.
One over t [indiscernible] equals [indiscernible] of the Xs and what you can prove is that the
distance between this and the true minimum over the ball is going to be bounded by R which is
the size of the ball, L which is the size of the gradients divided by the square root t. So what I
want you to take out of this is that as t goes to infinity I will be able to optimize this function
with the gradient descent. The key point why we care about this is that d does not appear. I
told you we were optimizing and I mentioned d but there is no d here. We could be, for all we
know we could be in arbitrary Hilbert space H. this is going to be relevant for matching real
learning application. So there will be a bounded space in the next slide. I prefer to warn you
but it will be a real thing. In an arbitrary Hilbert space we can optimize convex function over a
ball. Okay. That's what this is saying. Now that's good. What about a Banach space? What if
we want to optimize a ball in a Banach space? Why would we do that? Imagine you want to
optimize an L1 ball. In many machine applications you don't want to optimize the
[indiscernible] ball. You want to optimize the L1 ball. Then gradient descent does nothing for
you. If you were reading L1, L1 infinity you couldn't do gradient descent. You would not
optimize a function. The rate would depend on the dimension. Gradient descent gives no
guarantees and if you think about it, it does not even make sense. Nabla, f of X the gradient of f
is an element of the dual space B star. It's not an element of B so when you do, when you write
Xt -eta gradient of f Xt, it doesn't make sense. This is in B and this is in B star. This was realized
30 years ago but some people forgot. So Nemirovski and Yudin said we have this issue, so what
we are going to do is we will map Xt to the dual space. How do we map to the dual space? We
use the gradient of some function, so you'd use the gradient of some function phi. You map to
the dual and then you take a gradient step in the dual. You come back to the primal. That's,
you know, descent which I'm going to describe now a little bit more formally. Let's come back
to our deal because this was just intuition. Now we want to optimize the function f in Rd. what
we are going to do is we will have a function phi which is defined on the superset d of X so I
want to optimize over X my convex function. I define phi which is on d, a superset of x, a real
value function and what I'm going to do is the following. I'm at Xt, so what does gradient
descent do? I would take a gradient step. If I go out of X I would project in the European
distance. We saw that this will have a dependency on the dimension if, for instance we're
optimizing over the L1 ball. What we would like is something dimension independent as if we
were optimizing over an L2 ball. So what we are going to do is imagine that we are in
dimension infinity. We take Xt. We go to the dual space using the gradient of phi. Now we are
gradient of phi of Xt and here we take a gradient step, so we have gradient of phi of Xt minus
eta gradient of f of Xt. Now we are at a new point which we call W t +1 which is the gradient of
phi at W t +1. We come back in the primal. How do you come back in the primal? You just
take phi star which is potential dual of phi because this is well known in convex optimization.
You are W t +1. Maybe you are outside of your constraint set, so now you need to project, but
of course you don't want to project in the L2 metric. This would be meaningless. So what you
project in is in the sum of the metric induced by your potential phi. You projected the Bregman
divergence associated to phi and that gets you a point Xt +1. You do this intuitively and this you
can show is going to optimize your function f and it can be dimension independent on L1, for
instance. It can also give you fast rates of conversion for SDP, first-order method for SDP.
Because something that people have been looking for for years but this actually was invented
30 years ago. What you would take for SDP is phi to be the interpretive P but I won't go into
this. The key for us and for machine learning is that this algorithm is robust to noise in the
gradients. That's also why first-order methods are getting popular is you can do interpretive
methods but interpretive methods are not robust to noise. This is a robust to noise. So if
instead of the gradient here as something noisy which is unbiased, then I could do the same
thing. Now I'm going to use this idea to do a bandit algorithm. Here it goes. I want to, so here
is algorithm which we call OSMD, online stochastic mirror descent. We have our action set
here, the convex set of the action set. And what we want is to play a distribution over A. We
will take an action at random from the distribution. What I will define a parameter of my
algorithm is going to be a [indiscernible] function phi which is defined in the superset D of the
convex order A and I will also have what I call a sampling scheme which is mapping from the
convex all of A to a spacer distribution so that to any point here, A I can map to a distribution
such that this distribution in expectation gives me A. This such sampling scheme always exists
and what I'm going to do is the following. Assume that at a certain time step I have this priority
distribution Pt. This Pt is associated to a point Xt in the convex order of A through the sampling
scheme. Now I'm at Xt. I'm going to do mirror descent. Mirror descent you can write it
compactly like this. You take a gradient step in the dual so you take gradient of phi of Xt.
Gradient step you are optimizing a linear function so your gradient is just Zt of t. That's your
estimate of the gradient with a certain [indiscernible] rate eta and you come back in the
primary using the [indiscernible] dual that gives you the [indiscernible] Xt +1. As I told you if
you get out of the convex O to project back. You project back using the Bregman associated
phi. Now you have a point in the convex O and you use your sampling scheme to define the
property distribution Pt +1. That's a general template for our algorithm and now you can
instantiate it with different phi and different pi. That's the idea. If we, are there any questions
on this? You just do mirror descent and in here where you are trying to optimize linear
functions and you don't have the real linear function which would be Zt, but you have the
unbiased estimate Z ~and you have a sampling scheme which allows you to move from point in
the convex O to distribution over the action sets. What we proved in 2011 is this regret bound.
That the regret bound of this algorithm depends on the size of the set of actions measured
through the Bregman, so that's the first term. The diameter of the set measured with the
Bregman plus a variance term where the variance instead of being in the norm induced by
sigma t, it's a variance induced by phi. Let me just take a moment to look at this variance term.
What we did before we fix potential weights is that we said that we had the control on the
variance when the variance was induced by the norm with sigma t. But here instead of sigma t
we have the action of phi at xt and the inverse of that. What we want, if you think about it, is
an equation like this. If the action of phi at x is lower bounded by the inverse of the sampling
scheme, of the [indiscernible] of the sampling scheme, then you can upper bound this thing by
sigma t and you can upper bound the entire thing by D. Let me say that again. What we want
is to run on line stochastic mirror descent with a mirror map phi and a sampling scheme which
are dual to each other in some sense, in the sense that the action of phi should be related to
the covariance matrix of the sampling scheme at x. Of course, you can always do that again
because you can just blow up phi. You can make it like, push it. I mean you can do it. The issue
is that you also have the diameter of the action set which the Bregman that should remain
small. So you have a tension between the two terms. You want this plus keeping the diameter
small. We can do it in some cases. Let me show you there are cases where we can do it. In this
2012 paper when you are doing linear optimization of the ball on linear optimization with
bandit feedback, we recommend you take this slide. And the sampling scheme is very simple.
You have the ball and let's say I want to, I'm at Xt which is an element in my ball B 2d. Let's do
it like that. What I'm going to do is the following. With probability norm of Xt, what I will play
is actually play Xt over its norm. You see, I go all the way up to the boundary, so if I'm away
from the boundary it makes sense. Remember we are in a linear setting so the further away I
am from 0 the further I reduce the variance, if you want, if you think in terms of phi ID
processes. So what you want is to move away from 0. We refer to the norm of Xt which is
more than 1. I will go all the way up to 0. But then I have some probability mass left to do
exploration. In a sense my algorithm tells me to play very close to the boundary. It means that
the algorithm is certain that I'm going to play well. If it's a little bit away from the boundary, it
means that there is some uncertainty and I'm going to exactly use this amount of uncertainty to
do exploration. So with the rest of the mass which is one minus the norm of X, I'm going to play
uniformly at random on the canonical basis with random Xi. It's sort of it’s doing exploration
but it's choosing, it's adaptively choosing the rate at which it should do exploration, which I
think is a good idea.
>>: [indiscernible] conditional on Xt [indiscernible]? Could there be benefit in playing instead
of uniform but playing conditionals and being away from Xt?
>> Sebastien Bubeck: No. I don't think so. Because the reason is, oh, maybe you could. This is
only one direction.
>>: [indiscernible] your distributions seem like this should be proportional to distance from the
point you are playing. You want to be away from [indiscernible]
>> Sebastien Bubeck: Yes, but the issue is remember there is adversary who could try to trick
you into doing that. You have to be careful with that. I mean, when basically the idea is when
you decide to do the exploration, you should really mean it, like do it, do exploration. Don't try
to still use some exploitation.
>>: You can still take the [indiscernible] of Xt [indiscernible]
>> Sebastien Bubeck: Exactly. That's what I wanted to say. That's exactly. You can buy this,
yes. Instead of doing it on the canonical basis you could take the hyperplane here and do
uniform on this hyperplane. So the nice thing is that this is computational complexity linear in d
and it gets you square root of dn so even better than before. And this is optimal and this is for
an open problem to get this right. Now on the hypercube we propose this function phi. It's
weird looking, but I will come back to it in five minutes. So the sum of [indiscernible] the
inverse of the hyperbolic tangent, people usually prefer this form which is equal which is the
Boolean entropy. If I am at a point X, I have 1 plus Xi log 1Xi +1 minus Xi log 1 minus Xi. But
actually the proof goes by looking at it this way. This is the more natural way to look at it for
me. The exploration distribution makes it simple, so I have my cube and I am at a certain point
here and what I'm going to do is I'm just going to play like this. I'm going to decompose to
explode my point and instead of playing this, I will play this one, this one, this one or this one
but I will explode it like this. That will allow the idea of this research problem is to do the
sampling which chooses the geometry of the set which really uses all of the space that there is
in my set. Let me just say about the paper by Abernethy, Hazan and Rakhlin, what they did was
they computationally efficient strategy that you have an action set like this and what they do is
they look at the ellipsoid which is contained in the set and they do exploration uniformly on the
canonical basis and just by the ellipsoid. You see the issue is that when you do that you are
restricting yourself, you are confining yourself to an exploration in a very, very small set. You
are adapting to the geometry, but this is not good. What you want is really to explode it and
explore as far as possible to reduce the variance and to explore in as many directions as
possible. So this algorithm as computational convex [indiscernible] on d and [indiscernible]. I
showed you two examples in the geometric setting, Euclidean ball and the hypercube and they
are the only two that I know where we can get optimal regret bound with poly-time
computational complexity. Yes?
>>: I have a question about the situation where [indiscernible]. So you are saying that you
want to go as far away as possible to minimize variance? I buy it completely, but if I want to
look at the next problem of moving from linear functions to other functions, like convex
[indiscernible], that would be the worst I do, correct?
>> Sebastien Bubeck: Yes. You can do that. But linear is much easier than convex.
>>: So local ones have a chance of really extending to convex, but not to explode has no
chance.
>> Sebastien Bubeck: Exactly. I guess if you believe that your linear more than makes sense
you should exploit it as much as possible. Your linear model allows you to instead of doing this
go as far as possible and still an expectation be the same, so you should use that opportunity
which is given to you by the model. But I agree, if there was a convex function you could udo
that. All right. Let me very quickly go over the combinatorial setting. So this is history. Many,
many people looked at it, various results. Most of them sub optimal and the first optimal one
was in 2010 by Koolen, Warmuth and Kivinen. They obtain optimal bounds and then we did
when year later the semi-bandit feedback and we add also some new results from the full
information. So what Koolen, Warmuth and Kivinen did is that they looked at mirror dissent
with the negative entropy for the mirror map. They proved 2d square root n regret bound. The
open problem was whether or not you can change this with exponential weights. You can't.
But what we showed is that there exists an action set, a complicated action set subset of the
hypercube such that exponential weights no matter which learning rate eta you give to it, it will
have a regret which is lower bounded by d to the 3/2 square root n, whereas mirror dissent
gets you to d square root n. This is with full information. Not that this is completely different
from the geometric setting. I don't know who asked the question about combinatorial versus
geometric, but the geometric I just showed you that exponential weight with the right tricks
gets you the optimal regret bound. And here we can provably show that in the combinatorial
setting you cannot get the optimal bound. The two settings are really different. Even experts,
some experts don't really, didn't see the difference between the combinatorial and geometric.
It's not easy to understand the difference online, but this shows to you the difference. You
cannot get, exponential weight will not work in the combinatorial setting. The reason is really
because in the geometric setting you have a notable normalization which allows you to
basically bypass the geometry of the problem. Whereas, in the combinatorial setting, you really
have to think about the geometry of the action set. You cannot forget it. In semi-bandit
feedback life is really much easier. Everything works. It's very easy because what you observe
is first coordinate of at times first coordinate of Zt so if first coordinate of at was active you get
to 1 so you observe Zt on this coordinate, so you can do this type of estimate. They are
unbiased. You have control on the variance in terms of this. I just want to show it to you this
slide. Now this slide which is important is a 2013 paper. What we suggest a general type of
potential function phi which derives from INF type mirror map potential function phi from the
real line to the real line is this. A sum of integrals of the inverse of phi. Why is this anything
interesting is because we can show this type of regret bound, whereas the variance is
controlled in this lambda t which is one over the inverse of phi, the derivative of the inverse of
phi at Xt and you get this type of control. Maybe it's not clear why this is interesting, but
maybe here it comes now. What we can show is that with Xi equal to exponential effects this
gives you phi of Xi is the negative entropy so you can just look at this formula. This will give you
the negative entropy. This gives you another point of view, what is the entropy. Entropy is this
phi with Xi equals exponential affect and then we can prove this type of regret bounds where
you have the maximal L1 norm within your action set which arise and you prove this. This is
suboptimal because of the log factor and if you take Xi to be something polynomial like 1 over X
squared you can get this nicer bound, so this allows you, I mean, okay. The nice thing is with
this point of view you get the proof of the optimal regret for adversary [indiscernible] semibandit which is 7/9, whereas the first proof that we had was 10 pages because this is really the
right point of view. Now the open problem in the bandit feedback is that with John's
exploration so we get d square root n in the geometric setting and this gives you d squared
square root n in the combinatorial setting because there is a d difference in terms of scaling.
We proved this lower bound, d to the 3/2 square n, so there is a gap of square root d and I
conjecture that this can be removed and actually can be attained efficiently for reasonable
action set. The best procedure right now is by Abernethy, Hazan and Rakhlin that gets you d to
the 5/2 square root n and I will spend 2 minutes on the new idea. This has not been explored
yet. It's just pushing towards the end, this idea of using all the space that you have in your
action set. For C Xi in Rd and if I define P of theta a to be the following distribution. So p of
theta a of, it's a distribution with density with respect to the [indiscernible] measure in Rd. is
this p theta effects is proportional to exponential of the inner product between theta and x
times the indicator that x is in my action set. This is a sort of canonical exponential family. Now
you can show that for any point in your action set there always exists a parameter, a natural
parameter such that A is going to be the mean of the distribution p sub theta A. So there is a
duality going on. It's an exponential family. So you have your action set and you have Rd and
Rd is basically the dual of your action set. And what I call this mapping from an action to the
distribution with natural parameter theta of A the canonical sampling scheme. For any action it
gives you a sampling scheme which is this exponential form such that in expectation you get A
and the idea is to use and it really make sense, to use online stochastic mirror dissent with the
mirror map phi given by the entropy of the canonical sampling scheme. This algorithm can be
implemented efficiently as soon as you can sample from log concave functions from log
concave distribution and you can do that with the technique of Lovasz and Vempala. And the
key idea coming back to this idea of having the sampling scheme related to the action of phi is
that if phi is the entropy of the canonical sampling scheme because of the duality with
exponential families, the Fenchel dual of phi is the log partition function of your exponential
family. You can show that the action of phi inverse, that the action of the Fenchel dual, is
approximately the same thing as the covariance matrix of your canonical sampling scheme, so
you get with inequality the things that I wanted with an inequality. I'm hopeful that this will
work, but it is not known yet. And that's it and some references. I will stop here. Thank you
[applause]
>> Ofer Dekel: Any more questions?
>>: So have you tested this on [indiscernible] problem [indiscernible]
>> Sebastien Bubeck: No I haven't, but I'm hopeful somebody will. Let me tell you what works
in this, what I think will work. What will work is the semi-bandit setting. I didn't spend enough
time on it, but this is really what would work. And I think one big opportunity is to identify
which program could be reduced to combinatorial semi-bandit. I described in one paper in
NIPS 2013 where they reduce a [indiscernible] loop free MDPs to combinatorial semi-bandit. I
don't remember if they have experimented in this paper, but I think the algorithm described
here will work, will do something. Exponential weights with [indiscernible] so it would not do
anything. It's just theoretically interesting.
>>: As far as I remember the [indiscernible] algorithm of something from [indiscernible]
distribution. Also this is a thing that [indiscernible] mentioned to where is prohibited. Do you
have any idea on something that even if it doesn't have the full bulletproof guarantee that it
would nevertheless work in practice?
>> Sebastien Bubeck: Do something? Yes. But then, so I think what would work best if you
really have linear bandit feedback is to do something like linear UCB. It's a UCB where basically
you are in a linear setting so you have an ellipsoid for the confidence around your true
parameter and then you try to maximize within this Xena product with the ellipsoid and so you
have robust optimization technique related to mirror dissent actually, where you can use
robust, I mean there are books about this. You can use robust optimization technique to
efficiently approximate linear UCB and that will do something in practice but it won't have the
nice theoretical guarantees. I think one interesting direction which would be in between is
coming back to an earlier question is also to be in between something that assumes stochastic
and something that is adversarial. What you want is if the world is stochastic and you do as
well as if it was ID but you still want some security about what's going on if there is an
adversary trying to screw with you. I think one potential direction is to do linear UCB with a
robust optimization technique to actually do it in practice on the computer and at the same
time have a few checks, a few security checks to verify that you are indeed under the IED
setting and then if you observe that something does not behave as you wish it would behave
then you can either stop or you can do some of the algorithms that I described here. That's
something that would do something in practice.
>> Ofer Dekel: Any more questions? Let's thank the speaker. [applause].
Download