Document 17865194

advertisement
>> Yuval Peres: Good afternoon, so a lot of us study various objects on graphs [indiscernible] and so
with its particles, random walkers, spins. But this time we’ll hear about bandits on graphs from the
expert Nicolo Cesa-Bianchi, please.
>> Nicolo Cesa-Bianchi: Thank you, I’m mostly going to talk about experts as well.
[laughter]
This is a, some work I’ve done in recently. It’s, I’ll start from scratch so you will not need a lot of
background knowledge to understand this. We’ve been talking about sequence of decisions. We will
only deal with the nonstochastic or of the sarian models. This is the traditional prediction with expert
advice or nonstochastic bandit model.
The model is very easy to specify. Its very bare bones so you have just K actions that you have to, and
each one of this action let’s say will give you a loss when you select it. But you can’t avoid it. You have
to at every time step in your sequence of decisions you have to decide which action to pick. You will
incur a certain loss associated with your decision. We’ll assume that everything is a nonstochastic so
there is some assignment of losses to actions over time. There’s some numbers L I T and we’ll assume
that these numbers are just in the unit interval for simplicity. These are, this is the loss for playing
action I at time T.
Okay, the idea is that some unknown but deterministic process has laid down this sequence of losses.
Okay, for each time step or for each action. This is action I ranges from one to K and T is the time which
is discrete. Okay, and you don’t know, you have no previous knowledge about this so you don’t have
any prior about this sequence, this metrics of losses that are assigned to the actions.
Basically the goal of the player which is playing this game is to pick at every time step T. The player has
to pick some action I T in the set of actions and will incur the loss associated with that action, okay. The
player will typically use a randomization to perform this choice. The, at time T the choice of the action
I’s of T of the index of the action will be based on some previous observations which refer to losses of
actions that have been observed.
Okay, so we are to make this precise. We get to these, to the observation, the so called observation
models. Okay, so there are two very well known, well studied observation models which are the experts
and the bandit model. Let’s see if at every time step I pick I T and observe, okay I have two options here.
In the experts model I observe the entire L 1 T, L K T. I observe the entire vector of losses associated to
all the actions at that time step.
Okay, so I pick action I’s of T and incur the associated loss but I get to see all the losses of all actions.
This is called experts model. Think of investing in a number of assets. You can see the performance of
all the assets not only the assets that you have invested money on.
Okay in the bandit model I only get to observe the actual loss that I incur, okay. Again whatever I,
depending on the observation model if it’s an experts game whenever I play I T I know everything about
the past. I know all past active losses. If I play a bandit game whenever I play I T I will only know the
past losses that I incurred. But I don’t know anything about the losses of actions that I didn’t pick in the
past.
Okay here in this kind of nonstochastic setting where everything is deterministic in the, let’s say apart
for the internal, possible internal randomization of the player. The measure of interest is regret and this
is defined by the sum over a certain number of ten steps, so let’s say capital T of the loss incurred by the
player minus the loss of the best action up to the same ten steps. Since we are assuming that this, the
player might be randomized we will put an expectation over here.
Okay, so we, including the possible randomization of the player we are interested in minimizing the
difference between the commutative loss incurred by the player, which selects a certain sequence of
actions, and the best possible, the smallest possible loss that a player could have incurred by playing
consistently the same action. Okay is that clear?
Actions have a bounded loss, so clearly this thing; this quantity can grow at more, at most linearly with
time. Because actions have constant, may have constant loss at most. Anything interesting can be said
whether, when I can control this difference in, so that it grows sub-linearly with time. There’s a famous,
there’s kind of a let’s call this quantity. By the way let’s call it RT for regret. We know precisely up to
constant factors what happens in the experts and the bandit model.
What are the best possible regrets against any possible sentiment of losses to actions over time? We
know that in the experts case the regret will go like square root of T with the log of the number, let’s call
it sorry K, the number of experts. The best possible regret in the bandit case will also grow with the rate
as respect to time I will have a worse dependence on the number of actions. This is just because I’m
observing one K-ith of the information I observe in the experts model.
Okay, so now this is sort of a very nice and clear picture. We know that these rates are tight up to
constants. Okay, now I want to show you some model which is somehow interpolates between two,
these two extremes. Okay, so I would like to do so by introducing an algorithm which is a generic
algorithm that is able to achieve both these two regrets rates up to logarithmic factors.
I’m willing to pay a little bit more here. Let me weaken this by including an additional logarithmic factor
here. This is still good up to logarithmic factors we know it’s just a little bit worse. But now we have a
single algorithm that with a slight modification is able to achieve these two regrets according whether it
is run in the experts or in the bandit observation model, no. This is algorithm is a variant of edge of X
tree. It’s pretty easy to explain so the probability we just have to specify the probability. It is going to
be a randomized player so we’re going to have to specify the probability of picking an action at time T,
given the past observation. Okay, so we denote by this, the sigma reduced by the past observations.
It’s going to be something obvious, something trivial in the expert model because this will be just all the
past vectors of losses. Because of the bandit what I observe really depends on the outcome of my, on
the realization of my random variable, so my random selections here. Okay, so we denote this by P I T.
This is going to be proportionate to E to the minus Eta, L I T minus one. Okay, this L hat here is an
estimate of the past commutative loss of each action I. I’m going to pick action I with the probability
which is exponentially small in an estimate of the loss that this action suffered in all past steps. We’re
going to an overwhelming probability of picking the best action according to our loss estimates. But we
also give some a non-vanishing probability to, of picking an action that didn’t perform the best in the
past.
Okay, so now what is this? This is L I P minus one is simply the sum S one to T minus one of L I S hat.
These are instantaneous estimates of losses. These defined like this equals the L I, S divided by Q I S
times the indicator function of L I S observed. If the, according to the observation model I at time S I do
observe the loss of action I then I will estimate that the loss of the action using this ratio where Q, I
should say what is Q I S? Q I S is simply the probability of observing that action the loss of that action.
Probability of L I S observed given the past.
>>: I have a question.
>> Nicolo Cesa-Bianchi: Yes.
>>: Are the L I T, so for fixed I L I T is completely unrelated over different T values T?
>> Nicolo Cesa-Bianchi: Yes, it’s completely unrelated. The idea here is that if you give me completely
random arrangement assignment of losses. Then no matter what action I play I will essentially make, I
will, it won’t make any difference, because everything is random. This regret would be, is going to be
really small.
>>: But it seemed to me like in the bandit model you, if you only know what, it seems like you can’t.
Maybe I’m missing something. It seems like you can’t get any information…
>> Nicolo Cesa-Bianchi: Yeah, it seems, right, right it’s quite surprising that you actually can. Now I will
give you a little bit of an explanation here.
>>: Like what would your strategy be if for example all of the L’s are one except for one hidden one
that’s zero?
>> Nicolo Cesa-Bianchi: Okay.
>>: [indiscernible] round all them are one except for one is zero and…
>> Nicolo Cesa-Bianchi: Yes, yes.
>>: I just, I guess I couldn’t imagine, so the…
>> Nicolo Cesa-Bianchi: Yes, okay, if…
>>: No, no, no the optimum you only compare, you compare to your fixed action on the…
>> Nicolo Cesa-Bianchi: [indiscernible]
>>: Okay, okay, yeah…
>> Nicolo Cesa-Bianchi: You know if there’s one random, one non-zero trend then it’s easy somehow
because I will…
>>: You’re only companions somehow to term your word is trying to…
>> Nicolo Cesa-Bianchi: I’m comparing, I would compare it with a fixed column of this metrics.
>>: Sorry, yeah…
>> Nicolo Cesa-Bianchi: Right, so if there’s one, just one which is consistently no zero loss then sooner
or later we’ll solve the identified. If it’s random doesn’t matter. If there is some structure I should be
able to pick it up even though I don’t observe everything even the bandit model. Okay, yeah but it’s
perfectly, yes, thanks for asking this question. This is the probability and right.
>>: Another question.
>> Nicolo Cesa-Bianchi: Yes.
>>: Nicolo what is the distance between the Q I S and the P I T?
>> Nicolo Cesa-Bianchi: Okay, I’ll tell you in a moment. This is the next thing. Okay, so first of all okay
let’s see.
>>: Is there a…
>> Nicolo Cesa-Bianchi: Okay, so Q I S for instance we can say Q I S is going to be…
>>: It’s also current it’s also observed…
>> Nicolo Cesa-Bianchi: One…
>>: Oh, just I see, so the bandit problem could be the same one and it would be all one…
>> Nicolo Cesa-Bianchi: Sorry, okay, so what is the probability observing a loss of a certain action? It’s
going to be one in the experts case because I do observe everything by definition. It’s going to be the
same probability of picking that action. Because in the bandit model, because there I only see what I
pick, okay. Here we have a, and another thing, and now you can see that this definition is clearly gives
you the expectation for any fixed I S, the expectation of, now I should say, should write S minus one.
This is going to be exactly the correct. Because I am dividing, I’m putting here this indicator function and
when I take the expectation for a fixed I then this will be the probability of observing the loss which is
exactly this and will cancel. I get an unbiased estimate of that.
Alright, so now I, so now you see that this is; now you see. Okay, this a very specific observation, I mean
these are two specific observation models, right. I observe everything or just observe what I would pick.
In general you might be willing to run these algorithms with using different observation models, okay. I
will work to get these observation models. For instance you might get observation models from
graphical information associated with the actions. I’ll come to that in a minute. But let me just say two
words about, a few words about the analysis of this algorithm.
Okay, so how do I go about proving something like that for this algorithm here? Essentially I want; the
proof is actually quite short. But I don’t want to spend most of my time on it. I’ll just tell you that
analysis okay, so the key of it, the key to the analysis is to look at first of all the, this weights here which
are the weights assigned into actions. Okay, the, these weights when I normalize them will get me the
probabilities with which I pick actions here, okay.
Then I look at the sum of weights at a certain time step. Then I look at the ratio between let’s say
normalization, these are the normalization factors of these weights which get me the probability. I look
at the evolution over time of this quantity over here. This is again a potential function that allows me to
analyze the evolutional let’s say effectively gets me a way of controlling this notion of regret.
I will give you just some little hint about the analysis. Once you can prove deterministically,
deterministically I mean consider any realization of the random choices of the algorithm as given by
these probabilities. The algorithm is playing according to these probabilities and it will have a certain
realization of actions that are selected up to time T, okay.
Now I want to tell you something about this sequence of actions. It’s not really easy to prove something
like this sum over time, sum over actions of P I T L I T hat, smaller than equal. Then the cumulative
estimated loss of any action T. I will be interested in the index of the best action for the horizon I’m
looking at because, so K will be little k, let’s call it J. J will be the index of the action achieving this
meaning over here, okay.
Then I can plus, so this is basically very simple geographic manipulation of the, of this quantity here,
assignment of a time. I’m just using a very easy second order tailored expansion in order to linearize the
exponential function over there, and very little less. It’s basically a geographic manipulation. I get to,
okay, so this is sort of a basic equation which I get to very easily starting just from the analysis of this
quantity here. This holds deterministically for any sequence of actions by the player. It’s at the basis of
all analysis of these exponential weighted algorithms.
Now what I can do I can take expectation with respect to the random, with respect to the distribution of
these random variables. I know the distribution is this. I already know that these are unbiased
estimates of the losses. I can also very easily see that the second moments of these estimates are easily
controlled by this, one over S. Can you read over there, still okay?
Basically just because the definition of the, it’s various to see because of the boundedness of the loss
because of the definition of the loss estimates. I can prove, sorry this is going to be an inequality I can
prove this. Okay, now if I take expectation here I’m using the unbiased in this and that inequality here
what I get is the following.
Okay, so I still have expectation here because these are random variables. Okay, P I P, L I T, so the hats
go away because estimates are unbiased. Also hats go away this is the sum of losses so it just by
linearity the hats go away. This is a J and this is sorry capital T. Okay, this is a constant and I have
something here. Okay, I made a mistake here I had a P I T which I forgot here, okay. What I get here is P
I T divided by Q I T. Okay and this is also a random quantity because these are determinate, are random
functions of past observations in general.
Okay, now good. Okay, now here is just, this is just the cumulative performance of the players. This is
the average loss of the player which is playing according to this probability distribution. This is summed
over time. This is like the cumulative loss of the player. We can call it S of I, okay, is this quantity over
here. This is the, we can adjust big J to be the loss of the best action in the time horizon. I get mean J
and J T; I just pick the best one. Then I have L and K over Eta and then it’s very easy to get, to see here
what happens.
I have going in the experts model Q is one. Here I have a sum of probabilities, so this is one I have a sum
probabilities which is one and I have sum of a time. This quantity here becomes Eta hats T in the experts
case where T is the horizon, I’m summing up. In the bandit case well Q is P, P over P is one, so I get an
Eta hats T K because I am summing at one K times which is the number of actions. Okay, now by picking
Eta in order to trade off these two terms, experts case, so this is again expert, and this is again the
bandit. I exactly get these two bounds over here, okay, good.
Now this is the first part of the story. Now again I would like to play a little bit with this observation
model. What, suppose now that the actions have similarities, so again suppose that there is like, so my
actions are maybe like this. There is some graph, so these are my K actions. There are some similarities
among them. The edges of the graph indicate similarities within action. Maybe no I, actions are ads
that I display on some web page. Whenever I put an ad and I get some information about the revenue
that that ad got of the click through probability. I will also know that similar ads would have gotten a
similar loss or similar gain, okay.
Now I can assume that whenever I play a certain action, suppose now that I play this action over here.
Okay this is I T, now I only get to see, I don’t see everything. I see a little bit more than what I actually
played. I also see the losses of actions that are in the neighborhood of the action I picked, okay.
>>: In your example in your other [indiscernible] more people that I get maybe my loss exactly but
Jason’s loss in some, has some…
>> Nicolo Cesa-Bianchi: I could have like a random signal that is, yes, that’s correct. But this is a statistic
free talk. I won’t have any, I don’t have any, I won’t have any randomization in the model, okay. But
definite is true; indeed we have examples of that. But that’s just you know as a philosophy of this talk
let me.
Okay, so now it’s kind of easy to, of course this generalizes both models, no. In the experts case we
have a click. Everything I pick gets me to see everything is okay. Okay, so the neighborhood of every
node is the entire graph. Okay, in the bandit case I have an empty graph, okay. Whenever I don’t have
any edges so whenever I pick something I only get to see that. I general I can have anything in between.
Now the question is how should the, I expect to see some scaling here that interpolates between the
two. Okay, so what kind of scaling should we expect here, yes?
>>: [indiscernible] graph takes you in through the process…
>> Nicolo Cesa-Bianchi: Not necessarily, not necessarily, I can’t comment about that. Let’s, for the sake
of concreteness we can assume that for now that the graph is fixed and maybe unknown, you don’t
know that the beginning, okay.
How would you go about playing this game? Okay, playing this game you just don’t do any of the
algorithm is good, so you just use the algorithm. You can, what is now Q I T, so Q I T is the probability,
and a call is a probability of observing L I T given the past. This is going to be equal to the probability of
picking that particular guy, plus the sum of all the J in the neighborhood of the guy, of the probability of
picking them. I will observe the loss of this if either I pick this or if I pick any action in the neighborhood
of this guy here.
Okay, so now excellent, so we are basically done in a sense that all that’s left to do, all that’s left to do is
to study this quantity here, because this quantity will determine the final regret. Okay, so how does this
quantity because you know we had two easy cases. In bandit experts were easy, no work you just put P
and simplifier, you put one and sum, nothing to do.
But in general if you have a general observation model then it, you’ll have a little work to do, no. Let’s
see how it looks like. Let’s look at one of these guys. Just for a specific T so it can drop the index. I have
P I divided by P I, plus sum. This is sum over I, sum over J in neighborhood of I of P J. P one, P K is a
probability assignment over the vertices of the graph.
Okay, so now how big can this be? How can I, so one way to look at this is to take the counting measure
just as for the sake of clarity. If I put the counting measure there I get something completely
communitorial. This is one divided so it’s one over K and I just simplify the K throughout. I get just the,
sorry, the size of the neighborhood.
Okay, this is, okay, so give me any graph an oriented graph. What is the sum of the reciprocal of the
neighbor of the degrees plus one, okay? Okay, so this is a well known result. The result is and this, they
tell you that it’s upper bounded by the independence number of the graph. It’s the largest subset of the
vertices of the graph such that no two vertices in this subset have an agent common, okay.
>>: This corresponds to a particular algorithm where you label…
>> Nicolo Cesa-Bianchi: Yeah, so how do you prove these things? This is actually easy and fun to prove.
Let’s start from, we start from, okay give be any graph and let’s prove this upper bound over here.
What I do is I, let’s call a Q zero this quantity here, I one over one plus an I, okay. Now let me get I zero
is going to be the vertex which has a smallest neighborhood.
Okay, now I’m splitting the sum to zero. I’m splitting it to the sum over, okay let me do like this. Let me
consider, I want to consider the graph I zero and I want to cut a hole in the graph. I want to take out I
zero and the neighborhood of I zero. I cut this out from the graph. I zero the neighborhood and all the
dangling edges, okay.
Now the sum is split in two parts. What is left I call it Q one. Q one is what is left of Q zero in the sum
when it took away I zero and all the vertices in the neighborhood. Plus what I took away which is I zero
union and, let see I zero, let me get this right. Okay, so a sum over J, okay, and a sum one over one plus
size of N J. Okay, so by, can you see if I write down here?
>>: No.
>> Nicolo Cesa-Bianchi: No, okay, so this is a forbidden area, it is a no-fly zone. Maybe I’ll write down
here, okay. What happens now? Well you see that this kind of unfortunate so let’s look at the quantity
here, for this quantity here, sorry, I wasn’t planning. This quantity here is going to be okay this guy, I
zero is the one with the smallest neighborhood. It’s in the sum. I can replace every term of the sum
with the term, corresponding term with I zero because at the smallest denominator so it’s the largest
term. I have sum over J in I zero union neighborhood of I zero, okay. Then I have one divided one plus
size of the neighborhood of I zero.
Okay, so now you see that I have a constant here. This term is just equal, the sum is just equal to one
because I summing exactly this neighborhood of I zero plus one terms that are constants equal to the
one over one plus neighbor, size of neighborhood I zero. Okay, so now I know that this is at most Q I
plus one. Okay, so this is Q one, sorry Q one plus one.
I, and now I recurse on the remaining graph. I take again the I with the smallest degree in the
remaining graph, in the graph with the hole. I take it out and I again I can write that this is at most what
is left plus two. Okay, how many times can I take out, can I repeat this process? I take out the vertex
and all the neighbors okay. I can do it exactly at most independence number of the graph. Because this
is the largest number of edges that, vertices that, the largest subset of vertices that won’t have, that I
won’t remove when I remove any of them, subsets, okay. You can see this?
>>: [inaudible]
>> Nicolo Cesa-Bianchi: Yes.
>>: I’m sorry you finished yet?
>> Nicolo Cesa-Bianchi: Yes.
>>: One alternative is when you once started the way to put in the independence second just label the
graph within independent uniform variables. Take those vertices that have a max, that are local max
amount that they are [indiscernible] all the neighbors. When you do that the extracted size is exactly…
>> Nicolo Cesa-Bianchi: Yeah, there are many ways…
>>: That gives you…
>> Nicolo Cesa-Bianchi: There are many ways of doing it, yeah, many…
>>: But just the randomization gives it to you immediately as an equality, because the expected size of
the cube label…
>> Nicolo Cesa-Bianchi: We, okay, I was planning to use this proof a couple of other times. That’s why
I’m using this specific, I’m sure there are yes, yes definitely different points.
>>: Yes [indiscernible].
>> Nicolo Cesa-Bianchi: Okay, so this proof is for the counting measure. But you can generalize it for
any of the three measure of the graph, okay. Now this gets you something which is, will get you, this
quantity now will be bounded by the independents number. Now in the regret the proof will
immediately give you that the regret was K [indiscernible] T alpha L N K. You may immediately see that
in the case of the click the independents number is one, this is one and I recover the experts bound.
In the case of the empty graph the independents number is the number of vertices because I don’t,
everything, everybody’s independent of each other. I get, I’ll get a K here which is the upper bound for
the bandit. Okay, so this is, nicely interpolates between the two.
Okay, now in the remaining time I want to take a look at a different a more general situation which is,
okay, which is what happens when I have directions on the graph? This can very well be no because in,
so let’s assume that you have a situation in which, okay. You know it’s like the example I make usually is
they say you’re going to buy a game console. You get the recommendation for buying like high
definition cable, okay. If you’re interested in the game console it’s likely that you’ll need a high
definition cable in order to view the games very nicely.
The other way around is less likely. If you buy a high definition cable maybe you don’t have video
console, you have some underneath, okay. There are directions now and did I show [indiscernible] of
course are reducing further the information that you get, okay.
Now in which sense what is the observation model here? The observation model is that whenever I pick
some action I only observing the losses of the actions in the out neighborhood. I am observing the loss
of the action I pick and of all of the actions that are pointed to by edges of the actions I pick. Okay, so I
won’t see this anymore because the edges point in the wrong direction.
Okay, so now the, I can write again I can just, okay; I can just revise this definition here. Now the
probability observing the loss is just, we can just put a notation here, just the sum. It’s the probability of
picking that action plus the sum of the probability of picking actions that are in the neighborhood of this.
What is the probability to discern this loss? Either I pick this or I pick any action that has an edge
pointing to me, okay.
This is now, I have reduced the information and I would like to know how, what’s the correct regret
rate? By the way in the previous case where we saw that the regret was scaling with independence
number of graph we, that’s tight for any graph. For any graph you can provide a matching lower bound
for the game played on that graph that corresponds to the…
>>: Independence number.
>> Nicolo Cesa-Bianchi: To the independence number, okay so that’s sort of a variant of the standard
bandit proof. Okay, so let’s see how do I do this? First of all I hoping okay maybe you know I can still
prove something like that. Where here I just put something bigger which is, sorry, something smaller
which is the in the neighborhood.
However, there’s a counter example that rules out the possibility of getting exactly the same kind of
behavior. Okay, so I, so let’s see if I have a directed graph like this it’s, its total order. Okay, so actions
and now I have just like this and I have K actions, so total order over K actions. Now I have a probability
assignment of, which is exponentially small on the, so let’s say we number things one to up to K. Then I
have probability I is two to the minus I. I have a very small probability of picking an action that observes,
that gives me the total visibility, okay.
If I pick this action I have solved the loss of everybody else. It’s like in the experts case if I pick this
action I’ll absorb the loss of this specific action, it’s like the bandit. Okay, so if I have this bad probability
assignment it’s easy to see that the, this sum over here, so there sum I want to K P I T, P I, this P I over Q
I. This is going to be K plus, K minus one divided, K plus one divided by two I believe.
But if I look at the, this is in the, if I drop the orientation of the edges I get a click. Okay, so the
independence number of the edge without orientation is one. But this quantity here is bounded by
something linear in the number of the edges. I cannot hope in general unless I make any sum
assumption. I can offer to bound the D, the same quantity I had here now restricted on the in
neighborhood with independence number of the graph, dropping orientation of the edges.
Okay, so the problem here is that, well one might, one could blame several you know components,
several ingredients of the system. One might decide to blame the fact that I have big, too high variants
in my loss estimates. One way to reduce the variants in the loss estimates is to introduce bias. An easy
fix to this problem here is to change it to alter a little bit my loss estimates. My loss estimates now will
be tailored to the fact that I, my graph has orientation. I can expect situations like that where my
standard estimates won’t work.
Now I do a biased loss estimates. This biased loss estimates so let me just use the same notation here.
Will be just the same as before, L I T divided by Q I T indicator function of L I T observed. But now I will
add a little bit of bias down there, okay, in order to keep those things down. Now I’m under estimating
the true losses. This is a negative bias. But I can control the variants in a good way.
Essentially if I redo the proof I did before for the original estimate I get something very similar. I get,
once I take expectation the regret that quantity over here if I use these estimates here is bounded by
something like this, plus gamma. Then I have this same, sorry, I have the expectation here. I’m almost
done and then I have the L I T hat squared the P I T.
Okay, so the only difference here is that I have, okay let me write this actually as it should be, so P I, T
divided by Q I T plus gamma. Okay, so I get a very similar relationship that controls, very similar quantity
controlling the regret in this previous case. I have to deal with this gamma here, okay. You see here
gamma is playing essentially the same role as Eta which is the one I had in the, the parameter I had in
exponent of my, for my probabilities, although here I should find a way of dealing with this quantity
here.
Okay, so now in, I can prove for any choice of gamma I can upper bound this with something which is
again the twice the independence number of the graph plus a logarithmic factor, which depends on
several things including gamma of course and alpha. Okay, so you see here that the price I pay in order
to, well I can still control this quantity here in terms of the independence number as I did before. But
have just an additional logarithmic factor which will depend by, which will depend on this gamma term,
this bias term here I have introduced.
The way this can be proven it’s also interesting. I would like to show it to you and I think I have the time.
It’s not going to be long. Again, I will make a proof for the counting measure. The proof for the
counting measure is kind of silly because if I know that I have a uniform measure I know that I, these
things are ruled out, so I’m kind of, but this is essentially the essence of the proof is already there. Then
by introducing this bias term I can, I’m able to generalize the proof to arbitrary measures over the graph,
okay. Because this gives me enough control on the denominator.
Okay, so let’s see this proof. I want to, it’s again it’s completely communitorial question. I have an
oriented graph, a directed graph, and I want to control. I can’t find things, okay, you want to control the
sum over all the vertices over one, one, one, over one plus the size of the in neighborhoods of these
vertices, okay. This is the, like that gamma zero uniform measure. Then there is a sort of technicalities
to generalize it in order to get this upper bound here. How do I prove this?
>>: What’s inequality end point…
>> Nicolo Cesa-Bianchi: Yes, you’re right. I just did this right inequality. The inequality will look like this.
It’s again without the, if the graph, if I drop orientation I have a neighborhood I just know that it’s alpha.
If I have orientation and just consider the in neighborhood then I have twice alpha at times the log
factor here.
Okay, so now let’s see how the proof goes. Okay, now I’m going to, again I’m going to pick a sequence
of vertices as before. I’m going to take out the I zero which is the, this time is the one with the largest in
neighborhood. Oh, that’s correct, yes, I take I zero out and it recurse on what’s left.
Okay, so I’m just taking a vertex out, all the edges, not the neighborhood just the vertex. Okay, so it’s
again as before. But now I just take out this guy here and they left the neighborhood intact. Okay, and,
alright so let’s see let’s reason a little bit about this.
Now I want to take, I want to relate to this with the independence number of the graph without
orientation on the edges, without directions on the edges. This is the maximum so definitely this is
going bigger than the average. Okay, just because I just I picked the maximum. Okay, so the average
equals to the number of edges divided by K, sorry, the sum here equals to the number of edges.
Because I’m counting, I’m not counting twice because I have directions I’m only counting in. If I sum
over all that, I just get the [indiscernible] of the number of edges divided by K.
Now drop orientations, if you drop orientations you might instead of counting twice this guy you just
count it once. But this will only reduce the number of edges. I’m still going in the right direction in this
chain of inequalities. Now pretend I’m dropping all orientations at this stage. Then you use what
Turan’s theorem. Turan’s theorem relates the density of [indiscernible] undirected graph that with the
independence number. This gets you exactly K divided by two independence number of the graph
minus a half. It’s, usually it’s not written this way but I just wrote it for convenience like this.
Okay, so now I can essentially do a risk, write a recurrence as I wrote before. This is going to be, I’m
looking at my quantity of interest here, so this quantity here. I can split now in two parts. The part that
I have there which is, so this is going to be at most, okay at most because this is smaller, this is going to
be at most one over one plus neighborhood of I zero, plus the rest I different I zero, one plus one plus
nine, okay. It’s not minus one this is minus. Okay, so now just plug that in, I just plug this lower bound
on the size of the, in the neighborhood of I zero. That gets me an upper bound because I have a
denominator and this is alpha plus K. Then I have the same thing over here.
Okay, now I recurse and I just took out one vertex from the graph. I have a new graph there with a
smaller number of edges, a smaller number of vertices. I keep going so I pick again at the vertex with
the largest in neighborhood. I take it out and I keep on going like this. At the end what I have is that the
quantity over here. I write this I have sum over I one over one plus N I minus is more than equal than
sum over I two alpha A K I K two one, I believe. The first time I have K vertices. The second time I will
have K minus one and I go down till that, okay. I just get to alpha then I sum this harmonic sum over
here and I get exactly at most one minus one on a K L.
Okay, so essentially, right, I can take this proof generalize it a bit in order to get a control of this term
which will include, will be a little more sloppy, a little more loose. I have a case [indiscernible] and
everything but I can’t deal with, I can handle a better probability assignments. Okay, now if I just tune it,
properly tune it in gamma properly and the tuning of it in gamma will be the same. I can get a bound
which is exactly, which is the same form of a regret bound I had before.
Again it will depend; it will again depend on the independence number. But it will have additional log
terms, log factors that will be length K and also length T because gamma will be tuning in terms of one
over T, one over square root of T. This is essentially even in the undirected case I can still get a control
on the regret for a [indiscernible] directed graphs by using two tricks by using essentially just one simple
trick which is adding a little bias to the sum. Then with the proper control on this quantity which is
really the key quantity that rules the regret here. Essentially get the same result with just a different log
factors.
This is basically the message. This is, I found this interesting because it gives sort of a nice way of
blending between the experts and the bandit model. Yes?
>>: Can I ask a question?
>>: Yeah.
>> Nicolo Cesa-Bianchi: Sure.
>>: In the directed graph model we could draw the self, the little edges pointing to myself explicitly,
right, just in the picture.
>> Nicolo Cesa-Bianchi: Okay.
>>: I could also omit some of them. Would that break the proof if I don’t point to myself?
>>: Yes.
>>: So where can you point…
>>: No, I assume it would be…
>> Nicolo Cesa-Bianchi: If it’s disconnected you mean?
>>: I just don’t have a, I don’t review my own loss…
>>: [indiscernible]…
>>: Everybody has at least one incoming edge but it’s not for myself.
>> Nicolo Cesa-Bianchi: No, if you don’t see your own loss it’s a problem.
>>: Yeah, show me where it breaks then.
[laughter]
>> Nicolo Cesa-Bianchi: Well, well there’s a, this is a, there’s a lower bound in which you have a worse
dependence on the, on time T to the two thirds. It’s sort of a revealing game. You play a good action
but you don’t see it. In order to see something you have to play a bad action.
>>: No, I knew this but I, you I thought it had proved to be that I knew wrong. Where, can you show me
where the proof breaks down?
>> Nicolo Cesa-Bianchi: I, where the proof breaks down…
>>: These Q’s are always bigger than P’s.
>>: Right.
>> Nicolo Cesa-Bianchi: The Q will always include the P. This is a…
>>: [inaudible]…
>> Nicolo Cesa-Bianchi: This is…
>>: If I promise that Q is bigger…
>> Nicolo Cesa-Bianchi: You don’t have a one here this is what you mean. You don’t always have a one
down there which means that you don’t always observe. Yes, that’s pretty crucial to it. I mean I…
>>: If I promise that Q’s always bigger than P its okay?
>>: There’s a minus or half there that seems like it might be where…
>> Nicolo Cesa-Bianchi: Oh, Q is always bigger than P. If Q is always bigger than P yes because I’m a
failsafe to the bandit situation, that’s correct.
>>: Okay, so I don’t need an edge to myself…
>> Nicolo Cesa-Bianchi: If you can’t always ensure that the probability of serving a loss is always at least
as big as probability of picking the action that corresponded to that loss then it should be okay.
>>: I see.
>> Nicolo Cesa-Bianchi: Yeah.
>>: But then it would be hard to guarantee specifically there’s one really great guy who’s probability
shoots up and that’s it and you’re done, and he is not.
>> Nicolo Cesa-Bianchi: Yeah, I mean it’s, I mean all these arguments don’t assume anything about the
probabilities. This, all these things hold for, I mean all the communitorial stuff holds for any probability
assignment. It doesn’t matter what target I run.
>>: Okay.
>> Nicolo Cesa-Bianchi: Okay, so this is something that if you start assuming something about the
behavior of the probability, so the algorithm then it gets really, really tricky. You may also imagine
alternative observation models in which you basically you know what you observe the pair really
depends maybe on the realized losses. If you pick an action and you, that action has zero loss you don’t
observe anything. But if your action has a big loss then you get to see something else.
You may also, I mean there is some of that and the, these graphs don’t have to be fixed as I said in the
beginning. You actually don’t need to know the graph in advance. Suppose the graph are varying over
time then instead of having a dependence on the constant. Instead of having a dependence on the T
alpha we’ll have dependence on sum over T alpha T, on the sum of the K square root, square root, on
the sum of the independence numbers of the sequence of graph.
I can have the observation model changing over time. I don’t need to see it before hand. Before hand I
just need my probabilities one depend on the observation model. I can pick it blindly and then sums
what tells you okay actually this is what you observe. I don’t need to; I don’t even need to observe the
entire graph in order to update my probabilities. In order to run the algorithm I just need to observe the
graph up to the second neighborhood of the action I picked. Because I need to update the probabilities
for all actions whose loss I know.
This is up to I need to know the probability of picking. If I pick this guy I need, I will observe the loss of
this guy. I need to calculate the probability of observing this loss which is a function of the
neighborhood of this guy. I need to know some vicinity of the graph of the action I picked but not
entire.
Okay, I’m done. Okay thanks for your attention and patience.
>>: [indiscernible]
[applause]
Download