>> Ofer Dekel: Thanks everyone for coming. It's... Princeton and he's going to tell us about online learning...

advertisement
>> Ofer Dekel: Thanks everyone for coming. It's our pleasure here to have Haipeng Luo from
Princeton and he's going to tell us about online learning with unknown time horizon. Thank
you.
>> Haipeng Luo: Thanks for giving me the opportunity to visit MSR and to give this talk. I'm
going to talk about a recent ICML paper on minimax online learning with unknown time
horizon. So let's just start with a simple example of online learning. Suppose for each day a
gambler arrives at the casino with one dollar and then he decides to portion to bet on each n
event in the casino. After seeing these and and seeing these portions and then the casino
decides they lost zi over each event. At the end of the day the gambler leaves the casino with
the loss being the weighted sum of all the losses. The goal of the gambler is to minimize his
regret after T days and the regret is defined by the difference between the loss of the gambler
and the loss of the best event. This is an old problem and there are tons of algorithms to
achieve low regret, by low I mean sub linear regret, and such as the exponential weights
algorithm. On the other hand, it's also possible to derive game theoretic optimal solution,
which is the best gambler against the worst case casino. It's also possible to do that. Most of
the algorithms assume that the time horizon, T or the number of days is known especially those
minimax algorithms. If T is unknown, which is usually the case in practice and people have
been using different things that have to do with this case. The most important one is the
doubling trick and the idea is you don't know the horizon and so you just guess one. Once the
actual horizon exceeds your guess, you double the horizon and restart the algorithm from
scratch. The performance seems to be not bad at all. It's just that you can prove that it's just
2+ the square root of 2 times worse than the case when you know the T. In general, this
algorithm assumes to be inelegant and wasteful because it keeps restarting itself and forgetting
all of the previous information. That's the problem of the doubling trick. Another thing people
have been using is what I call current round is the last round idea. For those kinds of algorithms
you usually have a parameter to turn which depends on the T. In that case usually you can just
replace the T with the small t which is the current time. It always works. Later in this talk I will
show you an example where we show that this idea will fail completely. Besides these there is
also an algorithm called normal hedge by Chaudluri et al and this algorithm is totally parameter
free, but the thing is it is too difficult to analyze and understand the intuition behind it and it's
also not clear how to generalize it. So what we are interested in in this work is to answer the
question of what the game theoretic optimal solution is when T is unknown.
>>: Can I ask you a question?
>> Haipeng Luo: Yes.
>>: Can we ask questions during the talk?
>> Haipeng Luo: Yes.
>>: So you are saying most of the analyses in the previous slide hold when T is known, and then
the guarantees hold for all t’s smaller than T or only the very end?
>> Haipeng Luo: Only the very end.
>>: Okay. So you don't know anything about the regret as you are going along, only that the
regret is variant something? So now when you do the doubling trick, again, you are saying that
I guess what T is. I run the algorithm and then at that point I restart the algorithm and I run for
twice as long and I finish that, so now I will have a guarantee that at this point and at the end of
each time and you put a vector of 2+ something for its regret but only at these points?
>> Haipeng Luo: No.
>>: If I just restart it, so if my adversary knows my strategy. So he knows that after 1000
rounds [indiscernible] restart and we just decide that the game ends after 1001 steps. So
basically the game ends after I just restarted and [indiscernible] without histories, so am I not
paying a big penalty for that?
>> Haipeng Luo: Though. You are still bound by the regret at 2000 rounds.
>>: One more question. Are you looking at the bandit setting in which you just know the
gradient or do you know all the process?
>> Haipeng Luo: No. The full information setting, not the bandit one.
>>: [indiscernible] for example there can be a time two. If the first bound I'm going to get is
1000, right one operation goes by. At that point it's not a factor of two.
>>: It just goes into the constant. Because it's a sum of 1001 variables, so for the 1000 you get
a good bound and for the…
>>: Where are there constants? It says two times worse regret. Two is the constant.
>>: You don't start the [indiscernible] two, four, eight.
>> Haipeng Luo: Yeah right. That's it. You start with two.
>>: [indiscernible]
>> Haipeng Luo: Okay. We are interested in finding the best solution for when you don't know
T. Here are the main results. We first give some exact minimax solutions for special cases and
then we showed that there is a gap between learning with and without knowing the horizon.
And moreover, what we found a new simple technique just like doubling trick which can handle
an unknown horizon but it outperforms the doubling trick both in theory and in practice and in
general it recovers later. So before we talk about anything on the unknown horizon setting,
let's first look at the story of the new solution of the fixed horizon setting as a preliminary.
Here is the formal description of the hedge setting, the example I just gave you. Let's say
imagine a game between the player and an adversary. For each time t, the player chooses a
distribution Pt over N actions and then the adversary reveals the loss vector Zt with each
coordinate corresponding to the loss of each action. Let's assume Zt is from the loss base LS
which is a subset of the hypercube. At the end the player suffers losses being the end product
of Pt and Zt. Let's further define Lt to be the cumulative loss of the player up to time t and Mt
to be the cumulative loss vector of the actions after time t. Then the regret will be a function of
LT and MT which is LT minus the minimum coordinate of MT. To find the minimax solution is
the same as doing this, min over the algorithm of max over the loss sequence of the regret,
which is equivalent to this sequential minimax expression. To solve this, we can use some kind
of backward analysis that defines the optimal value at a state of the game. So here a state
would be M, the cumulative loss vector of the actions and r, the number of remaining rounds.
And this is defined by min over P max over Z of the loss for current round plus the optimal value
of the next state, so M plus Z and r-1 is just the next state. In other words, V of M,r computes
the regret if we assume that both the player and the adversary played in his best way from now
on. And basically it is just when r is 0. Finding the min max solution is just about solving this V
function. And back to the expression, it's just V of 0, T. So for some cases let's assume that LN
is this simple loss space, so here ei is this is the union vector in the space and so this means that
on each round only one action has lost one. All the other has no loss at all. Similar to this
previous work we found that V of Mr has a nice form. It's actually the expected regret against a
uniformly random adversary. Uniform random adversary means that if picks ei uniformly at
random on each round. This is exactly this because no matter how the player reacts, the loss
on each round will be one over N so it is r over n at the end of the game and this is just the
expected loss of the best event. Moreover the upper most strategy of round t is the following.
The wait for action i will be the value of the current state minus the value of the next state
assuming that ei will be big. The optimal regret is of the order square of T. This is all under the
setting that you know the big T. But then it is just for this simple case. For the general case of
an LS case of 0, 1 to the N we don't know the exact form the minimax solution, but we know
that it is of order square root of T log N and it's achieved by the exponential weights algorithm,
which says that on route P the rate for i should be proportional to e to the minus eta times the
loss of this action, and eta is just a parameter that depends on d. So this is the story for the
fixed horizon. Now let's move to the unknown horizon case. We started two specific settings.
The first one required random horizon setting where we assume that T is drawn from some
distribution Q and we assume Q is known to both the player and the adversary, but the actual
draw T is unknown to both. Here to find the minimax solution is the same as this M over
algorithm over infinity long loss sequence of the expected regret. Again, we saw that under this
simple loss space we can solve that exactly. This quantity is, the left-hand side is just what we
want to compute here, and the right hand side is the expectation of minimax regret in the fixed
horizon setting that I just showed you. Roughly speaking, this is saying that T unknown to both
is the same as T known to both, roughly speaking. Moreover, the optimal strategy on round t
will be the following. Here, Pt T is the optimal strategy on round t if you don't know the
horizon. And right now we don't know the horizon. We just apply the expectation, the
conditional expectation of this, of what you would have played if you had known the horizon,
given that the horizon is at least the current round. Is that clear? This is the optimal strategy.
We want to emphasize that similar results do not hold it in general. If you move from this
space to the general space, this equality just doesn't hold. But actually later we will see that
this is a very powerful idea, but before we talk about that, let's move on the second model we
consider. That would be called the adversarial horizon setting. In this setting, we let T be
completely controlled by the adversary. In other words, the adversary can choose to stop or
continue the game on the fly. If you want to figure out the minimax formulation, you might do
this, M over the algorithm and slip over both the horizon and the loss sequence of the regret.
But then you will quickly realize that it doesn't make sense at all, because the adversary will
obviously choose T to be infinity, leads to an infinity large regret. In other words, we need to
scale the regret somehow. One natural choice is to scale it by the regret under the fixed
horizon setting. Let's define this and let it be V tilde, so in other words, V tilde matches how
many times was you are when don't know the horizon compared to the case when you know
the horizon. It is trivial that V tilde is at least 1 because it's impossible to do even better when
you don't know the horizon.
>>: If I have an algorithm which is just a [indiscernible] would be the same r all the time, then
it's, the [indiscernible] is going to be 1.
>> Haipeng Luo: It's going to be 1?
>>: Yes, because of my strategy does not depend at all on the equals, then there is not going to
be any difference if I know the horizon or not. If I'm really, really bad I just, all of my predictions
are just the same throughout the game, just ignore.
>> Haipeng Luo: Uh-huh. But the only thing would be one is T lower bound; you are still talking
about lower bound, right?
>>: I'm saying that this ratio is going to be trivial if I'm really, really bad.
>> Haipeng Luo: Yes. But we are considering the best algorithm here. You want to find the
algorithm to achieve the best ratio.
>>: [indiscernible] V there is another [indiscernible], right?
>>: Maybe you have to [indiscernible]
>>: V is defined with regret of the algorithm or the best regret possible?
>> Haipeng Luo: It's the best regret possible.
>>: [indiscernible]
>> Haipeng Luo: Sorry?
>>: How is V 0 Tdefined?
>>: Can you go back to the…
>> Haipeng Luo: Oh, V 0… Okay. It's just this, minimax regret if you know the horizon. Okay?
>>: [indiscernible] horizon?
>> Haipeng Luo: Yes. We defined this as a new regret, but unfortunately, finding the minimax
solution under this setting proved difficult even when considering a trivial case when n is 2, so
only two actions we still don't know how to find the minimax solution. Under this case we do
find a nontrivial lower bound. Let's say there are only two actions and this is the loss space, the
general one and we prove that the V tilde there is at least square of 2, so there is actually a gap
between known and unknown horizon. In other words, if you don't know the horizon, the
adversary can always force you to achieve square of 2 times worst regret at least. The idea
behind the proof is simple. It's just, consider this restricted adversary that always keeps the
losses close. On Mt the loss for action 1 is L and L plus 2 for action 2. Then on the next round it
will either increase the loss for action 2, so it's L and L plus 2 and start again, or it ties up the
losses and continue it again, so you come back to this on the next round. We only consider this
adversary and then we compute this thing. It requires some work, but in the end it is square
root of 2. So even for this simple adversary it is already square of 2 times worse.
>>: [indiscernible] T?
>> Haipeng Luo: MT is the loss vector so L is the cumulative loss for action 1, and L plus 1 for
action 2.
>>: So how can the loss for action 2 jump by 2 between those two steps?
>> Haipeng Luo: You mean for…
>>: How can it increase by 2 if the losses bounded in 0, 1?
>>: [indiscernible].
>> Haipeng Luo: It is L +1 to L +2.
>>: Oh, L +1 [indiscernible]?
>> Haipeng Luo: Right.
>>: So [indiscernible]
>>: You do these with equal probability? The adversary is choosing the loss?
>> Haipeng Luo: The adversary? Sorry.
>>: You do these two things with equal probability or…
>> Haipeng Luo: No. It chooses whichever is better. The adversary chooses whichever is
better.
>>: In what sense better?
>> Haipeng Luo: Sorry?
>>: In what sense better, better in what sense?
>> Haipeng Luo: In maximizing this ratio. Okay? Even though we don't know the minimax
solution, we do find a new adaptive algorithm, by adaptive I mean it doesn't need to know the
parameter T. And the main idea, so let's revisit the minimax solution for the random horizon
setting. It's actually an adaptive algorithm in the sense that it has no parameter of T, so the
idea is to pretend that T is drawn from some distribution when you don't know the horizon, and
then you use this. To use this you need to answer several questions. First, what distribution Q
should be used. Second, we know that this is minimax optimal for the random horizon setting,
but how is the performance of this under the adversarially chosen horizon? And finally, how
general is this approach? The short answer to these questions are this is the right distribution,
so the probability of the horizon being small t is directly proportional to 1 over t to the d and
the idea is a constant greater than 3/2. The performance of this algorithm is of the optimal rate
and finally the approach is very general. In fact, below I'm going to explain these again under
the general setting, the online convex optimization setting. Let's consider this again. On each
route the player predicts a point xt in the convex set S, and then the adversary chooses a
convex loss function ft on the set S and at the end the player suffers loss ft of xt. The regret is,
again, the total loss of the player minus the loss of the best fixed point in the set S. This is the
well-known online convex optimization setting and it's clear that it includes the hedge setting
before. You just need a big x to be the simple x and ft to be the linear function. Again, we can
do that minimax solution in the fixed horizon setting. Here we extend the M to be the multiset
of all of the loss functions you have encountered and r is still the number of remaining rounds.
V of Mr is, again the current rounds plus the optimal value of the next day assuming that both
the player and the adversary is optimal. This is just a modus a union, this funny symbol.
Optimal strategy when you know the horizon is just the point the real x is minimal. Now our
adaptive algorithm is just to play the expectation of this point, this optimal point given that the
horizon is underneath the current round. It's the same idea presented in this new framework.
Is the algorithm clear? Here is the analysis of the algorithm. We do need a very mild
assumption saying that V of Mr is at least V of M, 0. In other words the game is always in favor
of the adversary because if you are playing some rounds always leads to a greater regret then
not playing at all. This is what this is saying. It's true in all of the examples we reconsider. The
theorem is, so this is the right distribution again, the probability of the horizon being small t
proportional to one over t to the d. We want to emphasize that this is not an assumption on
the horizon. It's just another field of the algorithm and if the minimax regret of the fixed
horizon is of order square of t with constant c then the regret for our algorithm will be at most
this no matter how the horizon is chosen. Again, even if chosen by the adversary, and if you
look at this it is of order square of T again and the than the main term is only A sub d times
worse than this, the fixed horizon case where Ad is a constant in term of d and this is a gamma
function and, again, you can minimize over d of this and then at the end you would get a d
about 3, which is three times worse than fixed horizon case. In theory, it's better than the
doubling trick. This is 2+ square of 2, a little bit. And I was just mentioning work of the proof.
The key step is to prove that the regret is bounded by d, so don't worry about the notation.
Roughly speaking the first term is the regret under the random horizon setting is exactly the
same. And the second term you can interpret it as the penalty as being under adversarial
setting. And the reason why d has to be greater than 3/2 is that if you write out this theory,
this theory is the sum of the square root of T over t to the t and d has to be greater than 3/2 to
make this converge. That's the reason why d is this funny constant.
>>: You apply this trick of taking the expected x given all possible [indiscernible], but you -- can
you go back one slide? You apply this to this specific optimal strategy? Could I have used it on
some other strategy and gotten the same guarantee?
>> Haipeng Luo: Actually, yes. I will cover that lately. So far this is only for this minimax
solution. Okay? So now let's look at some implementation issues and some applications. Let
me first show you my favorite example where the algorithm has a close form. Let's consider
this specific setting of the online convex optimization. Let's put S to be the L2 unit ball and have
the loss function to be a linear function divided by a point also in f. In other words on each
round both the player and the adversary picks a point inside the L2 unit ball and the loss is
[indiscernible]. Abernathy et al shows that this is the minimax solution when you know the
horizon. It's pretty simple. Capital Wt is just the sum of all the previous choices of the
adversary and you just compute this and that's minimax optimal. Now we apply our algorithm
on it with d equals 2 and then you get this close form. The expression looks terrifying, but the
computation is absolutely efficient and doable. It's just computing this in a constant. Apply this
and you get this. Then we compare several algorithms. Here in this graph the x-axis is the
horizon and the y-axis is regret and each data point is the maximal regret over the 1001 so to
simulate the worst case regret, and we compare four algorithms, which is the doubling trick and
the online gradient descent by Czinkovitch [phonetic] and our algorithm and the optimal
algorithm when you know the horizon by Abernathy et al. Notice that this algorithm, this
optimal algorithm kind of cheats by knowing that the horizon at the end is 1000, so it uses 1000
as a parameter and so at the end it's the best algorithm. But otherwise, these three other
algorithms and we can see our algorithm performs consistently 30 better than these two.
Another thing worth noting is let's look at the first 400 rounds about. Our algorithm actually
performs even better than the optimal one. In other words, in practice if you do know the
horizon and you want to use this algorithm with a big guess, say 1000, but the actual horizon is
only like 200, then using the optimal algorithm is actually not better than using our algorithm.
So that's the implication. The next question is what if closed forms do not exist and which
would you actually use for the case. It turns out if on each round you have a randomized player
is still okay. Let's go back to the hedge setting, but we modified it in the way that we allowed
the, the player has to pick a single action on each round instead of a distribution. He picks a
single action randomly according to some distribution. In this setting our algorithm simply
becomes this. First we draw a T according to our special Q and then we draw It according to
whatever you need to play when you know the horizon. The algorithm becomes totally
efficient and is easy to hold that same regret bound holds with high probability. The
computation isn't an issue here. Second, and back to your question, this is an example of
combining our algorithm with a non-minimax algorithm. The basic algorithm doesn't have to be
minimax. Here's an example. Let's consider the follow the perturbed leader, FPL. This
algorithm says that we can perturb the loss of the actions of little bit by just random variable
and then pick the best one. Here this random value variable, see the distribution of this is in
terms of the horizon T and now our algorithm totally becomes a Bayesian approach because we
are giving a prior on a known parameter of another random variable. We draw T according to
Q and then we draw this random variable and then we pick the leader. We prove that is
[indiscernible] the regret. Another non-minimax example is the exponential weight algorithm I
show you at the very beginning of the talk. Here minus eta T is the learning rate to turn which
is in terms of t horizon big T. When T is unknown, as I mentioned, people have been using this
idea to just replace the T with a small t so that, in other words so play P sub t [indiscernible] t
again. We show that our algorithm also works for any d, the constant d. But it degenerates to
the above one when d goes to infinity because this is our algorithm and let's look at this
[indiscernible] probability is converges to one if tao is t and 0 otherwise when d goes to infinity.
This is exactly the same as this, so our algorithm is the generalization of the previous idea, but
we want to emphasize that our algorithm is more applicable by giving you this example, in
which case this idea will completely fail. Let's go back to the L2 unit ball game and pick a point
in the L2 unit ball and the loss is [indiscernible]. Suppose the adversary picks e1 and -e1
alternatively, and as I show you this is the minimax algorithm when you know the horizon. If
you apply this idea, current round is the last round you let big T equal small t, then the output
of the algorithm will be 0 minus e1 over square root 2 alternatively. Then by a simple
computation, the regret will be this which is a linear regret, so this is suboptimal. So this idea
does not always work. To conclude, this example, so this idea doesn't work and doubling trick
achieved this regret and our algorithm achieved this with a smaller constant. Finally, so far we
have only discussed zeroth order bounds that depends on the horizon, but it's also possible to
derive first order bounds such as this. It depends on L star where L star is the loss of the best
action, so it doesn't depend on the horizon but it depends on the loss of the best action. Again,
if you don't know this we can just put a similar distribution on L star and do the same thing
again and we prove it is still the same rate. Okay? To conclude, we have examined some
minimax solutions for some special cases and then we showed the square of 2 lower bounds to
show that a gap between learning with and without knowing the horizon. We proposed our
idea of pretended distribution to handle unknown horizon and it outperformed the doubling
trick, it generalizes the previous idea but it's more applicable and it is general. It can be
combined with different algorithms and can be used in OCO setting. And finally I wanted to
mention some open problems. At the end we didn't really give the exact minimax solution for
the adversarial horizon setting, the one to minimize the ratio. We really want to know whether
there is an example, a special case where we can do this. Second, the lower bound holds only
for N equal 2 which is quite specific and we wonder whether there is a general bound. And at
the end unlike the doubling trick we actually proved our algorithm works with the minimax
algorithm and with some non-minimax algorithms separately. We don't have unified proof to
show that given any known regret algorithm we can play this idea on top of it and we wonder
whether that is true or not.
>>: Are you saying that square root of two only holds when N equals 2? The upper bound of 3
is for general N, right? Or was the upper bound 3?
>> Haipeng Luo: Yeah, yeah, sure sure. You mean the performance of the algorithm?
>>: Yeah. You get an algorithm with a constant of 3, there's the constant of N, for any N?
>> Haipeng Luo: For any N, yeah. That's for sure.
>>: But the square root of 2 is only for…
>> Haipeng Luo: It's the lower bound.
>>: So for general N it's somewhere in between?
>> Haipeng Luo: Right. And that's it. [applause]
>> Ofer Dekel: Any questions?
>>: What do you know about the bandit [indiscernible]?
>> Haipeng Luo: I know this. I think, so the algorithm still work in the bandit setting, but in the
bandit setting it is even more difficult to talk about the minimax algorithm because even for the
fixed horizon setting, I don't know about it.
>>: [indiscernible] information your bound is applied [indiscernible] weight?
>> Haipeng Luo: Sorry?
>>: For the full information in case your bound is 3 applied to most applicative weight? What
about when the applicative weight is applied to bandit? Doubling trick still works there.
>> Haipeng Luo: Uh-huh.
>>: Does something like your method work?
>> Haipeng Luo: That's a good question. We actually haven't thought about bandit setting in
this case. If you think minimax solution just doesn't hold thing in the bandit setting.
>>: [indiscernible] talking about the [indiscernible]
>> Haipeng Luo: Yeah, right.
>> Ofer Dekel: Anyone else? Okay. Let's think the speaker again. [applause]
Download