>> Ofer Dekel: Thanks everyone for coming. It's our pleasure here to have Haipeng Luo from Princeton and he's going to tell us about online learning with unknown time horizon. Thank you. >> Haipeng Luo: Thanks for giving me the opportunity to visit MSR and to give this talk. I'm going to talk about a recent ICML paper on minimax online learning with unknown time horizon. So let's just start with a simple example of online learning. Suppose for each day a gambler arrives at the casino with one dollar and then he decides to portion to bet on each n event in the casino. After seeing these and and seeing these portions and then the casino decides they lost zi over each event. At the end of the day the gambler leaves the casino with the loss being the weighted sum of all the losses. The goal of the gambler is to minimize his regret after T days and the regret is defined by the difference between the loss of the gambler and the loss of the best event. This is an old problem and there are tons of algorithms to achieve low regret, by low I mean sub linear regret, and such as the exponential weights algorithm. On the other hand, it's also possible to derive game theoretic optimal solution, which is the best gambler against the worst case casino. It's also possible to do that. Most of the algorithms assume that the time horizon, T or the number of days is known especially those minimax algorithms. If T is unknown, which is usually the case in practice and people have been using different things that have to do with this case. The most important one is the doubling trick and the idea is you don't know the horizon and so you just guess one. Once the actual horizon exceeds your guess, you double the horizon and restart the algorithm from scratch. The performance seems to be not bad at all. It's just that you can prove that it's just 2+ the square root of 2 times worse than the case when you know the T. In general, this algorithm assumes to be inelegant and wasteful because it keeps restarting itself and forgetting all of the previous information. That's the problem of the doubling trick. Another thing people have been using is what I call current round is the last round idea. For those kinds of algorithms you usually have a parameter to turn which depends on the T. In that case usually you can just replace the T with the small t which is the current time. It always works. Later in this talk I will show you an example where we show that this idea will fail completely. Besides these there is also an algorithm called normal hedge by Chaudluri et al and this algorithm is totally parameter free, but the thing is it is too difficult to analyze and understand the intuition behind it and it's also not clear how to generalize it. So what we are interested in in this work is to answer the question of what the game theoretic optimal solution is when T is unknown. >>: Can I ask you a question? >> Haipeng Luo: Yes. >>: Can we ask questions during the talk? >> Haipeng Luo: Yes. >>: So you are saying most of the analyses in the previous slide hold when T is known, and then the guarantees hold for all t’s smaller than T or only the very end? >> Haipeng Luo: Only the very end. >>: Okay. So you don't know anything about the regret as you are going along, only that the regret is variant something? So now when you do the doubling trick, again, you are saying that I guess what T is. I run the algorithm and then at that point I restart the algorithm and I run for twice as long and I finish that, so now I will have a guarantee that at this point and at the end of each time and you put a vector of 2+ something for its regret but only at these points? >> Haipeng Luo: No. >>: If I just restart it, so if my adversary knows my strategy. So he knows that after 1000 rounds [indiscernible] restart and we just decide that the game ends after 1001 steps. So basically the game ends after I just restarted and [indiscernible] without histories, so am I not paying a big penalty for that? >> Haipeng Luo: Though. You are still bound by the regret at 2000 rounds. >>: One more question. Are you looking at the bandit setting in which you just know the gradient or do you know all the process? >> Haipeng Luo: No. The full information setting, not the bandit one. >>: [indiscernible] for example there can be a time two. If the first bound I'm going to get is 1000, right one operation goes by. At that point it's not a factor of two. >>: It just goes into the constant. Because it's a sum of 1001 variables, so for the 1000 you get a good bound and for the… >>: Where are there constants? It says two times worse regret. Two is the constant. >>: You don't start the [indiscernible] two, four, eight. >> Haipeng Luo: Yeah right. That's it. You start with two. >>: [indiscernible] >> Haipeng Luo: Okay. We are interested in finding the best solution for when you don't know T. Here are the main results. We first give some exact minimax solutions for special cases and then we showed that there is a gap between learning with and without knowing the horizon. And moreover, what we found a new simple technique just like doubling trick which can handle an unknown horizon but it outperforms the doubling trick both in theory and in practice and in general it recovers later. So before we talk about anything on the unknown horizon setting, let's first look at the story of the new solution of the fixed horizon setting as a preliminary. Here is the formal description of the hedge setting, the example I just gave you. Let's say imagine a game between the player and an adversary. For each time t, the player chooses a distribution Pt over N actions and then the adversary reveals the loss vector Zt with each coordinate corresponding to the loss of each action. Let's assume Zt is from the loss base LS which is a subset of the hypercube. At the end the player suffers losses being the end product of Pt and Zt. Let's further define Lt to be the cumulative loss of the player up to time t and Mt to be the cumulative loss vector of the actions after time t. Then the regret will be a function of LT and MT which is LT minus the minimum coordinate of MT. To find the minimax solution is the same as doing this, min over the algorithm of max over the loss sequence of the regret, which is equivalent to this sequential minimax expression. To solve this, we can use some kind of backward analysis that defines the optimal value at a state of the game. So here a state would be M, the cumulative loss vector of the actions and r, the number of remaining rounds. And this is defined by min over P max over Z of the loss for current round plus the optimal value of the next state, so M plus Z and r-1 is just the next state. In other words, V of M,r computes the regret if we assume that both the player and the adversary played in his best way from now on. And basically it is just when r is 0. Finding the min max solution is just about solving this V function. And back to the expression, it's just V of 0, T. So for some cases let's assume that LN is this simple loss space, so here ei is this is the union vector in the space and so this means that on each round only one action has lost one. All the other has no loss at all. Similar to this previous work we found that V of Mr has a nice form. It's actually the expected regret against a uniformly random adversary. Uniform random adversary means that if picks ei uniformly at random on each round. This is exactly this because no matter how the player reacts, the loss on each round will be one over N so it is r over n at the end of the game and this is just the expected loss of the best event. Moreover the upper most strategy of round t is the following. The wait for action i will be the value of the current state minus the value of the next state assuming that ei will be big. The optimal regret is of the order square of T. This is all under the setting that you know the big T. But then it is just for this simple case. For the general case of an LS case of 0, 1 to the N we don't know the exact form the minimax solution, but we know that it is of order square root of T log N and it's achieved by the exponential weights algorithm, which says that on route P the rate for i should be proportional to e to the minus eta times the loss of this action, and eta is just a parameter that depends on d. So this is the story for the fixed horizon. Now let's move to the unknown horizon case. We started two specific settings. The first one required random horizon setting where we assume that T is drawn from some distribution Q and we assume Q is known to both the player and the adversary, but the actual draw T is unknown to both. Here to find the minimax solution is the same as this M over algorithm over infinity long loss sequence of the expected regret. Again, we saw that under this simple loss space we can solve that exactly. This quantity is, the left-hand side is just what we want to compute here, and the right hand side is the expectation of minimax regret in the fixed horizon setting that I just showed you. Roughly speaking, this is saying that T unknown to both is the same as T known to both, roughly speaking. Moreover, the optimal strategy on round t will be the following. Here, Pt T is the optimal strategy on round t if you don't know the horizon. And right now we don't know the horizon. We just apply the expectation, the conditional expectation of this, of what you would have played if you had known the horizon, given that the horizon is at least the current round. Is that clear? This is the optimal strategy. We want to emphasize that similar results do not hold it in general. If you move from this space to the general space, this equality just doesn't hold. But actually later we will see that this is a very powerful idea, but before we talk about that, let's move on the second model we consider. That would be called the adversarial horizon setting. In this setting, we let T be completely controlled by the adversary. In other words, the adversary can choose to stop or continue the game on the fly. If you want to figure out the minimax formulation, you might do this, M over the algorithm and slip over both the horizon and the loss sequence of the regret. But then you will quickly realize that it doesn't make sense at all, because the adversary will obviously choose T to be infinity, leads to an infinity large regret. In other words, we need to scale the regret somehow. One natural choice is to scale it by the regret under the fixed horizon setting. Let's define this and let it be V tilde, so in other words, V tilde matches how many times was you are when don't know the horizon compared to the case when you know the horizon. It is trivial that V tilde is at least 1 because it's impossible to do even better when you don't know the horizon. >>: If I have an algorithm which is just a [indiscernible] would be the same r all the time, then it's, the [indiscernible] is going to be 1. >> Haipeng Luo: It's going to be 1? >>: Yes, because of my strategy does not depend at all on the equals, then there is not going to be any difference if I know the horizon or not. If I'm really, really bad I just, all of my predictions are just the same throughout the game, just ignore. >> Haipeng Luo: Uh-huh. But the only thing would be one is T lower bound; you are still talking about lower bound, right? >>: I'm saying that this ratio is going to be trivial if I'm really, really bad. >> Haipeng Luo: Yes. But we are considering the best algorithm here. You want to find the algorithm to achieve the best ratio. >>: [indiscernible] V there is another [indiscernible], right? >>: Maybe you have to [indiscernible] >>: V is defined with regret of the algorithm or the best regret possible? >> Haipeng Luo: It's the best regret possible. >>: [indiscernible] >> Haipeng Luo: Sorry? >>: How is V 0 Tdefined? >>: Can you go back to the… >> Haipeng Luo: Oh, V 0… Okay. It's just this, minimax regret if you know the horizon. Okay? >>: [indiscernible] horizon? >> Haipeng Luo: Yes. We defined this as a new regret, but unfortunately, finding the minimax solution under this setting proved difficult even when considering a trivial case when n is 2, so only two actions we still don't know how to find the minimax solution. Under this case we do find a nontrivial lower bound. Let's say there are only two actions and this is the loss space, the general one and we prove that the V tilde there is at least square of 2, so there is actually a gap between known and unknown horizon. In other words, if you don't know the horizon, the adversary can always force you to achieve square of 2 times worst regret at least. The idea behind the proof is simple. It's just, consider this restricted adversary that always keeps the losses close. On Mt the loss for action 1 is L and L plus 2 for action 2. Then on the next round it will either increase the loss for action 2, so it's L and L plus 2 and start again, or it ties up the losses and continue it again, so you come back to this on the next round. We only consider this adversary and then we compute this thing. It requires some work, but in the end it is square root of 2. So even for this simple adversary it is already square of 2 times worse. >>: [indiscernible] T? >> Haipeng Luo: MT is the loss vector so L is the cumulative loss for action 1, and L plus 1 for action 2. >>: So how can the loss for action 2 jump by 2 between those two steps? >> Haipeng Luo: You mean for… >>: How can it increase by 2 if the losses bounded in 0, 1? >>: [indiscernible]. >> Haipeng Luo: It is L +1 to L +2. >>: Oh, L +1 [indiscernible]? >> Haipeng Luo: Right. >>: So [indiscernible] >>: You do these with equal probability? The adversary is choosing the loss? >> Haipeng Luo: The adversary? Sorry. >>: You do these two things with equal probability or… >> Haipeng Luo: No. It chooses whichever is better. The adversary chooses whichever is better. >>: In what sense better? >> Haipeng Luo: Sorry? >>: In what sense better, better in what sense? >> Haipeng Luo: In maximizing this ratio. Okay? Even though we don't know the minimax solution, we do find a new adaptive algorithm, by adaptive I mean it doesn't need to know the parameter T. And the main idea, so let's revisit the minimax solution for the random horizon setting. It's actually an adaptive algorithm in the sense that it has no parameter of T, so the idea is to pretend that T is drawn from some distribution when you don't know the horizon, and then you use this. To use this you need to answer several questions. First, what distribution Q should be used. Second, we know that this is minimax optimal for the random horizon setting, but how is the performance of this under the adversarially chosen horizon? And finally, how general is this approach? The short answer to these questions are this is the right distribution, so the probability of the horizon being small t is directly proportional to 1 over t to the d and the idea is a constant greater than 3/2. The performance of this algorithm is of the optimal rate and finally the approach is very general. In fact, below I'm going to explain these again under the general setting, the online convex optimization setting. Let's consider this again. On each route the player predicts a point xt in the convex set S, and then the adversary chooses a convex loss function ft on the set S and at the end the player suffers loss ft of xt. The regret is, again, the total loss of the player minus the loss of the best fixed point in the set S. This is the well-known online convex optimization setting and it's clear that it includes the hedge setting before. You just need a big x to be the simple x and ft to be the linear function. Again, we can do that minimax solution in the fixed horizon setting. Here we extend the M to be the multiset of all of the loss functions you have encountered and r is still the number of remaining rounds. V of Mr is, again the current rounds plus the optimal value of the next day assuming that both the player and the adversary is optimal. This is just a modus a union, this funny symbol. Optimal strategy when you know the horizon is just the point the real x is minimal. Now our adaptive algorithm is just to play the expectation of this point, this optimal point given that the horizon is underneath the current round. It's the same idea presented in this new framework. Is the algorithm clear? Here is the analysis of the algorithm. We do need a very mild assumption saying that V of Mr is at least V of M, 0. In other words the game is always in favor of the adversary because if you are playing some rounds always leads to a greater regret then not playing at all. This is what this is saying. It's true in all of the examples we reconsider. The theorem is, so this is the right distribution again, the probability of the horizon being small t proportional to one over t to the d. We want to emphasize that this is not an assumption on the horizon. It's just another field of the algorithm and if the minimax regret of the fixed horizon is of order square of t with constant c then the regret for our algorithm will be at most this no matter how the horizon is chosen. Again, even if chosen by the adversary, and if you look at this it is of order square of T again and the than the main term is only A sub d times worse than this, the fixed horizon case where Ad is a constant in term of d and this is a gamma function and, again, you can minimize over d of this and then at the end you would get a d about 3, which is three times worse than fixed horizon case. In theory, it's better than the doubling trick. This is 2+ square of 2, a little bit. And I was just mentioning work of the proof. The key step is to prove that the regret is bounded by d, so don't worry about the notation. Roughly speaking the first term is the regret under the random horizon setting is exactly the same. And the second term you can interpret it as the penalty as being under adversarial setting. And the reason why d has to be greater than 3/2 is that if you write out this theory, this theory is the sum of the square root of T over t to the t and d has to be greater than 3/2 to make this converge. That's the reason why d is this funny constant. >>: You apply this trick of taking the expected x given all possible [indiscernible], but you -- can you go back one slide? You apply this to this specific optimal strategy? Could I have used it on some other strategy and gotten the same guarantee? >> Haipeng Luo: Actually, yes. I will cover that lately. So far this is only for this minimax solution. Okay? So now let's look at some implementation issues and some applications. Let me first show you my favorite example where the algorithm has a close form. Let's consider this specific setting of the online convex optimization. Let's put S to be the L2 unit ball and have the loss function to be a linear function divided by a point also in f. In other words on each round both the player and the adversary picks a point inside the L2 unit ball and the loss is [indiscernible]. Abernathy et al shows that this is the minimax solution when you know the horizon. It's pretty simple. Capital Wt is just the sum of all the previous choices of the adversary and you just compute this and that's minimax optimal. Now we apply our algorithm on it with d equals 2 and then you get this close form. The expression looks terrifying, but the computation is absolutely efficient and doable. It's just computing this in a constant. Apply this and you get this. Then we compare several algorithms. Here in this graph the x-axis is the horizon and the y-axis is regret and each data point is the maximal regret over the 1001 so to simulate the worst case regret, and we compare four algorithms, which is the doubling trick and the online gradient descent by Czinkovitch [phonetic] and our algorithm and the optimal algorithm when you know the horizon by Abernathy et al. Notice that this algorithm, this optimal algorithm kind of cheats by knowing that the horizon at the end is 1000, so it uses 1000 as a parameter and so at the end it's the best algorithm. But otherwise, these three other algorithms and we can see our algorithm performs consistently 30 better than these two. Another thing worth noting is let's look at the first 400 rounds about. Our algorithm actually performs even better than the optimal one. In other words, in practice if you do know the horizon and you want to use this algorithm with a big guess, say 1000, but the actual horizon is only like 200, then using the optimal algorithm is actually not better than using our algorithm. So that's the implication. The next question is what if closed forms do not exist and which would you actually use for the case. It turns out if on each round you have a randomized player is still okay. Let's go back to the hedge setting, but we modified it in the way that we allowed the, the player has to pick a single action on each round instead of a distribution. He picks a single action randomly according to some distribution. In this setting our algorithm simply becomes this. First we draw a T according to our special Q and then we draw It according to whatever you need to play when you know the horizon. The algorithm becomes totally efficient and is easy to hold that same regret bound holds with high probability. The computation isn't an issue here. Second, and back to your question, this is an example of combining our algorithm with a non-minimax algorithm. The basic algorithm doesn't have to be minimax. Here's an example. Let's consider the follow the perturbed leader, FPL. This algorithm says that we can perturb the loss of the actions of little bit by just random variable and then pick the best one. Here this random value variable, see the distribution of this is in terms of the horizon T and now our algorithm totally becomes a Bayesian approach because we are giving a prior on a known parameter of another random variable. We draw T according to Q and then we draw this random variable and then we pick the leader. We prove that is [indiscernible] the regret. Another non-minimax example is the exponential weight algorithm I show you at the very beginning of the talk. Here minus eta T is the learning rate to turn which is in terms of t horizon big T. When T is unknown, as I mentioned, people have been using this idea to just replace the T with a small t so that, in other words so play P sub t [indiscernible] t again. We show that our algorithm also works for any d, the constant d. But it degenerates to the above one when d goes to infinity because this is our algorithm and let's look at this [indiscernible] probability is converges to one if tao is t and 0 otherwise when d goes to infinity. This is exactly the same as this, so our algorithm is the generalization of the previous idea, but we want to emphasize that our algorithm is more applicable by giving you this example, in which case this idea will completely fail. Let's go back to the L2 unit ball game and pick a point in the L2 unit ball and the loss is [indiscernible]. Suppose the adversary picks e1 and -e1 alternatively, and as I show you this is the minimax algorithm when you know the horizon. If you apply this idea, current round is the last round you let big T equal small t, then the output of the algorithm will be 0 minus e1 over square root 2 alternatively. Then by a simple computation, the regret will be this which is a linear regret, so this is suboptimal. So this idea does not always work. To conclude, this example, so this idea doesn't work and doubling trick achieved this regret and our algorithm achieved this with a smaller constant. Finally, so far we have only discussed zeroth order bounds that depends on the horizon, but it's also possible to derive first order bounds such as this. It depends on L star where L star is the loss of the best action, so it doesn't depend on the horizon but it depends on the loss of the best action. Again, if you don't know this we can just put a similar distribution on L star and do the same thing again and we prove it is still the same rate. Okay? To conclude, we have examined some minimax solutions for some special cases and then we showed the square of 2 lower bounds to show that a gap between learning with and without knowing the horizon. We proposed our idea of pretended distribution to handle unknown horizon and it outperformed the doubling trick, it generalizes the previous idea but it's more applicable and it is general. It can be combined with different algorithms and can be used in OCO setting. And finally I wanted to mention some open problems. At the end we didn't really give the exact minimax solution for the adversarial horizon setting, the one to minimize the ratio. We really want to know whether there is an example, a special case where we can do this. Second, the lower bound holds only for N equal 2 which is quite specific and we wonder whether there is a general bound. And at the end unlike the doubling trick we actually proved our algorithm works with the minimax algorithm and with some non-minimax algorithms separately. We don't have unified proof to show that given any known regret algorithm we can play this idea on top of it and we wonder whether that is true or not. >>: Are you saying that square root of two only holds when N equals 2? The upper bound of 3 is for general N, right? Or was the upper bound 3? >> Haipeng Luo: Yeah, yeah, sure sure. You mean the performance of the algorithm? >>: Yeah. You get an algorithm with a constant of 3, there's the constant of N, for any N? >> Haipeng Luo: For any N, yeah. That's for sure. >>: But the square root of 2 is only for… >> Haipeng Luo: It's the lower bound. >>: So for general N it's somewhere in between? >> Haipeng Luo: Right. And that's it. [applause] >> Ofer Dekel: Any questions? >>: What do you know about the bandit [indiscernible]? >> Haipeng Luo: I know this. I think, so the algorithm still work in the bandit setting, but in the bandit setting it is even more difficult to talk about the minimax algorithm because even for the fixed horizon setting, I don't know about it. >>: [indiscernible] information your bound is applied [indiscernible] weight? >> Haipeng Luo: Sorry? >>: For the full information in case your bound is 3 applied to most applicative weight? What about when the applicative weight is applied to bandit? Doubling trick still works there. >> Haipeng Luo: Uh-huh. >>: Does something like your method work? >> Haipeng Luo: That's a good question. We actually haven't thought about bandit setting in this case. If you think minimax solution just doesn't hold thing in the bandit setting. >>: [indiscernible] talking about the [indiscernible] >> Haipeng Luo: Yeah, right. >> Ofer Dekel: Anyone else? Okay. Let's think the speaker again. [applause]