Advanced Course In Machine Learning Online Learning

Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz In this lecture we describe a different model of learning which is called online learning. Online learning takes place in a sequence of consecutive rounds. To demonstrate the online learning model, consider again the papaya tasting problem. On each online round, the learner first receives an instance (the learner buys a papaya and knows its shape and color, which form the instance). Then, the learner is required to predict a label (is the papaya tasty?). At the end of the round, the learner gets the correct label (he tastes the papaya and then knows if it’s tasty or not). Finally, the learner uses this information to improve his future predictions. Previously, we used the batch learning model in which we first use a batch of training examples to learn a hypothesis and only when learning is completed the learned hypothesis is tested. In our papayas learning problem, we should first buy bunch of papayas and taste them all. Then, we use all of this information to learn a prediction rule that determines the taste of new papayas. In contrast, in online learning there is no separation between a training phase and a test phase. The same sequence of examples is used both for training and testing and the distinguish between train and test is through time. In our papaya problem, each time we buy a papaya, it is first considered a test example since we should predict if it’s going to taste good. But, after we take a byte from the papaya, we know the true label, and the same papaya becomes a training example that can help us improve our prediction mechanism for future papayas. The goal of the online learner is simply to make few prediction mistakes. By now, the reader should know that there are no free lunches – we must have some prior knowledge on the problem in order to be able to make accurate predictions. As in previous lectures, we encode our prior knowledge on the problem using some representation of the instances (e.g. shape and color) and by assuming that there is a class of hypotheses, H = {h : X → Y}, and on each online round the learner uses a hypothesis from H to make his prediction. To simplify our presentation, we start the lecture by describing online learning algorithms for the case of a finite hypothesis class. 1 Online Learning and Mistake Bounds for Finite Hypothesis Classes Throughout this section we assume that H is a finite hypothesis class. On round t, the learner first receives an instance, denoted xt , and is required to predict a label. After making his prediction, the learner receives the true label, yt = h? (xt ), where h? ∈ H is a fixed (but unknown to the learner) hypothesis. The most natural learning rule is to use a consistent hypothesis at each online round. If there are several consistent hypotheses, the Consistent algorithm chooses one of them arbitrarily. This corresponds to the ERM learning rule we discussed in batch learning. It is trivial to see that the number of mistakes the Consistent algorithm will make on any sequence is at most |H| − 1. This follows from the fact that whenever Consistent errs, at least one hypothesis is no longer consistent with all previous examples. Therefore, after making |H| − 1 mistakes, the only hypothesis which will remain consistent is h? , and it will never err again. On the other hand, it is easy to construct a sequence of examples on which Consistent indeed makes |H| − 1 mistakes (this is left as an exercise). Next, we define a variant of Consistent which has much better mistake bound. On each round, the RandConsistent algorithm choose a consistent hypothesis uniformly at random, as there is no reason to prefer one consistent hypothesis over another. This leads to the following algorithm. Online Learning-1 Algorithm 1 RandConsistent I NPUT: A finite hypothesis class H I NITIALIZE: V1 = H F OR t = 1, 2, . . . Receive xt Choose some h from Vt uniformly at random Predict ŷt = h(xt ) Receive true answer yt Update Vt+1 = {h ∈ Vt : h(xt ) = yt } The RandConsistent algorithm maintains a set, Vt , of all the hypotheses which are consistent with (x1 , y1 ), . . . , (xt−1 , yt−1 ). It then chooses a hypothesis uniformly at random from Vt and predicts according to this hypothesis. Recall that the goal of the learner is to make few mistakes. The following theorem upper bounds the expected number of mistakes RandConsistent makes on a sequence of examples. To motivate the bound, consider a round t and let αt be the fraction of hypotheses in Vt which are going to be correct on the example (xt , yt ). Now, if αt is close to 1, it means we are likely to make a correct prediction. On the other hand, if αt is close to 0, we are likely to make a prediction error. But, on the next round, after updating the set of consistent hypotheses, we will have |Vt+1 | = αt |Vt |, and since we now assume that αt is small, we will have a much smaller set of consistent hypotheses in the next round. To summarize, if we are likely to err on the current example then we are going to learn a lot from this example as well, and therefore be more accurate in later rounds. Theorem 1 Let H be a finite hypothesis class, let h? be some hypothesis in H and let (x1 , h? (x1 )), . . . , (xT , h? (xT )) be an arbitrary sequence of examples. Then, the expected number of mistakes the RandConsistent algorithm makes on this sequence is at most ln(|H|), where expectation is with respect to the algorithm’s own randomization. Proof For each round t, let αt = |Vt+1 |/|Vt |. Therefore, after T rounds we have 1 ≤ |VT +1 | = |H| T Y αt . t=1 Using the inequality b ≤ e−(1−b) , which holds for all b, we get that 1 ≤ |H| T Y e−(1−αt ) = |H| e− PT t=1 (1−αt ) t=1 ⇒ (1) T X (1 − αt ) ≤ ln(|H|) . t=1 Finally, since we predict ŷt by choosing h ∈ Vt uniformly at random, we have that the probability to make a mistake on round t is |{h ∈ Vt : h(xt ) 6= yt }| |Vt | − |Vt+1 | P[ŷt 6= yt ] = = = (1 − αt ) . |Vt | |Vt | Therefore, the expected number of mistakes is T X t=1 E[1[ŷt 6=yt ] ] = T X P[ŷt 6= yt ] = t=1 Online Learning-2 T X t=1 (1 − αt ) . Combining the above with Eq. (1) we conclude our proof. It is interesting to compare the guarantee in Theorem 1 to guarantees on the generalization error in the PAC model (see Corollary ??). In the PAC model, we refer to the T examples in the sequence as a training set. Then, Corollary ?? implies that with probability of at least 1 − δ, our average error on new examples is guaranteed to be at most ln(|H|/δ)/T . In contrast, Theorem 1 tells us a much stronger guarantee. We do not need to first train the model on T examples, in order to have error rate of ln(|H|)/T . We can have this same error rate immediately on the first T examples we observe. In our papayas example, we don’t need to first buy T papayas, taste them all, and only then to be able to classify new papayas. We can start making predictions from the first day, each day trying to buy a tasty papaya, and we know that our performance after T days will be the same as our performance had we first trained our model using T papayas ! Another important difference between the online model and the batch model is that in the latter we assume that instances are sampled i.i.d. from some underlying distribution, but in the former there is no such an assumption. In particular, Theorem 1 holds for any sequence of instances. Removing the i.i.d. assumption is a big advantage. Again, in the papayas problem, we are allowed to choose a new papaya every day, which clearly violates the i.i.d. assumption. On the flip side, we only have a guarantee on the total number of mistakes but we have no guarantee that after observing T examples we will identify the ’true’ hypothesis. Indeed, if we observe the same example on all the online rounds, we will make few mistakes but we will remain with a large set Vt of hypotheses, all of them are potentially the true hypothesis. Note that the RandConsistent algorithm is a specific variant of the general Consistent learning paradigm (i.e., ERM) and that the bound in Theorem 1 relies on the fact that we use this specific variant. This stands in contrast to the results we had before for the PAC model in which it doesn’t matter how we break ties, and any consistent hypothesis is guaranteed to perform well. In some situations, it is computationally harder to sample a consistent hypothesis from Vt while it is less demanding to merely find one consistent hypothesis. Moreover, if H is infinite, it is not well defined how to choose a consistent hypothesis uniformly at random. On the other hand, as mentioned before, the results we obtained for the PAC model assume that the data is i.i.d. while the bound for RandConsistent holds for any sequence of instances. If the data is indeed generated i.i.d. then it is possible to obtain a bound for the general Consistent paradigm. Theorem 1 bounds the expected number of mistakes. Using martingale measure concentration techniques, one can obtain a bound which holds with extremely high probability. A simpler way is to explicitly derandomize the algorithm. Note that RandConsistent predicts 1 with probability greater than 1/2 if the majority of hypotheses in Vt predicts 1. A simple derandomization is therefore to make a deterministic prediction according to a majority vote of the hypotheses in Vt . The resulting algorithm is called Halving. Algorithm 2 Halving I NPUT: A finite hypothesis class H I NITIALIZE: V1 = H F OR t = 1, 2, . . . Receive xt Predict ŷt = argmaxr∈{±1} |{h ∈ Vt : h(xt ) = r}| (In case of a tie predict ŷt = 1) Receive true answer yt Update Vt+1 = {h ∈ Vt : h(xt ) = yt } Theorem 2 Let H be a finite hypothesis class, let h? be some hypothesis in H and let (x1 , h? (x1 )), . . . , (xT , h? (xT )) be an arbitrary sequence of examples. Then, the number of mistakes the Halving algorithm makes on this sequence is at most log2 (|H|). Online Learning-3 Proof We simply note that whenever the algorithm errs we have |Vt+1 | ≤ |Vt |/2. (Hence the name Halving.) Therefore, if M is the total number of mistakes, we have 1 ≤ |VT +1 | ≤ |H| 2−M . Rearranging the above inequality we conclude our proof. A guarantee of the type given in Theorem 2 is called a Mistake Bound. Theorem 2 states that Halving enjoys a mistake bound of log2 (|H|). In the next section, we relax the assumption that all the labels are generated by a hypothesis h? ∈ H. 2 Weighted Majority and Regret Bounds In the previous section we presented the Halving algorithm and analyze its performance by providing a mistake bound. A crucial assumption we relied on is that the data is realizable, namely, the labels in the sequence are generated by some hypothesis h? ∈ H. This is a rather strong assumption on our data and prior knowledge. In this section we relax this assumption. Recall that the mistake bounds we derived in the previous section do not require the data to be sampled i.i.d. from some distribution. We allow the sequence to be deterministic, stochastic, or even adversarially adaptive to our own behavior (for example, this is the case in spam email filtering). Clearly, learning is impossible if there is no correlation between past and present examples. In the realizable case, future and past examples are tied together by the common hypothesis, h? ∈ H, that generates all labels. In the non-realizable case, we analyze the performance of an online learner using the notion of regret. The learner’s regret is the difference between his number of prediction mistakes and the number of mistakes the optimal fixed hypothesis in H makes on the same sequence of examples. This is termed ’regret’ since it measures how ’sorry’ the learner is, in retrospect, not to have followed the predictions of the optimal hypothesis. We again use the notation h? to denote the hypothesis in H that makes the least number of mistakes on the sequence of examples. But, now, the labels are not generated by h? , meaning that for some examples we might have h? (xt ) 6= yt . As mentioned before, learning is impossible if there is no correlation between past and future examples. However, even in this case, the algorithm can have low regret. In other words, having low regret does not necessarily mean that we will make few mistakes. It only means that the algorithm will be competitive with an oracle, that knows in advance what the data is going to be, and chooses the optimal h? . If our prior knowledge is adequate, then H contains a hypothesis that (more or less) explains the data, and then a low regret algorithm is guaranteed to have few mistakes. We now present an online learning algorithm for the non-realizable case, also called the agnostic case. As in the previous section, we assume that H is a finite hypothesis class and denote H = {h1 , . . . , hd }. Recall that the RandConsistent algorithm maintains the set Vt of all hypotheses which are consistent with the examples observed so far. Then, it samples a hypothesis from Vt uniformly at random. We can represent the set Vt as a vector wt ∈ Rd , where wt,i = 1 ifP hi ∈ Vt and otherwise wt,i = 0. Then, the RandConsistent algorithm chooses hi with probability wt,i /( j wt,j ), and the vector w is updated at the end of the round by zeroing all elements corresponding to hypotheses that err on the current example. If the data is non-realizable, the weight of h? will become zero once we encounter an example on which ? h errs. From this point on, h? will not affect the prediction of the RandConsistent algorithm. To overcome this problem, one can be less aggressive and instead of zeroing weights of erroneous hypotheses, one can just diminish their weight by scaling down their weight by some β ∈ (0, 1). The resulting algorithm is called Weighted-Majority. Online Learning-4 Algorithm 3 Weighted-Majority I NPUT: Finite hypothesis class H = {h1 , . . . , hd } ; Number of rounds T √ I NITIALIZE: β = e− 2 ln(d)/T ; w1 = (1, . . . , 1) ∈ Rd F OR t = 1, 2, . . . , T Pd Let Zt = j=1 wt,j Sample a hypothesis h ∈ H at random according to wt,d wt,1 Zt , . . . , Zt Predict ŷt = h(xt ) Receive true answer yt Update: ∀j, wt+1,j  wt,j β = w if hj (xt ) 6= yt else t,j The following theorem provides an expected regret bound for the algorithm. Theorem 3 Let H be a finite hypothesis class, and let (x1 , y1 ), . . . , (xT , yT ) be an arbitrary sequence of examples. If we run Weighted-Majority on this sequence we have the following expected regret bound " T # T X X p 1[h(xt )6=yt ] ≤ E 1[ŷt 6=yt ] − min 0.5 ln(|H|) T , h∈H t=1 t=1 where expectation is with respect to the algorithm own randomization. p Proof Let η = 2 ln(d)/T and note that wt+1,i = wt,i e−η 1[hi (xt )6=yt ] . Therefore, ln X wt,i Zt+1 = ln e−η1[hi (xt )6=yt ] . Zt Z t i Hoeffding inequality tells us that if X is a random variable over [0, 1] then ln E[e−ηX ] ≤ −η E[X] + η2 . 8 Since wt /Zt is a probability vector and 1[hi (xt )6=yt ] ∈ [0, 1], we can apply Hoeffding’s inequality to obtain: ln X wt,i η2 η2 Zt+1 ≤ −η 1[hi (xt )6=yt ] + = − η E[1[ŷt 6=yt ] ] + . Zt Zt 8 8 i Summing the above inequality over t we get ln(ZT +1 ) − ln(Z1 ) = T X t=1 ln T X Zt+1 T η2 ≤ −η E[1[ŷt 6=yt ] ] + . Zt 8 t=1 Next, we lower bound ZT +1 . For each i, let Mi be the number of mistakes hi makes on the entire sequence of T examples. Therefore, wT +1,i = e−ηMi and we get that ! X −ηMi ln ZT +1 = ln e ≥ ln max e−ηMi = −η min Mi . i i Online Learning-5 i Combining the above and using the fact that ln(Z1 ) = ln(d) we get that −η min Mi − ln(d) ≤ − η i T X E[1[ŷt 6=yt ] ] + t=1 T η2 , 8 which can be rearranged as follows: T X t=1 E[1[ŷt 6=yt ] ] − min Mi ≤ h∈H ln(d) η T + . η 8 Plugging the value of η into the above concludes our proof. Comparing the regret bound of Weighted-Majority in Theorem 3 to the mistake bound of RandConsistent given in Theorem 1 we note several differences. First, the result for Weighted-Majority holds for any sequence of examples while the mistake bound of RandConsistent assumes that the labels are generated by some h? ∈ H. As a result, the bound of Weighted-Majority is relative to the minimal number of mistakes a hypothesis h ∈ H makes. Second, dividing both bounds by the number of rounds T , wepget that the error rate in Theorem 1 decreases as ln(|H|)/T while the error rate in Theorem 3 decreases as ln(|H|)/T . This is similar to the results we had for PAC learning in the realizable and non-realizable cases. 3 Online Learnability and Littlestone’s dimension So far, we focused on specific hypothesis classes – finite classes or the class of linear separators. We derived specific algorithms for each type of class and analyzed their performance. In this section we take a more general approach, and aim at characterizing online learnability. In particular, we target the following question: what is the optimal online learning algorithm for a given class H. To simplify our derivation, we again consider the realizable case in which all labels are generated by some hypothesis h? ∈ H. We already encountered worst-case mistake bounds. For completeness, we give here a formal definition. Definition 1 (Mistake bound) Let H be a hypothesis class of functions from X to Y = {±1} and let A be an online algorithm. We say that M is a mistake bound for A, if for any hypothesis h? ∈ H and a sequence of examples (x1 , h? (x1 )), . . . , (xT , h? (xT )), the online algorithm A does not make more than M prediction mistakes when running on the sequence of examples. Remarks: • If an algorithm has a mistake bound M it does not necessarily mean that the algorithm will eventually know the identity of h? , even if we present it with infinite number of examples (e.g. we can feed the online algorithm the same example again and again). The only guarantee that we have is that the total mistakes that the algorithm will make is at most M . • When the learning algorithm makes randomized predictions, it is also possible to define a mistake bound M as an upper bound on the expected number of mistakes, where expectation is with respect to learner’s own randomization. For simplicity, throughout this section we do not analyze expected regret and allow the environment to observe the learner’s prediction before providing the label. In the following we present a dimension of hypothesis classes that characterizes the best possible achievable mistake bound. This measure was proposed by Littleston and we therefore refer to it as Ldim(H). To motivate the definition of Ldim it is convenient to view the online learning process as a game between two players: the learner vs. the environment. On round t of the game, the environment picks an instance xt , Online Learning-6 v1 v2 v1 v2 v3 v3 h1 −1 −1 ? h2 −1 1 ? h3 1 ? −1 h4 1 ? 1 Figure 1: An illustration of a shattered tree of depth 2. The blue path corresponds to the sequence of examples ((v1 , 1), (v3 , −1)). The tree is shattered by H = {h1 , h2 , h3 , h4 }, where the predictions of each hypothesis in H on the instances v1 , v2 , v3 is given in the table (a question mark means that hj (vi ) can be either 1 or −1). the learner predicts a label ŷt ∈ {+1, −1}, and finally the environment outputs the true label, yt ∈ {+1, −1}. Suppose that the environment wants to make the learner err on the first M rounds of the game. Then, it must output yt = −ŷt , and the only question is how to choose the instances xt in such a way that ensures that for some h? ∈ H we have yt = h? (xt ) for all t ∈ [M ]. It makes sense to assume that the environment should pick xt based on the previous predictions of the learner, ŷ1 , . . . , ŷt−1 . Since in our case we have yt = −ŷt we can also say that xt is a function of y1 , . . . , yt−1 . We can represent this dependence using a complete binary tree of depth M (we define the depth of the tree as the number of edges in a path from the root to a leaf). We have 2M +1 − 1 nodes in such a tree, and we attach an instance to each node. Let v1 , . . . , v2M +1 −1 be these instances. We start from the root of the tree, and set x1 = v1 . At round t, we set xt = vit where it is the current node. At the end of round t, we go to the left child of it if yt = −1 or to the right child if yt = 1. That is, it+1 = 2it + yt2+1 . Unraveling Pt−1 y +1 the recursion we obtain it = 2t−1 + j=1 j2 2t−1−j . The above strategy for the environment succeeds only if for any (y1 , . . . , yM ) there exists h ∈ H such that yt = h(xt ) for all t ∈ [M ]. This leads to the following definition. Definition 2 (H Shattered tree) A shattered tree of depth d is a sequence of instances v1 , . . . , v2d −1 in X such that for all labeling (y1 , . . . , yd ) ∈ {±1}M there exists h ∈ H such that for all t ∈ [d] we have Pt−1 y +1 h(vit ) = yt where it = 2t−1 + j=1 j2 2t−1−j . An illustration of a shattered tree of depth 2 is given in Figure 1. Definition 3 (Littlestone’s dimension (Ldim)) Ldim(H) is the maximal integer M such that there exist a shattered tree of depth M . The definition of Ldim and the discussion above immediately imply the following: Lemma 1 If Ldim(H) = M then no algorithm can have a mistake bound strictly smaller than M . Proof Let v1 , . . . , v2M −1 be a sequence that satisfies the requirements in the definition of Ldim. If the environment sets xt = vit and yt = −ŷt for all t ∈ [M ], then the learner makes M mistakes while the definition of Ldim implies that there exists a hypothesis h ∈ H such that yt = h(xt ) for all t. Let us now give several examples. Example 1 Let H be a finite hypothesis class. Clearly, any tree that is shattered by H has depth of at most log2 (|H|). Therefore, Ldim(H) ≤ log2 (|H|). Example 2 Let X = R and H = {x 7→ sign(x − a) : a ∈ R}. Then, Ldim(H) = ∞. (proof is left as an exercise). Online Learning-7 Example 3 Let X = {x ∈ {0, 1}∗ : kxk0 ≤ r} and H = {x 7→ sign(hw, xi) : kwk0 ≤ k}. The size of H is infinite. Nevertheless, Ldim(H) ≤ r k. (The proof uses the Perceptron mistake bound and is left as an exercise). Lemma 1 states that Ldim(H) lower bounds the mistake bound of any algorithm. Interestingly, there is a standard algorithm whose mistake bound matches this lower bound. The algorithm is similar to the Halving algorithm. Recall that the prediction of Halving is according to a majority vote of the hypotheses which are consistent with previous examples. We denoted this set by Vt . Put another way, Halving partition Vt into two sets: Vt+ = {h ∈ Vt : h(xt ) = 1} and Vt− = {h ∈ Vt : h(xt ) = −1}. It then predicts according to the larger of the two groups. The rational behind this prediction is that whenever Halving makes a mistake it ends up with |Vt+1 | ≤ 0.5 |Vt |. The optimal algorithm we present below uses the same idea, but instead of predicting according to the larger class, it predicts according to the class with larger Ldim. Algorithm 4 Standard Optimal Algorithm (SOA) I NPUT: A hypothesis class H I NITIALIZE: V1 = H F OR t = 1, 2, . . . Receive xt (r) For r ∈ {±1} let Vt = {h ∈ Vt : h(xt ) = r} (r) Predict ŷt = argmaxr Ldim(Vt ) Receive true answer yt (yt ) Update Vt+1 = Vt The following lemma formally establishes the optimality of the above algorithm. Lemma 2 SOA enjoys the mistake bound M = Ldim(H). Proof It suffices to prove that whenever the algorithm makes a prediction mistake we have Ldim(Vt+1 ) ≤ Ldim(Vt ) − 1. We prove this claim by assuming the contrary, that is, Ldim(Vt+1 ) = Ldim(Vt ). If this holds (r) true, then the definition of ŷt implies that Ldim(Vt ) = Ldim(Vt ) for both r = 1 and r = −1. But, in this case we can construct a tree of depth Ldim(Vt ) + 1 that satisfies the requirements given in the definition of Ldim for the class Vt , which leads to the desired contradiction. Combining Lemma 2 and Lemma 1 we obtain: Corollary 1 Let H be a hypothesis class with Ldim(H) = M . Then, the standard optimal algorithm enjoys the mistake bound M and no other algorithm can have a mistake bound strictly smaller than M . 3.1 Comparison to VC dimension In this section we compare the online learning dimension Ldim with the VC dimension we used for determining the learnability of a class in the batch learning model. To remind the reader, the VC dimension of a class H, denoted VCdim(H), is the maximal number d such that there are instances x1 , . . . , xd that are shattered by H. That is, for any sequence of labels (y1 , . . . , yd ) ∈ {+1, −1}d there exists a hypothesis h ∈ H that gives exactly this sequence of labels. Theorem 4 For any class H we have VCdim(H) ≤ Ldim(H). Online Learning-8 x1 x2 x3 x2 x3 x3 x3 Figure 2: How to construct a shattered tree from a shattered sequence x1 , . . . , xd . Proof Suppose VCdim(H) = d and let x1 , . . . , xd be a shattered set. We now construct a complete binary tree of instances v1 , . . . , v2d −1 , where all nodes at depth i are set to be xi (see the illustration in Figure 2). Now, the definition of shattered sample clearly implies that we got a valid shattered tree of depth d, and our proof is completed. Corollary 2 For a finite hypothesis class, we have VCdim(H) ≤ Ldim(H) ≤ log(|H|) . Both inequalities in Corollary 2 can be strict as the following examples show. Example 4 Consider again the class of initial segments on the real numbers. That is, X = R and H = {x 7→ sign(x − a) : a ∈ R}. We have shown that the VC dimension of H is 1 while the Littlestone dimension is ∞. Example 5 Let X = {1, . . . , d} and H = {h1 , . . . , hd } where hd (x) = 1 iff x = d. Then, it is easy to show that Ldim(H) = 1 while |H| = d can be arbitrarily large. Online Learning-9 Advanced Course in Machine Learning Spring 2011 Online Convex Optimization Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz A convex repeated game is a two players game that is performed in a sequence of consecutive rounds. On round t of the repeated game, the first player chooses a vector wt from a convex set A. Next, the second player responds with a convex function gt : A → R. Finally, the first player suffers an instantaneous loss gt (wt ). We study the game from the viewpoint of the first player. In offline convex optimization, the goal is to find a vector w within a convex set A that minimizes a convex objective function, g : A → R. In online convex optimization, the set A is known in advance, but the objective function may change along the online process. PTThe goal of the online optimizer, which we call the learner, is to minimize the averaged objective value T1 t=1 gt (wt ), where T is the total number of rounds. Low regret: Naturally, an adversary can make the cumulative loss of our online learning algorithm arbitrarily large. For example, the second player can always set gt (w) = 1 and then no matter what the learner will predict, the cumulative loss will be T . To overcome this deficiency, we restate the learner’s goal based on the notion of regret. The learner’s regret is the difference between his cumulative loss and the cumulative loss of the optimal offline minimizer. This is termed ’regret’ since it measures how ’sorry’ the learner is, in retrospect, not to use the optimal offline minimizer. That is, the regret is R(T ) = T T 1X 1X gt (wt ) − min gt (w) . w∈A T T t=1 t=1 We call an online algorithm a low regret algorithm if R(T ) = o(1). In this lecture we will study low regret algorithms for online convex optimization. We will also show how several familiar algorithms, like the Perceptron and Weighted Majority, can be derived from an online convex optimizer. We start with a brief overview of basic notions form convex analysis. 4 Convexity A set A is convex if for any two vectors w1 , w2 in A, all the line between w1 and w2 is also within A. That is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ A. A function f : A → R is convex if for all u, v ∈ Rn and α ∈ [0, 1] we have f (αu + (1 − α)v) ≤ αf (u) + (1 − α)f (v) . It is easy to verify that f is convex iff its epigraph is a convex set, where epigraph(f) = {(x, α) : f (x) ≤ α}. We allow f to output ∞ for some inputs x. This is a convenient way to restrict the domain of A to a proper subset of X . So, in this section we use R to denote the reals number and the special symbol ∞. The domain of f : X → R is defined as dom(f ) = {x : f (x) < ∞}. A set A is open if every point in A has a neighborhood lying in A. A set A is closed if its complement is an open set. A function f is closed if for any finite scalar α, the level set {w : f (w) ≤ α} is closed. Throughout, we focus on closed and convex functions. 4.1 Sub-gradients A vector λ is a sub-gradient of a function f at w if for all u ∈ A we have that f (u) − f (w) ≥ hu − w, λi . Online Convex Optimization-10 The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. For scalar functions, a sub-gradient of a convex function f at x is a slope of a line that touches f at x and is not above f everywhere. Two useful properties of subgradients are given below: 1. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f at w and is denoted by ∇f (w). In finite dimensional spaces, the gradient of f is the vector of partial derivatives of f . 2. If g(w) = maxi∈[r] gi (w) for r differentiable functions g1 , . . . , gr , and j = arg maxi gi (u), then the gradient of gj at u is a subgradient of g at u. Example 6 (Sub-gradients of the logistic-loss) Recall that the logistic-loss is defined as `(w; x, y) = log(1 + exp(−yhw, xi)). Since this function is differentiable, a sub-gradient at w is the gradient at w, which using the chain rule equals to ∇`(w; x, y) = − exp(−yhw, xi) −1 yx = yx. 1 + exp(−yhw, xi) 1 + exp(yhw, xi) Example 7 (Sub-gradients of the hinge-loss) Recall that the hinge-loss is defined as `(w; x, y) = max{0, 1 − yhw, xi}. This is the maximum of two linear functions. Therefore, using the two propoerties above we have that if 1 − yhw, xi > 0 then −y x ∈ ∂`(w; x, y) and if 1 − yhw, xi < 0 then 0 ∈ ∂`(w; x, y). Furthermore, it is easy to verify that   if 1 − yhw, xi > 0 {−yx} ∂`(w; x, y) = {0} if 1 − yhw, xi < 0   {−αyx : α ∈ [0, 1]} if 1 − yhw, xi = 0 1 -1 1 Figure 3: An illustration of the hinge-loss function f (x) = max{0, 1 − x} and one of its sub-gradients at x = 1. 4.2 Lipschitz functions We say that f : A → R is ρ-Lipschitz if for all u, v ∈ A |f (u) − f (v)| ≤ ρ ku − vk . An equivalent definition is that the `2 norm of all sub-gradients of f at points in A is bounded by ρ. More generally, we say that a convex function is V -Lipschitz w.r.t. P a norm k · k if for all x ∈ X exists v ∈ ∂f (x) with kvk? ≤ V . Of particular interest are p-norms, kxkp = ( i |xi |p )1/p . Online Convex Optimization-11 4.3 Dual norms Given a norm k · k, its dual norm is defined by kyk? = sup hx, yi . x:kxk≤1 For example, the Euclidean norm is dual to itself. More generally, for any p, q ≥ 1 such that 1/p + 1/q = 1, the norms !1/p !1/q X X p q kxkp = ; kxkq = |xi | |xi | i i are dual norms. The above also holds for kxk1 and kyk∞ = maxi |yi |. 4.4 Fenchel Conjugate There are two equivalent representations of a convex function. Either as pairs (x, f (x)) or as the set of tangents of f , namely pairs (slope,intersection-with-y-axis). The function that relates slopes of tangents to their intersection with the y axis is called the Fenchel conjugate of f . Set of points Set of tangents slo f (w) pe = θ w −f ? (θ) Formally, the Fenchel conjugate of f is defined as f ? (θ) = max hx, θi − f (x) . x Online Convex Optimization-12 Few remarks: • The definition immediately implies Fenchel-Young inequality: f ? (θ) ∀u, = max hx, θi − f (x) ≥ hu, θi − f (u) x • If f is closed and convex then f ?? = f • By the way, this implies Jensen’s inequality: f (E[x]) = max hθ, E[x]i − f ? (θ) = max E [hθ, xi − f ? (θ)] ≤ E[ max hθ, xi − f ? (θ) ] = E[f (x)] θ θ θ Several examples of Fenchel conjugate functions are given below. P i 4.5 f (x) f ? (θ) 1 2 2 kxk 1 2 2 kθk? kxk Indicator of unit k · k? ball P θi log ie wi log(wi ) Indicator of prob. simplex maxi θi c g(x) for c > 0 c g ? (θ/c) inf x g1 (x) + g2 (x − x) g1? (θ) + g2? (θ) Strong Convexity–Strong Smoothness Duality Recall that the domain of f : X → R is {x : f (x) < ∞}. We first define strong convexity. Definition 4 A function f : X → R is β-strongly convex w.r.t. a norm k · k if for all x, y in the relative interior of the domain of f and α ∈ (0, 1) we have f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) − 21 βα(1 − α)kx − yk2 We now define strong smoothness. Note that a strongly smooth function f is always finite. Definition 5 A function f : X → R is β-strongly smooth w.r.t. a norm k · k if f is everywhere differentiable and if for all x, y we have f (x + y) ≤ f (x) + h∇f (x), yi + 21 βkyk2 The following theorem states that strong convexity and strong smoothness are dual properties. Recall that the biconjugate f ?? equals f if and only if f is closed and convex. Theorem 5 (Strong/Smooth Duality) Assume that f is a closed and convex function. Then f is β-strongly convex w.r.t. a norm k · k if and only if f ? is β1 -strongly smooth w.r.t. the dual norm k · k? . Online Convex Optimization-13 f ? (w) f (w) ≥ β 2 ku − wk2 ≤ 1 2β ku − wk2? f ? (u) f (u) u w u w Figure 4: Illustrations of strong convexity (left) and strong smoothness (right). Subtly, note that while the domain of a strongly convex function f may be a proper subset of X (important for a number of settings), its conjugate f ? always has a domain which is X (since if f ? is strongly smooth then it is finite and everywhere differentiable). The above theorem can be found, for instance, in Zalinescu 2002 (see Corollary 3.5.11 on p. 217 and Remark 3.5.3 on p. 218). The following direct corollary of Theorem 5 is central in proving regret bounds. As we will show later, it is also and generalization bounds. P Corollary 3 If f is β strongly convex w.r.t. k · k and f ? (0) = 0, then, denoting the partial sum j≤i vj by v1:i , we have, for any sequence v1 , . . . , vn and for any u, n n n X X 1 X kvi k2? . hvi , ui − f (u) ≤ f ? (v1:n ) ≤ h∇f ? (v1:i−1 ), vi i + 2β i=1 i=1 i=1 Proof The 1st inequality is Fenchel-Young and the 2nd is from the definition of smoothness by induction. Examples of strongly convex functions Lemma 3 Let q ∈ [1, 2]. The function f : Rd → R defined as f (w) = with respect to k · kq over Rd . 1 2 2 kwkq is (q − 1)-strongly convex A proof can be found in Shalev-Shwartz 2007. We mainly use the above lemma to obtain results with respect to the norms k · k2 and k · k1 . The case q = 2 is straightforward. Obtaining results with respect to k · k1 is slightly more tricky since for q = 1 the strong convexity parameter is 0 (meaning that the function is not strongly convex). To overcome this ln(d) problem, we shall set q to be slightly more than 1, e.g. q = ln(d)−1 . For this choice of q, the strong convexity parameter becomes q − 1 = 1/(ln(d) − 1) ≥ 1/ ln(d) and the value of p corresponds to the dual norm is −1 p = (1 − 1/q) = ln(d). Note that for any x ∈ Rd we have kxk∞ ≤ kxkp ≤ (dkxkp∞ )1/p = d1/p kxk∞ = e kxk∞ ≤ 3 kxk∞ . Hence the dual norms are also equivalent up to a factor of 3: kwk1 ≥ kwkq ≥ kwk1 /3. The above lemma therefore implies the following corollary. Corollary 4 The function f : Rd → R defined as f (w) = convex with respect to k · k1 over Rd . 1 2 2 kwkq for q = ln(d) ln(d)−1 is 1/(9 ln(d))-strongly Another important example is given in the following lemma. P Lemma 4 The function f : Rd → R defined as f (x) = i xi log(xi ) is 1-strongly convex with respect to k · k1 over the domain {x : kxk1 = 1, x ≥ 0}. The proof can also be found in Shalev-Shwartz 2007. Online Convex Optimization-14 5 An algorithmic framework for Online Convex Optimization Algorithm 5 provides one common algorithm which achieves the following regret bound. It is one of a family of algorithms that enjoy the same regret bound (see Shalev-Shwartz 2007). Algorithm 5 Online Mirror Descent Initialize: w1 ← ∇f ? (0) for t = 1 to T Play wt ∈ A Receive gt and pick vt∈ ∂gt (wt ) Pt Update wt+1 ← ∇f ? −η s=1 vt end for Theorem 6 Suppose Algorithm 5 is used with a function f that is β-strongly convex w.r.t. a norm k · k on A and has f ? (0) = 0. Suppose the loss functions gt are convex and V -Lipschitz w.r.t. the norm k · k. Then, the algorithm run with any positive η enjoys the regret bound, T X gt (wt ) − min u∈A t=1 In particular, choosing η = q T X gt (u) ≤ t=1 maxu∈A f (u) ηV 2 T + . η 2β 2β maxu f (u) V 2T we obtain the regret bound s T T X X 2 maxu∈A f (u) T . gt (u) ≤ V gt (wt ) − min u∈A β t=1 t=1 Proof Apply Corollary 3 to the sequence −ηv1 , . . . , −ηvT to get, for all u, −η T X hvt , ui − f (u) ≤ −η t=1 T X hvt , wt i + t=1 T 1 X kηvt k2? . 2β t=1 Using the fact that gt is V -Lipschitz, we get kvt k? ≤ V . Plugging this into the inequality above and rearranging gives, T X f (u) ηV 2 T hvt , wt − ui ≤ + . η 2β t=1 By convexity of gt , gt (wt ) − gt (u) ≤ hvt , wt − ui. Therefore, T X gt (wt ) − t=1 T X gt (u) ≤ t=1 f (u) ηV 2 T + . η 2β Since the above holds for all u ∈ A the result follows. 6 Tightness of regret bounds In the previous sections√we presented algorithmic frameworks for online convex programming with regret bounds that depend on T and on the complexity of the competing vector as measured by f (u). In this section we show that without imposing additional conditions our bounds are tight, in a sense that is articulated below. √ First, we study the dependence of the regret bounds on T . Online Convex Optimization-15 Theorem 7 For any online convex programming algorithm, there exists a sequence of 1-Lipschitz convex functions of length √ T such that the regret of the algorithm on this sequence with respect to a vector u with kuk2 ≤ 1 is Ω( T ). Proof The proof is based on the probabilistic method. Let S = [−1, 1] and f (w) = 12 w2 . We clearly have that f (w) ≤ 1/2 for all w ∈ S. Consider a sequence of linear functions gt (w) = σt w where σt ∈ {+1, −1}. Note that gt is 1-Lipschitz for all t. Suppose that the sequence σ1 , . . . , σT is chosen in advance, independently with equal probability. Since wt only depends on σ1 , . . . , σt−1 it is independent of σt . Thus, Eσt [gt (wt ) | σ1 , . . . , σt−1 ] = 0. On the other hand, for any σ1 , . . . , σT we have that T T X X min gt (u) ≤ − σt . u∈S t=1 t=1 Therefore, E σ1 ,...,σT [Regret] = E σ1 ,...,σT X [ gt (wt )] − t " ≥ 0+ E σ1 ,...,σT E σ1 ,...,σT [min u X gt (u)] t # | X σt | . t The right-hand side p above is the √ expected distance after T steps of a random walk and is thus equal approximately to 2T /π√ = Ω( T ). Our proof is concluded by noting that if the expected value of the regret is greater√than Ω( T ), then there must exist at least one specific sequence for which the regret is greater than Ω( T ). Next, we study the dependence on the complexity function f (w). Theorem 8 For any online convex programming algorithm, there exists a strongly convex function f with respect to a norm k · k, a sequence of 1-Lipschitz convex functions with respect to k · k? , and a vector u, such that the regret of the algorithm on this sequence is Ω(f (u)). Proof Let S = RT and f (w) = 21 kwk22 . Consider the sequence of functions in which gt (w) = [1 − yt hw, et i]+ where yt ∈ {+1, −1} and et is the tth vector of the standard base, namely, et,i = 0 for all i 6= t and et,t = 1. Note that gt (w) is 1-Lipschitz with respect to the Euclidean norm. Consider any online learning algorithm. If on round t we have wt,i ≥ 0 then we set yt = −1 and otherwise we set yt = 1. Thus, gt (wt ) ≥ 1. On the other hand, the vector u = (y1 , . . . , yT ) satisfies f (u) ≤ T /2 and gt (u) = 0 for all t. Thus, Regret ≥ T = Ω(f (u)). 7 Convexification The algorithms we derived previously are based on the assumption that for each round t, the loss function gt (w) is convex with respect to w. A well known example in which this assumption does not hold is online classification with the 0-1 loss function. In this section we discuss to what extent online convex optimization can be used for online learning with the 0-1 loss. In particular, we will show how the Perceptron and Weighted Majority algorithms can be derived from the general online mirror descent framework. In online binary classification problems, at each round we receive an instance x ∈ X ⊂ Rn and we need to predict a label ŷt ∈ Y = {+1, −1}. Then, we receive the correct label yt ∈ Y and suffer the 0-1 loss: 1[yt 6=ŷt ] . We first show that no algorithm can obtain a sub-linear regret bound for the 0-1 loss function. To do so, let X = {1} so our problem boils down to finding the bias of a coin in an online manner. An adversary can Online Convex Optimization-16 make the number of mistakes of any online algorithm to be equal to T , by simply waiting for the learner’s prediction and then providing the P opposite answer as the true answer. In contrast, the number of mistakes of the constant prediction u = sign( t yt ) is at most T /2. Therefore, the regret of any online algorithm with respect to the 0-1 loss function will be at least T /2. This impossibility result is attributed to Cover. To overcome the above impossibility result, two solutions have been proposed. 7.1 Surrogate convex loss and the Perceptron The first solution is to find a convex loss function that upper bounds the original non-convex loss function. We describe this solution for the problem of online learning halfspaces. Recall that the set of linear hypotheses is: H = {x 7→ hw, xi : kwk2 ≤ U } . In the context of binary classification, the actual prediction is sign(hw, xi) and the 0 − 1 loss function of w on an example (x, y) is `0-1 (w, (x, y)) = 1[sign(hw,xi)6=y] . A popular surrogate convex loss function is the hinge-loss, defined as `hinge (w, (x, y)) = [1 − yhw, xi]+ , where [a]+ = max{0, a}. It is straightforward to verify that `hinge (w, (x, y)) is a convex function (w.r.t. w) and that `hinge (w, (x, y)) ≥ `0-1 (w, (x, y)). Therefore, for any u ∈ A we have Regret(T ) = T X `hinge (wt , (xt , yt )) − T X `hinge (u, (xt , yt )) t=1 t=1 ≥ T X `0-1 (wt , (xt , yt )) − t=1 T X `hinge (u, (xt , yt )) . t=1 As a direct corollary from the above inequality we get that a low regret algorithm for the (convex) hinge-loss function can be utilized for deriving an online learning algorithm for the 0-1 loss with the bound T X `0-1 (wt , (xt , yt )) ≤ t=1 T X `hinge (u, (xt , yt )) + Regret(T ) . t=1 Furthermore, denote by M the set of rounds in which sign(hwt , xt i) 6= yt and note that the left-hand side of the above is equal to |M|. We can remove the examples not in M from the sequence of examples, and run an online convex optimization algorithm only on the sequence of examples in M. In particular, applying Algorithm 5 with f (w) = 21 kwk22 we obtain the well known Perceptron algorithm: Algorithm 6 Perceptron Initialize: w1 ← 0 for t = 1 to T Receive xt Predict ŷt = sign(hwt , xt i) Receive yt If ŷt 6= yt Update wt+1 ← wt + yt xt Else Update wt+1 ← wt End if end for Online Convex Optimization-17 To analyze the Perceptron, we note that an update of the form wt+1 = wt + ηyt xt will yield the same algorithm, no matter what the value of η is. Therefore, we can use the regret bound given in Theorem 6 on the sequence of round in M and the set A = {w : kwk2 ≤ U } and get that for any u with kuk2 ≤ U we have X p (2) |M| ≤ `hinge (u, (xt , yt )) + X U |M| , t∈M where X = (maxt∈M kxt k2 ). It is easy to verify that this implies sX X |M| ≤ `hinge (u, (xt , yt )) + X U `hinge (u, (xt , yt )) + X 2 kuk2 . t∈M t∈M Such a bound is called a relative mistake bound. 7.2 Randomization and Weighted Majority Another way to circumvent Cover’s impossibility result is by relying on randomization. We demonstrate this idea using the setting of prediction with expert advice. In this setting, at each round the online learning algorithm receives the advice of d experts, denoted (f1t , . . . , fdt ) ∈ {0, 1}d . Based on the predictions of the experts, the algorithm should predict ŷt . To simplify notation, we assume in this subsection that Y = {0, 1} and not {−1, +1}. The following algorithm for prediction with expert advice is due to Littlestone and Warmuth. Algorithm 7 Learning with Expert Advice (Weighted Majority) input: Number of experts d ; Learning rate η > 0 initialize: θ 0 = (0, . . . , 0) ∈ Rd ; Z0 = d for t = 1, 2, . . . , T receive expert advice (f1t , f2t , . . . , fdt ) ∈ {0, 1}d environment determines yt without revealing it to learner Choose it at random according to the distribution defined by wit−1 = exp(θit−1 )/Zt−1 predict ŷt = fitt receive label yt Pd update: ∀i, θit = θit−1 − η|fit − yt | ; Zt = i=1 exp(θit ) To analyze the Weighted Majority algorithm we first note that the definition of ŷt clearly implies: E[|ŷt − yt |] = d X wit−1 |fit − yt | = hwt−1 , vt i , (3) i=1 where vt is the vector whose i’th element is |fit −yt |. Based on this presentation, the update rule is equivalent t to the update P of Algorithm 5 on the sequence of functions gt (w) = hw, v i with the strongly convex function f (w) = i wi log(wi ) + log(d). Letting A be the probabilistic simplex and applying Theorem 6 we obtain that T T X X p gt (wt ) − min gt (u) ≤ 2 log(d)T . t=1 u∈A t=1 In particular, the above holds for any u = ei . Combining the above with Eq. (3) we obtain " T # T X X p E |ŷt − yt | ≤ min |fit − yt | + 2 log(d)T . t=1 i t=1 Online Convex Optimization-18 Remark: Seemingly, we obtained a result w.r.t. the 0-1 loss function, which contradicts Cover’s impossibility result. There is no contradiction here because we force the environment to determine yt before observing the random coins flipped by the learner. In contrast, in Cover’s impossibility result, the environment can choose yt after observing ŷt . 8 Follow the Regularized Leader In this section we derive a different algorithmic approach in which the weight vector we use in round t + 1 is defined as X 1 wt+1 = argmin gt (w) + f (w) . η w∈A s≤t This rule is called Follow the Regularized Leader (FTRL). It is also convenient to define an unconstrained version X 1 w̃t+1 = argmin gt (w) + f (w) . η w s≤t The following lemma bounds the regret of FTRL in terms of a stability term. Lemma 5 For all u ∈ A, the regret of FTRL satisfies T X (gt (wt ) − gt (u)) ≤ T X (gt (wt ) − gt (wt+1 )) + η −1 (f (u) − f (w1 )) . t=1 t=1 Proof We have gt (wt ) − gt (u) = gt (wt ) − gt (wt+1 ) + gt (wt+1 ) − gt (u) . Therefore, we need to show that T X (gt (wt+1 ) − gt (u)) ≤ η −1 (f (u) − f (w1 )) . t=1 We prove the last inequality by induction. The base case of T = 0 holds by the definition of w1 . Now, suppose the statement holds for T − 1, i.e. T −1 X gt (wt+1 ) + η −1 f (w1 ) ≤ T −1 X gt (u) + η −1 f (u) t=1 t=1 Since it holds for any u ∈ A, it holds for u = wT +1 T −1 X gt (wt+1 ) + η −1 f (w1 ) ≤ t=1 T −1 X gt (wT +1 ) + η −1 f (wT +1 ) t=1 Adding gT (wT +1 ) to both sides, T X t=1 gt (wt+1 ) + η −1 f (w1 ) ≤ T X gt (wT +1 ) + η −1 f (wT +1 ) t=1 The induction step follows from the above since wT +1 is the minimizer of the right-hand side over all w ∈ A. P To derive a concrete bound from the above, one needs to show that the term t gt (wt+1 )−gt (wt ), which measures the stability of the algorithm, is small. This can be done assuming that f (w) is strongly convex and gt is Lipschitz, and one can derive bounds which are similar to the bounds for online mirror descent. In the next section, we derive another family of bounds which involves local norms. Online Convex Optimization-19 9 Another Mirror Descent Algorithm We now present a slightly different mirror descent algorithm. We will show bounds that depend on local norms, which will later on help us in deriving bounds for online learning in the limited feedback setting. Throughout, we consider the problem of online linear optimization — this is the special case of online convex optimization in which the functions are linear,1 , and take the form gt (w) = hw, xt i. The algorithm depends on a regularization function f : A → R and the definition of the update is based on the Bregman divergence associated with f . Definition 6 (Bregman divergence) Given a convex differentiable function f , the Bregman divergence associated with f is defined as Df (u, v) = f (u) − f (v) − h∇f (v), u − vi . Algorithm 8 Online Mirror Descent II Initialize: w1 = w̃1 = argminw∈A f (w) for t = 1 to T Play wt Get xt and pay hwt , xt i Let w̃t+1 = argminw ηhxt , wi + Df (w, wt ) Update wt+1 = argminw∈A Df (w, w̃t+1 ) end for To analyze the algorithm we start with the following lemma. Lemma 6 For all u ∈ A: T T X X f (u) − f (w̃1 ) hxt , wt − w̃t+1 i + hxt , wt − ui ≤ . η t=1 t=1 Proof By optimality condition and the definition of Bregman divergence we have ηxt + ∇f (w̃t+1 ) − ∇f (wt ) = 0 ⇒ ∇f (wt ) − ∇f (w̃t+1 ) = ηxt . Therefore ηhxt , wt − ui = h∇f (wt ) − ∇f (w̃t+1 ), wt − ui = Df (u, wt ) − Df (u, w̃t+1 ) + Df (wt , w̃t+1 ) , where the last equality follows from the definition of the Bregman divergence. Since wt is the projection of w̃t on A and since u ∈ A we know from the Bregman projection lemma (see Censor and Lent book or Cesa-Bianchi and Lugosi) that Df (u, wt ) ≤ Df (u, w̃t+1 ). Thus, ηhxt , wt − ui ≤ Df (u, w̃t ) − Df (u, w̃t+1 ) + Df (wt , w̃t+1 ) . Furthermore, from the non-negativity of the Bregman divergence we have Df (wt , w̃t+1 ) ≤ Df (wt , w̃t+1 ) + Df (w̃t+1 , wt ) = h∇f (wt ) − ∇f (w̃t+1 ), wt − w̃t+1 i = hηxt , wt − w̃t+1 i . 1 The linearity assumption is not critical since we can reduce a general convex function to a linear function (this is left as an exercise). Online Convex Optimization-20 Combining the above two inequalities and summing over t we obtain T X t=1 ηhxt , wt − ui ≤ T X (Df (u, w̃t ) − Df (u, w̃t+1 )) + t=1 T X ηhxt , wt − w̃t+1 i t=1 = Df (u, w̃1 ) − Df (u, w̃T +1 ) + T X ηhxt , wt − w̃t+1 i t=1 ≤ f (u) − f (w̃1 ) + T X ηhxt , wt − w̃t+1 i . t=1 Dividing the above by η we conclude our proof. To derive concrete bounds from the above we need to further upper bound the sum over hxt , wt − w̃t+1 i. We start with a simple example. Example 8 Let f (w) = 12 kwk22 . Then ∇f (w) = w and Df (w, u) = wt − ηxt and using Cauchy-Schwartz inequality we obtain 1 2 kw − uk22 . Therefore, w̃t+1 = hxt , wt − w̃t+1 i = hxt , ηxt i = ηkxt k22 . Thus, if kxt k2 ≤ Bx for all t and if A is contained in the `2 ball of radius Bw we obtain the regret bound B2 ηT Bx2 + 2ηw . Optimizing over η gives T X √ hxt , wt − ui ≤ Bw Bx 2T . t=1 Next, we derive an improved bound for the entropy regularization. P Example 9 Let f (w) = i wi log(wi ) and A = {w ≥ 0 : kwk1 = 1}. Then w̃1 = (1/d, . . . , 1/d) and f (w̃1 ) = − log(d). In addition, for all u ∈ A we have f (u) ≤ 0. By optimality condition we have ∇f (w̃t+1 ) = ∇f (wt ) − ηxt . Since ∇i f (w) = log(wi ) + 1 the above implies: log(w̃t+1,i ) = log(wt,i ) − ηxt,i ⇒ w̃t+1,i = wt,i exp(−ηxt,i ) . In addition, it is easy to verify that wt+1,i = w̃t,i /kw̃k1 . Therefore, we obtained the Weighted Majority algorithm. To analyze the algorithm let us assume that xt,i ≥ 0 for all i. Using the inequality 1 − exp(−a) ≤ a we obtain X hxt , wt − w̃t+1 i = xt,i wt,i (1 − exp(−ηxt,i )) i ≤η X wt,i x2t,i . i We can further bound the above by maxi x2t,i , but if xt,i also depends on wt,i we will obtain an improved bound. This will be the topic of the next lecture. Online Convex Optimization-21

Advanced Course In Machine Learning Online Learning

Related documents

Products

Support

Advanced Course In Machine Learning Online Learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib