Advanced Course In Machine Learning Online Learning

advertisement
Advanced Course in Machine Learning
Spring 2011
Online Learning
Lecturer: Shai Shalev-Shwartz
Scribe: Shai Shalev-Shwartz
In this lecture we describe a different model of learning which is called online learning. Online learning
takes place in a sequence of consecutive rounds. To demonstrate the online learning model, consider again
the papaya tasting problem. On each online round, the learner first receives an instance (the learner buys a
papaya and knows its shape and color, which form the instance). Then, the learner is required to predict a
label (is the papaya tasty?). At the end of the round, the learner gets the correct label (he tastes the papaya and
then knows if it’s tasty or not). Finally, the learner uses this information to improve his future predictions.
Previously, we used the batch learning model in which we first use a batch of training examples to learn
a hypothesis and only when learning is completed the learned hypothesis is tested. In our papayas learning
problem, we should first buy bunch of papayas and taste them all. Then, we use all of this information to
learn a prediction rule that determines the taste of new papayas. In contrast, in online learning there is no
separation between a training phase and a test phase. The same sequence of examples is used both for training
and testing and the distinguish between train and test is through time. In our papaya problem, each time we
buy a papaya, it is first considered a test example since we should predict if it’s going to taste good. But, after
we take a byte from the papaya, we know the true label, and the same papaya becomes a training example
that can help us improve our prediction mechanism for future papayas.
The goal of the online learner is simply to make few prediction mistakes. By now, the reader should
know that there are no free lunches – we must have some prior knowledge on the problem in order to be
able to make accurate predictions. As in previous lectures, we encode our prior knowledge on the problem
using some representation of the instances (e.g. shape and color) and by assuming that there is a class of
hypotheses, H = {h : X → Y}, and on each online round the learner uses a hypothesis from H to make his
prediction.
To simplify our presentation, we start the lecture by describing online learning algorithms for the case of
a finite hypothesis class.
1
Online Learning and Mistake Bounds for Finite Hypothesis Classes
Throughout this section we assume that H is a finite hypothesis class. On round t, the learner first receives
an instance, denoted xt , and is required to predict a label. After making his prediction, the learner receives
the true label, yt = h? (xt ), where h? ∈ H is a fixed (but unknown to the learner) hypothesis.
The most natural learning rule is to use a consistent hypothesis at each online round. If there are several
consistent hypotheses, the Consistent algorithm chooses one of them arbitrarily. This corresponds to the
ERM learning rule we discussed in batch learning.
It is trivial to see that the number of mistakes the Consistent algorithm will make on any sequence is
at most |H| − 1. This follows from the fact that whenever Consistent errs, at least one hypothesis is no
longer consistent with all previous examples. Therefore, after making |H| − 1 mistakes, the only hypothesis
which will remain consistent is h? , and it will never err again. On the other hand, it is easy to construct a
sequence of examples on which Consistent indeed makes |H| − 1 mistakes (this is left as an exercise).
Next, we define a variant of Consistent which has much better mistake bound. On each round, the
RandConsistent algorithm choose a consistent hypothesis uniformly at random, as there is no reason to
prefer one consistent hypothesis over another. This leads to the following algorithm.
Online Learning-1
Algorithm 1 RandConsistent
I NPUT: A finite hypothesis class H
I NITIALIZE: V1 = H
F OR t = 1, 2, . . .
Receive xt
Choose some h from Vt uniformly at random
Predict ŷt = h(xt )
Receive true answer yt
Update Vt+1 = {h ∈ Vt : h(xt ) = yt }
The RandConsistent algorithm maintains a set, Vt , of all the hypotheses which are consistent with
(x1 , y1 ), . . . , (xt−1 , yt−1 ). It then chooses a hypothesis uniformly at random from Vt and predicts according
to this hypothesis.
Recall that the goal of the learner is to make few mistakes. The following theorem upper bounds the
expected number of mistakes RandConsistent makes on a sequence of examples. To motivate the bound,
consider a round t and let αt be the fraction of hypotheses in Vt which are going to be correct on the example
(xt , yt ). Now, if αt is close to 1, it means we are likely to make a correct prediction. On the other hand, if
αt is close to 0, we are likely to make a prediction error. But, on the next round, after updating the set of
consistent hypotheses, we will have |Vt+1 | = αt |Vt |, and since we now assume that αt is small, we will have
a much smaller set of consistent hypotheses in the next round. To summarize, if we are likely to err on the
current example then we are going to learn a lot from this example as well, and therefore be more accurate in
later rounds.
Theorem 1 Let H be a finite hypothesis class, let h? be some hypothesis in H and let
(x1 , h? (x1 )), . . . , (xT , h? (xT )) be an arbitrary sequence of examples. Then, the expected number of mistakes the RandConsistent algorithm makes on this sequence is at most ln(|H|), where expectation is
with respect to the algorithm’s own randomization.
Proof For each round t, let αt = |Vt+1 |/|Vt |. Therefore, after T rounds we have
1 ≤ |VT +1 | = |H|
T
Y
αt .
t=1
Using the inequality b ≤ e−(1−b) , which holds for all b, we get that
1 ≤ |H|
T
Y
e−(1−αt ) = |H| e−
PT
t=1 (1−αt )
t=1
⇒
(1)
T
X
(1 − αt ) ≤ ln(|H|) .
t=1
Finally, since we predict ŷt by choosing h ∈ Vt uniformly at random, we have that the probability to make a
mistake on round t is
|{h ∈ Vt : h(xt ) 6= yt }|
|Vt | − |Vt+1 |
P[ŷt 6= yt ] =
=
= (1 − αt ) .
|Vt |
|Vt |
Therefore, the expected number of mistakes is
T
X
t=1
E[1[ŷt 6=yt ] ] =
T
X
P[ŷt 6= yt ] =
t=1
Online Learning-2
T
X
t=1
(1 − αt ) .
Combining the above with Eq. (1) we conclude our proof.
It is interesting to compare the guarantee in Theorem 1 to guarantees on the generalization error in the
PAC model (see Corollary ??). In the PAC model, we refer to the T examples in the sequence as a training
set. Then, Corollary ?? implies that with probability of at least 1 − δ, our average error on new examples
is guaranteed to be at most ln(|H|/δ)/T . In contrast, Theorem 1 tells us a much stronger guarantee. We do
not need to first train the model on T examples, in order to have error rate of ln(|H|)/T . We can have this
same error rate immediately on the first T examples we observe. In our papayas example, we don’t need to
first buy T papayas, taste them all, and only then to be able to classify new papayas. We can start making
predictions from the first day, each day trying to buy a tasty papaya, and we know that our performance after
T days will be the same as our performance had we first trained our model using T papayas !
Another important difference between the online model and the batch model is that in the latter we
assume that instances are sampled i.i.d. from some underlying distribution, but in the former there is no such
an assumption. In particular, Theorem 1 holds for any sequence of instances. Removing the i.i.d. assumption
is a big advantage. Again, in the papayas problem, we are allowed to choose a new papaya every day, which
clearly violates the i.i.d. assumption. On the flip side, we only have a guarantee on the total number of
mistakes but we have no guarantee that after observing T examples we will identify the ’true’ hypothesis.
Indeed, if we observe the same example on all the online rounds, we will make few mistakes but we will
remain with a large set Vt of hypotheses, all of them are potentially the true hypothesis.
Note that the RandConsistent algorithm is a specific variant of the general Consistent learning
paradigm (i.e., ERM) and that the bound in Theorem 1 relies on the fact that we use this specific variant. This
stands in contrast to the results we had before for the PAC model in which it doesn’t matter how we break
ties, and any consistent hypothesis is guaranteed to perform well. In some situations, it is computationally
harder to sample a consistent hypothesis from Vt while it is less demanding to merely find one consistent
hypothesis. Moreover, if H is infinite, it is not well defined how to choose a consistent hypothesis uniformly
at random. On the other hand, as mentioned before, the results we obtained for the PAC model assume that
the data is i.i.d. while the bound for RandConsistent holds for any sequence of instances. If the data is
indeed generated i.i.d. then it is possible to obtain a bound for the general Consistent paradigm.
Theorem 1 bounds the expected number of mistakes. Using martingale measure concentration techniques,
one can obtain a bound which holds with extremely high probability. A simpler way is to explicitly derandomize the algorithm. Note that RandConsistent predicts 1 with probability greater than 1/2 if the majority
of hypotheses in Vt predicts 1. A simple derandomization is therefore to make a deterministic prediction
according to a majority vote of the hypotheses in Vt . The resulting algorithm is called Halving.
Algorithm 2 Halving
I NPUT: A finite hypothesis class H
I NITIALIZE: V1 = H
F OR t = 1, 2, . . .
Receive xt
Predict ŷt = argmaxr∈{±1} |{h ∈ Vt : h(xt ) = r}|
(In case of a tie predict ŷt = 1)
Receive true answer yt
Update Vt+1 = {h ∈ Vt : h(xt ) = yt }
Theorem 2 Let H be a finite hypothesis class, let h? be some hypothesis in H and let
(x1 , h? (x1 )), . . . , (xT , h? (xT )) be an arbitrary sequence of examples. Then, the number of mistakes the
Halving algorithm makes on this sequence is at most log2 (|H|).
Online Learning-3
Proof We simply note that whenever the algorithm errs we have |Vt+1 | ≤ |Vt |/2. (Hence the name Halving.)
Therefore, if M is the total number of mistakes, we have
1 ≤ |VT +1 | ≤ |H| 2−M .
Rearranging the above inequality we conclude our proof.
A guarantee of the type given in Theorem 2 is called a Mistake Bound. Theorem 2 states that Halving
enjoys a mistake bound of log2 (|H|). In the next section, we relax the assumption that all the labels are
generated by a hypothesis h? ∈ H.
2
Weighted Majority and Regret Bounds
In the previous section we presented the Halving algorithm and analyze its performance by providing a
mistake bound. A crucial assumption we relied on is that the data is realizable, namely, the labels in the
sequence are generated by some hypothesis h? ∈ H. This is a rather strong assumption on our data and prior
knowledge. In this section we relax this assumption.
Recall that the mistake bounds we derived in the previous section do not require the data to be sampled
i.i.d. from some distribution. We allow the sequence to be deterministic, stochastic, or even adversarially
adaptive to our own behavior (for example, this is the case in spam email filtering). Clearly, learning is
impossible if there is no correlation between past and present examples.
In the realizable case, future and past examples are tied together by the common hypothesis, h? ∈ H,
that generates all labels. In the non-realizable case, we analyze the performance of an online learner using
the notion of regret. The learner’s regret is the difference between his number of prediction mistakes and
the number of mistakes the optimal fixed hypothesis in H makes on the same sequence of examples. This is
termed ’regret’ since it measures how ’sorry’ the learner is, in retrospect, not to have followed the predictions
of the optimal hypothesis. We again use the notation h? to denote the hypothesis in H that makes the least
number of mistakes on the sequence of examples. But, now, the labels are not generated by h? , meaning that
for some examples we might have h? (xt ) 6= yt .
As mentioned before, learning is impossible if there is no correlation between past and future examples.
However, even in this case, the algorithm can have low regret. In other words, having low regret does not
necessarily mean that we will make few mistakes. It only means that the algorithm will be competitive with
an oracle, that knows in advance what the data is going to be, and chooses the optimal h? . If our prior
knowledge is adequate, then H contains a hypothesis that (more or less) explains the data, and then a low
regret algorithm is guaranteed to have few mistakes.
We now present an online learning algorithm for the non-realizable case, also called the agnostic case. As
in the previous section, we assume that H is a finite hypothesis class and denote H = {h1 , . . . , hd }. Recall
that the RandConsistent algorithm maintains the set Vt of all hypotheses which are consistent with the
examples observed so far. Then, it samples a hypothesis from Vt uniformly at random. We can represent the
set Vt as a vector wt ∈ Rd , where wt,i = 1 ifP
hi ∈ Vt and otherwise wt,i = 0. Then, the RandConsistent
algorithm chooses hi with probability wt,i /( j wt,j ), and the vector w is updated at the end of the round by
zeroing all elements corresponding to hypotheses that err on the current example.
If the data is non-realizable, the weight of h? will become zero once we encounter an example on which
?
h errs. From this point on, h? will not affect the prediction of the RandConsistent algorithm. To
overcome this problem, one can be less aggressive and instead of zeroing weights of erroneous hypotheses,
one can just diminish their weight by scaling down their weight by some β ∈ (0, 1). The resulting algorithm
is called Weighted-Majority.
Online Learning-4
Algorithm 3 Weighted-Majority
I NPUT: Finite hypothesis class H = {h1 , . . . , hd } ; Number of rounds T
√
I NITIALIZE: β = e− 2 ln(d)/T ; w1 = (1, . . . , 1) ∈ Rd
F OR t = 1, 2, . . . , T
Pd
Let Zt = j=1 wt,j
Sample a hypothesis h ∈ H at random according to
wt,d
wt,1
Zt , . . . , Zt
Predict ŷt = h(xt )
Receive true answer yt
Update: ∀j, wt+1,j

wt,j β
=
w
if hj (xt ) 6= yt
else
t,j
The following theorem provides an expected regret bound for the algorithm.
Theorem 3 Let H be a finite hypothesis class, and let (x1 , y1 ), . . . , (xT , yT ) be an arbitrary sequence of
examples. If we run Weighted-Majority on this sequence we have the following expected regret bound
" T
#
T
X
X
p
1[h(xt )6=yt ] ≤
E
1[ŷt 6=yt ] − min
0.5 ln(|H|) T ,
h∈H
t=1
t=1
where expectation is with respect to the algorithm own randomization.
p
Proof Let η = 2 ln(d)/T and note that wt+1,i = wt,i e−η 1[hi (xt )6=yt ] . Therefore,
ln
X wt,i
Zt+1
= ln
e−η1[hi (xt )6=yt ] .
Zt
Z
t
i
Hoeffding inequality tells us that if X is a random variable over [0, 1] then
ln E[e−ηX ] ≤ −η E[X] +
η2
.
8
Since wt /Zt is a probability vector and 1[hi (xt )6=yt ] ∈ [0, 1], we can apply Hoeffding’s inequality to obtain:
ln
X wt,i
η2
η2
Zt+1
≤ −η
1[hi (xt )6=yt ] +
= − η E[1[ŷt 6=yt ] ] +
.
Zt
Zt
8
8
i
Summing the above inequality over t we get
ln(ZT +1 ) − ln(Z1 ) =
T
X
t=1
ln
T
X
Zt+1
T η2
≤ −η
E[1[ŷt 6=yt ] ] +
.
Zt
8
t=1
Next, we lower bound ZT +1 . For each i, let Mi be the number of mistakes hi makes on the entire sequence
of T examples. Therefore, wT +1,i = e−ηMi and we get that
!
X
−ηMi
ln ZT +1 = ln
e
≥ ln max e−ηMi = −η min Mi .
i
i
Online Learning-5
i
Combining the above and using the fact that ln(Z1 ) = ln(d) we get that
−η min Mi − ln(d) ≤ − η
i
T
X
E[1[ŷt 6=yt ] ] +
t=1
T η2
,
8
which can be rearranged as follows:
T
X
t=1
E[1[ŷt 6=yt ] ] − min Mi ≤
h∈H
ln(d) η T
+
.
η
8
Plugging the value of η into the above concludes our proof.
Comparing the regret bound of Weighted-Majority in Theorem 3 to the mistake bound
of RandConsistent given in Theorem 1 we note several differences.
First, the result
for Weighted-Majority holds for any sequence of examples while the mistake bound of
RandConsistent assumes that the labels are generated by some h? ∈ H. As a result, the bound of
Weighted-Majority is relative to the minimal number of mistakes a hypothesis h ∈ H makes. Second, dividing both bounds by the number of rounds T , wepget that the error rate in Theorem 1 decreases as
ln(|H|)/T while the error rate in Theorem 3 decreases as ln(|H|)/T . This is similar to the results we had
for PAC learning in the realizable and non-realizable cases.
3
Online Learnability and Littlestone’s dimension
So far, we focused on specific hypothesis classes – finite classes or the class of linear separators. We derived
specific algorithms for each type of class and analyzed their performance. In this section we take a more
general approach, and aim at characterizing online learnability. In particular, we target the following question:
what is the optimal online learning algorithm for a given class H.
To simplify our derivation, we again consider the realizable case in which all labels are generated by some
hypothesis h? ∈ H. We already encountered worst-case mistake bounds. For completeness, we give here a
formal definition.
Definition 1 (Mistake bound) Let H be a hypothesis class of functions from X to Y = {±1} and let A be
an online algorithm. We say that M is a mistake bound for A, if for any hypothesis h? ∈ H and a sequence
of examples (x1 , h? (x1 )), . . . , (xT , h? (xT )), the online algorithm A does not make more than M prediction
mistakes when running on the sequence of examples.
Remarks:
• If an algorithm has a mistake bound M it does not necessarily mean that the algorithm will eventually
know the identity of h? , even if we present it with infinite number of examples (e.g. we can feed the
online algorithm the same example again and again). The only guarantee that we have is that the total
mistakes that the algorithm will make is at most M .
• When the learning algorithm makes randomized predictions, it is also possible to define a mistake
bound M as an upper bound on the expected number of mistakes, where expectation is with respect to
learner’s own randomization. For simplicity, throughout this section we do not analyze expected regret
and allow the environment to observe the learner’s prediction before providing the label.
In the following we present a dimension of hypothesis classes that characterizes the best possible achievable mistake bound. This measure was proposed by Littleston and we therefore refer to it as Ldim(H).
To motivate the definition of Ldim it is convenient to view the online learning process as a game between
two players: the learner vs. the environment. On round t of the game, the environment picks an instance xt ,
Online Learning-6
v1
v2
v1
v2
v3
v3
h1
−1
−1
?
h2
−1
1
?
h3
1
?
−1
h4
1
?
1
Figure 1: An illustration of a shattered tree of depth 2. The blue path corresponds to the sequence of examples
((v1 , 1), (v3 , −1)). The tree is shattered by H = {h1 , h2 , h3 , h4 }, where the predictions of each hypothesis
in H on the instances v1 , v2 , v3 is given in the table (a question mark means that hj (vi ) can be either 1 or
−1).
the learner predicts a label ŷt ∈ {+1, −1}, and finally the environment outputs the true label, yt ∈ {+1, −1}.
Suppose that the environment wants to make the learner err on the first M rounds of the game. Then, it must
output yt = −ŷt , and the only question is how to choose the instances xt in such a way that ensures that for
some h? ∈ H we have yt = h? (xt ) for all t ∈ [M ].
It makes sense to assume that the environment should pick xt based on the previous predictions of
the learner, ŷ1 , . . . , ŷt−1 . Since in our case we have yt = −ŷt we can also say that xt is a function of
y1 , . . . , yt−1 . We can represent this dependence using a complete binary tree of depth M (we define the
depth of the tree as the number of edges in a path from the root to a leaf). We have 2M +1 − 1 nodes in such a
tree, and we attach an instance to each node. Let v1 , . . . , v2M +1 −1 be these instances. We start from the root
of the tree, and set x1 = v1 . At round t, we set xt = vit where it is the current node. At the end of round t,
we go to the left child of it if yt = −1 or to the right child if yt = 1. That is, it+1 = 2it + yt2+1 . Unraveling
Pt−1 y +1
the recursion we obtain it = 2t−1 + j=1 j2 2t−1−j .
The above strategy for the environment succeeds only if for any (y1 , . . . , yM ) there exists h ∈ H such
that yt = h(xt ) for all t ∈ [M ]. This leads to the following definition.
Definition 2 (H Shattered tree) A shattered tree of depth d is a sequence of instances v1 , . . . , v2d −1 in X
such that for all labeling (y1 , . . . , yd ) ∈ {±1}M there exists h ∈ H such that for all t ∈ [d] we have
Pt−1 y +1
h(vit ) = yt where it = 2t−1 + j=1 j2 2t−1−j .
An illustration of a shattered tree of depth 2 is given in Figure 1.
Definition 3 (Littlestone’s dimension (Ldim)) Ldim(H) is the maximal integer M such that there exist a
shattered tree of depth M .
The definition of Ldim and the discussion above immediately imply the following:
Lemma 1 If Ldim(H) = M then no algorithm can have a mistake bound strictly smaller than M .
Proof Let v1 , . . . , v2M −1 be a sequence that satisfies the requirements in the definition of Ldim. If the
environment sets xt = vit and yt = −ŷt for all t ∈ [M ], then the learner makes M mistakes while the
definition of Ldim implies that there exists a hypothesis h ∈ H such that yt = h(xt ) for all t.
Let us now give several examples.
Example 1 Let H be a finite hypothesis class. Clearly, any tree that is shattered by H has depth of at most
log2 (|H|). Therefore, Ldim(H) ≤ log2 (|H|).
Example 2 Let X = R and H = {x 7→ sign(x − a) : a ∈ R}. Then, Ldim(H) = ∞. (proof is left as an
exercise).
Online Learning-7
Example 3 Let X = {x ∈ {0, 1}∗ : kxk0 ≤ r} and H = {x 7→ sign(hw, xi) : kwk0 ≤ k}. The size of H
is infinite. Nevertheless, Ldim(H) ≤ r k. (The proof uses the Perceptron mistake bound and is left as an
exercise).
Lemma 1 states that Ldim(H) lower bounds the mistake bound of any algorithm. Interestingly, there is a
standard algorithm whose mistake bound matches this lower bound. The algorithm is similar to the Halving
algorithm. Recall that the prediction of Halving is according to a majority vote of the hypotheses which
are consistent with previous examples. We denoted this set by Vt . Put another way, Halving partition Vt
into two sets: Vt+ = {h ∈ Vt : h(xt ) = 1} and Vt− = {h ∈ Vt : h(xt ) = −1}. It then predicts according to
the larger of the two groups. The rational behind this prediction is that whenever Halving makes a mistake
it ends up with |Vt+1 | ≤ 0.5 |Vt |.
The optimal algorithm we present below uses the same idea, but instead of predicting according to the
larger class, it predicts according to the class with larger Ldim.
Algorithm 4 Standard Optimal Algorithm (SOA)
I NPUT: A hypothesis class H
I NITIALIZE: V1 = H
F OR t = 1, 2, . . .
Receive xt
(r)
For r ∈ {±1} let Vt
= {h ∈ Vt : h(xt ) = r}
(r)
Predict ŷt = argmaxr Ldim(Vt )
Receive true answer yt
(yt )
Update Vt+1 = Vt
The following lemma formally establishes the optimality of the above algorithm.
Lemma 2 SOA enjoys the mistake bound M = Ldim(H).
Proof It suffices to prove that whenever the algorithm makes a prediction mistake we have Ldim(Vt+1 ) ≤
Ldim(Vt ) − 1. We prove this claim by assuming the contrary, that is, Ldim(Vt+1 ) = Ldim(Vt ). If this holds
(r)
true, then the definition of ŷt implies that Ldim(Vt ) = Ldim(Vt ) for both r = 1 and r = −1. But, in this
case we can construct a tree of depth Ldim(Vt ) + 1 that satisfies the requirements given in the definition of
Ldim for the class Vt , which leads to the desired contradiction.
Combining Lemma 2 and Lemma 1 we obtain:
Corollary 1 Let H be a hypothesis class with Ldim(H) = M . Then, the standard optimal algorithm enjoys
the mistake bound M and no other algorithm can have a mistake bound strictly smaller than M .
3.1
Comparison to VC dimension
In this section we compare the online learning dimension Ldim with the VC dimension we used for determining the learnability of a class in the batch learning model.
To remind the reader, the VC dimension of a class H, denoted VCdim(H), is the maximal number d such
that there are instances x1 , . . . , xd that are shattered by H. That is, for any sequence of labels (y1 , . . . , yd ) ∈
{+1, −1}d there exists a hypothesis h ∈ H that gives exactly this sequence of labels.
Theorem 4 For any class H we have VCdim(H) ≤ Ldim(H).
Online Learning-8
x1
x2
x3
x2
x3
x3
x3
Figure 2: How to construct a shattered tree from a shattered sequence x1 , . . . , xd .
Proof Suppose VCdim(H) = d and let x1 , . . . , xd be a shattered set. We now construct a complete binary
tree of instances v1 , . . . , v2d −1 , where all nodes at depth i are set to be xi (see the illustration in Figure 2).
Now, the definition of shattered sample clearly implies that we got a valid shattered tree of depth d, and our
proof is completed.
Corollary 2 For a finite hypothesis class, we have
VCdim(H) ≤ Ldim(H) ≤ log(|H|) .
Both inequalities in Corollary 2 can be strict as the following examples show.
Example 4 Consider again the class of initial segments on the real numbers. That is, X = R and H =
{x 7→ sign(x − a) : a ∈ R}. We have shown that the VC dimension of H is 1 while the Littlestone dimension
is ∞.
Example 5 Let X = {1, . . . , d} and H = {h1 , . . . , hd } where hd (x) = 1 iff x = d. Then, it is easy to show
that Ldim(H) = 1 while |H| = d can be arbitrarily large.
Online Learning-9
Advanced Course in Machine Learning
Spring 2011
Online Convex Optimization
Lecturer: Shai Shalev-Shwartz
Scribe: Shai Shalev-Shwartz
A convex repeated game is a two players game that is performed in a sequence of consecutive rounds.
On round t of the repeated game, the first player chooses a vector wt from a convex set A. Next, the second
player responds with a convex function gt : A → R. Finally, the first player suffers an instantaneous loss
gt (wt ). We study the game from the viewpoint of the first player.
In offline convex optimization, the goal is to find a vector w within a convex set A that minimizes a
convex objective function, g : A → R. In online convex optimization, the set A is known in advance, but the
objective function may change along the online process.
PTThe goal of the online optimizer, which we call the
learner, is to minimize the averaged objective value T1 t=1 gt (wt ), where T is the total number of rounds.
Low regret: Naturally, an adversary can make the cumulative loss of our online learning algorithm arbitrarily large. For example, the second player can always set gt (w) = 1 and then no matter what the learner
will predict, the cumulative loss will be T . To overcome this deficiency, we restate the learner’s goal based
on the notion of regret. The learner’s regret is the difference between his cumulative loss and the cumulative
loss of the optimal offline minimizer. This is termed ’regret’ since it measures how ’sorry’ the learner is, in
retrospect, not to use the optimal offline minimizer. That is, the regret is
R(T ) =
T
T
1X
1X
gt (wt ) − min
gt (w) .
w∈A T
T t=1
t=1
We call an online algorithm a low regret algorithm if R(T ) = o(1). In this lecture we will study low regret
algorithms for online convex optimization. We will also show how several familiar algorithms, like the
Perceptron and Weighted Majority, can be derived from an online convex optimizer.
We start with a brief overview of basic notions form convex analysis.
4
Convexity
A set A is convex if for any two vectors w1 , w2 in A, all the line between w1 and w2 is also within A. That
is, for any α ∈ [0, 1] we have that αw1 + (1 − α)w2 ∈ A. A function f : A → R is convex if for all
u, v ∈ Rn and α ∈ [0, 1] we have
f (αu + (1 − α)v) ≤ αf (u) + (1 − α)f (v) .
It is easy to verify that f is convex iff its epigraph is a convex set, where epigraph(f) = {(x, α) : f (x) ≤ α}.
We allow f to output ∞ for some inputs x. This is a convenient way to restrict the domain of A to a
proper subset of X . So, in this section we use R to denote the reals number and the special symbol ∞. The
domain of f : X → R is defined as dom(f ) = {x : f (x) < ∞}.
A set A is open if every point in A has a neighborhood lying in A. A set A is closed if its complement
is an open set. A function f is closed if for any finite scalar α, the level set {w : f (w) ≤ α} is closed.
Throughout, we focus on closed and convex functions.
4.1
Sub-gradients
A vector λ is a sub-gradient of a function f at w if for all u ∈ A we have that
f (u) − f (w) ≥ hu − w, λi .
Online Convex Optimization-10
The differential set of f at w, denoted ∂f (w), is the set of all sub-gradients of f at w. For scalar functions, a
sub-gradient of a convex function f at x is a slope of a line that touches f at x and is not above f everywhere.
Two useful properties of subgradients are given below:
1. If f is differentiable at w then ∂f (w) consists of a single vector which amounts to the gradient of f
at w and is denoted by ∇f (w). In finite dimensional spaces, the gradient of f is the vector of partial
derivatives of f .
2. If g(w) = maxi∈[r] gi (w) for r differentiable functions g1 , . . . , gr , and j = arg maxi gi (u), then the
gradient of gj at u is a subgradient of g at u.
Example 6 (Sub-gradients of the logistic-loss) Recall that the logistic-loss is defined as `(w; x, y) =
log(1 + exp(−yhw, xi)). Since this function is differentiable, a sub-gradient at w is the gradient at w,
which using the chain rule equals to
∇`(w; x, y) =
− exp(−yhw, xi)
−1
yx =
yx.
1 + exp(−yhw, xi)
1 + exp(yhw, xi)
Example 7 (Sub-gradients of the hinge-loss) Recall that the hinge-loss is defined as `(w; x, y) =
max{0, 1 − yhw, xi}. This is the maximum of two linear functions. Therefore, using the two propoerties
above we have that if 1 − yhw, xi > 0 then −y x ∈ ∂`(w; x, y) and if 1 − yhw, xi < 0 then 0 ∈ ∂`(w; x, y).
Furthermore, it is easy to verify that


if 1 − yhw, xi > 0
{−yx}
∂`(w; x, y) = {0}
if 1 − yhw, xi < 0


{−αyx : α ∈ [0, 1]} if 1 − yhw, xi = 0
1
-1
1
Figure 3: An illustration of the hinge-loss function f (x) = max{0, 1 − x} and one of its sub-gradients at
x = 1.
4.2
Lipschitz functions
We say that f : A → R is ρ-Lipschitz if for all u, v ∈ A
|f (u) − f (v)| ≤ ρ ku − vk .
An equivalent definition is that the `2 norm of all sub-gradients of f at points in A is bounded by ρ.
More generally, we say that a convex function is V -Lipschitz w.r.t. P
a norm k · k if for all x ∈ X exists
v ∈ ∂f (x) with kvk? ≤ V . Of particular interest are p-norms, kxkp = ( i |xi |p )1/p .
Online Convex Optimization-11
4.3
Dual norms
Given a norm k · k, its dual norm is defined by
kyk? =
sup hx, yi .
x:kxk≤1
For example, the Euclidean norm is dual to itself. More generally, for any p, q ≥ 1 such that 1/p + 1/q = 1,
the norms
!1/p
!1/q
X
X
p
q
kxkp =
; kxkq =
|xi |
|xi |
i
i
are dual norms. The above also holds for kxk1 and kyk∞ = maxi |yi |.
4.4
Fenchel Conjugate
There are two equivalent representations of a convex function. Either as pairs (x, f (x)) or as the set of
tangents of f , namely pairs (slope,intersection-with-y-axis). The function that relates slopes of tangents to
their intersection with the y axis is called the Fenchel conjugate of f .
Set of points
Set of tangents
slo
f (w)
pe
=
θ
w
−f ? (θ)
Formally, the Fenchel conjugate of f is defined as
f ? (θ) = max hx, θi − f (x) .
x
Online Convex Optimization-12
Few remarks:
• The definition immediately implies Fenchel-Young inequality:
f ? (θ)
∀u,
=
max hx, θi − f (x)
≥
hu, θi − f (u)
x
• If f is closed and convex then f ?? = f
• By the way, this implies Jensen’s inequality:
f (E[x])
=
max hθ, E[x]i − f ? (θ)
=
max E [hθ, xi − f ? (θ)]
≤
E[ max hθ, xi − f ? (θ) ] = E[f (x)]
θ
θ
θ
Several examples of Fenchel conjugate functions are given below.
P
i
4.5
f (x)
f ? (θ)
1
2
2 kxk
1
2
2 kθk?
kxk
Indicator of unit k · k? ball
P θi log
ie
wi log(wi )
Indicator of prob. simplex
maxi θi
c g(x) for c > 0
c g ? (θ/c)
inf x g1 (x) + g2 (x − x)
g1? (θ) + g2? (θ)
Strong Convexity–Strong Smoothness Duality
Recall that the domain of f : X → R is {x : f (x) < ∞}. We first define strong convexity.
Definition 4 A function f : X → R is β-strongly convex w.r.t. a norm k · k if for all x, y in the relative
interior of the domain of f and α ∈ (0, 1) we have
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) − 21 βα(1 − α)kx − yk2
We now define strong smoothness. Note that a strongly smooth function f is always finite.
Definition 5 A function f : X → R is β-strongly smooth w.r.t. a norm k · k if f is everywhere differentiable
and if for all x, y we have
f (x + y) ≤ f (x) + h∇f (x), yi + 21 βkyk2
The following theorem states that strong convexity and strong smoothness are dual properties. Recall that
the biconjugate f ?? equals f if and only if f is closed and convex.
Theorem 5 (Strong/Smooth Duality) Assume that f is a closed and convex function. Then f is β-strongly
convex w.r.t. a norm k · k if and only if f ? is β1 -strongly smooth w.r.t. the dual norm k · k? .
Online Convex Optimization-13
f ? (w)
f (w)
≥
β
2 ku
− wk2
≤
1
2β ku
− wk2?
f ? (u)
f (u)
u
w
u
w
Figure 4: Illustrations of strong convexity (left) and strong smoothness (right).
Subtly, note that while the domain of a strongly convex function f may be a proper subset of X (important
for a number of settings), its conjugate f ? always has a domain which is X (since if f ? is strongly smooth
then it is finite and everywhere differentiable). The above theorem can be found, for instance, in Zalinescu
2002 (see Corollary 3.5.11 on p. 217 and Remark 3.5.3 on p. 218).
The following direct corollary of Theorem 5 is central in proving regret bounds. As we will show later, it
is also and generalization bounds.
P
Corollary 3 If f is β strongly convex w.r.t. k · k and f ? (0) = 0, then, denoting the partial sum j≤i vj by
v1:i , we have, for any sequence v1 , . . . , vn and for any u,
n
n
n
X
X
1 X
kvi k2? .
hvi , ui − f (u) ≤ f ? (v1:n ) ≤
h∇f ? (v1:i−1 ), vi i +
2β
i=1
i=1
i=1
Proof The 1st inequality is Fenchel-Young and the 2nd is from the definition of smoothness by induction.
Examples of strongly convex functions
Lemma 3 Let q ∈ [1, 2]. The function f : Rd → R defined as f (w) =
with respect to k · kq over Rd .
1
2
2 kwkq
is (q − 1)-strongly convex
A proof can be found in Shalev-Shwartz 2007.
We mainly use the above lemma to obtain results with respect to the norms k · k2 and k · k1 . The case
q = 2 is straightforward. Obtaining results with respect to k · k1 is slightly more tricky since for q = 1
the strong convexity parameter is 0 (meaning that the function is not strongly convex). To overcome this
ln(d)
problem, we shall set q to be slightly more than 1, e.g. q = ln(d)−1
. For this choice of q, the strong convexity
parameter becomes q − 1 = 1/(ln(d) − 1) ≥ 1/ ln(d) and the value of p corresponds to the dual norm is
−1
p = (1 − 1/q) = ln(d). Note that for any x ∈ Rd we have
kxk∞ ≤ kxkp ≤ (dkxkp∞ )1/p = d1/p kxk∞ = e kxk∞ ≤ 3 kxk∞ .
Hence the dual norms are also equivalent up to a factor of 3: kwk1 ≥ kwkq ≥ kwk1 /3. The above lemma
therefore implies the following corollary.
Corollary 4 The function f : Rd → R defined as f (w) =
convex with respect to k · k1 over Rd .
1
2
2 kwkq
for q =
ln(d)
ln(d)−1
is 1/(9 ln(d))-strongly
Another important example is given in the following lemma.
P
Lemma 4 The function f : Rd → R defined as f (x) = i xi log(xi ) is 1-strongly convex with respect to
k · k1 over the domain {x : kxk1 = 1, x ≥ 0}.
The proof can also be found in Shalev-Shwartz 2007.
Online Convex Optimization-14
5
An algorithmic framework for Online Convex Optimization
Algorithm 5 provides one common algorithm which achieves the following regret bound. It is one of a family
of algorithms that enjoy the same regret bound (see Shalev-Shwartz 2007).
Algorithm 5 Online Mirror Descent
Initialize: w1 ← ∇f ? (0)
for t = 1 to T
Play wt ∈ A
Receive gt and pick vt∈ ∂gt (wt ) Pt
Update wt+1 ← ∇f ? −η s=1 vt
end for
Theorem 6 Suppose Algorithm 5 is used with a function f that is β-strongly convex w.r.t. a norm k · k on A
and has f ? (0) = 0. Suppose the loss functions gt are convex and V -Lipschitz w.r.t. the norm k · k. Then, the
algorithm run with any positive η enjoys the regret bound,
T
X
gt (wt ) − min
u∈A
t=1
In particular, choosing η =
q
T
X
gt (u) ≤
t=1
maxu∈A f (u) ηV 2 T
+
.
η
2β
2β maxu f (u)
V 2T
we obtain the regret bound
s
T
T
X
X
2 maxu∈A f (u) T
.
gt (u) ≤ V
gt (wt ) − min
u∈A
β
t=1
t=1
Proof Apply Corollary 3 to the sequence −ηv1 , . . . , −ηvT to get, for all u,
−η
T
X
hvt , ui − f (u) ≤ −η
t=1
T
X
hvt , wt i +
t=1
T
1 X
kηvt k2? .
2β t=1
Using the fact that gt is V -Lipschitz, we get kvt k? ≤ V . Plugging this into the inequality above and rearranging gives,
T
X
f (u) ηV 2 T
hvt , wt − ui ≤
+
.
η
2β
t=1
By convexity of gt , gt (wt ) − gt (u) ≤ hvt , wt − ui. Therefore,
T
X
gt (wt ) −
t=1
T
X
gt (u) ≤
t=1
f (u) ηV 2 T
+
.
η
2β
Since the above holds for all u ∈ A the result follows.
6
Tightness of regret bounds
In the previous sections√we presented algorithmic frameworks for online convex programming with regret
bounds that depend on T and on the complexity of the competing vector as measured by f (u). In this
section we show that without imposing additional conditions our bounds are tight, in a sense that is articulated
below.
√
First, we study the dependence of the regret bounds on T .
Online Convex Optimization-15
Theorem 7 For any online convex programming algorithm, there exists a sequence of 1-Lipschitz convex
functions of length
√ T such that the regret of the algorithm on this sequence with respect to a vector u with
kuk2 ≤ 1 is Ω( T ).
Proof The proof is based on the probabilistic method. Let S = [−1, 1] and f (w) = 12 w2 . We clearly
have that f (w) ≤ 1/2 for all w ∈ S. Consider a sequence of linear functions gt (w) = σt w where σt ∈
{+1, −1}. Note that gt is 1-Lipschitz for all t. Suppose that the sequence σ1 , . . . , σT is chosen in advance,
independently with equal probability. Since wt only depends on σ1 , . . . , σt−1 it is independent of σt . Thus,
Eσt [gt (wt ) | σ1 , . . . , σt−1 ] = 0. On the other hand, for any σ1 , . . . , σT we have that
T
T
X
X
min
gt (u) ≤ − σt .
u∈S
t=1
t=1
Therefore,
E
σ1 ,...,σT
[Regret]
=
E
σ1 ,...,σT
X
[
gt (wt )] −
t
"
≥ 0+
E
σ1 ,...,σT
E
σ1 ,...,σT
[min
u
X
gt (u)]
t
#
|
X
σt |
.
t
The right-hand side
p above is the
√ expected distance after T steps of a random walk and is thus equal
approximately to 2T /π√ = Ω( T ). Our proof is concluded by noting that if the expected value of the
regret is greater√than Ω( T ), then there must exist at least one specific sequence for which the regret is
greater than Ω( T ).
Next, we study the dependence on the complexity function f (w).
Theorem 8 For any online convex programming algorithm, there exists a strongly convex function f with
respect to a norm k · k, a sequence of 1-Lipschitz convex functions with respect to k · k? , and a vector u, such
that the regret of the algorithm on this sequence is Ω(f (u)).
Proof Let S = RT and f (w) = 21 kwk22 .
Consider the sequence of functions in which
gt (w) = [1 − yt hw, et i]+ where yt ∈ {+1, −1} and et is the tth vector of the standard base, namely,
et,i = 0 for all i 6= t and et,t = 1. Note that gt (w) is 1-Lipschitz with respect to the Euclidean norm.
Consider any online learning algorithm. If on round t we have wt,i ≥ 0 then we set yt = −1 and otherwise
we set yt = 1. Thus, gt (wt ) ≥ 1. On the other hand, the vector u = (y1 , . . . , yT ) satisfies f (u) ≤ T /2 and
gt (u) = 0 for all t. Thus, Regret ≥ T = Ω(f (u)).
7
Convexification
The algorithms we derived previously are based on the assumption that for each round t, the loss function
gt (w) is convex with respect to w. A well known example in which this assumption does not hold is online
classification with the 0-1 loss function. In this section we discuss to what extent online convex optimization
can be used for online learning with the 0-1 loss. In particular, we will show how the Perceptron and Weighted
Majority algorithms can be derived from the general online mirror descent framework.
In online binary classification problems, at each round we receive an instance x ∈ X ⊂ Rn and we need
to predict a label ŷt ∈ Y = {+1, −1}. Then, we receive the correct label yt ∈ Y and suffer the 0-1 loss:
1[yt 6=ŷt ] .
We first show that no algorithm can obtain a sub-linear regret bound for the 0-1 loss function. To do so,
let X = {1} so our problem boils down to finding the bias of a coin in an online manner. An adversary can
Online Convex Optimization-16
make the number of mistakes of any online algorithm to be equal to T , by simply waiting for the learner’s
prediction and then providing the P
opposite answer as the true answer. In contrast, the number of mistakes of
the constant prediction u = sign( t yt ) is at most T /2. Therefore, the regret of any online algorithm with
respect to the 0-1 loss function will be at least T /2. This impossibility result is attributed to Cover.
To overcome the above impossibility result, two solutions have been proposed.
7.1
Surrogate convex loss and the Perceptron
The first solution is to find a convex loss function that upper bounds the original non-convex loss function. We
describe this solution for the problem of online learning halfspaces. Recall that the set of linear hypotheses
is:
H = {x 7→ hw, xi : kwk2 ≤ U } .
In the context of binary classification, the actual prediction is sign(hw, xi) and the 0 − 1 loss function of w
on an example (x, y) is `0-1 (w, (x, y)) = 1[sign(hw,xi)6=y] .
A popular surrogate convex loss function is the hinge-loss, defined as
`hinge (w, (x, y)) = [1 − yhw, xi]+ ,
where [a]+ = max{0, a}. It is straightforward to verify that `hinge (w, (x, y)) is a convex function (w.r.t. w)
and that `hinge (w, (x, y)) ≥ `0-1 (w, (x, y)). Therefore, for any u ∈ A we have
Regret(T )
=
T
X
`hinge (wt , (xt , yt )) −
T
X
`hinge (u, (xt , yt ))
t=1
t=1
≥
T
X
`0-1 (wt , (xt , yt )) −
t=1
T
X
`hinge (u, (xt , yt )) .
t=1
As a direct corollary from the above inequality we get that a low regret algorithm for the (convex) hinge-loss
function can be utilized for deriving an online learning algorithm for the 0-1 loss with the bound
T
X
`0-1 (wt , (xt , yt )) ≤
t=1
T
X
`hinge (u, (xt , yt )) + Regret(T ) .
t=1
Furthermore, denote by M the set of rounds in which sign(hwt , xt i) 6= yt and note that the left-hand side
of the above is equal to |M|. We can remove the examples not in M from the sequence of examples, and
run an online convex optimization algorithm only on the sequence of examples in M. In particular, applying
Algorithm 5 with f (w) = 21 kwk22 we obtain the well known Perceptron algorithm:
Algorithm 6 Perceptron
Initialize: w1 ← 0
for t = 1 to T
Receive xt
Predict ŷt = sign(hwt , xt i)
Receive yt
If ŷt 6= yt
Update wt+1 ← wt + yt xt
Else
Update wt+1 ← wt
End if
end for
Online Convex Optimization-17
To analyze the Perceptron, we note that an update of the form wt+1 = wt + ηyt xt will yield the same
algorithm, no matter what the value of η is. Therefore, we can use the regret bound given in Theorem 6 on
the sequence of round in M and the set A = {w : kwk2 ≤ U } and get that for any u with kuk2 ≤ U we
have
X
p
(2)
|M| ≤
`hinge (u, (xt , yt )) + X U |M| ,
t∈M
where X = (maxt∈M kxt k2 ). It is easy to verify that this implies
sX
X
|M| ≤
`hinge (u, (xt , yt )) + X U
`hinge (u, (xt , yt )) + X 2 kuk2 .
t∈M
t∈M
Such a bound is called a relative mistake bound.
7.2
Randomization and Weighted Majority
Another way to circumvent Cover’s impossibility result is by relying on randomization. We demonstrate
this idea using the setting of prediction with expert advice. In this setting, at each round the online learning
algorithm receives the advice of d experts, denoted (f1t , . . . , fdt ) ∈ {0, 1}d . Based on the predictions of the
experts, the algorithm should predict ŷt . To simplify notation, we assume in this subsection that Y = {0, 1}
and not {−1, +1}.
The following algorithm for prediction with expert advice is due to Littlestone and Warmuth.
Algorithm 7 Learning with Expert Advice (Weighted Majority)
input: Number of experts d ; Learning rate η > 0
initialize: θ 0 = (0, . . . , 0) ∈ Rd ; Z0 = d
for t = 1, 2, . . . , T
receive expert advice (f1t , f2t , . . . , fdt ) ∈ {0, 1}d
environment determines yt without revealing it to learner
Choose it at random according to the distribution defined by wit−1 = exp(θit−1 )/Zt−1
predict ŷt = fitt
receive label yt
Pd
update: ∀i, θit = θit−1 − η|fit − yt | ; Zt = i=1 exp(θit )
To analyze the Weighted Majority algorithm we first note that the definition of ŷt clearly implies:
E[|ŷt − yt |] =
d
X
wit−1 |fit − yt | = hwt−1 , vt i ,
(3)
i=1
where vt is the vector whose i’th element is |fit −yt |. Based on this presentation, the update rule is equivalent
t
to the update
P of Algorithm 5 on the sequence of functions gt (w) = hw, v i with the strongly convex function
f (w) = i wi log(wi ) + log(d). Letting A be the probabilistic simplex and applying Theorem 6 we obtain
that
T
T
X
X
p
gt (wt ) − min
gt (u) ≤ 2 log(d)T .
t=1
u∈A
t=1
In particular, the above holds for any u = ei . Combining the above with Eq. (3) we obtain
" T
#
T
X
X
p
E
|ŷt − yt | ≤ min
|fit − yt | + 2 log(d)T .
t=1
i
t=1
Online Convex Optimization-18
Remark: Seemingly, we obtained a result w.r.t. the 0-1 loss function, which contradicts Cover’s impossibility result. There is no contradiction here because we force the environment to determine yt before observing
the random coins flipped by the learner. In contrast, in Cover’s impossibility result, the environment can
choose yt after observing ŷt .
8
Follow the Regularized Leader
In this section we derive a different algorithmic approach in which the weight vector we use in round t + 1 is
defined as
X
1
wt+1 = argmin
gt (w) + f (w) .
η
w∈A
s≤t
This rule is called Follow the Regularized Leader (FTRL). It is also convenient to define an unconstrained
version
X
1
w̃t+1 = argmin
gt (w) + f (w) .
η
w
s≤t
The following lemma bounds the regret of FTRL in terms of a stability term.
Lemma 5 For all u ∈ A, the regret of FTRL satisfies
T
X
(gt (wt ) − gt (u)) ≤
T
X
(gt (wt ) − gt (wt+1 )) + η −1 (f (u) − f (w1 )) .
t=1
t=1
Proof We have
gt (wt ) − gt (u) = gt (wt ) − gt (wt+1 ) + gt (wt+1 ) − gt (u) .
Therefore, we need to show that
T
X
(gt (wt+1 ) − gt (u)) ≤ η −1 (f (u) − f (w1 )) .
t=1
We prove the last inequality by induction. The base case of T = 0 holds by the definition of w1 . Now,
suppose the statement holds for T − 1, i.e.
T
−1
X
gt (wt+1 ) + η −1 f (w1 ) ≤
T
−1
X
gt (u) + η −1 f (u)
t=1
t=1
Since it holds for any u ∈ A, it holds for u = wT +1
T
−1
X
gt (wt+1 ) + η
−1
f (w1 ) ≤
t=1
T
−1
X
gt (wT +1 ) + η −1 f (wT +1 )
t=1
Adding gT (wT +1 ) to both sides,
T
X
t=1
gt (wt+1 ) + η −1 f (w1 ) ≤
T
X
gt (wT +1 ) + η −1 f (wT +1 )
t=1
The induction step follows from the above since wT +1 is the minimizer of the right-hand side over all
w ∈ A.
P
To derive a concrete bound from the above, one needs to show that the term t gt (wt+1 )−gt (wt ), which
measures the stability of the algorithm, is small. This can be done assuming that f (w) is strongly convex and
gt is Lipschitz, and one can derive bounds which are similar to the bounds for online mirror descent. In the
next section, we derive another family of bounds which involves local norms.
Online Convex Optimization-19
9
Another Mirror Descent Algorithm
We now present a slightly different mirror descent algorithm. We will show bounds that depend on local
norms, which will later on help us in deriving bounds for online learning in the limited feedback setting.
Throughout, we consider the problem of online linear optimization — this is the special case of online convex
optimization in which the functions are linear,1 , and take the form gt (w) = hw, xt i.
The algorithm depends on a regularization function f : A → R and the definition of the update is based
on the Bregman divergence associated with f .
Definition 6 (Bregman divergence) Given a convex differentiable function f , the Bregman divergence associated with f is defined as
Df (u, v) = f (u) − f (v) − h∇f (v), u − vi .
Algorithm 8 Online Mirror Descent II
Initialize: w1 = w̃1 = argminw∈A f (w)
for t = 1 to T
Play wt
Get xt and pay hwt , xt i
Let w̃t+1 = argminw ηhxt , wi + Df (w, wt )
Update wt+1 = argminw∈A Df (w, w̃t+1 )
end for
To analyze the algorithm we start with the following lemma.
Lemma 6 For all u ∈ A:
T
T
X
X
f (u) − f (w̃1 )
hxt , wt − w̃t+1 i +
hxt , wt − ui ≤
.
η
t=1
t=1
Proof By optimality condition and the definition of Bregman divergence we have
ηxt + ∇f (w̃t+1 ) − ∇f (wt ) = 0 ⇒ ∇f (wt ) − ∇f (w̃t+1 ) = ηxt .
Therefore
ηhxt , wt − ui = h∇f (wt ) − ∇f (w̃t+1 ), wt − ui
= Df (u, wt ) − Df (u, w̃t+1 ) + Df (wt , w̃t+1 ) ,
where the last equality follows from the definition of the Bregman divergence. Since wt is the projection
of w̃t on A and since u ∈ A we know from the Bregman projection lemma (see Censor and Lent book or
Cesa-Bianchi and Lugosi) that Df (u, wt ) ≤ Df (u, w̃t+1 ). Thus,
ηhxt , wt − ui ≤ Df (u, w̃t ) − Df (u, w̃t+1 ) + Df (wt , w̃t+1 ) .
Furthermore, from the non-negativity of the Bregman divergence we have
Df (wt , w̃t+1 ) ≤ Df (wt , w̃t+1 ) + Df (w̃t+1 , wt )
= h∇f (wt ) − ∇f (w̃t+1 ), wt − w̃t+1 i
= hηxt , wt − w̃t+1 i .
1 The
linearity assumption is not critical since we can reduce a general convex function to a linear function (this is left as an exercise).
Online Convex Optimization-20
Combining the above two inequalities and summing over t we obtain
T
X
t=1
ηhxt , wt − ui ≤
T
X
(Df (u, w̃t ) − Df (u, w̃t+1 )) +
t=1
T
X
ηhxt , wt − w̃t+1 i
t=1
= Df (u, w̃1 ) − Df (u, w̃T +1 ) +
T
X
ηhxt , wt − w̃t+1 i
t=1
≤ f (u) − f (w̃1 ) +
T
X
ηhxt , wt − w̃t+1 i .
t=1
Dividing the above by η we conclude our proof.
To derive concrete bounds from the above we need to further upper bound the sum over hxt , wt − w̃t+1 i.
We start with a simple example.
Example 8 Let f (w) = 12 kwk22 . Then ∇f (w) = w and Df (w, u) =
wt − ηxt and using Cauchy-Schwartz inequality we obtain
1
2 kw
− uk22 . Therefore, w̃t+1 =
hxt , wt − w̃t+1 i = hxt , ηxt i = ηkxt k22 .
Thus, if kxt k2 ≤ Bx for all t and if A is contained in the `2 ball of radius Bw we obtain the regret bound
B2
ηT Bx2 + 2ηw . Optimizing over η gives
T
X
√
hxt , wt − ui ≤ Bw Bx 2T .
t=1
Next, we derive an improved bound for the entropy regularization.
P
Example 9 Let f (w) = i wi log(wi ) and A = {w ≥ 0 : kwk1 = 1}. Then w̃1 = (1/d, . . . , 1/d) and
f (w̃1 ) = − log(d). In addition, for all u ∈ A we have f (u) ≤ 0. By optimality condition we have
∇f (w̃t+1 ) = ∇f (wt ) − ηxt .
Since ∇i f (w) = log(wi ) + 1 the above implies:
log(w̃t+1,i ) = log(wt,i ) − ηxt,i ⇒ w̃t+1,i = wt,i exp(−ηxt,i ) .
In addition, it is easy to verify that wt+1,i = w̃t,i /kw̃k1 . Therefore, we obtained the Weighted Majority
algorithm. To analyze the algorithm let us assume that xt,i ≥ 0 for all i. Using the inequality 1 − exp(−a) ≤
a we obtain
X
hxt , wt − w̃t+1 i =
xt,i wt,i (1 − exp(−ηxt,i ))
i
≤η
X
wt,i x2t,i .
i
We can further bound the above by maxi x2t,i , but if xt,i also depends on wt,i we will obtain an improved
bound. This will be the topic of the next lecture.
Online Convex Optimization-21
Download