18.657: Mathematics of Machine Learning Lecture 20 Nov. 23, 2015 Lecturer: Philippe Rigollet Scribe: Vira Semenova In this lecture, we talk about the adversarial bandits under limited feedback. Adversarial bandit is a setup in which the loss function l(a, z) : AxZ is determinitic. Limited feedback means that the information available to the DM after the step t is It = {l(a1 , z1 ), ..., l(at−1 , zt )}, namely consits of the realised losses of the past steps only. 5. ADVERSARIAL BANDITS Consider the problem of prediction with expert advice. Let the set of adversary moves be Z and the set of actions of a decision maker A = {e1 , ..., eK }. At time t, at ∈ A and zt ∈ Z are simultaneously revealed.Denote the loss associated to the decision at ∈ A and his adversary playing zt by l(at , zt ). We compare the total loss after n steps to the minimum expert loss, namely: min 1≤j ≤K n X lt (ej , zt ), t=1 where ej is the choice of expert j ∈ {1, 2, .., K}. The cumulative regret is then defined as Rn = n X t=1 lt (at , zt ) − min 1≤j≤K n X lt (ej , zt ) t=1 The feedback at step t can be either full or limited. The full feedback setup means that the vector f = (l(e1 , zt ), ..., l(eK , zt ))⊤ of losses incurred at a pair of adversary’s choice zt and each bandit ej ∈ {e1 , ..., eK } is observed after each step t. Hence, the information available to the DM after the step t is It = ∪tt′ =1 {l(a1 , zt′ ), ..., l(aK , zt′ )}. The limited feedback means that the time −t feedback consists of the realised loss l(at , zt ) only. Namely, the information available to the DM is It = {l(a1 , z1 ), ..., l(at , zt )}. An example of the first setup is portfolio optimization problems, where the loss of all possible portfolios is observed at time t. An example of the second setup is a path planning problem and dynamic pricing, where the loss of the chosen decision only is observed. This lecture has limited feedback setup. The two strategies, defined √ in the past lectures, were exponential weights, which yield the regret of order Rn ≤ c n log K and Follow the Perturbed Leader. We would like to play exponential weights, defined as: P −1 exp(−η ts=1 l(ej , zs )) pt,j = Pk Pt−1 l=1 exp(−η s=1 l(ej , zs )) This decision rule is not feasible, since the loss l(ej , zt ) are not part of the feedback if ej 6= at . We will estimate it by ˆl(ej , zt ) = l(ej , zt )1I(at = ej ) P (at = ej ) 1 Lemma: ˆl(ej , zt ) is an unbiased estimator of l(ej , zt ) ˆ j , zt ) = PK Proof. Eat l(e k=1 l(ek ,zt )1I(ek =et ) P (at P (at =ej ) = ek ) = l(ej , zt ) Definition (Exp 3 Algorithm): Let η > 0 be fixed. Define the exponential weights as (j) pt+1,j Pt−1 ˆ s=1 l(ej , zs )) P exp(−η t−1 ˆl(ej , zs )) exp(−η = Pk l=1 s=1 (Exp3 stands for Exponential weights for Exploration√and Exploitation.) √ bound is K We will show that the regret of Exp3 is bounded by 2nK log K. This √ times bigger than the bound on the regret under the full feedback. The K multiplier is the price of have smaller information set at the time t. The are methods that√allow to get rid of log K term in this expression. On the other hand, it can be shown that 2nK is the optimal regret. Proof. Let Wt,j = exp(−η Pk Pt−1 ˆ s=1 l(ej , zs )), Wt = j=1 Wt,j , and pt = Wt+1 log( ) = log( Wt PK Wt,j ej . Wt j=1 PK Pt−1 ˆ ˆ j=1 exp(−η s=1 l(ej , zs )) exp(−ηl(ej , zt )) ) PK Pt−1 ˆ j=1 exp(−η s=1 l(ej , zs )) t−1 X ˆl(eJ , zs ))) = log(IEJ ∼pt exp(−η (5.1) (5.2) s=1 η2 ≤∗ log(1 − η IEJ ∼pt ˆl(eJ , zs ) + IEJ ∼pt ˆl2 (eJ , zs ) 2 (5.3) where ∗ inequality is obtained by plugging in IEJ ∼pt ˆl(eJ , zt ) into the inequality exp x ≥ 1 − ηx + η 2 x2 2 . IEJ ∼pt ˆl(eJ , zt ) = K X pt,j ˆl(eJ , zt ) = j=1 IEJ ∼pt lˆ2 (eJ , zt ) = = K X K X pt,j j=1 pt,j lˆ2 (eJ , zt ) = j=1 l2 (ej , zt ) Pat ,t K X pt,j j=1 ≤ l(ej , zt )1I(at = ej ) = l(at , zt ) P (at = ej ) l2 (ej , zt )1I(at = ej ) P 2 (at = ej ) 1 Pat ,t Summing from 1 through n, we get P log(Wt+1 ) ≤ log(W1 ) − η nt=1 l(at , zt ) + (5.4) (5.5) (5.6) η2 2 2 P 1 Pa ,t t . For t = 1, we initialize w1,j = 1, so W1 = K. PK pj,t Since IEJ Pa1 ,t = j=1 pj,t = K , the expression above becomes t P 2 IE log(Wn+1 ) − log K ≤ −η tn=1 l(at , zt ) + η 2Kn PK Pt−1 ˆ Noting that log(Wn+1 ) = log( P j=1 exp(−η s=1 l(ej , zs )) n ∗ and defining j = argmin1≤j≤K t=1 l(ej , zt ), we obtain: K t−1 t−1 X X X log(Wn+1 ) ≥ log( exp(−η l(ej , zs ))) = −η l(ej , zs ) j=1 s=1 s=1 Together: n X n X log K ηKn + η 2 t=1 t=1 √ √ The choice of η := 2 log KnK yields the bound Rn ≤ 2K log Kn. l(at , zt ) − min 1≤j≤K l(ej , zt ) ≤ 3 MIT OpenCourseWare http://ocw.mit.edu 18.657 Mathematics of Machine Learning Fall 2015 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.