18.657: Mathematics of Machine Learning Lecture 18 Nov. 2, 2015 Lecturer: Philippe Rigollet Scribe: Haihao (Sean) Lu 3. STOCHASTIC BANDITS 3.1 Setup The stochastic multi-armed bandit is a classical model for decision making and is defined as follows: There are K arms(different actions). Iteratively, a decision maker chooses an arm k ∈ {1, . . . , K}, yielding a sequence XK,1 , . . . , XK,t , . . ., which are i.i.d random variables with mean µk . Define µ∗ = maxj µj or ∗ ∈ arg max. A policy π is a sequence {πt }t≥1 , which indicates which arm to be pulled at time t. πt ∈ {1, . . . , K} and it depends only on the observations strictly interior to t. The regret is then defined as: n n X X Rn = max IE[ XK,t ] − IE[ Xπt ,t ] k t=1 n X = nµ∗ − IE[ t=1 Xπt ,t ] t=1 n X = nµ∗ − IE[IE[ Xπt ,t | πt ]] t=1 = K X ∆k IE[Tk (n)] , k=1 where ∆k = µ∗ − µk and Tk (n) = pulled. Pn t=1 1I(πt = k) is the number of time when arm k was 3.2 Warm Up: Full Info Case X1,t Assume in this subsection that K = 2 and we observe the full information ... at XK,t time t after choosing πt . So in each iteration, a normal idea is to choose the arm with highest average return so far. That is ¯ k,t πt = argmax X k=1,2 where t X̄k,t 1X = Xk,s t s=1 Assume from now on that all random variable Xk,t are subGaussian with variance proxy σ 2 , which means IE[eux ] ≤ e u2 σ 2 2 for all u ∈ IR. For example, N (0, σ 2 ) is subGaussian with 1 variance proxy σ 2 and any bounded random variable X ∈ [a, b] is subGaussian with variance proxy (b − a)2 /4 by Hoeffding’s Lemma. Therefore, Rn = ∆IE[T2 (n)] , (3.1) where ∆ = µ1 − µ2 . Besides, T2 (n) = 1 + n X ¯ 2,t > X ¯ 1,t ) 1I(X t=2 =1+ n X ¯ 2,t − X ¯ 1,t − (µ2 − µ1 ) ≥ ∆) . 1I(X t=2 ¯ 2,t − X ¯ 1,t ) − (µ2 − µ1 ) is centered subGaussian with variance proxy It is easy to check that (X 2 2σ , whereby t∆2 ¯2,t > X ¯ 1,t )] ≤ e− 4σ2 IE[1I(X by a simple Chernoff Bound. Therefore, Rn ≤ ∆(1 + ∞ X t∆2 e− 4σ2 ) ≤ ∆ + t=0 whereby the benchmark is Rn ≤ ∆ + 4σ 2 , ∆ (3.2) 4σ 2 . ∆ 3.3 Upper Confidence Bound (UCB) Without loss of generality, from P now on we assume σ = 1. A trivial idea is that after s pulls on arm k, we use µ̂k,s = 1s j∈{pulls of k} XK,j and choose the one with largest µ̂k,s . The problem of this trivial policy is that for some arm, we might try it for only limited times, which give a bad average and then we never try it again. In order to overcome this limitation, a good idea is to choose the arm with highest upper bound estimate on the mean of each arm at some probability lever. Note that the arm with less tries would have a large deviations from its mean. This is called Upper Confidence Bound policy. 2 Algorithm 1 Upper Confidence Bound (UCB) for t = 1 to K do πt = t end for for t = K + 1 to n do Tk (t) = t−1 X 1I(πt = k) s=1 (number of time we have pull arm k before time t) t−1 1 X µ̂k,t = XK,t∧s Tk (t) s=1 s ( ) 2 log(t) πt ∈ argmax µ̂k,t + 2 , Tk (t) k∈[K] end for Theorem: The UCB policy has regret K X log n π2 X + (1 + ) ∆k ∆k 3 Rn ≤ 8 k,∆k >0 k=1 Proof. From now on we fix k such that ∆k > 0. Then IE[Tk (n)] = 1 + n X IP(πt = k) . t=K+1 Note that for t > K, s s 2 log t 2 log t {πt = k} ⊆ {µ̂k,t + 2 ≤ µ̂∗,t + 2 } Tk (t) T∗ (t) s s s ( ) 2 log t [ 2 log t [ 2 log t ⊆ {µk ≥ µ̂k,t + 2 } {µ∗ ≥ µ̂∗,t + 2 } {µ∗ ≤ µk + 2 , πt = k} Tk (t) T∗ (t) Tk (t) And from a union bound, we have s IP(µ̂k,t − µk < −2 s 2 log t 2 log t ) = IP(µ̂k,t − µk < 2 ) Tk (t) Tk (t) t X t −s 8 log s ≤ exp( ) 2 s=1 = 3 1 t3 q t Thus IP(µk > µ̂k,t + 2 2Tlog )≤ k (t) whereby n X t=K+1 1 t3 q t and similarly we have IP(µ∗ > µ̂∗,t + 2 2Tlog )≤ ∗ (t) 1 t3 , s n n X 1 X 2 log t IP(πt = k) ≤ 2 + IP(µ∗ ≤ µk + 2 , πt = k) 3 t Tk (t) ≤2 t=1 ∞ X t=1 1 + t3 t=1 n X IP(Tk (t) ≤ t=1 8 log t , πt = k) ∆2k ∞ n X 1 X 8 log n ≤2 + IP(Tk (t) ≤ 2 , πt = k) 3 t ∆ k t=1 t=1 ∞ ∞ X 1 X 8 log n ≤2 + IP(s ≤ ) t3 ∆2k t=1 s=1 ∞ X 1 8 log n ≤2 + 2 t ∆2k t=1 = π 2 8 log n + , 3 ∆2k where s is the counter of pulling arm k. Therefore we have Rn = K X k =1 ∆k IE[Tk (n)] ≤ X k,∆k >0 ∆k (1 + π 2 8 log n + ), 3 ∆2k which furnishes the proof. n Consider the case K = 2 at first, then from the theorem above we know Rn ∼ log ∆ , which is consistent with intuition that when the difference of two arm is small, it is hard to distinguish which to choose. On the other hand, it always hold that Rn ≤ n∆. Combining log(n∆2 ) n these two results, we have Rn ≤ log up to a constant. ∆ ∧ n∆, whereby Rn ≤ ∆ Actually it turns out to be the optimal bound. When K ≥ 3, we can similarly get the P log(n∆2 ) result that Rn ≤ k ∆k k . This, however, is not the optimal bound. The optimal bound P P should be k log(n/H) , which includes the harmonic sum and H = k ∆12 . See [Lat15]. ∆k k 3.4 Bounded Regret From above we know UCB policy can give regret that increases with at most rate log n with n. In this section we would consider whether it is possible to have bounded regret. Actually it turns out that if there is a known separator between the expected reward of optimal arm and other arms, there is a bounded regret policy. We would only consider the case when K = 2 here. Without loss of generality, we ∆ assume µ1 = ∆ 2 and µ2 = − 2 , then there is a natural separator 0. 4 Algorithm 2 Bounded Regret Policy (BRP) π1 = 1 and π2 = 2 for t = 3 to n do if maxk µ̂k,t > 0 then then πt = argmaxk µ̂k,t else πt = 1, πt+1 = 2 end if end for Theorem: BRP has regret Rn ≤ ∆ + 16 . ∆ Proof. IP(πt = 2) = IP(µ̂2,t > 0, πt = 2) + IP(µ̂2,t ≤ 0, πt = 2) Note that n X IP(µ̂2,t > 0, πt = 2) ≤ IE t=3 ≤ IE ≤ n X t=3 n X 1I(µ̂2,t > 0, πt = 2) 1I(µ̂2,t − µ2 > 0, πt = 2) t=3 ∞ X 2 − s∆ 8 e s=1 = 8 , ∆2 where s is the counter of pulling arm 2 and the third inequality is a Chernoff bound. Similarly, n X IP(µ̂2,t ≤ 0, πt = 2) = t=3 n X IP(µ̂1,t ≤ 0, πt−1 = 1) t=3 ≤ 8 , ∆2 Combining these two inequality, we have Rn ≤ ∆(1 + 16 ), ∆2 References [Lat15] Tor Lattimore, Optimally confident UCB : Improved regret for finite-armed bandits, Arxiv:1507.07880, 2015. 5 MIT OpenCourseWare http://ocw.mit.edu 18.657 Mathematics of Machine Learning Fall 2015 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.