EECS 495: Randomized Algorithms Multi-armed Bandits Reading: none Lecture 13 Goal: Given finite horizon T , seek policy to minimize regret: ! T X max T × µi∗ − E[ Rt ] Multi-armed bandits (Robbins ’52): • Slot machine with multiple levers Xi t=1 • levers give rewards according to distribu- where i∗ is arm with highest µ . i tion Claim: √There is a policy that obtains regret O( nT log T ) (and hence per-turn re• want to maximize sum of rewards gret vanishes for large T ). • no initial knowledge about payoffs Question: Easy policies? Algorithm: Play all arms for a while, then play best one. Problem: Tradoff between • exploit lever with high expected payoff • Play each arm for T 2/3 steps • explore to get more info about expected payoffs of other levers • Choose arm with max sample ave and play for remaining T − nT 2/3 steps √ Claim: O(nT 2/3 log T ) regret. Example: Keyword Allocations Claim: (Chernoff-Hoeffding’s inequality). • n advertisers with CPC v1 , . . . , vn and Let X1 , . . . , Xk be k independent draws from CTR p1 , . . . , pn a distribution on [0, 1] with mean µ. Let Pk 1 X be sample average. Then: µ̂ = • one slot per keyword i i=1 k 2 Pr[µ̂ − µ > ] ≤ 2e−2k • CTRs unknown, fixed over time and • which ad to show? 2 Pr[µ̂ + µ < ] ≤ 2e−2k . Problem: Proof: (of regret bound): • µ̂i = sample ave of arm i. by hoeffding with k = T 2/3 : √ log T 2 Pr[|µi − µ̂i | > ]≤ 2 1/3 T T • n arms • reward Xi ∈ [0, 1] of arm i • Xi random variable with mean µi 1 • Assume n << T . by union bound: √ log T 2 Pr[∃i : |µi − µ̂i | > ]≤ 1/3 T T • ∆i = µi∗ − µi be per-turn regret for playing i, • Qi be exp. # times i is played in T steps. T • W/prob. 1 − T2 , chosen arm i has µi ≥ For each arm i 6= i∗ , E(Qi ) ≤ 4 log + 2: ∆2i √ 2 log T µi∗ − T 1/3 , so regret at most • Pr[Φi∗ < µi∗ ] ≤ 1/T no matter how long √ 2 log T 2 we play it. nT 2/3 + T × + ×T T 1/3 T If i∗ played continously, at each step t, where r log T 1 Pr[Φi∗ (t) < µi∗ ] = Pr[µi∗ −µˆi∗ (t) > ]≤ 2 – first term is regret due to initial ext T plore by Hoeffding. Dips below µi∗ with prob. – second term regret due to slightly at most T1 by union bound over steps. sub-opt arm played at most T times • Pr[Φi > µi ] ≤ 1/T after enough trials. – third term regret due to arm subopt by 1 If i played for ti = µ̂i + ∆i /2), then Idea: Treating all arms equal wastes time. Play arm with highest upper confidence interval. Either 4 log T ∆2i steps (index is Pr[Φi > µi ] = Pr[µ̂i − µi > ∆i /2] ≤ 1/T by Hoeffding. • also has higher mean, or If neither event happens, play i at most ti times, else w.p. at most 2/T , play arm at most T times. • narrow confidence interval either way, we’re happy. Regret is: Def: If µ̂i is sample ave and ti is number of timesqplayed arm i, then the index Φi of i is µ̂i + logti T . X i Algorithm: Index policy ∆i E[Qi ] ≈ X 4 log T ( ) ∆ i i Define • Play arm with highest index • Update index incur total regret at most √ 2 nT log T √ Claim: O( nT log T ) regret. q 4n log T T 4 log T n ∆i = • Arms with large regret ∆i > q 4n log T • Arms with small regret ∆i ≤ T incur total regret at most T max ∆ i i = √ 2 nT log T . Proof: Let • i∗ be arm with highest mean, 2 Lower bound: Claim: Any bandit policy incurs regret √ Ω( nT ). Proof: n − 1 arms with meanp 1/2; one arm with mean 1/2 + for = O(p n/2). With t samples, variance becomes 1/t, so need O(T /n) samples to decide if arm is good one with constant prob. Not enough samples to resolve all arms, so with constant √ probability fail and incur regret T = Ω( nT ). 3