Bandit optimization with large strategy sets Alexandre Proutiere Joint work with Richard Combes Docent lecture October 11, 2013 1 Outline 1. 2. 3. 4. Motivation Bandit optimization: background Graphically unimodal bandits Applications 2 1. Motivation 3 Rate adaptation in 802.11 Adapting the modulation/coding scheme to the radio environment - 802.11 a/b/g Yes/No rates Success probabilities Throughputs 6 9 12 r 1 r 2 . . . rN ✓1 ✓2 . . . ✓N µ1 µ2 . . . µN 18 24 36 µ i = ri ✓ i 48 54 (Mbit/s) - Optimal sequential rate selection? 4 Rate adaptation in 802.11 - 802.11 n/ac MIMO Rate + MIMO mode - Example: two modes, single-stream (SS) or double-stream (DS) 27 13.5 27 54 40.5 81 54 108 81 162 108 216 243 121.5 135 270 DS SS 5 CTR estimation in Ad Auctions - Current practice: ü Bayesian estimator (graphical models + EP) ü Minimize the mean square error on CTRs ü Underlying assumption: a static system - A dynamic system (changing ads, changing CTRs, …) - … but most importantly our goal is to maximize profit not minimize CTR estimation error! CTR bandits - CTR matrix: Queries (n > 109) Ads (m > 106) j i µij - Profit after T queries: profit(T ) = T X t=1 Xij ⇠ Ber(µij ) bi(t)j(t) Xi(t)j(t) Online decision problem Queries (> 109) Click j 1 Ads (> 106) 0 1 1 No click 0 1 1 0 0 Online decision problem Queries (> 109) j 1 Ads (> 106) 0 1 1 0 1 1 0 0 Online decision problem Queries (> 109) 1 Ads (> 106) 0 1 1 1 1 0 0 0 0 2. Bandit optimization 11 Bandit optimization A sequential decision problem (Thompson 1933) 1 2 3 4 5 ….. patients D D D L D ….. D L L L D ….. – A set of possible actions at each step – Unknown sequence of rewards for each action 12 Bandit optimization A sequential decision problem (Thompson 1933) 1 2 3 4 5 ….. patients D D D L D ….. D L L L D ….. – A set of possible actions at each step – Unknown sequence of rewards for each action – Bandit feedback: only rewards of chosen actions are observed 13 Bandit optimization A sequential decision problem (Thompson 1933) 1 2 3 4 5 ….. patients D D D L D ….. D L L L D ….. A set of possible actions at each step Unknown sequence of rewards for each action Bandit feedback: only rewards of chosen actions are observed Goal: maximize the cumulative reward (up to step T), i.e., strike the optimal exploration-exploitation trade-off 14 Regret instantaneous reward unknown best action your algorithm time 15 Regret instantaneous reward regret unknown best action your algorithm time Objective: to identify the best action with minimum exploration, i.e., to minimize regret (to maximize the "convergence rate") Particularly relevant when the best action evolves – for tracking problems 16 Stochastic Bandits Robbins 1952 - K arms / decisions / actions - Unknown i.i.d. rewards: Xi,t ⇠ Ber(µi ), µ? = max µi = µi? i - Lack of structure: µi 2 [0, 1], 8i 2 {1, . . . , K} - Under online algorithm ⇡ , arm selected at time t : I t⇡ function of history (I 1⇡ , XI1⇡ ,1 , . . . , It⇡ 1 , XIt⇡ 1 ,t 1 ) - Regret up to time T : R⇡ (T ) = max E i=1,...,K T X t=1 Xi,t E T X XIt⇡ ,t t=1 17 Stochastic Bandits - Asymptotic regret lower bound (no algorithm can beat this performance) - Uniformly good algorithm: E[ti (T )] = o(T ↵ ), 8↵ > 0, 8µ, 8i 6= i? Theorem (Lai-Robbins 1985) For any uniformly good policy ⇡ R⇡ (T ) lim inf T !1 log(T ) X i6=i? µ? µi KL(µi , µ? ) p KL(p, q) = p log( ) + (1 KL divergence number: q p) log( 1 1 Regret linear in the number of arms, and proportional to (µ? p ) q 1 µi ) 18 The change-of-measure argument µi 1 2 k 3 4 i? 5 arms i To identify the minimum number of times sub-optimal arm k must be played, find the most confusing parameters 19 The change-of-measure argument µi The most confusing µ0 1 2 k 3 4 i? 5 arms i To identify the minimum number of times sub-optimal arm k must be played, find the most confusing parameters log(T ) µ? µk log(T ) k played ~ times, yielding a regret KL(µk , µ? ) KL(µk , µ? ) 20 Algorithms - Optimal but complicated policy (Lai-Robbins 1985) - Simpler and optimal algorithms (Agrawal 1995) - ε-greedy algorithm, ε=1/t logarithmic regret - UCB algorithm (Auer et al. 2002): a simple suboptimal index policy s ni (t) 1 X Xi,t + ni (t) t=1 2 log(t) ni (t) - KL-UCB (Garivier-Cappe 2011): claiming back optimality Index-based policy, index of arm i ⇢ max q 1 : ni (t)KL(µ̂i (t), q) log(t) + 3 log log(t) 21 Bandit classification - Unstructured bandits: average rewards are not related µ = (µ1 , . . . , µK ) 2 ⇥ ⇥= K Y µ2 ⇥ [ai , bi ] i=1 µ1 - Structured bandits: the decision maker knows that average µ2 rewards are related ⇥ 6= K Y [ai , bi ] i=1 ⇥ µ1 - The rewards observed for a given action provide side-information about the average rewards of other actions - How can we exploit this side-information optimally? 22 Bandit classification K/T (Decision set/Time horizon) 0 Lai-Robbins 1985 Bandits with structure: Combinatorial bandits Continuous bandits: Linear bandits Convex bandits Unimodal bandits Agrawal 1995 Kleinberg-Slivkins 2005 Bubeck et al. 2011 … Lai Agrawal-Teneketzis-Anantharam 1989 Kumar 1985 Borkar-Varaiya 1979 Optimal control of Markov Chains ∞ Infinite Bandits: Berry et al. 1997 Wang et al. 2008 Bonald-P. 2013 23 A few papers since 2005 See ICML tutorial 2011: Audibert-Munos 24 2. Graphically unimodal bandits 25 Actions and rewards G = (V, E) 26 Actions and rewards G = (V, E) Actions 27 Actions and rewards k? G = (V, E) j µ j > µi i µ = (µi )i2V 2 UG Graphical unimodality: from any vertex, there is a path with increasing rewards to the best vertex. Related work: Yu-Mannor 2011 (sub-optimal algorithms) 28 Average rewards - Linear structure: at time t , action i yields a reward r i Xi,t Xi,t ⇠ Ber(✓i ) r 1 r 2 . . . rN ✓1 ✓2 . . . ✓N µ1 µ2 . . . µN known unknown 2 [0, 1] Average rewards µ i = ri ✓ i Optimal action: k? , µ? = µ k ? - Graphical unimodality w.r.t. some known graph G 29 Graphically unimodal bandits k? G = (V, E) j µ j > µi i How can the reward structure be optimally exploited? How does regret scale with the graph size and topology? 30 Controlled Markov Chains Kumar, Lai, Borkar, Varaiya, … y x p(x, y; u, ✓) reward: r (x, u) - Finite state and action spaces ✓ 2 ⇥ - Unknown parameter ⇥ : compact metric space - Control: finite set of irreducible control laws g : X ! U µg (✓) = X ⇡✓g (x)r(x, g(x)) x2X - Optimal control law: g ? - Regret: R ⇡ (T ) = T µg? (✓) E T X r(Xt , g ⇡ (Xt )) t=1 31 Regret lower bound - KL number under policy g : X g p(x, y; g(x), ✓) g I (✓, ) = ⇡ (x)p(x, y; g(x), ✓) log ✓ p(x, y; g(x), ) x,y - Bad parameter set: g? ? B(✓) = { 2 ⇥ : g not opt., I (✓, ) = 0} R⇡ (T ) - Lower bound (Graves-Lai'97): lim inf T !1 log(T ) c(✓) = inf X cg (µg? (✓) g6=g ? s.t. inf 2B(✓) X c(✓) µg (✓)) cg I g (✓, ) 1 g6=g ? 32 Application to graphically unimodal bandits - State space: set of possible rewards - Control laws: constant mappings to the set of actions, e.g. g = k - Transitions (i.i.d. process): p(x, y : k, ✓) = ⇢ ✓k 1 ✓k if y = rk if y = 0 - Average rewards: g = k µg (✓) = rk ✓k = µk 33 Fundamental performance limit N (k) N (k) = {l 2 N (k) : rk ✓k rl } k Theorem: For any uniformly good algorithm ⇡ R⇡ (T ) lim inf T !1 log(T ) cG (✓) cG (✓) = X rk ? ✓ k ? rk ✓ k r k ? ✓k ? KL(✓ , k rk ) k2N (k? ) The performance limit does not depend on the size of the decision space! Structure could really help. 34 Proof inf X cg (µg? (✓) g6=g ? s.t. inf 2B(✓) µi X µg (✓)) Example: classical unimodality cg I g (✓, ) 1 g6=g ? The most confusing µ 1 2 3 4 5 6 7 8 actions 35 Optimal Action Sampling tk (n) 1 X - Empirical average reward: µ̂k (n) = rk Xk (s) tk (n) s=1 - Leader at time n: L(n) 2 arg max µ̂k (n) k - Number of times k has been the leader: lk (n) = ⇢ - Index of k: bk (n) = max q 2 [0, rk ] : tk (n)KL ✓ n X 1L(s)=k s=1 µ̂k (n) q , rk rk ◆ log(lL(n) (n)) + c log log(lL(n) (n)) 36 Optimal Action Sampling Algorithm – Optimal Action Sampling (OAS) For n = 1, . . . , K , select action k(n) = n For n K + 1 , select action k(n) : k(n) = ( L(n) arg max k2N (L(n)) Theorem: For any µ 2 UG , bk (n) if (lL(n) (n) 1)/( + 1) 2 N, ROAS (T ) lim sup cG (✓). T !1 log(T ) 37 Proof X rK XE[lk (T )] (T ) rK ? E[lk (T )] RORS OAS(T ) R k6=k k6=k? + X k2N (k? ) (rk? ✓k? rk ✓k )E[ T X 1L(t)=k? ,k(t)=k ] t=1 First term O(log log(T )) Second term (1 + ✏)c(✓) log(T ) + O(log log(T )) Deviation bounds (refined concentration inequalities) + further decomposition of the set of events Finite time analysis 38 Non-stationary environments - Average rewards may evolve over time: ✓(t) - Best decision at time t: k? (t) - Goal: track the best decision - Regret: R⇡ (T ) = T X rk? (t)✓k? (t) (t) rk⇡ (t) ✓k⇡ (t) (t) t=1 - Sub-linear regret cannot be achieved (Garivier-Moulines 2011) - Assumptions: ✓(t) σ-Lipschitz, and T 1 X X lim sup T !1 T n=1 k,k0 2N (k) 1|rk ✓k (n) rk0 Yi 42 Rate adapta-on in 802.11 Adap-ng the modula-on/coding scheme to the radio environment -­‐ 802.11 a/b/g Yes/No rates Success probabili-es Throughputs r 1 r 2 . . . rN ✓1 ✓2 . . . ✓N µ1 µ2 . . . µN µ i = ri ✓ i -­‐ Structure: unimodality + ✓1 > ✓2 > . . . > ✓N G 6 9 12 18 24 36 48 54 (Mbit/s) 43 Rate adapta-on in 802.11 -­‐ 802.11 n/ac MIMO Rate + MIMO mode (32 combina-ons in n) -­‐ Example: two modes, single-­‐stream (SS) or double-­‐stream (DS) 27 54 81 108 162 216 243 270 DS G 13.5 27 40.5 54 81 108 121.5 135 SS 44 State-­‐of-­‐the-­‐art -­‐ ARF (Auto Rate Fallback): acer n successive successes, probe a higher rate; acer two consecu-ve failures reduce the rate -­‐ AARF: vary n dynamically depending on the speed at which the radio environment evolves -­‐ SampleRate: based on achieved throughputs over a sliding window, explore a new rate every 10 packets -­‐ Measurement based approaches: Map SNR to packet error rate (does not work – OFDM): RBAR, OAR, CHARM, … -­‐ 802.11n MIMO: MiRA, RAMAS, … All exis-ng algorithms are heuris-cs. Rate adapta-on design: a graphically unimodal bandit with large strategy set 45 Op-mal Rate Sampling Algorithm – Op-mal Rate Sampling (ORS) For n = 1, . . . , K , select ac-on k(n) = n For n K + 1 , select ac-on k(n) : k(n) = ( L(n) arg max k2N (L(n)) bk (n) if (lL(n) (n) otherwise. 1)/( + 1) 2 N, ORS is asympto-cally op-mal (minimizes regret) Its performance does not depend on the number of possible rates! For non-­‐sta-onary environments: SW-­‐ORS (ORS with sliding window) 46 802.11g – sta-onary environment GRADUAL (success prob. smoothly decreases with rate) 4000 SampleRate SW−G−ORS G−ORS 3500 3000 Regret 2500 2000 1500 1000 500 0 0 20 40 60 80 100 Time (s) 47 802.11g – sta-onary environment STEEP (success prob. is either close to 1 or to 0) 2000 SampleRate SW−G−ORS G−ORS 1800 1600 Regret 1400 1200 1000 800 600 400 200 0 0 20 40 60 Time (s) 80 100 48 802.11g – non-­‐sta-onary environment TRACES Instantaneous Throughput (Mbps) 55 54Mbps 48Mbps 36Mbps 24Mbps 18Mbps 12Mbps 9Mbps 6Mbps 50 45 40 35 30 25 20 15 10 5 0 0 50 100 150 200 250 300 Time (s) 49 802.11g – non-­‐sta-onary environment RESULTS Instantaneous Throughput (Mbps) 55 Oracle SW−G−ORS SampleRate 50 45 40 35 30 25 20 15 10 5 0 0 50 100 150 Time (s) 200 250 300 50 Summary -­‐ Regret minimiza-on: the right objec-ve for tracking op-mal opera-ng points in changing environments -­‐ A solu-on to graphically unimodal bandit problems, an important class of structured bandits -­‐ Lower bound on regret -­‐ Asympto-cally op-mal and simple algorithms -­‐ Non-­‐sta-onary analysis -­‐ Applica-on to rate adapta-on in 802.11 systems -­‐ An op-mal algorithm: ORS -­‐ The regret under ORS does not depend on the number of available decisions (crucial!) -­‐ Numerical valida-on using simula-ons and test-­‐beds 51 Current work -­‐ Other applica-ons of structured bandits -­‐ -­‐ -­‐ -­‐ -­‐ Wireless systems: large MIMO, scheduling, spectrum sharing Online clustering Recommenda-on systems Pricing … 52 Future work Decision set/Time horizon 0 Structured bandits solved (stoch. control) ∞ ? Infinite bandits solved (Bonald-­‐P.) -­‐ What can we do when the size of the decision space and the -me horizon are comparable? 53 h{p:// richardcombesresearch/ h{p:// 54