Nonstochastic Multi-Armed Bandits With Graph-Structured Feedback Noga Alon, TAU Nicolo Cesa-Bianchi, Milan Claudio Gentile, Insubria Shie Mannor, Technion Yishay Mansour, TAU and MSR Ohad Shamir, Weizmann Nonstochastic sequential decision-making • K actions and T time steps • lt(a) – loss of action a at time t • At time t – player picks action Xt – incurs loss lt(Xt) – observe feedback on losses • Multi-arm bandit: only lt(Xt) • Experts (full information): lt(j) for any j 2 Nonstochastic sequential decision-making • Goal: – minimize losses – benchmark: The best single action • The action j that minimizes the loss – no stochastic assumptions on losses • Regret T R T E [ ( X t )] min j player loss best t 1 – MAB t ( j) t 1 action • Known regret bounds: – Experts 3 T TK T ln K Motivation – observablity undirected directed 4 undirected observation graph ? ? ? ? ? ? ? ? 5 undirected observation graph ? ? 3 ? ? ? ? ? 6 undirected observation graph 7 5 3 ? 1 ? ? ? 7 undirected observation graph • MAB: no edges • Experts: clique ? ? 3 ? 3 ? ? ? 4 1 8 2 ? 8 7 5 6 Modeling Directed vs Undirected Informed vs Uniformed • Different types of dependencies • When does the learner observes the graph • Different measures – Independent set – Dominating set – Max Acyclic Subgraph 9 – Before – After • only the neighbors Our Results Uniformed setting • Undirected graph • Uniformed setting Informed setting • Directed graphs – Only the neighbors of the node • Regret characterization – dominating sets and ind. set – Independent sets ~ O ( T ( G ) ln K ) • Both expectation and high prob. – Max Acyclic Subgraph (not tight) • Directed graph – Random Erdos-Renyi graphs 10 EXP3-SET • Online Algorithm Pr[ X t a ] exp( t s 1 t (a ) ˆ ( a ) Pr[ ( a ) is observed t t 0 ˆ ( a ) ) s ] where if t ( a ) obseved otherwise E [ ˆ t ( a )] t ( a ) • Theorem RT 11 ln K T Pr[ X 2 t a | t ( a ) is observed] t 1 (Gt ) T (G)lnK (G t )lnK EXP3-Set Regret – key lemma • Lemma Q • Proof: Build an i.s. S Pr[ X t a ] Pr[ a observed a Note: MAB: Q=K Full info. Q=1 ] • Note j N ( a ) 12 (G ) – consider action a with minimal Pr[a observed] – Add a to S – Delete a and its neighbors Pr[ X t j ] Pr[ j observed ] j N ( a ) Pr[ X t j ] Pr[ a observed ] 1 Dominating set – directed graph ? ? ? ? ? ? ? ? 15 Dominating set – directed graph ? ? ? ? ? ? ? ? 16 EXP3-DOM • Simplified version – fixed graph G – D is dominating set • log approx • Main modification – add probabilities to D • induce observability • probabilities: p a , t (1 ) w a ,t Wt I [a D ] |D| • Select Xt using pt • Observe lt(a) for a in SXt,t • weights w a ,t 1 w a ,t exp( ˆ t ( a ) / | D |) ˆ ( a ) t 17 t (a ) Pr[ observe a ] I [ i S X t ,t ] EXP3-DOM • Simple example • Transitive observability – tournament • action 1 observes all actions – D={1} • EXP3-DOM • Sample action 1 with prob γ – action 1 is the exploration • Otherwise run a MAB – specifically EXP3-SET • Intuition – action 1 replaces mixture with uniform 18 Conclusion • Observability model – Between MAB and Experts • more work to be done • Uninformed setting – Undirected graph • Informed setting – Directed graph • [Kocak, Neu, Valko and R. Muno] improved uniformed 19 Outline • Model and motivation • symmetric observability • non-symmetric observability 23 EXP3-DOM: key lemma • Lemma • Proof: high level – G directed graph, – d-i indegree of i, – α=α(G) K 1 1 d i 1 i K 2 ln 1 – shrink graph • GK,Gk-1, … – delete nodes • step s: – delete max indegree node • From Turan’s theorem • Turan’s Theorem – undirected graph G(V,E) |V | 1 24 ( G ); 2|E | |V | max d i | Ds | | Vs | | Vs | 2 s 1 2 EXP3-DOM: key lemma (proof) • Completing the proof K 1 1 d i 1 K K 2 K K K K 1 d 2 K i,K 2 i i 1 i K 1 i 1, K K 1 1 d i2 1 1 d i2 K 1 i,K 1 1 d i 1 i,K i , K 1 2 ln( 1 K ) • Note, due to edge elimination 25 d i 1, K d i , K 1 EXP3-DOM- Key lemma (modified) • Lemma (what we really need!) • G(V,E) directed graph – INi indegree of i – r size dominating set; and α size ind. set – p distribution over V • pi≥β Q 26 pi K i 1 pi j IN i K 2 K r 2 ln 1 2r pj EXP3 –DOM: changing graphs • Simple – all dom. set same size – approx. same size • Problem – different size dom. set • can be 1 or K • Solution – keep log levels • depend on log2 (Dt) – algorithm per level 27 • Complications – parameters depend on level – setting the learning rate • need a delicate doubling • Main tech. challenge – handle dynamic adversary. EXP3-DOM • receive obs. graph – find dominating set Dt • logarithmic approximation • Run the right copy – Let bt = log2 (Dt) – run copy bt • log copies • For Copy bt – param. depend on bt 28 • probabilities: p a , t (1 ) w a ,t Wt Dt I [a Dt ] • Select Xt using p • Observe lt(a) for a in SXt,t • weights w a ,t 1 w a ,t ˆ ( a ) t bt ˆ exp( t ( a ) / 2 ) t (a ) Pr[ observe a ] I [ i S X t ,t ] EXP3-DOM – main Theorem • Theorem: log K RT b0 b 2 ln K b E [ b b t T b 1 Qt 2 b 1 ] • tuning γb R T O ((ln K ) E [ 29 T t 1 4 | D t | Q t t ] (ln K ) ln( KT )) b Independent set • Independent set α(G) • [Mannor & Shamir 2012] • Tight Regret ? ? T ( G ) ln K ? ? ? ? – α(G) “replaces” K ? • Cons: ? 30 – requires to observe G – solves an LP each step