Convergent Learning in Unknown Graphical Games Dr Archie Chapman, Dr David Leslie, Dr Alex Rogers and Prof Nick Jennings School of Mathematics, University of Bristol and School of Electronics and Computer Science University of Southampton david.leslie@bristol.ac.uk Playing games? Rock Scissors Paper Rock Scissors Paper (0,0) (1,1) (1,1) (1,1) (0,0) (1,1) (1,1) (1,1) (0,0) Playing games? Playing games? Dense deployment of sensors to detect pedestrian and vehicle activity within an urban environment. Berkeley Engineering Learning in games • Adapt to observations of past play • Hope to converge to something “good” • Why?! – Bounded rationality justification of equilibrium – Robust to behaviour of “opponents” – Language to describe distributed optimisation Notation • • • • Players i {1,..., N } Discrete action sets Ai Joint action set A A1 AN Reward functions ri : A R • Mixed strategies i i ( Ai ) • Joint mixed strategy space 1 N • Reward functions extend to ri : R Best response / Equilibrium • Mixed strategies of all players other than i is i • Best response of player i is bi ( i ) argmax ri ( i , i ) i i satisfying, for all i, i bi ( i ) • An equilibrium is a Fictitious play Game matrix Estimate strategies of other players Select best action given estimates Update estimates Belief updates • Belief about strategy of player i is the MLE (ai ) t i (ai ) t i t • Online updating t t 1 b( 1 t t 1 ) t 1 Stochastic approximation • Processes of the form X t 1 X t t 1F ( X t ) M t 1 et 1 where E( M t 1 | X t ) 0 and et 0 • F is set-valued (convex and u.s.c.) • Limit points are chain-recurrent sets of the differential inclusion X F (X ) Best-response dynamics • Fictitious play has M and e identically 0, and t 1t • Limit points are limit points of the bestresponse differential inclusion b( ) • In potential games (and zero-sum games and some others) the limit points must be Nash equilibria Generalised weakened fictitious play • Consider any process such that t t 1 t b ( t t 1 ) where 0, 0 and t t t 1 M and also an interplay between t t and M. • The convergence properties do not change Fictitious play Game matrix Estimate strategies of other players Select best action given estimates Update estimates Learning the game Rock Scissors Paper (?, ?) (?, ?) (?, ?) (?, ?) (?, ?) (?, ?) (?, ?) (?, ?) (?, ?) Rock Scissors Paper R ri (a ) e t i t t i Reinforcement learning • Track the average reward for each joint action • Play each joint action frequently enough • Estimates will be close to the expected value • Estimated game converges to the true game Q-learned fictitious play Game matrix Estimate strategies of other players Select best action given estimates Estimated game matrix Select best action given estimates Update estimates Theoretical result Theorem – If all joint actions are played infinitely often then beliefs follow a GWFP Proof: The estimated game converges to the true game, so selected strategies are -best responses. Playing games? Dense deployment of sensors to detect pedestrian and vehicle activity within an urban environment. Berkeley Engineering It’s impossible! • N players, each with A actions • Game matrix has AN entries to learn • Each individual must estimate the strategy of every other individual • It’s just not possible for realistic game scenarios Marginal contributions • Marginal contribution of player i is total system reward – system reward if i absent • Maximised marginal contributions implies system is at a (local) optimum • Marginal contribution might depend only on the actions of a small number of neighbours Sensors – rewards • Global reward for action a is n j (a) U g (a) E E 1 I j is observed events events j j j and observation • Marginal reward for i is n j ( a ) 1 n j (a) ri (a) U g (a) U g (ai ) E events j observed by i • Actually use R t i j observed by i n j ( a t ) 1 n j ( at ) Local learning Game matrix Estimate strategies of other players Select best action given estimates Estimated game matrix Select best action given estimates Update estimates Local learning Game matrix Estimate strategies of neighbours Select best action given estimates Estimated game matrix for local interactions Select best action given estimates Update estimates Theoretical result Theorem – If all joint actions of local games are played infinitely often then beliefs follow a GWFP Proof: The estimated game converges to the true game, so selected strategies are -best responses. Sensing results So what?! • Play converges to (local) optimum with only noisy information and local communication • An individual always chooses an action to maximise expected reward given information • If an individual doesn’t “play cricket”, the other individuals will reach an optimal point conditional on the behaviour of the itinerant Summary • Learning the game while playing is essential • This can be accommodated within the GWFP framework • Exploiting the neighbourhood structure of marginal contributions is essential for feasibility