Convergent Learning in Unknown Graphical Games

advertisement
Convergent Learning in Unknown
Graphical Games
Dr Archie Chapman, Dr David Leslie,
Dr Alex Rogers and Prof Nick Jennings
School of Mathematics, University of Bristol and
School of Electronics and Computer Science
University of Southampton
david.leslie@bristol.ac.uk
Playing games?
Rock
Scissors
Paper
Rock Scissors Paper
 (0,0) (1,1) (1,1) 


 (1,1) (0,0) (1,1) 
 (1,1) (1,1) (0,0) 


Playing games?
Playing games?
Dense deployment
of sensors to detect
pedestrian and
vehicle activity
within an urban
environment.
Berkeley Engineering
Learning in games
• Adapt to observations of past play
• Hope to converge to something “good”
• Why?!
– Bounded rationality justification of equilibrium
– Robust to behaviour of “opponents”
– Language to describe distributed optimisation
Notation
•
•
•
•
Players i  {1,..., N }
Discrete action sets Ai
Joint action set A  A1   AN
Reward functions ri : A  R
• Mixed strategies  i  i  ( Ai )
• Joint mixed strategy space   1    N
• Reward functions extend to ri :   R
Best response / Equilibrium
• Mixed strategies of all players other than i is
 i
• Best response of player i is
bi ( i )  argmax ri ( i ,  i )
 i i
 satisfying, for all i,
 i  bi ( i )
• An equilibrium is a
Fictitious play
Game matrix
Estimate
strategies of
other players
Select best
action given
estimates
Update estimates
Belief updates
• Belief about strategy of player i is the MLE
 (ai ) 
t
i
 (ai )
t
i
t
• Online updating
 
t
t 1

 b(
1
t
t 1
) 
t 1

Stochastic approximation
• Processes of the form
X t 1  X t  t 1F ( X t )  M t 1  et 1
where E( M t 1 | X t )  0 and et  0
• F is set-valued (convex and u.s.c.)
• Limit points are chain-recurrent sets of the
differential inclusion
X  F (X )
Best-response dynamics
• Fictitious play has M and e identically 0, and
t  1t
• Limit points are limit points of the bestresponse differential inclusion
  b( )  
• In potential games (and zero-sum games and
some others) the limit points must be Nash
equilibria
Generalised weakened fictitious play
• Consider any process such that
 
t
t 1

t
  b (
t
t 1
) 
where   0,   0 and
t
t
t 1
M

and also an interplay between
t
t


 and M.
• The convergence properties do not change
Fictitious play
Game matrix
Estimate
strategies of
other players
Select best
action given
estimates
Update estimates
Learning the game
Rock Scissors Paper
 (?, ?) (?, ?) (?, ?) 


 (?, ?) (?, ?) (?, ?) 
 (?, ?) (?, ?) (?, ?) 


Rock
Scissors
Paper
R  ri (a )  e
t
i
t
t
i
Reinforcement learning
• Track the average reward for each joint
action
• Play each joint action frequently enough
• Estimates will be close to the expected value
• Estimated game converges to the true game
Q-learned fictitious play
Game matrix
Estimate
strategies of
other players
Select best
action given
estimates
Estimated
game matrix
Select best
action given
estimates
Update estimates
Theoretical result
Theorem – If all joint actions are played
infinitely often then beliefs follow a GWFP
Proof: The estimated game converges to the
true game, so selected strategies are -best
responses.
Playing games?
Dense deployment
of sensors to detect
pedestrian and
vehicle activity
within an urban
environment.
Berkeley Engineering
It’s impossible!
• N players, each with A actions
• Game matrix has AN entries to learn
• Each individual must estimate the strategy of
every other individual
• It’s just not possible for realistic game
scenarios
Marginal contributions
• Marginal contribution of player i is
total system reward – system reward if i absent
• Maximised marginal contributions implies
system is at a (local) optimum
• Marginal contribution might depend only on
the actions of a small number of neighbours
Sensors – rewards
• Global reward for action a is





n j (a) 
U g (a) 
E
E  1  
 I j is observed   events

events j
j
j




and observation
• Marginal reward for i is

n j ( a ) 1
n j (a)

ri (a)  U g (a)  U g (ai )  E



events
j observed
by i
• Actually use
R 
t
i
 
j observed
by i

n j ( a t ) 1

n j ( at )






Local learning
Game matrix
Estimate
strategies of
other players
Select best
action given
estimates
Estimated
game matrix
Select best
action given
estimates
Update estimates
Local learning
Game matrix
Estimate
strategies of
neighbours
Select best
action given
estimates
Estimated
game matrix
for local
interactions
Select best
action given
estimates
Update estimates
Theoretical result
Theorem – If all joint actions of local games
are played infinitely often then beliefs follow a
GWFP
Proof: The estimated game converges to the
true game, so selected strategies are -best
responses.
Sensing results
So what?!
• Play converges to (local) optimum with only
noisy information and local communication
• An individual always chooses an action to
maximise expected reward given information
• If an individual doesn’t “play cricket”, the
other individuals will reach an optimal point
conditional on the behaviour of the itinerant
Summary
• Learning the game while playing is essential
• This can be accommodated within the GWFP
framework
• Exploiting the neighbourhood structure of
marginal contributions is essential for
feasibility
Download