Multi-Agent Learning Mini-Tutorial

advertisement
Multi-Agent Learning Mini-Tutorial
Gerry Tesauro
IBM T.J.Watson Research Center
http://www.research.ibm.com/infoecon
http://www.research.ibm.com/massdist
Outline

Statement of the problem

Tools and concepts from RL & game theory

“Naïve” approaches to multi-agent learning

ordinary single-agent RL; no-regret learning

fictitious play

evolutionary game theory
“Sophisticated” approaches



minimax-Q (Littman), Nash-Q (Hu & Wellman)

tinkering with learning rates: WoLF (Bowling),
Multiple-timescale Q-learning (Leslie & Collins)

“strategic teaching” (Camerer talk)
Challenges and Opportunities
Normal single-agent learning



Assume that environment has observable states,
characterizable expected rewards and state transitions,
and all of the above is stationary (MDP-ish)
Non-learning, theoretical solution to fully specified
problem: DP formalism
Learning: solve by trial and error without a full
specification: RL + exploration, Monte Carlo, ...
Multi-Agent Learning Problem:


Agent tries to solve its
learning problem, while
other agents in the
environment also are
trying to solve their own
learning problems.
Non-learning, theoretical
solution to fully specified
problem: game theory
Basics of game theory

A game is specified by: players (1…N), actions, and
payoff matrices (functions of joint actions)
B’s action
A’s action
R P S
R 0 1 1
P 1 0 1
S 1 1 0
A’s payoff

R P S
R 0 1 1
P 1 0 1
S 1 1 0
B’s payoff
If payoff matrices are identical, game is cooperative, else
non-cooperative (zero-sum = purely competitive)
Basic lingo…(2)








Games with no states: (bi)-matrix games
Games with states: stochastic games, Markov games;
(state transitions are functions of joint actions)
Games with simultaneous moves: normal form
Games with alternating turns: extensive form
No. of rounds = 1: one-shot game
No. of rounds > 1: repeated game
deterministic action choice: pure strategy
non-deterministic action choice: mixed strategy
Basic Analysis





A joint strategy x is Pareto-optimal if no x’ that improves
everybody’s payoffs
An agent’s xi is a dominant strategy if it’s always best
regardless of others’ actions
xi is a best-reponse to others’ x-i if it maximizes payoff
given x-i
A joint strategy x is an equilibrium if each agent’s
strategy is simultaneously a best-response to everyone
else’s strategy, i.e. no incentive to deviate (Nash,
correlated)
A Nash equilibrium always exists, but may be
exponentially many of them, and not easy to compute
What about imperfect information games?

Nash eqm. requires knowledge of all payoffs. For
imperfect info. games, corresponding concept is
Bayes-Nash equilibrium (Nash plus Bayesian inference
over hidden information). Even more intractable than
regular Nash.
Can we make game theory more tractable?




Active area of research
Symmetric games: payoffs are invariant under swapping
of player labels.  Can look for symmetric equilibria,
where all agents play same mixed strategy.
Network games: agent payoffs only depend on
interactions with a small # of neighbors
Summarization games: payoffs are simple summarization
functions of population joint actions (e.g. voting)
Summary: pros and cons of game theory


Game theory provides a nice conceptual/theoretical
framework for thinking about multi-agent learning.
Game theory is appropriate provided that:






Game is stationary and fully specified;
Enough computer power to compute equilibrium;
Can assume other agents are also game theorists;
Can solve equilibrium coordination problem.
Above conditions rarely hold in real applications
Multi-agent learning is not only a fascinating problem,
it may be the only viable option.
Naïve Approaches to Multi-Agent Learning


Basic idea: agent adapts, ignoring non-stationarity of
other agents’ strategies
1. Fictitious play: Agent observes time-average
frequency of other players’ action choices, and
models:
# times k observed
prob(action k ) 
total # observations
agent then plays best-response to this model

Variants of fictitious play: exponential recency
weighting, “smoothed” best response (~softmax),
small adjustment toward best response, ...

What if all agents use fictitious play?



Strict Nash equilibria are absorbing points for fictitious
play
Typical result is limit-cycle behavior of strategies, with
increasing period as N  
In certain cases, product of empirical distributions
converges to Nash even though actual play cycles
(penny matching example)
More Naïve Approaches…

2. Evolutionary game theory: “Replicator Dynamics”
models: large population of agents using different
strategies, fittest agents breed more copies.
 Let x= population strategy vector, and xk = fraction of
population playing strategy k. Growth rate then:

dxk
k
 xk u (e , x)  u ( x, x)
dt



Above equation also derived from an “imitation” model
NE are fixed points of above equation, but not
necessarily attractors (unstable or neutral stable)
Many possible dynamic behaviors...

limit cycles
attractors

Also saddle points, chaotic orbits, ...
unstable f.p.
Replicator dynamics: auction bidding strategies
More Naïve Approaches…

3. Iterated Gradient Ascent: (Singh, Kearns and
Mansour): Again does a myopic adaptation to other
players’ current strategy.
dxi

u ( xi , xi ) 

dt
xi


Coupled system of linear equations: u is linear in
xi and x-i
Analysis for two-player, two-action games: either
converges to a Nash fixed point on the boundary
(at least one pure strategy), or get limit cycles
Further Naïve Approaches…

4. Dumb Single-Agent Learning: Use a single-agent
algorithm in a multi-agent problem & hope that it works



No-regret learning by pricebots (Greenwald & Kephart)
Simultaneous Q-learning by pricebots (Tesauro & Kephart)
In many cases, this actually works: learners converge either
exactly or approximately to self-consistent optimal strategies
“Sophisticated” approaches


Takes into account the possibility that other agents’
strategies might change.
5. Multi-Agent Q-learning:
 Minimax-Q (Littman): convergent algorithm for
two-player zero-sum stochastic games
 Nash-Q (Hu & Wellman): convergent algorithm
for two-player general-sum stochastic games;
requires use of Nash equilibrium solver
More sophisticated approaches...

6. Varying learning rates


WoLF: “Win or Learn Fast” (Bowling): agent
reduces its learning rate when performing well,
and increases when doing badly. Improves
convergence of IGA and policy hill-climbing
Multi-timescale Q-Learning (Leslie): different
agents use different power laws t-n for learning
rate decay: achieves simultaneous convergence
where ordinary Q-learning doesn’t
More sophisticated approaches...

7. “Strategic Teaching:” recognizes that other
players’ strategy are adaptive

“A strategic teacher may play a strategy which is
not myopically optimal (such as cooperating in
Prisoner’s Dilemma) in the hope that it induces
adaptive players to expect that strategy in the
future, which triggers a best-response that benefits
the teacher.” (Camerer, Ho and Chong)
Theoretical Research Challenges


Proper theoretical formulation?
 “No short-cut” hypothesis: Massive on-line search a la
Deep Blue to maximize expected long-term reward
 (Bayesian) Model and predict behavior of other players,
including how they learn based on my actions (beware
of infinite model recursion)
 trial-and-error exploration
 continual Bayesian inference using all evidence over all
uncertainties (Boutilier: Bayesian exploration)
When can you get away with simpler methods?
Real-World Opportunities

Multi-agent systems where you can’t do game
theory (covers everything :-))
 Electronic marketplaces (Kephart)
 Mobile networks (Chang)
 Self-managing computer systems (Kephart)
 Teams of robots (Bowling, Stone)
 Video games
 Military/counter-terrorism applications
Download