Multi-Agent Learning Mini-Tutorial Gerry Tesauro IBM T.J.Watson Research Center http://www.research.ibm.com/infoecon http://www.research.ibm.com/massdist Outline Statement of the problem Tools and concepts from RL & game theory “Naïve” approaches to multi-agent learning ordinary single-agent RL; no-regret learning fictitious play evolutionary game theory “Sophisticated” approaches minimax-Q (Littman), Nash-Q (Hu & Wellman) tinkering with learning rates: WoLF (Bowling), Multiple-timescale Q-learning (Leslie & Collins) “strategic teaching” (Camerer talk) Challenges and Opportunities Normal single-agent learning Assume that environment has observable states, characterizable expected rewards and state transitions, and all of the above is stationary (MDP-ish) Non-learning, theoretical solution to fully specified problem: DP formalism Learning: solve by trial and error without a full specification: RL + exploration, Monte Carlo, ... Multi-Agent Learning Problem: Agent tries to solve its learning problem, while other agents in the environment also are trying to solve their own learning problems. Non-learning, theoretical solution to fully specified problem: game theory Basics of game theory A game is specified by: players (1…N), actions, and payoff matrices (functions of joint actions) B’s action A’s action R P S R 0 1 1 P 1 0 1 S 1 1 0 A’s payoff R P S R 0 1 1 P 1 0 1 S 1 1 0 B’s payoff If payoff matrices are identical, game is cooperative, else non-cooperative (zero-sum = purely competitive) Basic lingo…(2) Games with no states: (bi)-matrix games Games with states: stochastic games, Markov games; (state transitions are functions of joint actions) Games with simultaneous moves: normal form Games with alternating turns: extensive form No. of rounds = 1: one-shot game No. of rounds > 1: repeated game deterministic action choice: pure strategy non-deterministic action choice: mixed strategy Basic Analysis A joint strategy x is Pareto-optimal if no x’ that improves everybody’s payoffs An agent’s xi is a dominant strategy if it’s always best regardless of others’ actions xi is a best-reponse to others’ x-i if it maximizes payoff given x-i A joint strategy x is an equilibrium if each agent’s strategy is simultaneously a best-response to everyone else’s strategy, i.e. no incentive to deviate (Nash, correlated) A Nash equilibrium always exists, but may be exponentially many of them, and not easy to compute What about imperfect information games? Nash eqm. requires knowledge of all payoffs. For imperfect info. games, corresponding concept is Bayes-Nash equilibrium (Nash plus Bayesian inference over hidden information). Even more intractable than regular Nash. Can we make game theory more tractable? Active area of research Symmetric games: payoffs are invariant under swapping of player labels. Can look for symmetric equilibria, where all agents play same mixed strategy. Network games: agent payoffs only depend on interactions with a small # of neighbors Summarization games: payoffs are simple summarization functions of population joint actions (e.g. voting) Summary: pros and cons of game theory Game theory provides a nice conceptual/theoretical framework for thinking about multi-agent learning. Game theory is appropriate provided that: Game is stationary and fully specified; Enough computer power to compute equilibrium; Can assume other agents are also game theorists; Can solve equilibrium coordination problem. Above conditions rarely hold in real applications Multi-agent learning is not only a fascinating problem, it may be the only viable option. Naïve Approaches to Multi-Agent Learning Basic idea: agent adapts, ignoring non-stationarity of other agents’ strategies 1. Fictitious play: Agent observes time-average frequency of other players’ action choices, and models: # times k observed prob(action k ) total # observations agent then plays best-response to this model Variants of fictitious play: exponential recency weighting, “smoothed” best response (~softmax), small adjustment toward best response, ... What if all agents use fictitious play? Strict Nash equilibria are absorbing points for fictitious play Typical result is limit-cycle behavior of strategies, with increasing period as N In certain cases, product of empirical distributions converges to Nash even though actual play cycles (penny matching example) More Naïve Approaches… 2. Evolutionary game theory: “Replicator Dynamics” models: large population of agents using different strategies, fittest agents breed more copies. Let x= population strategy vector, and xk = fraction of population playing strategy k. Growth rate then: dxk k xk u (e , x) u ( x, x) dt Above equation also derived from an “imitation” model NE are fixed points of above equation, but not necessarily attractors (unstable or neutral stable) Many possible dynamic behaviors... limit cycles attractors Also saddle points, chaotic orbits, ... unstable f.p. Replicator dynamics: auction bidding strategies More Naïve Approaches… 3. Iterated Gradient Ascent: (Singh, Kearns and Mansour): Again does a myopic adaptation to other players’ current strategy. dxi u ( xi , xi ) dt xi Coupled system of linear equations: u is linear in xi and x-i Analysis for two-player, two-action games: either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles Further Naïve Approaches… 4. Dumb Single-Agent Learning: Use a single-agent algorithm in a multi-agent problem & hope that it works No-regret learning by pricebots (Greenwald & Kephart) Simultaneous Q-learning by pricebots (Tesauro & Kephart) In many cases, this actually works: learners converge either exactly or approximately to self-consistent optimal strategies “Sophisticated” approaches Takes into account the possibility that other agents’ strategies might change. 5. Multi-Agent Q-learning: Minimax-Q (Littman): convergent algorithm for two-player zero-sum stochastic games Nash-Q (Hu & Wellman): convergent algorithm for two-player general-sum stochastic games; requires use of Nash equilibrium solver More sophisticated approaches... 6. Varying learning rates WoLF: “Win or Learn Fast” (Bowling): agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbing Multi-timescale Q-Learning (Leslie): different agents use different power laws t-n for learning rate decay: achieves simultaneous convergence where ordinary Q-learning doesn’t More sophisticated approaches... 7. “Strategic Teaching:” recognizes that other players’ strategy are adaptive “A strategic teacher may play a strategy which is not myopically optimal (such as cooperating in Prisoner’s Dilemma) in the hope that it induces adaptive players to expect that strategy in the future, which triggers a best-response that benefits the teacher.” (Camerer, Ho and Chong) Theoretical Research Challenges Proper theoretical formulation? “No short-cut” hypothesis: Massive on-line search a la Deep Blue to maximize expected long-term reward (Bayesian) Model and predict behavior of other players, including how they learn based on my actions (beware of infinite model recursion) trial-and-error exploration continual Bayesian inference using all evidence over all uncertainties (Boutilier: Bayesian exploration) When can you get away with simpler methods? Real-World Opportunities Multi-agent systems where you can’t do game theory (covers everything :-)) Electronic marketplaces (Kephart) Mobile networks (Chang) Self-managing computer systems (Kephart) Teams of robots (Bowling, Stone) Video games Military/counter-terrorism applications