Multi-Agent Learning Mini-Tutorial

Multi-Agent Learning Mini-Tutorial Gerry Tesauro IBM T.J.Watson Research Center http://www.research.ibm.com/infoecon http://www.research.ibm.com/massdist Outline  Statement of the problem  Tools and concepts from RL & game theory  “Naïve” approaches to multi-agent learning  ordinary single-agent RL; no-regret learning  fictitious play  evolutionary game theory “Sophisticated” approaches    minimax-Q (Littman), Nash-Q (Hu & Wellman)  tinkering with learning rates: WoLF (Bowling), Multiple-timescale Q-learning (Leslie & Collins)  “strategic teaching” (Camerer talk) Challenges and Opportunities Normal single-agent learning    Assume that environment has observable states, characterizable expected rewards and state transitions, and all of the above is stationary (MDP-ish) Non-learning, theoretical solution to fully specified problem: DP formalism Learning: solve by trial and error without a full specification: RL + exploration, Monte Carlo, ... Multi-Agent Learning Problem:   Agent tries to solve its learning problem, while other agents in the environment also are trying to solve their own learning problems. Non-learning, theoretical solution to fully specified problem: game theory Basics of game theory  A game is specified by: players (1…N), actions, and payoff matrices (functions of joint actions) B’s action A’s action R P S R 0 1 1 P 1 0 1 S 1 1 0 A’s payoff  R P S R 0 1 1 P 1 0 1 S 1 1 0 B’s payoff If payoff matrices are identical, game is cooperative, else non-cooperative (zero-sum = purely competitive) Basic lingo…(2)         Games with no states: (bi)-matrix games Games with states: stochastic games, Markov games; (state transitions are functions of joint actions) Games with simultaneous moves: normal form Games with alternating turns: extensive form No. of rounds = 1: one-shot game No. of rounds > 1: repeated game deterministic action choice: pure strategy non-deterministic action choice: mixed strategy Basic Analysis      A joint strategy x is Pareto-optimal if no x’ that improves everybody’s payoffs An agent’s xi is a dominant strategy if it’s always best regardless of others’ actions xi is a best-reponse to others’ x-i if it maximizes payoff given x-i A joint strategy x is an equilibrium if each agent’s strategy is simultaneously a best-response to everyone else’s strategy, i.e. no incentive to deviate (Nash, correlated) A Nash equilibrium always exists, but may be exponentially many of them, and not easy to compute What about imperfect information games?  Nash eqm. requires knowledge of all payoffs. For imperfect info. games, corresponding concept is Bayes-Nash equilibrium (Nash plus Bayesian inference over hidden information). Even more intractable than regular Nash. Can we make game theory more tractable?     Active area of research Symmetric games: payoffs are invariant under swapping of player labels.  Can look for symmetric equilibria, where all agents play same mixed strategy. Network games: agent payoffs only depend on interactions with a small # of neighbors Summarization games: payoffs are simple summarization functions of population joint actions (e.g. voting) Summary: pros and cons of game theory   Game theory provides a nice conceptual/theoretical framework for thinking about multi-agent learning. Game theory is appropriate provided that:       Game is stationary and fully specified; Enough computer power to compute equilibrium; Can assume other agents are also game theorists; Can solve equilibrium coordination problem. Above conditions rarely hold in real applications Multi-agent learning is not only a fascinating problem, it may be the only viable option. Naïve Approaches to Multi-Agent Learning   Basic idea: agent adapts, ignoring non-stationarity of other agents’ strategies 1. Fictitious play: Agent observes time-average frequency of other players’ action choices, and models: # times k observed prob(action k )  total # observations agent then plays best-response to this model  Variants of fictitious play: exponential recency weighting, “smoothed” best response (~softmax), small adjustment toward best response, ...  What if all agents use fictitious play?    Strict Nash equilibria are absorbing points for fictitious play Typical result is limit-cycle behavior of strategies, with increasing period as N   In certain cases, product of empirical distributions converges to Nash even though actual play cycles (penny matching example) More Naïve Approaches…  2. Evolutionary game theory: “Replicator Dynamics” models: large population of agents using different strategies, fittest agents breed more copies.  Let x= population strategy vector, and xk = fraction of population playing strategy k. Growth rate then:  dxk k  xk u (e , x)  u ( x, x) dt    Above equation also derived from an “imitation” model NE are fixed points of above equation, but not necessarily attractors (unstable or neutral stable) Many possible dynamic behaviors...  limit cycles attractors  Also saddle points, chaotic orbits, ... unstable f.p. Replicator dynamics: auction bidding strategies More Naïve Approaches…  3. Iterated Gradient Ascent: (Singh, Kearns and Mansour): Again does a myopic adaptation to other players’ current strategy. dxi  u ( xi , xi )   dt xi   Coupled system of linear equations: u is linear in xi and x-i Analysis for two-player, two-action games: either converges to a Nash fixed point on the boundary (at least one pure strategy), or get limit cycles Further Naïve Approaches…  4. Dumb Single-Agent Learning: Use a single-agent algorithm in a multi-agent problem & hope that it works    No-regret learning by pricebots (Greenwald & Kephart) Simultaneous Q-learning by pricebots (Tesauro & Kephart) In many cases, this actually works: learners converge either exactly or approximately to self-consistent optimal strategies “Sophisticated” approaches   Takes into account the possibility that other agents’ strategies might change. 5. Multi-Agent Q-learning:  Minimax-Q (Littman): convergent algorithm for two-player zero-sum stochastic games  Nash-Q (Hu & Wellman): convergent algorithm for two-player general-sum stochastic games; requires use of Nash equilibrium solver More sophisticated approaches...  6. Varying learning rates   WoLF: “Win or Learn Fast” (Bowling): agent reduces its learning rate when performing well, and increases when doing badly. Improves convergence of IGA and policy hill-climbing Multi-timescale Q-Learning (Leslie): different agents use different power laws t-n for learning rate decay: achieves simultaneous convergence where ordinary Q-learning doesn’t More sophisticated approaches...  7. “Strategic Teaching:” recognizes that other players’ strategy are adaptive  “A strategic teacher may play a strategy which is not myopically optimal (such as cooperating in Prisoner’s Dilemma) in the hope that it induces adaptive players to expect that strategy in the future, which triggers a best-response that benefits the teacher.” (Camerer, Ho and Chong) Theoretical Research Challenges   Proper theoretical formulation?  “No short-cut” hypothesis: Massive on-line search a la Deep Blue to maximize expected long-term reward  (Bayesian) Model and predict behavior of other players, including how they learn based on my actions (beware of infinite model recursion)  trial-and-error exploration  continual Bayesian inference using all evidence over all uncertainties (Boutilier: Bayesian exploration) When can you get away with simpler methods? Real-World Opportunities  Multi-agent systems where you can’t do game theory (covers everything :-))  Electronic marketplaces (Kephart)  Mobile networks (Chang)  Self-managing computer systems (Kephart)  Teams of robots (Bowling, Stone)  Video games  Military/counter-terrorism applications

Multi-Agent Learning Mini-Tutorial

Related documents

Products

Support

Multi-Agent Learning Mini-Tutorial

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib