Multi-Agent Planning in Complex Uncertain Environments Daphne Koller Stanford University Joint work with: Carlos Guestrin (CMU) Ronald Parr (Duke) Collaborative Multiagent Planning Long-term goals Multiple agents Coordinated decisions Search and rescue, firefighting Factory management Multi-robot tasks (Robosoccer) Network routing Air traffic control Computer game playing ©2004 – Carlos Guestrin, Daphne Koller Collaborative Multiagent Planning Joint Planning Space Joint action space: Each agent i takes action ai at each step Joint action a= {a1,…, an} for all agents Joint state space: Assignment x1,…,xn to some set of variables X1,…,Xn Joint state x= {x1,…, xn} of entire system Joint system: Payoffs and state dynamics depend on joint state and joint action Cooperative agents: Want to maximize total payoff ©2004 – Carlos Guestrin, Daphne Koller Exploiting Structure Real-world problems have: Hundreds of objects Googles of states Real-world problems have structure! Approach: Exploit structured representation to obtain efficient approximate solution ©2004 – Carlos Guestrin, Daphne Koller Outline Action Coordination Joint Planning Factored Value Functions Coordination Graphs Context-Specific Coordination Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution Generalizing to New Environments Relational MDPs Generalizing Value Functions ©2004 – Carlos Guestrin, Daphne Koller One-Shot Optimization Task Q-function Q(x,a) encodes agents’ payoff for joint action a in joint state x Agents’ task: To compute arg max Q( x, a) a #actions is exponential Complete state observability Full agent communication ©2004 – Carlos Guestrin, Daphne Koller Factored Payoff Function Approximate Q function as sum of Q sub-functions Each sub-function depends on local part of system Two interacting agents Agent and important resource Two inter-dependent pieces of machinery Q(A1,…,A4, X1,…,X4) ¼ Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) ©2004 – Carlos Guestrin, Daphne Koller [K. & Parr ’99,’00] [Guestrin, K., Parr ’01] Distributed Q Function Q sub-functions assigned to relevant agents Q(A1,…,A4, X1,…,X4) ¼ Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) + Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4) 2 3 1 4 [Guestrin, K., Parr ’01] ©2004 – Carlos Guestrin, Daphne Koller Q4 Multiagent Action Selection Instantiate current state x Distributed Q function Maximal action argmaxa Q2(A1, A2, X1,X2) 2 Q1(A1, A4, X1,X4) 3 1 4 Q4(A3, A4, X3,X4) ©2004 – Carlos Guestrin, Daphne Koller Q3(A2, A3, X2,X3) Instantiating State x Limited observability: agent i only observes variables in Qi Q2(A1, A2, X1,X2) Observe only X1 and X2 2 Q1(A1, A4, X1,X4) 3 1 4 Q4(A3, A4, X3,X4) ©2004 – Carlos Guestrin, Daphne Koller Q3(A2, A3, X2,X3) Choosing Action at State x Instantiate current state x Maximal action maxa Q2(A1, A2), X1,X2) 2 Q1(A1, A4,) X1,X4) 3 1 4 Q4(A3, A4,) X3,X4) ©2004 – Carlos Guestrin, Daphne Koller Q3(A2, A3), X2,X3) Variable Elimination + + + maxa Use variable elimination for maximization: max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + Q3 ( A2 , A3 ) + Q4 ( A3 , A4 ) A1 ,A2 ,A3 ,A4 Q (A , A ) = max Q1 ( A1 , A4 ) + Q2 ( A1 , A22) + 1max2[Q3 ( A2 , A3 ) + Q4 ( A3 , A4 ) ] A1 ,A2 ,A4 A3 = max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + g1 ( A2 , A4 ) A1 ,A2 ,A4 Q1(A1, A4) A2 A4 Q3(A2, A3) Value of optimal A3 action Limited communication Attack for optimal action choice Attack 5 Attack Defend 6 graph Comm. bandwidth = tree-width of coord. ©2004 – Carlos Guestrin, Daphne Koller Q4(A3, ADefend 4) Defend Attack 8 Defend 12 Choosing Action at State x max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + Q3 ( A2 , A3 ) + Q4 ( A3 , A4 ) A1 ,A2 ,A3 ,A4 = max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + max[Q3 ( A2 , A3 ) + Q4 ( A3 , A4 ) ] A1 ,A2 ,A4 A3 = max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + g1 ( A2 , A4 ) A1 ,A2 ,A4 ©2004 – Carlos Guestrin, Daphne Koller Choosing Action at State x max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + Q3 ( A2 , A3 ) + Q4 ( A3 , A4 ) A1 ,A2 ,A3 ,A4 = max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + max[Q3 ( A2 , A3 ) + Q4 ( A3 , A4 ) ] A1 ,A2 ,A4 A3 = max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + g1 ( A2 , A4 ) A1 ,A2 ,A4 Q2(A1, A2) 2 Q1(A1, A4) 3 1 4 Q4(A3, A4) ©2004 – Carlos Guestrin, Daphne Koller max A g3 1Q (A3(A , 2A,4A ) 3) + 2 [ ] Coordination Graphs Communication follows triangulated graph Computation grows exponentially in tree width Graph-theoretic measure of “connectedness” Arises in BNs, CSPs, … Cost exponential in worst case, fairly low for many real graphs A5 A1 A7 A8 A6 A9 A2 A4 A3 ©2004 – Carlos Guestrin, Daphne Koller A11 A10 Context-Specific Interactions Payoff structure can vary by context Agents A1, A2 both trying to pass through same narrow corridor Can use context-specific “value rules” <At(X,A1), At(X,A2), A1 = fwd A2 = fwd : -100> Hope: Context-specific payoffs will induce context-specific coordination A1 ©2004 – Carlos Guestrin, Daphne Koller X A2 Context-Specific Coordination a1 a5 x : 4 A5 a5 a6 x : 2 A6 A1 a2 a3 x : 0.1 A2 a1 a2 x : 5 a1 a3 x : 1 A4 A3 Instantiate current state: x = true ©2004 – Carlos Guestrin, Daphne Koller a6 x : 7 a1 a6 x : 3 a3 a4 x : 3 a4 x : 1 a1 a2 a4 x : 3 Context-Specific Coordination a1 a5 : 4 a5 a6 : 2 A6 A5 a6 : 7 a1 a2 : 5varies Coordination structure basedA1on context a2 a3 : 0.1 A2 A4 A3 a3 a4 : 3 ©2004 – Carlos Guestrin, Daphne Koller a4 : 1 a1 a2 a4 : 3 Context-Specific Coordination a1 a5 : 4 a5 a6 : 2 A6 A5 a6 : 7 aa21 :5a2 : 5varies Coordination structure a5 : 4 1 based onAcommunication a2 a3 : 0.1 Maximizing out A1 A2 A4 A3 a4 : 1 a1 a2 a4 : 3 a3 a4 : 3 A1 = a1 ©2004 – Carlos Guestrin, Daphne Koller Rule-based variable elimination [Zhang & Poole ’99] Context-Specific Coordination a5 a6 : 2 A6 A5 a6 : 7 a2 : 5 Coordination structure varies a5decisions :4 a2 based a3 : 0.1 onAagent 1 A2 A4 a4 : 1 A3 Eliminate A1 from the graph ©2004 – Carlos Guestrin, Daphne Koller a3 a4 : 3 Rule-based variable elimination [Zhang & Poole ’99] Robot Soccer Kok, Vlassis & Groen University of Amsterdam UvA Trilearn 2002 won German Open 2002, but placed fourth in Robocup-2002. “ … the improvements introduced in UvA Trilearn 2003 … include an extension of the intercept skill, improved passing behavior and especially the usage of coordination graphs to specify the coordination requirements between the different agents.” ©2004 – Carlos Guestrin, Daphne Koller RoboSoccer Value Rules Coordination graph rules include conditions on player role and aspects of global system state Example rules for player i, in role of passer: Depends on distance of j to goal after move ©2004 – Carlos Guestrin, Daphne Koller UvA Trilearn 2003 Results Round 1 Opponent Round 1 Mainz Rolling Brains (Germany) 4-0 Iranians (Iran) 31-0 Sahand (Iran) 39-0 a4ty (Latvia) 25-0 Helios (Iran) 2-1 AT-Humboldt (Germany) 5-0 ZJUBase (China) 6-0 Aria (Iran) 6-0 Hana (Japan) 26-0 Zenit-NewERA (Russia) 4-0 RoboSina (Iran) 6-0 Wright Eagle (China) 3-1 Everest (China) 7-1 Aria (Iran) 5-0 Semi-final Brainstormers (Germany) 4-1 Final TsinghuAeolus (China) 4-3 Round 2 Round 3 UvA Trilearn won • • • • German Open 2003 US Open 2003 RoboCup 2003 German Open 2004 ©2004 – Carlos Guestrin, Daphne Koller Score 177-7 Outline Action Coordination Joint Planning Factored Value Functions Coordination Graphs Context-Specific Coordination Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution Generalizing to New Environments Relational MDPs Generalizing Value Functions ©2004 – Carlos Guestrin, Daphne Koller peasant footman building ©2004 – Carlos Guestrin, Daphne Koller Real-time Strategy Game Peasants collect resources and build Footmen attack enemies Buildings train peasants and footmen Planning Over Time Markov Decision Process (MDP) representation: Action space: Joint agent actions a= {a1,…, an} State space: Joint state descriptions x= {x1,…, xn} Momentary reward function R(x,a) Probabilistic system dynamics P(x’|x,a) ©2004 – Carlos Guestrin, Daphne Koller Policy Policy: (x) = a x0 At state x, action a for all agents (x0) = both peasants get wood x1 (x1) = one peasant gets gold, other builds barrack x2 ©2004 – Carlos Guestrin, Daphne Koller (x2) = Peasants get gold, footmen attack Value of Policy Expected longterm reward starting from x Value: V(x) Start from x0 x0 R(x0) V(x0) = E[R(x0) + R(x1) + 2 R(x2) + 3 R(x3) + 4 R(x4) + ] (x0 ) x1 (x1 ) R(x1) (x1’) x1 ’ R(x1’) x1’’ (x1’’) R(xKoller ©2004 – Carlos Guestrin, Daphne 1’’) Future rewards discounted by [0,1) x2 R(x2) (x2 ) x3 R(x3) (x3 ) x4 R(x4) Optimal Long-term Plan Optimal policy Optimal Q-function *(x) Q*(x,a) Bellman Equations: Q ( x, a) = R( x, a) + P ( x'| x, a)V ( x' ) x' V (x) = max Q (x, a) a Optimal policy: ( x ) = arg max Q ( x, a) a ©2004 – Carlos Guestrin, Daphne Koller Solving an MDP Solve Bellman equation Optimal value V*(x) Optimal policy *(x) Many algorithms solve the Bellman equations: Policy iteration [Howard ’60, Bellman ‘57] Value iteration [Bellman ‘57] Linear programming [Manne ’60] … ©2004 – Carlos Guestrin, Daphne Koller LP Solution to MDP minimize : V (x ) x V ( x ) Q (a, x ) subject to : x, a One variable V (x) for each state One constraint for each state x and action a Polynomial time solution ©2004 – Carlos Guestrin, Daphne Koller Are We Done? Planning is polynomial in #states and #actions #states exponential in number of variables #actions exponential in number of agents Efficient approximation by exploiting structure! ©2004 – Carlos Guestrin, Daphne Koller Structured Representation Time t APeasant t+1 Peasant P’ Gold G’ ABuild Footman AFootman Enemy ©2004 – Carlos Guestrin, Daphne Koller Factored MDP P(F’|F,G,AB,AF) F’ [Boutilier et al. ’95] E’ State Dynamics Decisions Rewards Complexity of representation: Exponential in #parents (worst case) Structured Value function ? Factored MDP Structure in Time t t+1 X X’ t+2 t+3 X’’ X’’’ Y Y’ Y’’ Y’’’ Z Z’ Z’’ Z’’’ R Factored MDP R Almost! R R Structure in V* Factored V often provides good approximate value function ©2004 – Carlos Guestrin, Daphne Koller * V Structured Value Functions [Bellman et al. ‘63], [Tsitsiklis & Van Roy ‘96] [K. & Parr ’99,’00] Approximate V* as a factored value function V (x) = i wi hi (x) In rule-based case: hi is a rule concerning small part of the system wi is the value associated with the rule Goal: find w giving good approximation V to V* Factored value function V = wi hi Factored Q function Q = Qi Can use coordination graph ©2004 – Carlos Guestrin, Daphne Koller Approximate LP Solution V(xwi)h (x) x h ((x x)) wi V Q (Qa,(xa), x) subject to : minimize : i i i i ii i xx,,aa i One variable wi for each basis function i Polynomial number of LP variables One constraint for every state and action Exponentially many LP constraints ©2004 – Carlos Guestrin, Daphne Koller So What Now? [Guestrin, K., Parr ’01] subject to : wi hi (x ) Qi (a, x) i i x, a subject to : 0 Qi (a, x ) wi hi (x ) i x, a Exponentially many linear = one nonlinear constraint subject to : 0 max Qi (a, x ) wi hi ( x ) a ,x i ©2004 – Carlos Guestrin, Daphne Koller Variable Elimination Revisited [Guestrin, K., Parr ’01] Use Variable Elimination to represent constraints: 0 max f1 ( A, B) + f 2 ( A, C ) + max [ f 3 (C, D) + f 4 ( B, D)] A,B ,C D 0 max f1 ( A, B ) + f 2 ( A, C ) + g1 ( B ,C ) A, B ,C ( B ,C ) g1 f 3 (C, D ) + f 4 ( B, D ) Exponentially fewer constraints subject to : 0 max Qi (a, x ) wi hi ( x ) a ,x Polynomial LP i for finding good factored approximation to V* ©2004 – Carlos Guestrin, Daphne Koller Network Management Problem Computer runs processes Computer status = {good, dead, faulty} Dead neighbors increase dying probability Ring Ring of Rings Reward for successful processes Each SysAdmin takes local action = {reboot, not reboot } Star ©2004 – Carlos Guestrin, Daphne Koller k-grid Scaling of Factored LP Explicit LP Factored LP 2n (n+1-k)2k k = tree-width number of constraints 40000 Explicit LP 30000 Factored LP k = 12 20000 k = 10 k=8 10000 k=5 0 2 4 ©2004 – Carlos Guestrin, Daphne Koller 6 8 10 12 number of variables 14 16 k=3 Total running time (seconds) Multiagent Running Time 180 160 Ring of rings 140 120 Star pair basis 100 80 60 40 Star single basis 20 0 2 4 6 8 10 Number of agents ©2004 – Carlos Guestrin, Daphne Koller 12 14 16 Strategic 2x2 Factored MDP model offline 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function online Q x ©2004 – Carlos Guestrin, Daphne Koller Coordination graph computes argmaxa Q(x,a) World a Demo: Strategic 2x2 Guestrin, Koller, Gearhart & Kanodia ©2004 – Carlos Guestrin, Daphne Koller Limited Interaction MDPs [Guestrin & Gordon, ’02] Some MDPs have additional structure: Agents are largely autonomous Interact in limited ways M1 X2 Can decompose MDP as set of agentbased MDPs, with limited interface X’1 M2 A1 X’2 X3 X’3 ©2004 – Carlos Guestrin, Daphne Koller X’2 A1 X1 X2 X1 A1 X2 A2 X’1 A1 e.g., competing for resources X1 X1 X2 X’2 X3 X’3 A2 Limited Interaction MDPs [Guestrin & Gordon, ’02] In such MDPs, our LP matrix is highly structured Can use Dantzig-Wolfe LP decomposition to solve LP optimally, in a decentralized way Gives rise to a market-like algorithm with multiple agents and a centralized “auctioneer” ©2004 – Carlos Guestrin, Daphne Koller Auction-style planning Each agent solves local (stand-alone) MDP Agents send `constraint messages’ to auctioneer: [Guestrin & Gordon, ’02] Auctioneer Must agree on “policy” for shared variables Auctioneer sends `pricing messages’ to agents Set pricing based on conflicts Pricing reflects penalties for constraint violations Influences agents’ rewards in their MDP ©2004 – Carlos Guestrin, Daphne Koller Plan, plan, plan $ $ Plan, plan, plan $ Plan, plan, plan Fuel Allocation Problem UAV start UAVs share a pot of fuel Targets have varying priority Ignore target interference Bererton, Gordon, Thrun & Khosla ©2004 – Carlos Guestrin, Daphne Koller Target Fuel Allocation Problem [Bererton, Gordon, Thrun, & Khosla , ’03] ©2004 – Carlos Guestrin, Daphne Koller High-Speed Robot Paintball Bererton, Gordon & Thrun ©2004 – Carlos Guestrin, Daphne Koller High-Speed Robot Paintball Game variant 1 Game variant 2 Coordination point x = start location Sensor Placement + = goal location ©2004 – Carlos Guestrin, Daphne Koller High-Speed Robot Paintball Bererton, Gordon & Thrun ©2004 – Carlos Guestrin, Daphne Koller Outline Action Coordination Joint Planning Factored Value Functions Coordination Graphs Context-Specific Coordination Multi-Agent Markov Decision Processes Efficient Linear Programming Solution Decentralized Market-Based Solution Generalizing to New Environments Relational MDPs Generalizing Value Functions ©2004 – Carlos Guestrin, Daphne Koller Generalizing to New Problems Many problems are “similar” Solve Problem 1 Solve Problem 2 Solve Problem n Good solution to Problem n+1 MDPs are different! Different sets of states, action, reward, transition, … ©2004 – Carlos Guestrin, Daphne Koller Generalizing with Relational MDPs “Similar” domains have similar “types” of objects Relational MDP Exploit similarities by computing generalizable value functions Generalization Avoid need to replan Tackle larger problems ©2004 – Carlos Guestrin, Daphne Koller Relational Models and MDPs [Guestrin, K., Gearhart & Kanodia ‘03] Classes: Relations Collects, Builds, Trains, Attacks… Instances Peasant, Footman, Gold, Barracks, Enemy… Peasant1, Peasant2, Footman1, Enemy1… Builds on Probabilistic Relational Models ©2004 – Carlos Guestrin, Daphne Koller [K. & Pfeffer ‘98] Relational MDPs [Guestrin, K., Gearhart & Kanodia ‘03] Enemy Footman Health H’ my_enemy Health AFootman R Count Class-level transition probabilities depends on: Attributes; Actions; Attributes of related objects Class-level reward function Very compact representation! Does not depend on # of objects ©2004 – Carlos Guestrin, Daphne Koller H’ World is a Large Factored MDP Relational MDP Instantiation (world): # of objects # instances of each class Links between instances Well-defined factored MDP ©2004 – Carlos Guestrin, Daphne Koller Links between objects Factored MDP MDP with 2 Footmen and 2 Enemies Footman1 F1.Health F1.H’ F1.A Enemy1 E1.Health E1.H’ R1 Footman2 F2.Health F2.H’ F2.A Enemy2 R2 ©2004 – Carlos Guestrin, Daphne Koller E2.Health E2.H’ World is a Large Factored MDP Relational MDP # of objects Links between objects Instantiate world Well-defined factored MDP Use factored LP for planning We have gained nothing! ©2004 – Carlos Guestrin, Daphne Koller Factored MDP Class-level Value Functions 20 F1 alive, E1 alive F1 alive, E1 dead F1 dead, E1 alive F1 dead, 1 E1 dead 15 Footman1 10 Enemy1 5 F1.Health 0 E .Health 20 15 Footman2 10 5 0 F2.Health F2 alive, E2 alive F2 alive, E2 dead F2 dead, E2 alive F2 dead, E2 dead VE1 (E .H) (F .H, E2V .H) V VF2 F 1 E 2 F are Interchangeable! + + V(F1.H, E1.H, F2.H, EUnits 2.H) = VF1(F1.H, E1.H) VF1 VF2 VF VE1 VE2 VE Enemy2 E2.Health VE2(EV2.H ) E + At state x, each footman has different contribution to V Given wC — can instantiate value function for any world ©2004 – Carlos Guestrin, Daphne Koller Factored LP-based Generalization How many samples? F alive, E alive F alive, E dead F dead, E alive F dead, E dead 20 15 10 E1 F1 E2 F2 5 Classlevel factored LP 0 VF 10 8 6 20 E1 alive E1 dead 15 10 4 5 2 0 0 E1 Generalize E alive 8 10 10 8 6 F1 alive, E1 alive F1 alive, E1 dead F1 dead, E1 alive F1 dead, E1 dead F1 20 E2 alive F2 alive, E2 alive F2 alive, E2 dead F2 dead, E2 alive F2 dead, E2 dead 15 E2 dead 10 4 5 2 0 0 E2 F2 E dead 6 4 2 0 Sample Set I ©2004 – Carlos Guestrin, Daphne Koller 10 VE 8 6 4 2 0 E3 20 E3 alive 15 E3 dead 10 5 0 F3 F3 alive, E3 alive F3 alive, E3 dead F3 dead, E3 alive F3 dead, E3 dead Sampling Complexity Exponentially many worlds # objects in a world is unbounded need exponentially many samples? must sample very large worlds? NO! ©2004 – Carlos Guestrin, Daphne Koller Theorem Sample m small worlds of up to O( ln 1/ ) objects m= samples Value function within O() of class-level value function optimized for all worlds, with prob. at least 1- ©2004 – Carlos Guestrin, Daphne Koller Rcmax is the maximum class reward Strategic 2x2 offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs Factored LP computes value function online Q x ©2004 – Carlos Guestrin, Daphne Koller Coordination graph computes argmaxa Q(x,a) World a Strategic 9x3 offline Relational MDP model 9 Peasants, 3 Footmen, Enemy, Gold, Wood, Barracks ~3 trillion state/action pairs grows exponentially in # agents Factored LP computes value function online Qo x ©2004 – Carlos Guestrin, Daphne Koller Coordination graph computes argmaxa Q(x,a) World a Strategic Generalization offline Relational MDP model 2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks ~1 million state/action pairs 9 Peasants, 3 Footmen, Factored LP computes Enemy, Gold, Wood, Barracksclass-level value function online ~3 trillion state/action pairs x ©2004 – Carlos Guestrin, Daphne Koller instantiated Q-functions wC in # agents grow polynomially Coordination graph computes argmaxa Q(x,a) World a Demo: Generalized 9x3 Guestrin, Koller, Gearhart & Kanodia ©2004 – Carlos Guestrin, Daphne Koller Tactical Generalization 3 v. 3 4 v. 4 Generalize Planned in 3 Footmen versus 3 Enemies Generalized to 4 Footmen versus 4 Enemies ©2004 – Carlos Guestrin, Daphne Koller Demo: Planned Tactical 3x3 Guestrin, Koller, Gearhart & Kanodia ©2004 – Carlos Guestrin, Daphne Koller Demo: Generalized Tactical 4x4 Guestrin, Koller, Gearhart & Kanodia [Guestrin, K., Gearhart & Kanodia ‘03] ©2004 – Carlos Guestrin, Daphne Koller Summary Distributed coordinated action selection Effective planning under uncertainty Generalization to new problems Structured Multi-Agent MDPs ©2004 – Carlos Guestrin, Daphne Koller Important Questions Continuous spaces Complex actions Partial observability Learning to act How far can we go?? ©2004 – Carlos Guestrin, Daphne Koller http://robotics.stanford.edu/~koller Carlos Guestrin Chris Gearhart Neal Kanodia Shobha Venkataraman Ronald Parr Curt Bererton Geoff Gordon Sebastian Thrun Jelle Kok Matthijs Spaan Nikos Vlassis