Multi-Agent Planning in Complex Uncertain Environments Daphne Koller

advertisement
Multi-Agent Planning
in Complex Uncertain
Environments
Daphne Koller
Stanford University
Joint work with:
Carlos Guestrin (CMU)
Ronald Parr (Duke)
Collaborative Multiagent Planning
Long-term
goals






Multiple
agents
Coordinated
decisions
Search and rescue, firefighting
Factory management
Multi-robot tasks (Robosoccer)
Network routing
Air traffic control
Computer game playing
©2004 – Carlos Guestrin, Daphne Koller
Collaborative
Multiagent
Planning
Joint Planning Space




Joint action space:

Each agent i takes action ai at each step

Joint action a= {a1,…, an} for all agents
Joint state space:

Assignment x1,…,xn to some set of variables X1,…,Xn

Joint state x= {x1,…, xn} of entire system
Joint system: Payoffs and state dynamics depend
on joint state and joint action
Cooperative agents: Want to maximize total payoff
©2004 – Carlos Guestrin, Daphne Koller
Exploiting Structure

Real-world problems have:



Hundreds of objects
Googles of states
Real-world problems have
structure!
Approach:
Exploit structured representation to
obtain efficient approximate solution
©2004 – Carlos Guestrin, Daphne Koller
Outline

Action Coordination




Joint Planning




Factored Value Functions
Coordination Graphs
Context-Specific Coordination
Multi-Agent Markov Decision Processes
Efficient Linear Programming Solution
Decentralized Market-Based Solution
Generalizing to New Environments


Relational MDPs
Generalizing Value Functions
©2004 – Carlos Guestrin, Daphne Koller
One-Shot Optimization Task


Q-function Q(x,a) encodes agents’ payoff for
joint action a in joint state x
Agents’ task: To compute
arg max Q( x, a)
a



#actions is exponential 
Complete state observability 
Full agent communication 
©2004 – Carlos Guestrin, Daphne Koller
Factored Payoff Function


Approximate Q function as sum of Q sub-functions
Each sub-function depends on local part of system



Two interacting agents
Agent and important resource
Two inter-dependent pieces of machinery
Q(A1,…,A4, X1,…,X4)
¼
Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +
Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4)
©2004 – Carlos Guestrin, Daphne Koller
[K. & Parr ’99,’00]
[Guestrin, K., Parr ’01]
Distributed Q Function

Q sub-functions assigned to relevant agents
Q(A1,…,A4, X1,…,X4)
¼
Q1(A1, A4, X1,X4) + Q2(A1, A2, X1,X2) +
Q3(A2, A3, X2,X3) + Q4(A3, A4, X3,X4)
2
3
1
4
[Guestrin, K., Parr ’01]
©2004 – Carlos Guestrin, Daphne Koller
Q4
Multiagent Action Selection
Instantiate
current
state x
Distributed
Q function
Maximal
action
argmaxa
Q2(A1, A2, X1,X2)
2
Q1(A1, A4, X1,X4)
3
1
4
Q4(A3, A4, X3,X4)
©2004 – Carlos Guestrin, Daphne Koller
Q3(A2, A3, X2,X3)
Instantiating State x
Limited observability: 
agent i only observes variables in Qi
Q2(A1, A2, X1,X2)
Observe only
X1 and X2
2
Q1(A1, A4, X1,X4)
3
1
4
Q4(A3, A4, X3,X4)
©2004 – Carlos Guestrin, Daphne Koller
Q3(A2, A3, X2,X3)
Choosing Action at State x
Instantiate
current
state x
Maximal action
maxa
Q2(A1, A2), X1,X2)
2
Q1(A1, A4,) X1,X4)
3
1
4
Q4(A3, A4,) X3,X4)
©2004 – Carlos Guestrin, Daphne Koller
Q3(A2, A3), X2,X3)
Variable Elimination
+
+
+
maxa
Use variable elimination for maximization:
max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + Q3 ( A2 , A3 ) + Q4 ( A3 , A4 )
A1 ,A2 ,A3 ,A4
Q (A , A )
= max Q1 ( A1 , A4 ) + Q2 ( A1 , A22) + 1max2[Q3 ( A2 , A3 ) + Q4 ( A3 , A4 ) ]
A1 ,A2 ,A4
A3
= max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + g1 ( A2 , A4 )
A1 ,A2 ,A4
Q1(A1, A4)


A2
A4
Q3(A2, A3)
Value of optimal
A3 action
Limited communication
 Attack
for optimal
action choice
Attack
5
Attack
Defend
6 graph
Comm. bandwidth = tree-width
of coord.
©2004 – Carlos Guestrin, Daphne Koller
Q4(A3, ADefend
4)
Defend
Attack
8
Defend
12
Choosing Action at State x
max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + Q3 ( A2 , A3 ) + Q4 ( A3 , A4 )
A1 ,A2 ,A3 ,A4
= max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + max[Q3 ( A2 , A3 ) + Q4 ( A3 , A4 ) ]
A1 ,A2 ,A4
A3
= max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + g1 ( A2 , A4 )
A1 ,A2 ,A4
©2004 – Carlos Guestrin, Daphne Koller
Choosing Action at State x
max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + Q3 ( A2 , A3 ) + Q4 ( A3 , A4 )
A1 ,A2 ,A3 ,A4
= max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + max[Q3 ( A2 , A3 ) + Q4 ( A3 , A4 ) ]
A1 ,A2 ,A4
A3
= max Q1 ( A1 , A4 ) + Q2 ( A1 , A2 ) + g1 ( A2 , A4 )
A1 ,A2 ,A4
Q2(A1, A2)
2
Q1(A1, A4)
3
1
4
Q4(A3, A4)
©2004 – Carlos Guestrin, Daphne Koller
max
A
g3 1Q
(A3(A
, 2A,4A
) 3) +
2
[
]
Coordination Graphs


Communication follows triangulated graph
Computation grows exponentially in tree width



Graph-theoretic measure of “connectedness”
Arises in BNs, CSPs, …
Cost exponential in worst case,
fairly low for many real graphs
A5
A1
A7
A8
A6
A9
A2
A4
A3
©2004 – Carlos Guestrin, Daphne Koller
A11
A10
Context-Specific Interactions

Payoff structure can vary by context


Agents A1, A2 both trying to pass
through same narrow corridor
Can use context-specific “value rules”
<At(X,A1), At(X,A2),
A1 = fwd  A2 = fwd : -100>

Hope: Context-specific payoffs will
induce context-specific coordination
A1
©2004 – Carlos Guestrin, Daphne Koller
X
A2
Context-Specific Coordination
a1  a5  x : 4
A5
a5  a6  x : 2
A6
A1
a2  a3  x : 0.1
A2
a1  a2  x : 5
a1  a3  x : 1
A4
A3
Instantiate current
state: x = true
©2004 – Carlos Guestrin, Daphne Koller
a6  x : 7
a1  a6  x : 3
a3  a4  x : 3
a4  x : 1
a1  a2  a4  x : 3
Context-Specific Coordination
a1  a5 : 4
a5  a6 : 2
A6
A5
a6 : 7
a1  a2 : 5varies
Coordination structure
basedA1on context
a2  a3 : 0.1
A2
A4
A3
a3  a4 : 3
©2004 – Carlos Guestrin, Daphne Koller
a4 : 1
a1  a2  a4 : 3
Context-Specific Coordination
a1  a5 : 4
a5  a6 : 2
A6
A5
a6 : 7
aa21 :5a2 : 5varies
Coordination structure
a5 : 4
1
based onAcommunication
a2  a3 : 0.1
Maximizing out A1
A2
A4
A3
a4 : 1
a1  a2  a4 : 3
a3  a4 : 3
A1 = a1
©2004 – Carlos Guestrin, Daphne Koller
Rule-based variable elimination
[Zhang & Poole ’99]
Context-Specific Coordination
a5  a6 : 2
A6
A5
a6 : 7
a2 : 5
Coordination structure
varies
a5decisions
:4
a2 based
a3 : 0.1 onAagent
1
A2
A4
a4 : 1
A3
Eliminate A1 from
the graph
©2004 – Carlos Guestrin, Daphne Koller
a3  a4 : 3
Rule-based variable elimination
[Zhang & Poole ’99]
Robot Soccer
Kok, Vlassis & Groen
University of Amsterdam
UvA Trilearn 2002 won German Open 2002, but
placed fourth in Robocup-2002.
“ … the improvements introduced in UvA Trilearn
2003 … include an extension of the intercept skill,
improved passing behavior and especially the usage
of coordination graphs to specify the coordination
requirements between the different agents.”
©2004 – Carlos Guestrin, Daphne Koller
RoboSoccer Value Rules


Coordination graph rules include conditions on
player role and aspects of global system state
Example rules for player i, in role of passer:
Depends on distance of
j to goal after move
©2004 – Carlos Guestrin, Daphne Koller
UvA Trilearn 2003 Results
Round 1
Opponent
Round 1
Mainz Rolling Brains
(Germany)
4-0
Iranians (Iran)
31-0
Sahand (Iran)
39-0
a4ty (Latvia)
25-0
Helios (Iran)
2-1
AT-Humboldt (Germany)
5-0
ZJUBase (China)
6-0
Aria (Iran)
6-0
Hana (Japan)
26-0
Zenit-NewERA (Russia)
4-0
RoboSina (Iran)
6-0
Wright Eagle (China)
3-1
Everest (China)
7-1
Aria (Iran)
5-0
Semi-final
Brainstormers (Germany)
4-1
Final
TsinghuAeolus (China)
4-3
Round 2
Round 3
UvA Trilearn won
•
•
•
•
German Open 2003
US Open 2003
RoboCup 2003
German Open 2004
©2004 – Carlos Guestrin, Daphne Koller
Score
177-7
Outline

Action Coordination




Joint Planning




Factored Value Functions
Coordination Graphs
Context-Specific Coordination
Multi-Agent Markov Decision Processes
Efficient Linear Programming Solution
Decentralized Market-Based Solution
Generalizing to New Environments


Relational MDPs
Generalizing Value Functions
©2004 – Carlos Guestrin, Daphne Koller
peasant
footman
building
©2004 – Carlos Guestrin, Daphne Koller
Real-time Strategy Game
Peasants collect resources and build
Footmen attack enemies
Buildings train peasants and footmen
Planning Over Time
Markov Decision Process (MDP) representation:

Action space: Joint agent actions a= {a1,…, an}

State space: Joint state descriptions x= {x1,…, xn}

Momentary reward function R(x,a)

Probabilistic system dynamics P(x’|x,a)
©2004 – Carlos Guestrin, Daphne Koller
Policy
Policy: (x) = a
x0
At state x,
action a for all
agents
(x0) = both peasants get wood
x1
(x1) = one peasant gets gold,
other builds barrack
x2
©2004 – Carlos Guestrin, Daphne Koller
(x2) = Peasants get gold,
footmen attack
Value of Policy
Expected longterm reward
starting from x
Value: V(x)
Start
from x0
x0
R(x0)
V(x0) = E[R(x0) +  R(x1) + 2 R(x2) +
3 R(x3) + 4 R(x4) + ]
(x0
)
x1
(x1
)
R(x1)
(x1’)
x1 ’
R(x1’)
x1’’
(x1’’)
R(xKoller
©2004 – Carlos Guestrin, Daphne
1’’)
Future rewards
discounted by   [0,1)
x2
R(x2)
(x2
)
x3
R(x3)
(x3
)
x4
R(x4)
Optimal Long-term Plan
Optimal policy
Optimal Q-function
*(x)
Q*(x,a)
Bellman Equations:
Q ( x, a) = R( x, a) +   P ( x'| x, a)V ( x' )


x'


V (x) = max Q (x, a)
a
Optimal policy:
 ( x ) = arg max Q ( x, a)


a
©2004 – Carlos Guestrin, Daphne Koller
Solving an MDP
Solve
Bellman
equation
Optimal
value V*(x)
Optimal
policy *(x)
Many algorithms solve the Bellman equations:




Policy iteration [Howard ’60, Bellman ‘57]
Value iteration [Bellman ‘57]
Linear programming [Manne ’60]
…
©2004 – Carlos Guestrin, Daphne Koller
LP Solution to MDP
minimize : V (x )
x
V ( x )  Q (a, x )
subject to : 
x, a



One variable V (x) for each state
One constraint for each state x and action a
Polynomial time solution
©2004 – Carlos Guestrin, Daphne Koller
Are We Done?

Planning is polynomial in #states and #actions

#states exponential in number of variables

#actions exponential in number of agents
Efficient approximation by
exploiting structure!
©2004 – Carlos Guestrin, Daphne Koller
Structured Representation
Time
t
APeasant
t+1
Peasant
P’
Gold
G’
ABuild
Footman
AFootman
Enemy
©2004 – Carlos Guestrin, Daphne Koller
Factored
MDP
P(F’|F,G,AB,AF)
F’
[Boutilier et al. ’95]




E’
State
Dynamics
Decisions
Rewards
Complexity of representation:
Exponential in #parents (worst case)
Structured Value function ?

Factored MDP  Structure in
Time
t
t+1
X
X’
t+2
t+3
X’’
X’’’
Y
Y’
Y’’
Y’’’
Z
Z’
Z’’
Z’’’
R
Factored MDP
R
Almost!
R
R
Structure in V*

Factored V often provides good
approximate value function
©2004 – Carlos Guestrin, Daphne Koller
*
V
Structured Value Functions
[Bellman et al. ‘63], [Tsitsiklis & Van Roy ‘96]
[K. & Parr ’99,’00]

Approximate V* as a factored value function
V (x) = i wi hi (x)

In rule-based case:



hi is a rule concerning small part of the system
wi is the value associated with the rule
Goal: find w giving good approximation V to V*
Factored value function
V =  wi hi
Factored Q function
Q =  Qi
Can use coordination graph
©2004 – Carlos Guestrin, Daphne Koller
Approximate LP Solution
 V(xwi)h (x)
x
h ((x
x))  
 wi V
Q (Qa,(xa), x)
subject to :
minimize :
i i
i
i ii
i
 xx,,aa
 

i
One variable wi for each basis function 


i
Polynomial number of LP variables
One constraint for every state and action 

Exponentially many LP constraints
©2004 – Carlos Guestrin, Daphne Koller
So What Now?
[Guestrin, K., Parr ’01]

subject to :  wi hi (x )   Qi (a, x)
i
 i
x, a

subject to : 0   Qi (a, x )  wi hi (x )
i

x, a
Exponentially many linear =
one nonlinear constraint

subject to : 0 max  Qi (a, x )  wi hi ( x )
a ,x
i

©2004 – Carlos Guestrin, Daphne Koller
Variable Elimination Revisited
[Guestrin, K., Parr ’01]

Use Variable Elimination to represent constraints:
0  max f1 ( A, B) + f 2 ( A, C ) + max [ f 3 (C, D) + f 4 ( B, D)]
A,B ,C
D
0  max f1 ( A, B ) + f 2 ( A, C ) + g1
( B ,C )
A, B ,C
( B ,C )
g1
 f 3 (C, D ) + f 4 ( B, D )
Exponentially
fewer
constraints
subject to : 0 max  Qi (a, x )  wi hi ( x )
a ,x

Polynomial
LP i for finding
good factored approximation to V*
©2004 – Carlos Guestrin, Daphne Koller
Network Management Problem





Computer runs
processes
Computer status =
{good, dead, faulty}
Dead neighbors
increase dying
probability
Ring
Ring of Rings
Reward for successful
processes
Each SysAdmin takes
local action =
{reboot, not reboot }
Star
©2004 – Carlos Guestrin, Daphne Koller
k-grid
Scaling of Factored LP
Explicit LP
Factored LP
2n
(n+1-k)2k
k = tree-width
number of constraints
40000
Explicit LP
30000
Factored LP
k = 12
20000
k = 10
k=8
10000
k=5
0
2
4
©2004 – Carlos Guestrin, Daphne Koller
6
8
10
12
number of variables
14
16
k=3
Total running time (seconds)
Multiagent Running Time
180
160
Ring of
rings
140
120
Star
pair basis
100
80
60
40
Star
single basis
20
0
2
4
6
8
10
Number of agents
©2004 – Carlos Guestrin, Daphne Koller
12
14
16
Strategic 2x2
Factored MDP model
offline
2 Peasants, 2 Footmen, Enemy, Gold, Wood, Barracks
~1 million state/action pairs
Factored LP computes value function
online
Q
x
©2004 – Carlos Guestrin, Daphne Koller
Coordination graph computes
argmaxa Q(x,a)
World
a
Demo: Strategic 2x2
Guestrin, Koller, Gearhart & Kanodia
©2004 – Carlos Guestrin, Daphne Koller
Limited Interaction MDPs
[Guestrin & Gordon, ’02]

Some MDPs have additional structure:


Agents are largely autonomous
Interact in limited ways


M1
X2
Can decompose MDP as set of agentbased MDPs, with limited interface
X’1
M2
A1
X’2
X3
X’3
©2004 – Carlos Guestrin, Daphne Koller
X’2
A1
X1
X2
X1
A1
X2
A2
X’1
A1
e.g., competing for resources
X1
X1
X2
X’2
X3
X’3
A2
Limited Interaction MDPs
[Guestrin & Gordon, ’02]



In such MDPs, our LP matrix is highly structured
Can use Dantzig-Wolfe LP decomposition to solve
LP optimally, in a decentralized way
Gives rise to a market-like algorithm with multiple
agents and a centralized “auctioneer”
©2004 – Carlos Guestrin, Daphne Koller
Auction-style planning


Each agent solves local
(stand-alone) MDP
Agents send `constraint
messages’ to auctioneer:



[Guestrin & Gordon, ’02]
Auctioneer
Must agree on “policy” for
shared variables
Auctioneer sends `pricing
messages’ to agents

Set pricing
based on
conflicts
Pricing reflects penalties for
constraint violations
Influences agents’ rewards
in their MDP
©2004 – Carlos Guestrin, Daphne Koller
Plan,
plan,
plan
$
$
Plan,
plan,
plan
$
Plan,
plan,
plan
Fuel Allocation Problem
UAV start



UAVs share a pot of fuel
Targets have varying priority
Ignore target interference
Bererton, Gordon,
Thrun & Khosla
©2004 – Carlos Guestrin, Daphne Koller
Target
Fuel Allocation Problem
[Bererton, Gordon, Thrun, & Khosla , ’03]
©2004 – Carlos Guestrin, Daphne Koller
High-Speed Robot Paintball
Bererton, Gordon & Thrun
©2004 – Carlos Guestrin, Daphne Koller
High-Speed Robot Paintball
Game variant 1
Game variant 2
Coordination point
x = start location
Sensor Placement
+ = goal location
©2004 – Carlos Guestrin, Daphne Koller
High-Speed Robot Paintball
Bererton, Gordon & Thrun
©2004 – Carlos Guestrin, Daphne Koller
Outline

Action Coordination




Joint Planning




Factored Value Functions
Coordination Graphs
Context-Specific Coordination
Multi-Agent Markov Decision Processes
Efficient Linear Programming Solution
Decentralized Market-Based Solution
Generalizing to New Environments


Relational MDPs
Generalizing Value Functions
©2004 – Carlos Guestrin, Daphne Koller
Generalizing to New Problems
Many problems are “similar”
Solve
Problem 1
Solve
Problem 2
Solve
Problem n
Good
solution to
Problem n+1
MDPs are different! 
Different sets of states, action, reward, transition, …
©2004 – Carlos Guestrin, Daphne Koller
Generalizing with Relational MDPs
“Similar” domains have
similar “types” of objects 
Relational
MDP
Exploit similarities by computing
generalizable value functions
Generalization
Avoid need to replan
Tackle larger problems
©2004 – Carlos Guestrin, Daphne Koller
Relational Models and MDPs
[Guestrin, K., Gearhart & Kanodia ‘03]

Classes:


Relations


Collects, Builds, Trains, Attacks…
Instances


Peasant, Footman, Gold, Barracks, Enemy…
Peasant1, Peasant2, Footman1, Enemy1…
Builds on Probabilistic Relational Models
©2004 – Carlos Guestrin, Daphne Koller
[K. & Pfeffer ‘98]
Relational MDPs
[Guestrin, K., Gearhart & Kanodia ‘03]
Enemy
Footman
Health
H’
my_enemy
Health
AFootman
R
Count

Class-level transition probabilities
depends on:


Attributes; Actions; Attributes of related
objects
Class-level reward function
Very compact representation!
Does not depend on # of objects
©2004 – Carlos Guestrin, Daphne Koller
H’
World is a Large Factored MDP
Relational
MDP

Instantiation (world):



# of
objects
# instances of each class
Links between instances
Well-defined factored MDP
©2004 – Carlos Guestrin, Daphne Koller
Links
between
objects
Factored
MDP
MDP with 2 Footmen and 2 Enemies
Footman1
F1.Health
F1.H’
F1.A
Enemy1
E1.Health
E1.H’
R1
Footman2
F2.Health
F2.H’
F2.A
Enemy2
R2
©2004 – Carlos Guestrin, Daphne Koller
E2.Health
E2.H’
World is a Large Factored MDP
Relational
MDP



# of
objects
Links
between
objects
Instantiate world
Well-defined factored MDP
Use factored LP for planning
We have gained nothing! 
©2004 – Carlos Guestrin, Daphne Koller
Factored
MDP
Class-level Value Functions
20
F1 alive,
E1 alive
F1 alive,
E1 dead
F1 dead,
E1 alive
F1 dead,
1
E1 dead
15
Footman1
10
Enemy1
5
F1.Health
0
E .Health
20
15
Footman2
10
5
0
F2.Health
F2 alive,
E2 alive
F2 alive,
E2 dead
F2 dead,
E2 alive
F2 dead,
E2 dead
VE1
(E .H)
(F .H, E2V
.H)
V
VF2
F 1
E 2
F
are Interchangeable!
+
+
V(F1.H, E1.H, F2.H, EUnits
2.H) =
VF1(F1.H, E1.H)
VF1
 VF2  VF
VE1
 VE2  VE
Enemy2
E2.Health
VE2(EV2.H
)
E
+
At state x, each footman has different contribution to V
Given wC — can instantiate value function for any world 
©2004 – Carlos Guestrin, Daphne Koller
Factored LP-based Generalization
How many samples?
F alive,
E alive
F alive,
E dead
F dead,
E alive
F dead,
E dead
20
15
10
E1
F1
E2
F2
5
Classlevel
factored
LP
0
VF
10
8
6
20
E1 alive
E1 dead
15
10
4
5
2
0
0
E1
Generalize
E alive
8
10
10
8
6
F1 alive,
E1 alive
F1 alive,
E1 dead
F1 dead,
E1 alive
F1 dead,
E1 dead
F1
20
E2 alive
F2 alive,
E2 alive
F2 alive,
E2 dead
F2 dead,
E2 alive
F2 dead,
E2 dead
15
E2 dead
10
4
5
2
0
0
E2
F2
E dead
6
4
2
0
Sample Set
I
©2004 – Carlos Guestrin, Daphne Koller
10
VE
8
6
4
2
0
E3
20
E3 alive
15
E3 dead
10
5
0
F3
F3 alive,
E3 alive
F3 alive,
E3 dead
F3 dead,
E3 alive
F3 dead,
E3 dead
Sampling Complexity
Exponentially
many worlds
# objects in a world
is unbounded

need exponentially
many samples?

must sample
very large worlds?
NO!
©2004 – Carlos Guestrin, Daphne Koller
Theorem
Sample m small worlds of
up to O( ln 1/ ) objects
m=
samples
Value function within O() of class-level
value function optimized for all worlds,
with prob. at least 1-
©2004 – Carlos Guestrin, Daphne Koller
Rcmax is the maximum class reward
Strategic 2x2
offline
Relational MDP
model
2 Peasants, 2 Footmen,
Enemy, Gold, Wood, Barracks
~1 million state/action pairs
Factored LP computes value function
online
Q
x
©2004 – Carlos Guestrin, Daphne Koller
Coordination graph computes
argmaxa Q(x,a)
World
a
Strategic 9x3
offline
Relational MDP
model
9 Peasants, 3 Footmen,
Enemy, Gold, Wood, Barracks
~3 trillion state/action pairs

grows exponentially
in # agents
Factored LP computes value function
online
Qo
x
©2004 – Carlos Guestrin, Daphne Koller
Coordination graph computes
argmaxa Q(x,a)
World
a
Strategic Generalization
offline
Relational MDP
model
2 Peasants, 2 Footmen,
Enemy, Gold, Wood, Barracks
~1 million state/action pairs
9 Peasants, 3 Footmen,
Factored
LP computes
Enemy,
Gold, Wood,
Barracksclass-level value function
online
~3 trillion state/action pairs
x
©2004 – Carlos Guestrin, Daphne Koller
instantiated Q-functions
wC in # agents
grow polynomially
Coordination graph computes
argmaxa Q(x,a)
World
a
Demo: Generalized 9x3
Guestrin, Koller, Gearhart & Kanodia
©2004 – Carlos Guestrin, Daphne Koller
Tactical Generalization
3 v. 3
4 v. 4
Generalize

Planned in 3 Footmen versus 3 Enemies

Generalized to 4 Footmen versus 4 Enemies
©2004 – Carlos Guestrin, Daphne Koller
Demo: Planned Tactical 3x3
Guestrin, Koller, Gearhart & Kanodia
©2004 – Carlos Guestrin, Daphne Koller
Demo: Generalized Tactical 4x4
Guestrin, Koller, Gearhart & Kanodia
[Guestrin, K., Gearhart & Kanodia ‘03]
©2004 – Carlos Guestrin, Daphne Koller
Summary
Distributed
coordinated
action
selection
Effective
planning
under
uncertainty
Generalization
to new
problems
Structured Multi-Agent MDPs
©2004 – Carlos Guestrin, Daphne Koller
Important Questions
Continuous
spaces
Complex
actions
Partial
observability
Learning
to act
How far can we go??
©2004 – Carlos Guestrin, Daphne Koller
http://robotics.stanford.edu/~koller
Carlos Guestrin
Chris Gearhart
Neal Kanodia
Shobha Venkataraman
Ronald Parr
Curt Bererton
Geoff Gordon
Sebastian Thrun
Jelle Kok
Matthijs Spaan
Nikos Vlassis
Download