Document

advertisement
POMDPs
Slides based on Hansen et. Al.’s
tutorial + R&N 3rd Ed Sec 17.4
Planning using Partially Observable
Markov Decision Processes:
A Tutorial
Presenters:
Eric Hansen, Mississippi State University
Daniel Bernstein, University of Massachusetts/Amherst
Zhengzhu Feng, University of Massachusetts/Amherst
Rong Zhou, Mississippi State University
Introduction and foundations
Definition of POMDP
Goals, rewards and optimality criteria
Examples and applications
Computational complexity
Belief states and Bayesian conditioning
Planning under partial observability
Imperfect observation
Goal
Environment
Environment
Action
Two Approaches to Planning
under Partial Observability

Nondeterministic planning



Probabilistic (decision-theoretic) planning


Uncertainty is represented by set of possible states
No possibility is considered more likely than any other
Uncertainty is represented by probability distribution
over possible states
In this tutorial we consider the second, more
general approach
Markov models
Prediction
Fully observable
Markov chain
Planning
MDP
(Markov decision process)
Partially observable
Hidden
Markov model
POMDP
(Partially observable
Markov decision process)
Definition of POMDP
hidden states:
s0
S1
S2
observations:
z0
z1
z2
actions:
a0
a1
a2
rewards:
r0
r1
r2
Goals, rewards and optimality criteria


Rewards are additive and time-separable, and
objective is to maximize expected total reward
Traditional planning goals can be encoded in
reward function
Example: achieving a state satisfying property P at
minimal cost is encoded by making any state satisfying
P a zero-reward absorbing state, and assigning all
other states negative reward.


POMDP allows partial satisfaction of goals and
tradeoffs among competing goals
Planning horizon can be finite, infinite or indefinite
Machine Maintenance
X
Canonical application of POMDPs in Operations Research
Robot Navigation


Canonical application of POMDPs in AI
Toy example from Russell & Norvig’s AI textbook
+1
Actions:
N, S, E, W, Stop
0.1
0.1
–1
Start
Observations:
sense surrounding walls
0.8
Many other applications
Helicopter control [Bagnell & Schneider 2001]
 Dialogue management [Roy, Pineau & Thrun 2000]
 Preference elicitation [Boutilier 2002]
 Optimal search and sensor scheduling

[Krishnamurthy & Singh 2000]

Medical diagnosis and treatment
[Hauskrecht & Fraser 2000]

Packet scheduling in computer networks
[Chang et al. 2000; Bent & Van Hentenryck 2004]
Computational complexity

Finite-horizon



PSPACE-hard [Papadimitriou & Tsitsiklis 1987]
NP-complete if unobservable
Infinite-horizon


Undecidable [Madani, Hanks & Condon 1999]
NP-hard for -approximation [Lusena, Goldsmith &
Mundhenk 2001]

NP-hard for memoryless or bounded-memory
control problem [Littman 1994; Meuleau et al. 1999]
POMDP

<S, A, T, R, Ω, O> tuple




S, A, T, R of MDP
Ω – finite set of observations
O:SxA-> Π(Ω)
Belief state



- information state
– b, probability distribution over S
- b(s1)
POMDP
a
world
o


Goal is to maximize expected long-term
reward from the initial state distribution
State is not directly observed
Two sources of POMDP complexity

Curse of dimensionality



size of state space
shared by other planning problems
Curse of memory



size of value function (number of vectors)
or equivalently, size of controller (memory)
dimensionality
memory
unique to POMDPs
Complexity of each iteration of DP:
| S | | A || 
2
n1 Z
| n |
|
Two representations of policy



Policy maps history to action
Since history grows
exponentially with horizon, it
needs to be summarized,
especially in infinite-horizon case
Two ways to summarize history
 belief state

finite-state automaton –
partitions history into finite
number of “states”
Belief simplex
S1
(0, 0, 1)
S2
(0, 1)
0
S0
0
(1, 0)
2 states
S1
(0, 1, 0)
(1, 0, 0)
3 states
S0
Belief state has Markov property


The process of maintaining the belief state is
Markovian
For any belief state, the successor belief state
depends only on the action and observation
a1
z1
z1
P(s0) = 0
z2
z2
a2
P(s0) = 1
Belief-state MDP





State space: the belief simplex
Actions: same as before
State transition function:
P(b’|b,a) = e E P(b’|b,a,e)P(e|b,a)
Reward function:
r(b,a) =sS b(s)r(s,a)
Bellman optimality equation:


V (b)  maxaA r (b, a)    P(b' b, a)V (b' )
b'


Should be Integration…
Belief-state controller
“State Estimation”
Obs. e



P(b|b,a,e)
b
Current
Belief
State
(Register)
b
Policy

a
Action
Update belief state after action and observation
Policy maps belief state to action
Policy is found by solving the belief-state MDP
POMDP as MDP in Belief Space
Dynamic Programming for POMDPs

We’ll start with some important concepts:
s1
s2
s3
0.25
0.40
0.35
belief state
a1
o1
o
2
a2
o1
a3
o2
a2
o1
a1
a3
o2
a1
policy tree
s1
s2
linear value function
Dynamic Programming for POMDPs
a1
a2
s1
s2
Dynamic Programming for POMDPs
o1
a1
o2
a1
o1
a1
a1
a2
o2
a1
o1
a1
o2
a1
o1
a1
a2
a2
o2
a2
o1
a1
o2
a2
o1
a2
a1
a2
o2
a1
o1
a1
o2
a2
o1
a2
a2
a2
o2
a2
s1
s2
Dynamic Programming for POMDPs
o1
a1
o2
a1
o1
a1
a1
a2
o2
a1
o1
a2
a1
o2
a1
o1
a2
a2
o2
a2
s1
s2
Dynamic Programming for POMDPs
s1
s2
[Finitie Horizon Case]
POMDP Value Iteration: Basic Idea

First Problem Solved
Key insight: value function


piecewise linear & convex (PWLC)
Convexity makes intuitive sense



Each line (hyperplane) represented
with vector



Middle of belief space – high entropy, can’t
select actions appropriately, less long-term
reward
Near corners of simplex – low entropy, take
actions more likely to be appropriate for current
world state, gain more reward
Coefficients of line (hyperplane)
e.g. V(b) = c1 x b(s1) + c2 x (1-b(s1))
To find value function at b, find vector
with largest dot pdt with b
POMDP Value Iteration: Phase 1: One action plans
Two states: 0 and 1
R(0)=0 ; R(1) = 1
[stay]0.9 stay; 0.1 go
[go] 0.9 go; 0.1 stay
Sensor reports correct state with 0.6 prob
Discount facto=1
POMDP Value Iteration: Phase 2: Two action (conditional) plans
stay
0
1
stay stay
Point-based Value Iteration:
Approximating with Exemplar Belief States
Solving infinite-horizon POMDPs



Value iteration: iteration of dynamic
programming operator computes value
function that is arbitrarily close to optimal
Optimal value function is not necessarily
piecewise linear, since optimal control may
require infinite memory
But in many cases, as Sondik (1978) and
Kaelbling et al (1998) noticed, value iteration
converges to a finite set of vectors. In these
cases, an optimal policy is equivalent to a
finite-state controller.
Policy evaluation
o2
o2
o1
s1q1 o
1
o2 o2
1 o1
q
o2
s2q1 o1
o1
s1q2
q2
s2q2
As in the fully observable case, policy evaluation
involves solving a system of linear equations.
There is one unknown (and one equation) for
each pair of system state and controller node
Policy improvement
0
a0
2
a0
z0,z1
3
a1
z0
1
a1
4
a0
z1
V(b)
V(b)
0
z0,z1
z0
z1
z0,z1
0
a0
z0,z1
3
a1
z0
0
z0,z1
z1
V(b)
0,2
0
z0,z1
4
a0
0
3
4
4
1
1
z1
1
a1
3
1
z0
0
a0
1
0
1
Per-Iteration Complexity of
POMDP value iteration..
Number of a vectors needed at tth iteration
Time for computing each a vector
Approximating POMDP value
function with bounds

It is possible to get approximate value
functions for POMDP in two ways

Over constrain it to be a NOMDP: You get
Blind Value function which ignores the
observation
Under-estimates value

A “conformant” policy


(over-estimates cost)
For infinite horizon, it will be same action always! (only |A| policies)
Relax it to be a FOMDP: You assume that the
state is fully observable. Over-estimates value

A “state-based” policy
(under-estimates cost)
Per iteration
Upper bounds for leaf nodes
can come from FOMDP VI
and lower bounds from NOMDP VI
Observations are written as o or z
Comparing POMDPs with Nondeterministic conditional planning
POMDP
Non-Deterministic Case
RTDP-Bel doesn’t do look ahead, and
also stores the current estimate of
value function (see update)
---SLIDES BEYOND THIS NOT
COVERED--
Two Problems



How to represent value function over continuous belief
space?
How to update value function Vt with Vt-1?
POMDP -> MDP
S => B, set of belief states
A => same
T => τ(b,a,b’)
R => ρ(b, a)
Running Example

POMDP with



Two states (s1 and s2)
Two actions (a1 and a2)
Three observations (z1, z2, z3)
1D belief space for a 2 state POMDP
Probability that state is s1
Second Problem


Can’t iterate over all belief states (infinite) for valueiteration but…
Given vectors representing Vt-1, generate vectors
representing Vt
Horizon 1



No future
Value function consists only
of immediate reward
e.g.









R(s1, a1) = 0, R(s2, a1) = 1.5,
R(s1, a2) = 1, R(s2, a2) = 0
b = <0.25, 0.75>
Value of doing a1
= 1 x b(s1) + 0 x b(s2)
= 1 x 0.25 + 0 x 0.75
Value of doing a2
= 0 x b(s1) + 1.5 x b(s2)
= 0 x 0.25 + 1.5 x 0.75
Second Problem

Break problem down into 3 steps



-Compute value of belief state given action and
observation
-Compute value of belief state given action
-Compute value of belief state
Horizon 2 – Given action & obs

If in belief state b,what is the best value of



doing action a1 and seeing z1?
Best value = best value of immediate action +
best value of next action
Best value of immediate action = horizon 1
value function
Horizon 2 – Given action & obs



Assume best immediate action is a1 and obs is z1
What’s the best action for b’ that results from initial b
when perform a1 and observe z1?
Not feasible – do this for all belief states (infinite)
Horizon 2 – Given action & obs

Construct function over entire (initial) belief space


from horizon 1 value function
with belief transformation built in
Horizon 2 – Given action & obs

S(a1, z1) corresponds to paper’s

S() built in:






- horizon 1 value function
- belief transformation
- “Weight” of seeing z after performing a
- Discount factor
- Immediate Reward
S() PWLC
Second Problem

Break problem down into 3 steps



-Compute value of belief state given action and
observation
-Compute value of belief state given action
-Compute value of belief state
Horizon 2 – Given action



What is the horizon 2 value of a belief state given
immediate action is a1?
Horizon 2, do action a1
Horizon 1, do action…?
Horizon 2 – Given action




What’s the best strategy at b?
How to compute line (vector) representing best
strategy at b? (easy)
How many strategies are there in figure?
What’s the max number of strategies (after taking
immediate action a1)?
Horizon 2 – Given action


How can we represent the 4 regions (strategies) as a
value function?
Note: each region is a strategy
Horizon 2 – Given action



Sum up vectors representing region
Sum of vectors = vectors (add lines, get lines)
Correspond to paper’s transformation
Horizon 2 – Given action


What does each region represent?
Why is this step hard (alluded to in paper)?
Second Problem

Break problem down into 3 steps



-Compute value of belief state given action and
observation
-Compute value of belief state given action
-Compute value of belief state
Horizon 2
a1
U
a2
Horizon 2
This tells you
how to act! =>
Purge
Second Problem

Break problem down into 3 steps




-Compute value of belief state given action and
observation
-Compute value of belief state given action
-Compute value of belief state
Use horizon 2 value function to update horizon
3’s ...
The Hard Step


Easy to visually inspect to obtain different regions
But in higher dimensional space, with many actions and
observations….hard problem
Naïve way - Enumerate

How does Incremental Pruning do it?
Incremental Pruning


Combinations
Purge/
Filter
How does IP improve
naïve method?
Will IP ever do worse
than naïve method?
Incremental Pruning

What’s other novel idea(s) in IP?


RR: Come up with smaller set D as argument to Dominate()
RR has more linear pgms but less contraints in the
worse case.

Empirically ↓ constraints saves more time than ↑ linear
programs require
Incremental Pruning

What’s other novel idea(s) in IP?

RR: Come up with smaller set D as argument to
Dominate()
Why are the terms after U needed?
Identifying Witness

Witness Thm:





-Let Ua be a set of vectors representing value
function
-Let u be in Ua (e.g. u = αz1,a2 + αz2,a1 + αz3,a1 )
-If there is a vector v which differs from u in one
observation (e.g. v = αz1,a1 + αz2,a1 + αz3,a1) and
there is a b such that b.v > b.u,
-then Ua is not equal to the true value function
Witness Algm




Randomly choose a belief state b
Compute vector representing best value at b (easy)
Add vector to agenda
While agenda is not empty
• Get vector Vtop from top of agenda
• b’ = Dominate(Vtop, Ua)
• If b’ is not null (there is a witness),


compute vector u for best value at b’ and add it to Ua
compute all vectors v’s that differ from u at one observation
and add them to agenda
b’
b’’
b
b’
b’’
Linear Support

If value function is incorrect, biggest diff is at edges
(convexity)
Linear Support
Number of of policy trees


|Z|T-1
|A|
at horizon T
Example for |A| = 4 and |Z| = 2
Horizon
0
1
2
3
4
# of policy trees
1
4
64
16,384
1,073,741,824
Policy graph
V(b)
a0
a0
a1
a2
z0
z0
z1
2 states
z1
a1
z0
z1
z1
z0, z1
z0,z1
a2
z0
Policy iteration for POMDPs

Sondik’s (1978) algorithm represents policy as
a mapping from belief states to actions




only works under special assumptions
very difficult to implement
never used
Hansen’s (1998) algorithm represents policy as
a finite-state controller



fully general
easy to implement
faster than value iteration
Properties of policy iteration


Theoretical
 Monotonically improves finite-state controller
 Converges to -optimal finite-state controller
after finite number of iterations
Empirical
 Runs from 10 to over 100 times faster than
value iteration
Scaling up
State abstraction and factored representation
Belief compression
Forward search and sampling approaches
Hierarchical task decomposition
State abstraction and
factored representation of POMDP





DP algorithms are typically state-based
Most AI representations are “feature-based”
|S| is typically exponential in the number of
features (or variables) – the “curse of
dimensionality”
State-based representations for problems with
more than a few variables are impractical
Factored representations exploit regularities in
transition and observation probabilities, and
reward
Example: Part-painting problem
[Draper, Hanks, Weld 1994]

Boolean state variables
flawed (FL), blemished (BL),
painted (PA), processed (PR),
notified (NO)

actions
Inspect, Paint, Ship, Reject, Notify

cost function
Cost of 1 for each action
Cost of 1 for shipping unflawed part that is not painted
Cost of 10 for shipping flawed part or rejecting unflawed part

initial belief state
Pr(FL) = 0.3, Pr(BL|FL) = 1.0, Pr(BL|FL) = 0.0,
Pr(PA) = 0.0, Pr(PR) = 0.0, Pr(NO) = 0.0
Factored representation of MDP
[Boutilier et al. 1995; Hoey, St. Aubin, Hu, & Boutilier 1999]
FL
SL-FL
PA
FL’
SL-FL’
PA’
SH
SH’
RE
RE’
NO
NO’
FL
T
F
FL’
FL’
1.0
0.0
FL
1.0
PA
T
F
F
F
F
SH RE
T/F T/F
F F
T T/F
T/F T
T/F T/F
NO
T/F
F
T/F
T/F
T
0.0
PA’
1.0
0.95
0.0
0.0
0.0
PA
SH
RE
NO
0.95
Dynamic Belief Network


PA’
Probability Tables
0.0
1.0
Decision Diagrams
Dynamic Bayesian network captures variable independence
Algebraic decision diagram captures value independence
Decision diagrams
X
X
Y
Y
Z
TRUE
FALSE
Binary decision
Diagram (BDD)
Y
Z
5.8
3.6 18.6
Z
9.5
Algebraic decision
diagram (ADD)
Operations on decision diagrams

Addition (subtraction), multiplication (division),
minimum (maximum), marginalization, expected
value
X
X
Y
+
Z
1.0

2.0
3.0
X
Y
=
Z
10.0 20.0
Y
Y
Z
30.0
Z
11.0 12.0 22.0 23.0 33.0
Complexity of operators depends on size of decision
diagrams, not number of states!
Symbolic dynamic programming for
factored POMDPs [Hansen & Feng 2000]


Factored representation of value function:
replace |S|-vectors with ADDs that only
make relevant state distinctions
Two steps of DP algorithm



Generate new ADDs for value function
Prune dominated ADDs
State abstraction is based on aggregating
states with the same value
Generation step:
Symbolic implementation
observation
probabilities
obs1
akt+1
action
reward
transition
probabilities

obs2
obs3
ait
Pruning step: Symbolic implementation



pruning is the most computationally
expensive part of algorithm
must solve a linear program for each
(potential) ADD in value function
because state abstraction reduces the
dimensionality of linear programs, it
significantly improves efficiency
Improved performance
Test
problem
1
2
3
4
5
6
7
Degree of
abstraction
0.01
0.03
0.10
0.12
0.44
0.65
1.00
Degree of abstraction =
Speedup factor
Generate
Prune
42
26
17
11
0.4
3
0.8
0.6
-3.4
0.4
-0.7
0.1
-6.5
-0.1
Number abstract states
Number primitive states
Optimal plan (controller)
for part-painting problem
FL
FL
Paint
OK
FL  PA
FL
FL  PA
Ship
PR  NO
Notify PR  NO
PR
Inspect
FL  BL
FL  BL
FL  BL
FL  BL
OK
~OK
Inspect
~OK
Reject
FL
FL
Approximate state aggregation
Simplify each ADD in value function by merging
leaves that differ in value by less than .
 = 0.4
Approximate pruning
Prune vectors from value function that add
less than  to value of any belief state
a1
a2

a3
a4
(0,1)
(1,0)
Error bound




These two methods of approximation share the
same error bound
“Weak convergence,” i.e., convergence to within
2/(1-) of optimal (where  is discount factor)
After “weak convergence,” decreasing  allows
further improvement
Starting with relatively high  and gradually
decreasing it accelerates convergence
Approximate dynamic programming


Strategy: ignore differences of value less
than some threshold 
Complementary methods



Approximate state aggregation
Approximate pruning
…address two sources of complexity


size of state space
size of value function (memory)
Belief compression


Reduce dimensionality of belief space by
approximating the belief state
Examples of approximate belief states
 tuple of mostly-likely state plus entropy
of belief state [Roy & Thrun 1999]
 belief features learned by exponential
family Principal Components Analysis
[Roy & Gordon 2003]

standard POMDP algorithms can be
applied in the lower-dimensional belief
space, e.g., grid-based approximation
Forward search
z0
a0
a1
z0 z1 z0 z1 z0
a0
a1
z1
z0
a0
a1
z1 z0
a0
z1 z0
z1
a1
z1 z0
a0
z1 z0
a1
z1 z0
z1
Sparse sampling

Forward search can be combined with
Monte Carlo sampling of possible
observations and action outcomes
[Kearns et al 2000; Ng & Jordan 2000]


Remarkably, complexity is
independent of size of state space!!!
On-line planner selects -optimal
action for current belief state
State-space decomposition
For some POMDPs, each action/observation
pair identifies a specific region of the state space
Motivating Example Continued
A “deterministic observation” reveals that world
is in one of a small number of possible states
Same for “hybrid POMDPs”, which are POMDPs
with some fully observable and some partially
observable state variables
Region-based dynamic programming
Tetrahedron and surfaces
Hierarchical task decomposition



We have considered abstraction in state space
Now we consider abstraction in action space
For fully observable MDPs:





Options [Sutton, Precup & Singh 1999]
HAMs [Parr & Russell 1997]
Region-based decomposition [Hauskrecht et al 1998]
MAXQ [Dietterich 2000]
Hierarchical approach may cause sub-optimality,
but limited forms of optimality can be guaranteed


Hierarchical optimality (Parr and Russell)
Recursive optimality (Dietterich)
Hierarchical approach to POMDPs

Theocharous & Mahadevan (2002)




Pineau et al (2003)




based on hierarchical hidden Markov model
approximation
~1000 state robot hallway-navigation problem
based on Dietterich’s MAXQ decomposition
approximation
~1000 state robot navigation and dialogue
Hansen & Zhou (2003)


also based on Dietterich’s MAXQ decomposition
convergence guarantees and epsilon-optimality
Macro action as finite-state controller
clear
South
wall
East
goal
wall
West
Stop
clear
East
wall
North
goal
clear

Allows exact modeling of macro’s effects


macro state transition probabilities
macro rewards
Taxi example [Dietterich 2000]
Task hierarchy [Dietterich 2000]
Taxi
Get
Put
Pickup
Putdown
Navigate
North
South
East
West
Hierarchical finite-state controller
Get
Stop
Pickup
Put
Nav.
Nav.
East
Stop
North
Stop
Putdown
North
West
South
Stop
East
Stop
MAXQ-hierarchical policy iteration
Create initial sub-controller for each sub-POMDP
in hierarchy
 Repeat until error bound is less than 
 Identify subtask that contributes most to overall
error
 Use policy iteration to improve the
corresponding controller
 For each node of controller, create abstract
action (for parent task) and compute its model
 Propagate error up through hierarchy

Modular structure of controller
Complexity reduction



Per-iteration complexity of policy iteration
|A| |Q||Z|, where A is set of actions
Z is set of observations
Q is set of controller nodes
Per-iteration complexity of hierarchical PI
|A| i |Qi||Z| , where |Q| = i|Qi|
With hierarchical decomposition,
complexity is sum of the complexity of the
subproblems, instead of product
Scalability
MAXQ-hierarchical policy iteration can solve
any POMDP, if it can decompose it into subPOMDPs that can be solved by policy iteration
 Although each sub-controller is limited in
size, the hierarchical controller is not limited
in size
 Although the (abstract) state space of each
subtask is limited in size, the total state
space is not limited in size
Multi-Agent Planning with POMDPs
Partially observable stochastic games
Generalized dynamic programming
Multi-Agent Planning with POMDPs
1
a1
z1 , r1
a2
2


world
z2 , r2
Many planning problems involve multiple agents
acting in a partially observable environment
The POMDP framework can be extended to
address this
Partially observable
stochastic game (POSG)

A POSG is S, A1, A2, Z1, Z2,, P, r1, r2, where







S is a finite state set, with initial state s0
A1, A2 are finite action sets
Z1, Z2 are finite observation sets
P(s’|s, a1, a2) is state transition function
P(z1, z2| s, a1, a2) is observation function
r1(s, a1, a2) and r2(s, a1, a2) are reward functions
Special cases:


All agents share the same reward function
Zero-sum games
Plans and policies




A local policy is a mapping i : Zi*  Ai
A joint policy is a pair 1, 2
Each agent wants to maximize its own
long-term expected reward
Although execution is distributed, planning
can be centralized
Beliefs in POSGs



With a single agent, a belief is a
distribution over states
How does this generalize to multiple
agents?
Could have beliefs over beliefs over
beliefs, but there is no algorithm for
working with these
Example
States: grid cell pairs
Actions: ,,,
Transitions: noisy
Goal: pick up balls
Observations: red lines
Another Example
msg




msg
States: who has a message to send?
Actions: send or don’t send
Reward: +1 for successful broadcast
0 if collision or channel not used
Observations: was there a collision? (noisy)
Strategy Elimination in POSGs


Could simply convert to normal form
But the number of strategies is doubly
exponential in the horizon length
…
…
R111,
R112
…
…
Rm11,
Rm12
…
…
R1n1,
R1n2
…
Rmn1,
Rmn2
Generalized dynamic programming


Initialize 1-step policy trees to be actions
Repeat:



Evaluate all pairs of t-step trees from current
sets
Iteratively prune dominated policy trees
Form exhaustive sets of t+1-step trees from
remaining t-step trees
What Generalized DP Does


The algorithm performs iterated elimination
of dominated strategies in the normal form
game without first writing it down
For cooperative POSGs, the final sets
contain the optimal joint policy
Some Implementation Issues



As before, pruning can be done using linear
programming
Algorithm keeps value function and policy
trees in memory (unlike POMDP case)
Currently no way to prune in an incremental
fashion
A Better Way to Do Elimination


We use dynamic programming to eliminate
dominated strategies without first
converting to normal form
Pruning a subtree eliminates the set of
trees containing it
o1
a2
a3
o1
o2
a1
prune
a1
o
a2
o1
o
2
a1
o1
a2
o2
a2
o1
a3
2
a2
a2
o2
a3
o1
a3
o2
a2
o1
a3
a2
eliminate
o2
a1
Dynamic Programming


Build policy tree sets simultaneously
Prune using a generalized belief space
o1
a1
a3
p1
o2
o1
a1
a3
a2
p2
o2

a1
agent 2 state space
s1
s2

o1
a2
a2
q1
o2
o1
a2
a1
a3
q2
agent 1 state space
o2
a2
Dynamic Programming
a1
a2
a1
a2
Dynamic Programming
o1
a1
a1
o1
a1
o2
a1
o1
a2
a2
a2
o2
a1
o1
o2
a1
o1
a1
a2
o2
a1
o1
a2
a2
a1
o2
a2
a2
o1
o2
a1
o1
a1
a1
o2
a2
o1
a2
a1
a1
a2
o2
a2
o1
a1
o2
a1
o1
a2
a2
a2
o2
a1
o1
o2
a1
o1
a1
a2
o2
a1
o1
a2
a2
a1
o2
a2
a2
o2
a1
o1
a1
o2
a2
o1
a2
a1
a2
o2
a2
Dynamic Programming
o1
a1
a1
o1
a2
a1
a1
a1
o1
o2
a1
o1
o1
o2
a2
a2
o2
a1
a1
o2
a2
o1
a2
a1
a1
o2
a2
o1
a2
a1
a2
o2
a2
o1
a1
o2
a1
o1
a2
a2
a2
o2
a1
o1
o2
a1
o1
a1
a2
o2
a1
o1
a2
a2
a1
o2
a2
a2
o2
a1
o1
a1
o2
a2
o1
a2
a1
a2
o2
a2
Dynamic Programming
o1
a1
a1
o1
a2
a1
a1
o1
o2
a1
o1
o2
a2
a2
o2
a1
a1
o2
a2
o1
a2
a1
o2
a2
o1
a2
a1
a2
o2
a2
o1
a1
o2
a1
o1
a2
a2
a2
o2
a1
o1
a1
a2
o2
a2
o1
a1
o2
a2
o1
a2
a1
a2
o2
a2
Dynamic Programming
o1
a1
a1
o2
a1
o1
a1
o2
a2
o1
a2
a2
o2
a1
o1
a2
a1
a2
o2
a2
o1
a1
o2
a1
o1
a2
a2
a2
o2
a1
o1
a1
a2
o2
a2
o1
a1
o2
a2
o1
a2
a1
a2
o2
a2
Dynamic Programming
o1
a1
a1
o2
a1
o1
a1
o2
a2
o1
a2
a2
o2
a1
o1
a2
a1
a2
o2
a2
o1
a1
o2
a1
o1
a2
a2
a2
o2
a1
o1
a1
o2
a2
o1
a2
a1
a2
o2
a2
Dynamic Programming
Complexity of POSGs


The cooperative finite-horizon case is
NEXP-hard, even with two agents whose
observations completely determine the
state [Bernstein et al. 2002]
Implications:


The problem is provably intractable (because
P  NEXP)
It probably requires doubly exponential time
to solve in the worst case
Download