Interactive POMDP - University of Georgia

advertisement
AAMAS 2012
Generalized and Bounded Policy Iteration for Finitely
Nested Interactive POMDPs: Scaling Up
Ekhlas Sonu, Prashant Doshi
Dept. of Computer Science
University of Georgia
Overview
We generalize Bounded Policy Iteration for
POMDP to the multiagent decision making
framework of Interactive POMDP
We discuss the challenges associated with this
generalization
Substantial scalability achieved using the
generalized approach
Introduction: Interactive POMDP
Interactive POMDP (Gmytrasiewicz&Doshi,05):
Generalization of POMDP to multiagent settings
Applications
Money Laundering (Ng et al.,10)
Lemonade stand game (Wunder et al.,11)
Modeling human behavior (Doshi et al.,10), and more…
Differs from Dec-POMDP
Dec-POMDP: Team of agents
I-POMDP: Individual agent in presence of other agents
– cooperative, competitive or neutral settings
Introduction: I-POMDP
(Finitely-nested and 2 agents)
I-POMDPi,l =<ISi,l, A, Wi, Ti, Oi, Ri, γ>
ai/Ti(s, ai, aj, s’)
Physical States
(S)
oj/Oj(s’, ai, aj, oj) ,
Rj (s, ai, aj)
aj/Tj(s, ai, aj, s’)
i
Interactive state
oi/Oi(s’, ai, aj, oi) ,
Ri (s, ai, aj)
ISi,l = S X Qj,l-1
S: Set of physical states
Qj,l-1 : Set of intentional
models of j at level l-1
j
A = Ai X Aj
Wi: set of observations of i
Ti: S X Ai X Aj  DS
Oi: S X Ai X Aj  DWi
Ri: S X Ai X Aj  R
I-POMDP Belief Update and Value Function
Belief Update:
An agent must predict the other agent’s actions by
anticipating its updated beliefs over time. Therefore
belief update consists of
Updating distribution over physical states: Transition
Function, Observation Function of agent i
Updating distribution over dynamic models: Belief update of
other agents and its observation function
Value Function:
Must incorporate the I-POMDP belief update in
computing long term rewards
Solving I-POMDP (Related Work)
Previous work: Value iteration algorithms
Interactive particle filtering (I-PF) (Doshi&Gmytrasiewicz,09)
nested particle filter: sampled recursive representation of
agent’ nested belief
Interactive point-based value iteration (I-PBVI)
(Doshi&Perez,08)
point based domination check
Iteratively apply Backup Operator:
Expensive operator
Scale only to toy problems
Over multiple time steps:
Curse of history
Curse of dimensionality
Phy. St.
(S)
b = D(ISi,l)
Background
Policy Iteration
Class of solution algorithms – search policy space
Exponential growth in solution size
Bounded Policy Iteration (Poupart&Boutilier,03)
Fixed solution size (controlled growth)
Applied in POMDP & Dec-POMDP
Dec-BPI (Bernstein,Hansen&Zilberstein,05) -- optional
correlation device may not be feasible in non-cooperative
settings
Contribution:
We present the first policy iteration algorithm
(approximate) for I-POMDPs : generalization of BPI
Show scalability to larger problems
Policy Representation
Possible representation of policy
Node  action
Edge  obs
Tree Representation
Finite State Controllers
(Hansen, 1998)
Node has an infinite horizon policy rooted at it
Node has a value vector associated with it
which is a linear vector over the entire belief
space
Beliefs are mapped to a node (n) that optimizes
the expected reward from that belief:
i.e. argmaxn b ∙ Vn
Finite State Controller
A finite state controller may be defined as:
where:
is the set of nodes in the FSC of agent i
is the set of edge labels (Wi)
Let:
partitions the entire belief space
Policy Iteration
Starting with an initial controller, iterate over
two steps until convergence:
Policy Evaluation:
Evaluate Vn for each node
Solve system of linear equations
Policy Improvement:
Construct a better controller
Possibly by adding new nodes
Policy Improvement (Hansen,98)
Apply Backup operator, i.e.
construct new nodes with
all possible values of action
and transition on
V
observation
|A||N||W| new nodes
Add them to the controller
Prune all dominated nodes
P(s)
1
Drawback: Leads to exponential 0
Example of policy iteration for a POMDP
growth in controller size
Bounded Policy Iteration (BPI)
(Poupart&Boutilier,03)
Instead of performing a
complete back up, replace a
node with a better node
Linear program for partial backup
New node is a convex
combination of two backed up
nodes
Changes in controller:
:stochastic action policy
:stochastic observation policy
e
Local Optima
This form of policy improvement is prone to converging to local optima
When all nodes are tangents to backed up nodes: e = 0, no improvement
Escape technique suggested by Poupart & Boutilier (2003) in BPI
V
0
P(s)
1
I-POMDP Generalization: Nested Controllers
Nested Controllers: Analogous to nested beliefs
Embed recursive reasoning
Starting from level 0 upwards, for each level l, construct a Finite
state controller for each frame of each agent (
)
For convenience of representation, let’s assume two agents and each one
frame for an agent at each level
Agent i’s level 2 controller:
Agent j’s level 1 controller:
Agent i’s level 0 controller:
Interactive BPI: Policy Evaluation
Compute the value vector of each node using the estimate of
other agent’s model by solving a system of linear equations:
For each ni,l, and interactive state, is=(s, nj,l-1), solve:
I-BPI: Policy Improvement
Pick a node (ni,l) and perform a partial backup using LP to construct
another node (n’i,l) that pointwise dominates ni,l by some e > 0
V
e
0
P(s)
1
New vector dominates old vector by e and hence
replaces it
I-BPI: Policy Improvement
Pick a node (ni,l) and perform a partial backup using LP to construct
another node that pointwise dominates ni,l by some e > 0
Objective Function:
Variables:
Constraints:
maximize e
Escaping Local Optima
V
0
bR1
bT
P(s)
bR2
1
Analogous to escaping for POMDPs
Algorithm: I-BPI
1. Starting from Level 0 up to Level l, construct a 1 node controller for
each level with a random action and transition to itself.
2. Reformulate interactive state space and evaluate
Ll
.
.
.
L1
L0
Time
Algorithm: I-BPI
3. Starting from Level 0 up to Level l, perform 1 step of back up
operator. Max |Ai(j)| nodes
Ll
.
.
.
L1
L0
Time
Algorithm: I-BPI
4. Starting from Level 0 up to Level l, reformulate IS space, perform
policy evaluation followed by policy improvement at each level
Ll
.
.
.
L1
L0
Time
Algorithm: I-BPI
5. Repeat step 4 until convergence
6. If converged, push nested controller out of local optima by
adding new nodes
Ll
.
.
.
L1
L0
Time
Evaluation
AUAV: 81 states, 5 actions, 4 observations
Money Laundering: 99 States, 11 actions, 9 Observations
Scales to larger problems...
Runtime for algorithm and the average rewards from simulations
* Represents expected rewards obtained from vectors
Evaluation
Simulations results for multiagent tiger problem showing results obtained by
simulating performance of agent controllers of various sizes for Levels 1 – 4
Discussion
Advantages of I-BPI
Is significantly quicker and scales to large problems (100s of
states, tens of actions and observations)
Mitigates curse of history and curse of dimensionality
Improved solution quality
Limitations
Prone to local optima
Escape technique may not work for certain local optima
Not entirely free from curses of history and dimensionality
Future Work
Scale to even larger problems and more agents
Mealy machine implementation for controllers (Amato et al. 2011)
Thank you…
Poster #731 today at 16:00-17:00 (Panel 98)
Acknowledgement:
This research is partially supported by an
NSF CAREER grant, #IIS-0845036
Policy Improvement
Apply Backup operator,
i.e. construct new nodes
with all possible values of
action and transition to
nodes in current
controller
|A||N||Z| new nodes
Add them to the controller
|A|
Z1
Z2
|N|
Z|Z|
|N|
|N|
Introduction: POMDP
POMDP: Framework for optimal sequential decision making
under uncertainty in single agent settings
<S, A, W, T, O, R, g >
a/T(s, a, s’)
Physical States
(S)
b = D(S)
z/O(s’, a, z) ,
R(s, a)
S: set of states
A: set of actions
Z: set of observations
g: discount factor
h: Horizon
T: S X A  DS
O: S X A  DZ
R: S X A  R
•Agent maintains a belief (b)
over physical states
•Policy p : b  A
Objective is to find a policy p that maximizes long term expected rewards:
ER = Immediate Reward + discounted future reward
Future Work
Extend approach to problems with even larger
dimensions
Extend to problems with more than two
agents
Mealy machine implementation of finite state
controllers (Amato, et.al; 2011)
I-POMDP Belief Update and Value Function
Belief Update:
An agent must predict the other agent’s actions by anticipating its
updated beliefs over time. Therefore belief update consists of
Updating distribution over physical states: Transition Function,
Observation Function of agent i
Updating distribution over dynamic models: Belief update of other
agents and its observation function
Value Function:
Solving I-POMDP (Related Work)
Previous work: Value iteration
algorithms
I-PF (Doshi, Gmytrasiewicz; 2009):
s, a/T(s, a, s’)
particle filter: sampled recursive
representation of agent’ nested belief
I-PBVI (Doshi, Perez; 2008):
point based domination check
Iteratively apply Backup Operator:
Expensive operator
Over multiple time steps:
Curse of history
Curse of dimensionality
Phy. St.
(S)
b = D(ISi,l)
s’/O(s’, a, z),
R(s, a)
I-POMDP Generalization: Nested Controllers
Embed recursive reasoning
Starting from level 0 upwards, for each level l, construct a
Finite state controller for each frame of each agent (
)
For convenience of representation, let’s assume two agents and each
one frame for an agent at each level
L 0:
L 1:
.
.
.
L l:
.
.
.
.
.
.
.
.
.
I-POMDP Generalization: Nested Controllers
Embed recursive reasoning
Starting from level 0 upwards, for each level l, construct a
Finite state controller for each frame of each agent (
)
For convenience of representation, let’s assume two agents and each
one frame for an agent at each level
Download