Planning in Factored Hybrid-MDPs

advertisement
Planning in Factored Hybrid-MDPs
Milos Hauskrecht
Department of Computer Science, University of Pittsburgh
Branislav Kveton
Intelligent Systems Program, University of Pittsburgh
Carlos Guestrin
Department of Computer Science, CMU
Factored MDPs
• Limitations
 Standard MDP formulation permits only flat discrete state and
action spaces
 Real-world problems are complex, factored, and involve
continuous quantities
• Factored MDPs with discrete components
 State and action spaces grow exponentially in the number of
variables
 Traditional methods for solving MDPs are polynomial in the size
of state and action spaces
• Factored MDPs with continuous components
 State and action spaces are infinitely large
 optimal value function or policy may not have finite support
 Naive discretization of continuous variables often leads to an
exponential complexity of solution methods
Hybrid Factored MDPs
• A hybrid factored MDP
(HMDP) is a 4-tuple M =
(X, A, P, R):
 X = (XD, XC) is a vector of
state variables (discrete or
continuous)
 A = (AD, AC) is a vector of
action variables (discrete
or continuous)
 P and R are factored
transition and reward
models represented by a
dynamic Bayesian network
(DBN)
A1
X1
X’1
R1
A2
X2
R2
t
X’2
A3
X3
t+1
X’3
Optimal Value Function and Policy
• Optimal policy π∗ maximizes the infinite horizon
discounted reward:
∞ t

E π ∑ γ R(x t , π(x t ))
 t =0

• Optimal value function V∗ is a fixed point of the Bellman
equation:


∗
∗
V (x ) = sup R(x, a )+ γ ∑ ∫ P(x′ | x, a )V (x′)dx′C 
a 
x ′D x ′C


• Linear value function approximation
 A compact representation of an MDP does not guarantee the
same structure in the optimal value function or policy
 Approximation by a linear combination of |w| basis functions:
V w ( x ) = ∑ w i fi (x i )
i
Hybrid Approximate Linear Programming
• Hybrid approximate linear programming (HALP)
formulation:
Basis function relevance weight:
α i = ∑ ∫ ψ (x )fi (x )dx C
minimize w
subject to :
xD x C
∑w α
∑ w F (x, a )− R(x, a )≥ 0
i
i
i
i i
i
∀x ∈ X, a ∈ A
Difference between the basis function fi(x) and
its discounted backprojection:
Fi (x, a ) = fi (x )− γ ∑ ∫ P(x′ | x, a )fi (x′)dx′C
x ′D x ′C
HALP solutions
HALP formulation contains infinite number of constraints,
∑ w iFi (x , a)− R (x , a) ≥ 0, ∀x ∈ X , a ∈ A
i
Methods:
• Monte Carlo (Hauskrecht, Kveton, 2004):

•
ε−HALP (Guestrin, Hauskrecht, Kveton, 2004):




•
Sample constraint space
Discretize constraint space on the regular grid
Take advantage of the structure
Cutting plane methods on discrete state space
Exponential in the treewidth of the problem
Direct cutting plane methods (Kveton, Hauskrecht, 2005)

Stochastic optimizations for finding a maximally violated
constraint
Scale-up potential
• Problems with over 25 state and 20 action variables
A8
X7
X10
Regulation device
represented by a
discrete action node
A5
X11
Irrigation channel
represented by a
continuous variable
A10
X17
A15
X2
A2
Inflow
regulation
device
X5
X8
X16
A14
Outflow
regulation
device
X14
A1
X4
A9
X13
A13
A4
X1
A6
X3
A3
X6
X15
A11
X9
A7
X12
A12
Large irrigation network
Traffic Management
• Recent surveys show





Americans spend billions of hours in city traffic jams
Congestions have grown in urban areas of every size
Congestions have increasing economic impact
The problem is too complex and growing rapidly
No technology has provided a definite solution yet
• Our approach
 Automatic traffic management to ease congestions
 Traffic management exhibits local structure with long-term global
consequences
Automatic Traffic Management
XF-3 – # cars on Forbes
Ave between Schenley Dr
and S Bellefield Ave
XF-2 – # cars on Forbes Ave between
Bigelow Blvd and Schenley Dr
XB-1 – # cars on Bigelow Blvd
between 5th Ave and Forbes Ave
XF-1 – # cars on Forbes Ave
between S Bouquet St and
Bigelow Blvd
AF-2 – traffic
lights at Forbes
Ave and
Schenley Dr
AF-1 – traffic lights
at Forbes Ave and
Bigelow Blvd
Download