An Incremental Sampling-based Algorithm for Stochastic Optimal Control Martha Witick Department of Computer Science Rice University Vu Anh Huynh, Sertac Karamanm Emilio Frazzoli. ICRA 2012. Motion Planning! •Continuous time •Continuous state space •Continuous controls •Noisy. MDPs? S Motivation • Cnts-time, cnts space stochastic optimal control problem: no closed form, exact algorithmic solutions • Approximate cnts problem with discrete MDP and compute solution • ... but exponential with number of state and control spaces • Sampling-based methods: fast and effective, but... • RRT: not optimal • RRT*: no systems with uncertain dynamics Overview • Have a continuous time/space stochastic optimal control problem • Want an optimal cost function J* and ultimately an optimal policy μ* • Create a discrete-state Markov Decision Process and refine, iterating until the current cost-to-go Ji* is close enough to J* • What the iterative Markov Decision Process (iMDP) algorithm does. Outline • Continuous Stochastic Optimal Control Problem Definition • Discrete Markov Chain approximation • iMDP Algorithm • iMDP Results • Conclusion Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS S0 S state x(t) ∊ S control u(t) time t ≥ 0 Consider a stochastic dynamical system that looks like this: robot's dynamics Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS state x(t) ∊ S control u(t) time t ≥ 0 S0 S Consider a stochastic dynamical system that looks like this: robot's dynamics noise Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS state x(t) ∊ S control u(t) time t ≥ 0 S0 S Consider a stochastic dynamical system that looks like this: robot's dynamics noise Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS x ( 0 ) S0 S Solutions looks like this until x(t) hits δS: state x(t) ∊ S control u(t) time t ≥ 0 Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS x ( 0 ) S0 S Solutions looks like this until x(t) hits δS: state x(t) ∊ S control u(t) time t ≥ 0 Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS x ( 0 ) S0 S Solutions looks like this until x(t) hits δS: state x(t) ∊ S control u(t) time t ≥ 0 Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS x ( 0 ) S0 S state x(t) ∊ S policy μ(t) time t ≥ 0 A Markov control (or policy) μ(t): S→ℝ needs only state x(t). Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS x ( 0 ) S0 S The expected cost-to-go function under policy μ state x(t) ∊ S policy μ(t) time t ≥ 0 Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS x ( 0 ) S0 S The expected cost-to-go function under policy μ first exit time state x(t) ∊ S policy μ(t) time t ≥ 0 Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS x ( 0 ) S0 S The expected cost-to-go function under policy μ discount rate α ∊ [0,1) state x(t) ∊ S policy μ(t) time t ≥ 0 Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS x ( 0 ) S0 S The expected cost-to-go function under policy μ cost rate function state x(t) ∊ S policy μ(t) time t ≥ 0 Continuous Stochastic Dynamics S ⊂ ℝdx interior S0 smooth boundary δS δS x ( 0 ) S0 S The expected cost-to-go function under policy μ terminal cost function state x(t) ∊ S policy μ(t) time t ≥ 0 Continuous Stochastic Dynamics • We want the optimal cost-to-go function J*: • We want to compute J* so we can get its optimal policy μ* • But solving this continuous problem is hard. Solving this is hard • Lets make a discrete model! Outline • Continuous Stochastic Optimal Control Problem Definition • Discrete Markov Chain approximation • iMDP Algorithm • iMDP Results • Conclusion Markov Chain Approximation Approximate stochastic dynamics with a sequence of MDPs {Mn}∞n=0 Markov Chain Approximation Approximate stochastic dynamics with a sequence of MDPs {Mn}∞n=0 Markov Chain Approximation Approximate stochastic dynamics with a sequence of MDPs {Mn}∞n=0 Discretize states: grab finite set of states Sn from S assign transition probabilities Pn Discretize time: assign non-negative holding time Δtn(z) to each state z Don't need to discretize controls. Markov Chain Approximation Local Consistency Property: • For all states z ∊ S, • For all states z ∊ S and all controls v ∊ U, Markov Chain Approximation Local Consistency Property: • For all states z ∊ S, • For all states z ∊ S and all controls v ∊ U, Markov Chain Approximation Local Consistency Property: • For all states z ∊ S, • For all states z ∊ S and all controls v ∊ U, Markov Chain Approximation Local Consistency Property: • For all states z ∊ S, • For all states z ∊ S and all controls v ∊ U, Markov Chain Approximation Local Consistency Property: • For all states z ∊ S, • For all states z ∊ S and all controls v ∊ U, Markov Chain Approximation Approximate stochastic dynamics with a sequence of MDPs {Mn}∞n=0 Control problem is analogous, so let's define a discrete discounted cost: Continuous for comparison: Markov Chain Approximation Approximate stochastic dynamics with a sequence of MDPs {Mn}∞n=0 Discontinuity and Remarks • F, f, g, and h can be discontinuous and still work • While the controlled Markov chain has a discrete state structure and the stochastic dynamical system has a continuous model, they BOTH have a continuous control space. Outline • Continuous Stochastic Optimal Control Problem Definition • Discrete Markov Chain approximation • iMDP Algorithm • iMDP Results • Conclusion iMDP Algorithm Set 0th MDP M0 to empty S0 δS S iMDP Algorithm Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP S0 δS S iMDP Algorithm Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP Sample state from δS and add it to Mn S0 zs δS S iMDP Algorithm Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP Sample state from δS and add it to Mn S0 δS S iMDP Algorithm Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP Sample state from δS and add it to Mn Sample zs from S0 zs S0 δS S iMDP Algorithm Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP Sample state from δS and add it to Mn Sample zs from S0 Set znearest to nearest state in Sn zs S0 δS znearest S iMDP Algorithm zs x:[0,t] S0 δS znearest S Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP Sample state from δS and add it to Mn Sample zs from S0 Set znearest to nearest state in Sn Compute trajectory x:[0,t] with control u from znearest to zs iMDP Algorithm zs z x:[0,t] S0 δS znearest S Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP Sample state from δS and add it to Mn Sample zs from S0 Set znearest to nearest state in Sn Compute trajectory x:[0,t] with control u from znearest to zs Set z to x(0) iMDP Algorithm zs z x:[0,t] S0 δS znearest S Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP Sample state from δS and add it to Mn Sample zs from S0 Set znearest to nearest state in Sn Compute trajectory x:[0,t] with control u from znearest to zs Set z to x(0) and add to Sn iMDP Algorithm z S0 δS znearest S Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP Sample state from δS and add it to Mn Sample zs from S0 Set znearest to nearest state in Sn Compute trajectory x:[0,t] with control u from znearest to zs Set z to x(0) and add to Sn Compute cost and save Update() new z and Kn states in Sn iMDP Algorithm: Update Loop Update() new z and Kn states in Sn: iMDP Algorithm: Update Loop Update() new z and Kn states in Sn: iMDP Algorithm: Update Loop Update() new z and Kn states in Sn: iMDP Algorithm: Update() iMDP Algorithm: Update() iMDP Algorithm: Update() Uniformly sample or create Cn controls from z to nearest Cn states iMDP Algorithm: Update() For each control v in Un iMDP Algorithm: Update() Compute new transition probability to nearest log(Sn) states Znear iMDP Algorithm: Update() iMDP Algorithm: Update() holding time tau iMDP Algorithm: Update() cost rate function g iMDP Algorithm: Update() discount rate alpha iMDP Algorithm: Update() Expected cost-to-go to Znear under n-th policy iMDP Algorithm: Update() iMDP Algorithm: Update() If the new cost J is less then the old cost iMDP Algorithm: Update() Update z's cost, policy, holding time, and number of states in Sn iMDP Algorithm z S δS znearest S Set 0th MDP M0 to empty while n < N do nth MDP Mn <- (n-1)th MDP Sample state from δS and add it to Mn Sample zs from S0 Set znearest to nearest state in Sn Compute trajectory x:[0,t] with control u from znearest to zs Set z to x(0) and add to Sn Compute cost and save Update() new z and Kn states in Sn n=n+1 IMDP: Generating a Policy iMDP Time Complexity • |Sn|Ѳ states updated and log(|Sn|) controls processed per iteration • Time Complexity when Pn constructed with linear equations: O(n |Sn|Ѳ log |Sn| ) • Time Complexity when Pn constructed with Gaussian distribution: O(n |Sn|Ѳ (log |Sn|)2 ) • Space Complexity: O(|Sn|) • State Space Size: |Sn| = Θ(n) iMDP Analysis iMDP Analysis iMDP Analysis iMDP Analysis Outline • Continuous Stochastic Optimal Control Problem Definition • Discrete Markov Chain approximation • iMDP Algorithm • iMDP Results • Conclusion Results Experiment (a): convergence of iMDP applied to a stochastic LQR problem With dynamics dx = Results Experiment (a): convergence of iMDP applied to a stochastic LQR problem With dynamics dx = Results Experiment (b): A system with stochastic single integrator dynamics to a goal region (upper right) with free ending time in a cluttered environment. Results •Experiment (c): Noise-free versus noisy. σ = 0.37 Note that (b) failed to find a valid trajectory. Conclusions • The iMDP algorithm: • incrementally builds discrete MDPs to approximate continuous problems • Improves cost every iteration (value iteration), novel approach to computing Bellman updates • has a fast total processing time O(k^1+0 log k) • Remove necessity for point-to-point steering of RRT (e.g. RRT*)