Planning Policies Using Dynamic Optimization  Chris Atkeson 2012

Planning Policies Using
Dynamic Optimization
Chris Atkeson 2012
Example: One Link Swing Up
One Link Swing Up
• State: ( , )
• Action: ( )
• Cost function:
What is a policy?
• Function mapping state to command: u(x)
How can we compute a policy?
• Optimize trajectory from every starting
point. The value function is the cost of
each of those trajectories.
• Parameterize the policy u(x,p) and
optimize the parameters for some
distribution of initial conditions.
• Dynamic programming.
Optimize Trajectory From Every Cell
Value Function
Value Function
Types of tasks
• Regulator tasks: want to stay at xd
• Trajectory tasks: go from A to B in time T,
or attain goal set G
• Periodic tasks: cyclic behavior such as
Ways to Parameterize Policies
Linear function u(x,p) = pTx = Kx
Polynomial (nonlinear controller)
Associated with trajectory
– u(t) = uff(t) + K(t)(x – xd(t))
• Associated with trajectory(ies)
– u(x) = unn(x) + Knn(x)(x – xdnn(x))
nn: nearest neighbor
• …
Optimizing Policies Using Function
Policy Search
• Parameterized policy u = (x,p), p is vector of
adjustable parameters.
• Simplest approach: Run it for a while, and
measure total cost.
• Use favorite function optimization approach to
search for best p.
• There are tricks to improve policy comparison,
such as using the same perturbations in
different trials, and terminating trial early if really
bad (racing algorithms).