Planning Policies Using Dynamic Optimization Chris Atkeson 2012 Example: One Link Swing Up One Link Swing Up • State: ( , ) • Action: ( ) • Cost function: What is a policy? • Function mapping state to command: u(x) Policy How can we compute a policy? • Optimize trajectory from every starting point. The value function is the cost of each of those trajectories. • Parameterize the policy u(x,p) and optimize the parameters for some distribution of initial conditions. • Dynamic programming. Optimize Trajectory From Every Cell Value Function Value Function Types of tasks • Regulator tasks: want to stay at xd • Trajectory tasks: go from A to B in time T, or attain goal set G • Periodic tasks: cyclic behavior such as walking Ways to Parameterize Policies • • • • Linear function u(x,p) = pTx = Kx Table Polynomial (nonlinear controller) Associated with trajectory – u(t) = uff(t) + K(t)(x – xd(t)) • Associated with trajectory(ies) – u(x) = unn(x) + Knn(x)(x – xdnn(x)) nn: nearest neighbor • … Optimizing Policies Using Function Optimization Policy Search • Parameterized policy u = (x,p), p is vector of adjustable parameters. • Simplest approach: Run it for a while, and measure total cost. • Use favorite function optimization approach to search for best p. • There are tricks to improve policy comparison, such as using the same perturbations in different trials, and terminating trial early if really bad (racing algorithms).