Ch 17. Optimal control theory and the linear Bellman equation HJ Kappen BTSM Seminar 12.07.19.(Thu) Summarized by Joon Shik Kim Introduction • Optimising a sequence of actions to attain some future goal is the general topic of control theory. • In an example of a human throwing a spear to kill an animal, a sequence of actions can be assigned a cost consists of two terms. • The first is a path cost that specifies the energy consumption to contract the muscles. • The second is an end cost that specifies whether the spear will kill animal, just hurt it, or miss it. • The optimal control solution is a sequence of motor commands that results in killing the animal by throwing the spear with minimal physical effort. Discrete Time Control (1/3) • x t 1 x t f ( t , x t , u t ), t 0,1, ..., T 1, where xt is an n-dimensional vector describing the state of the system and ut is an m-dimensional vector that specifies the control or action at time t. • A cost function that assigns a cost to each sequence of controls T 1 C ( x 0 , u 0:T 1 ) ( x T ) R (t , x , u t t0 t ) where R(t,x,u) is the cost associated with taking action u at time t in state x, and Φ(xT) is the cost associated with ending up in state xT at time T. Discrete Time Control (3/3) • The problem of optimal control is to find the sequence u0:T-1 that minimises C(x0, u0:T-1). • The optimal cost-to-go J ( t , x t ) m in ( x T ) u t :T 1 T 1 st R ( s, xs , u s ) m in ( R ( t , x t , u t ) J ( t 1, x t f ( t , x t , u t ))). ut Discrete Time Control (1/3) • The algorithm to compute the optimal control, trajectory, and the cost is given by • 1. Initialization: J (T , x ) ( x ). • 2. Backwards: For t=T-1,…,0 and for x compute u t ( x ) arg m in{ R ( t , x , u ) J ( t 1, x f ( t , x , u ))}, * u J ( t , x ) R ( t , x , u t ) J ( t 1, x f ( t , x , u t )). * * • 3. Forwards: For t=0,…,T-1 compute x t 1 x t f ( t , x t , u t ( x t )). * * * * * The HJB Equation (1/2) • J ( t , x ) m in ( R , x , u ) dt J ( t dt , x f ( x , u , t ) dt )), u m in ( R ( t , x , u ) dt J ( t , x ) t J ( t , x ) dt x J ( t , x ) f ( x , u , t ) dt ), u • t J ( t , x ) m in ( R ( t , x , u ) f ( x , u , t ) x J ( x , t )). u (Hamilton- Jacobi-Belman equation) • The optimal control at the current x, t is given by u ( x , t ) arg m in ( R , u , t ) f ( x , u , t ) x J ( t , x )). u • Boundary condition is J ( x , T ) ( x ). The HJB Equation (2/2) Optimal control of mass on a spring Stochastic Differential Equations (1/2) • Consider the random walk on the line x t 1 x t t , t , with x0=0. t • In a closed form, x t i 1 i . • x t 0, x t . • In the continuous time limit we define 2 t (Wiener Process) dx t x t dt x t d • The conditional probability distribution ( x , t | x 0 , 0) ( x x0 ) 2 exp 2 t 2 t 1 . Stochastic Optimal Control Theory (2/2) • dx f ( x ( t ), u ( t ), t ) dt d • dξ is a Wiener process with d d ( t , x , u ) dt . • Since <dx2> is of order dt, we must make a Taylor expansion up to order dx2. i j ij 1 2 t J ( t , x ) m in R ( t , x , u ) f ( x , u , t ) x J ( x , t ) ( t , x , u ) x J ( x , t ) . u 2 Stochastic Hamilton-Jacobi-Bellman equation dx f ( x , u , t ) dt : drift dx ( t , x , u ) dt : diffusion 2 Path Integral Control (1/2) • In the problem of linear control and quadratic cost, the nonlinear HJB equation can be transformed into a linear equation by a log transformation of the cost-to-go. J ( x , t ) log ( x , t ). HJB becomes 1 V T T 2 t ( x , t ) f T r ( g g ) . 2 Path Integral Control (2/2) • Let ( y , | x , t ) describe a diffusion process for t defined Fokker-Planck equation V ( x, t ) ( f ) T dy ( y , T 1 2 T r ( g g ) . 2 T | x , y ) exp( ( y ) / ). (1) The Diffusion Process as a Path Integral (1/2) • Let’s look at the first term in the equation 1 in the previous slide. The first term describes a process that kills a sample trajectory with a rate of V(x,t)dt/λ. • Sampling process and Monte Carlo dx f ( x , t ) dt g ( x , t ) d , x x dx , With probability 1-V(x,t)dt/λ, xi † , ( x, t ) with probability V(x,t)/λ, in this case, path is killed. dy ( y , T | x , t ) exp( ( y ) / ) 1 N i alive exp( ( x i ( T )) ). The Diffusion Process as a Path Integral (2/2) • p ( x (t T ) | x , t ) 1 exp S ( x ( t T )) . ( x, t ) 1 where ψ is a partition function, J is a freeenergy, S is the energy of a path, and λ the temperature. Discussion • One can extend the path integral control of formalism to multiple agents that jointly solve a task. In this case the agents need to coordinate their actions not only through time, but also among each other to maximise a common reward function. • The path integral method has great potential for application in robotics.