Ch 17. Optimal control theory and the linear Bellman equation

advertisement
Ch 17. Optimal control theory
and the linear Bellman equation
HJ Kappen
BTSM Seminar
12.07.19.(Thu)
Summarized by Joon Shik Kim
Introduction
• Optimising a sequence of actions to attain some
future goal is the general topic of control theory.
• In an example of a human throwing a spear to kill
an animal, a sequence of actions can be assigned a
cost consists of two terms.
• The first is a path cost that specifies the energy
consumption to contract the muscles.
• The second is an end cost that specifies whether the
spear will kill animal, just hurt it, or miss it.
• The optimal control solution is a sequence of motor
commands that results in killing the animal by
throwing the spear with minimal physical effort.
Discrete Time Control (1/3)
• x t  1  x t  f ( t , x t , u t ), t  0,1, ..., T  1,
where xt is an n-dimensional vector describing the
state of the system and ut is an m-dimensional vector
that specifies the control or action at time t.
• A cost function that assigns a cost to each sequence
of controls
T 1
C ( x 0 , u 0:T 1 )   ( x T ) 
 R (t , x , u
t
t0
t
)
where R(t,x,u) is the cost associated with taking
action u at time t in state x, and Φ(xT) is the cost
associated with ending up in state xT at time T.
Discrete Time Control (3/3)
• The problem of optimal control is to find
the sequence u0:T-1 that minimises
C(x0, u0:T-1).
• The optimal cost-to-go

J ( t , x t )  m in   ( x T ) 
u t :T 1

T 1

st

R ( s, xs , u s ) 

 m in ( R ( t , x t , u t )  J ( t  1, x t  f ( t , x t , u t ))).
ut
Discrete Time Control (1/3)
• The algorithm to compute the optimal
control, trajectory, and the cost is given
by
• 1. Initialization: J (T , x )   ( x ).
• 2. Backwards: For t=T-1,…,0 and for x
compute
u t ( x )  arg m in{ R ( t , x , u )  J ( t  1, x  f ( t , x , u ))},
*
u
J ( t , x )  R ( t , x , u t )  J ( t  1, x  f ( t , x , u t )).
*
*
• 3. Forwards: For t=0,…,T-1 compute
x t  1  x t  f ( t , x t , u t ( x t )).
*
*
*
*
*
The HJB Equation (1/2)
•
J ( t , x )  m in ( R , x , u ) dt  J ( t  dt , x  f ( x , u , t ) dt )),
u
 m in ( R ( t , x , u ) dt  J ( t , x )   t J ( t , x ) dt   x J ( t , x ) f ( x , u , t ) dt ),
u
•
  t J ( t , x )  m in ( R ( t , x , u )  f ( x , u , t )  x J ( x , t )).
u
(Hamilton-
Jacobi-Belman equation)
• The optimal control at the current x, t is
given by
u ( x , t )  arg m in ( R , u , t )  f ( x , u , t )  x J ( t , x )).
u
• Boundary condition is
J ( x , T )   ( x ).
The HJB Equation (2/2)
Optimal control of mass on a spring
Stochastic Differential Equations
(1/2)
• Consider the random walk on the line
x t 1  x t   t ,  t    ,
with x0=0.
t
• In a closed form, x t   i 1  i .
•  x t   0,  x    t .
• In the continuous time limit we define
2
t
(Wiener Process)
dx t  x t  dt  x t  d 
• The conditional probability distribution
 ( x , t | x 0 , 0) 
 ( x  x0 ) 2
exp  
2 t
2  t

1

.

Stochastic Optimal Control Theory
(2/2)
• dx  f ( x ( t ), u ( t ), t ) dt  d 
• dξ is a Wiener process with
 d  d     ( t , x , u ) dt .
• Since <dx2> is of order dt, we must
make a Taylor expansion up to order dx2.
i
j
ij
1


2
  t J ( t , x )  m in  R ( t , x , u )  f ( x , u , t )  x J ( x , t )   ( t , x , u )  x J ( x , t )  .
u
2


Stochastic Hamilton-Jacobi-Bellman equation
 dx   f ( x , u , t ) dt : drift
 dx    ( t , x , u ) dt : diffusion
2
Path Integral Control (1/2)
• In the problem of linear control and
quadratic cost, the nonlinear HJB
equation can be transformed into a
linear equation by a log transformation
of the cost-to-go. J ( x , t )    log  ( x , t ).
HJB becomes
1
 V
T
T
2 
  t ( x , t )     f   T r ( g g  )   .
2
 

Path Integral Control (2/2)
• Let  ( y ,  | x , t ) describe a diffusion process
for   t defined Fokker-Planck equation
   
V

 ( x, t ) 
  ( f )
T
 dy  ( y , T
1
2


T r  ( g g  ) .
2
T
| x , y ) exp(   ( y ) /  ).
(1)
The Diffusion Process as a Path
Integral (1/2)
• Let’s look at the first term in the
equation 1 in the previous slide. The first
term describes a process that kills a
sample trajectory with a rate of V(x,t)dt/λ.
• Sampling process and Monte Carlo
dx  f ( x , t ) dt  g ( x , t ) d  ,
x  x  dx , With probability 1-V(x,t)dt/λ,
xi  † ,
 ( x, t ) 
with probability V(x,t)/λ, in this case, path is killed.
 dy  ( y , T
| x , t ) exp(   ( y ) /  ) 
1
N

i alive
exp(   ( x i ( T ))  ).
The Diffusion Process as a Path
Integral (2/2)
•
p ( x (t  T ) | x , t ) 
 1

exp   S ( x ( t  T ))  .
 ( x, t )
 

1
where ψ is a partition function, J is a freeenergy, S is the energy of a path, and λ
the temperature.
Discussion
• One can extend the path integral control
of formalism to multiple agents that
jointly solve a task. In this case the
agents need to coordinate their actions
not only through time, but also among
each other to maximise a common
reward function.
• The path integral method has great
potential for application in robotics.
Download