RL 12: Natural Actor-Critic - School of Informatics

advertisement
RL 12: Natural Actor-Critic
Michael Herrmann
University of Edinburgh, School of Informatics
27/02/2015
Today’s topics
Natural gradient
Compatible function approximation
Natural actor-critic (NAC)
Biases, stochastic approximation, test experiments
27/02/2015 Michael Herrmann RL 12
Last time: Policy gradient
Average reward give a (parametric) policy:
ρQ,πω =
X
µπω (x) πω (a|x) Qπω (x, a)
x,a
In order to realise the policy gradient
ωt+1 = ωt + βt ∇ω ρω
we assume that the dependency of µ and Q on ω to be “weak”,
i.e. use a simplifying assumption for the dependency of µ and Q on
ω, namely
X
∇ω ρ (ω) =
µπ (x) {∇ω πω (a|x)} Qπ (x, a)
x,a
Many versions of the algorithm possible (REINFORCE)
27/02/2015 Michael Herrmann RL 12
Actor-Critic Algorithm
Algorithm (SARSA/Q):
Initialise x and ω, sample a ∼ πω (·|x)
Iterate:
obtain reward r , transition to new state x 0
new action a0 ∼ πω (·|x 0 )
δ = r + γQθ (x 0 , a0 ) − Qθ (x, a)
ω = ω + β∇ω log πω (a|x) Qθ (x, a)
θ = θ + αδ ∂Q
∂θ
a ← a0 , x ← x 0
Until termination criterion
27/02/2015 Michael Herrmann RL 12
Policy gradient
Actor-critic algorithms maintain two sets of parameters (θ, ω),
one (θ) for the representation of the value function and one
(ω) for the representation of the policy.
Policy gradient methods are realised via stochastic descent
using the current estimate of the value function.
Simultaneously, the estimate of the value function is gradually
improved.
It is a suggestive idea to harmonise the two aspects of the
optimisation process
27/02/2015 Michael Herrmann RL 12
Bias and Variance in the Actor-Critic Algorithm
The approximation of the policy gradient introduces bias and
variance. We need to be careful with the choice of the function
approximation for Q.
For compatibility of the representations of value function and
policy, aim at
∇θ Qθ ∼ ∇ω log πω
In order to check whether the approximation Q̂π (x, a; θ) of the true
value function Qπ (x, a) is good enough for calculating ρ consider
the following error measure:
2
X
µπ (x) Q̂π (x, a; θ) − Qπ (x, a) πω (a|x)
π (θ) =
x,a
We want to show now that using the best (w.r.t. θ) approximation
Q̂π (x, a; θ) leaves the gradient of ρ (w.r.t to ω) unchanged.
S. Kakade (2001) A natural policy gradient. NIPS 14, 1531-1538.
27/02/2015 Michael Herrmann RL 12
Compatible function approximation
Idea: Use score function
Ψi (x, a)π =
∂
log πω (a|x)
∂ωi
as basis functions for the approximation of the state-action value
function in terms of Ψ
X
Q̂π (x, a; θ) =
θi Ψπi (x, a)
i
This implies ∇θ Qθ = ∇ω log πω .
Using a score function as basis function is often possible, but may
not always be a good choice (e.g. Gaussians πω give linear Ψ)
27/02/2015 Michael Herrmann RL 12
Consequences of the compatible function approximation
For Q̂π (x, a; θ) = θΨπ (x, a), minimisation of implies
∂
∂θ
= 0:
X
∂ π
(θ) =
µπ (x) Ψi (x, a)π Q̂π (x, a; θ)−Qπ (x, a) πω (a|x) = 0
∂θi
x,a
or equivalently (this is what we wanted to show!)
X
X
µπ(x)Ψi (x, a)πQ̂π(x, a; θ)πω (a|x) = µπ(x)Ψi (x, a)πQπ(x, a)πω (a|x)
x,a
x,a
and in vector form using the basis functions for Q̂π = θ> Ψ(x, a)π
X
X
µπ(x)Ψ(x, a)π θ> Ψ(x, a)ππω (a|x) = µπ(x)Ψ(x, a)πQπ(x, a)πω (a|x)
x,a
27/02/2015 Michael Herrmann RL 12
x,a
Consequences of the compatible function approximation
X
X
µπ(x)Ψ(x, a)π Ψ(x, a)π> θπω (a|x) = µπ(x)Ψ(x, a)πQπ(x, a)πω (a|x)
x,a
x,a
π
∂
logπω(a|x)
By definition ∇ω π = πΨi (x, a) because Ψi (x, a)π = ∂ω
i
!
X
X
µπ(x)Ψ(x, a)π Ψ(x, a)π>πω (a|x) θ =
µπ(x)Qπ(x, a)∇ω πω (a|x)
x,a
x,a
= ∇ω ρ (ω)
Compare left hand side and the Fisher information matrix:
∂ log πω (a|x) ∂ log πω (a|x)
Fij (ω)= Eµπ (x) Eπω (a|x)
∂ωi
∂ωj
X
∂ log πω (a|x) ∂ log πω (a|x)
=
µπ(x)πω (a|x)
∂ωi
∂ωj
x,a
X
F (ω) =
µπ(x)Ψ(x, a)π Ψ> (x, a)π πω (a|x) ⇒ θ = F −1 (ω) ∇ω ρ (ω)
x,a
27/02/2015 Michael Herrmann RL 12
Recipe for Natural Actor-Critic
1
Given the current policy π we determine the score function Ψ.
2
Using π we also get a sample of rewards which we can use to
estimate the value function Q̂.
3
At the same time we estimate the probability µ̂ of the agent in
the state space.
4
From µ, π , Ψ, Q we can now find the optimal parameters θ
by solving a (linear) equation.
5
θ is used in order to update the parameters of the policy (β
learning rate).
ωt+1 = ωt + βt θt
This makes sense because we have seen that θ = F −1 ∇ω ρ (ω)
which is a natural gradient on ρ.
What is a natural gradient?
27/02/2015 Michael Herrmann RL 12
Natural gradient
The gradient is orthogonal to the level lines of the cost function.
For a circular problem it points towards the optimum, while, for
non-circular problem, we might be able to do better.
The natural gradient can be interpreted as a removal of the adverse
effects of the particular model: In the above example we could
simply “divide by the eigenvalues”, i.e. apply a linear transformation
with the inverse eigenvalues and appropriate eigenvectors.
27/02/2015 Michael Herrmann RL 12
Gradient descent
Gradient decent
θt+1 = θt − η∇θ f (θt )
Assume a affine (linear) transformation ϕ = W −1 θ, so we have
∂θ
ϕt+1 = ϕt − η
∇θ f (θt ) = ϕt − ηW > ∇θ f (θt )
∂ϕ
Multiply by W
W ϕt+1 = W ϕt − ηWW > ∇θ f (θt )
0
θt+1 = θt − ηWW > ∇θ f (θt )
0
In general θt+1
6= θt+1 =⇒ Gradient is not affine invariant.
This is nothing to worry about: The gradient works reasonably well
with any positive definite matrix in front, but we can do better.
27/02/2015 Michael Herrmann RL 12
Beyond gradient decent
Gradient decent is based on a first-order Taylor expansion
f (θ) ≈ f (θ0 ) + ∇θ f (θ0 )> (θ − θ0 )
Consider second-order Taylor expansion
1
f (θ) ≈ f (θ0 ) + ∇θ f (θ0 ) + (θ − θ0 )> H (θ0 ) (θ − θ0 )
2
2
∂ f
(θ0 ). In this
where the Hessian is given by Hij (θ0 ) = ∂θ
i θj
approximation we can optimise f w.r.t. θ by
θ = θ0 − H −1 (θ0 ) ∇θ f (θ0 )
This is left unchanged by a linear transform ϕ = W −1 θ:
H (ϕ0 ) = W > H (θ0 ) W and ∇ϕ → W > ∇θ :
−1
ϕ = ϕ0 − W > HW
W > ∇θ f (W ϕ0 ) = W −1 H −1 (θ0 ) ∇θ f (W ϕ0 )
θ = θ0 − H −1 (θ0 ) ∇θ f (θ)
(after multiplication by W )
Second order method (Newton) is affine invariant.
27/02/2015 Michael Herrmann RL 12
Reformulation of gradient descent
Gradient descent improves the current estimate, perfect for a linear
cost function in a specific coordinate system. Is it the best we can
do (ignoring the 2nd order correction by the Hesse matrix)?
Given step size η, we find
θ∗ = arg
max
θ: kθ−θ0 k≤η
f (θ) ≈ arg
= arg
max
f (θ0 ) + ∇θ f (θ0 ) (θ − θ0 )
max
∇θ f (θ0 ) (θ − θ0 )
θ: kθ−θ0 k≤η
θ: kθ−θ0 k≤η
= θ0 + η
∇θ f (θ0 )
k∇θ f (θ0 )k
i.e. optimally (θ − θ0 ) has length η and is parallel to the unit
∇θ f
vector k∇
, where k·k is the Euclidean norm.
θf k
Can we use also other norms (or distance functions)?
27/02/2015 Michael Herrmann RL 12
Kullback-Leibler divergence
Kullback-Leibler divergence
KL (πω1 (a|x) , πω2 (a|x)) =
X
πω1 (a|x) log
a,x
πω1 (a|x)
πω2 (a|x)
Consider two similar policies πω (a|x) and πω+δω (a|x). Perform a
Taylor expansion of KL (πω (a|x) , πω+δω (a|x)):
Constant term: KL (πω (a|x) , πω (a|x)) = 0
Linear term:
∂
∂ω KL (πω
(a|x) , πω (a|x)) = 0
Quadratic term is the Fisher information matrix.
27/02/2015 Michael Herrmann RL 12
Fisher information
In other words, the Hessian for the Kullback-Leibler divergence is
the Fisher information matrix.
∂ log πω (a|x) ∂ log πω (a|x)
Fij (x; ω) = Eπω (a|x)
∂ωi
∂ωj
Benefits
As a Hessian the Fisher matrix gives an affine invariant
descent.
The approximation of the down/uphill direction becomes
better for non-linear cost functions
Fisher information matrix is covariant (means: invariant
against transformations of the parameter).
Literature:
Natural gradient (S. Amari: Natural gradient works efficiently in learning,
NC 10, 251-276, 1998)
Examples by Bagnell and Schneider (2003) and Jan Peters (2003, 2008)
27/02/2015 Michael Herrmann RL 12
Natural gradient
θ = F −1 (ω) ∇ω ρ (ω) is a natural gradient on
X
ρQ,π,µ =
µπω (x) Qπω (x, a) πω (a|x)
x,a
if we can assume the dependency of µ and Q on ω to be “weak”, i.e.
X
∇ω ρ (ω) =
µπ (x) Qπ (x, a) ∇ω πω (a|x)
x,a
We have seen that
˜ ω ρ (ω)
F (ω) θ = ∇ω ρ (ω) ⇔ θ = F (ω)−1 ∇ω ρ (ω) =: ∇
which defines the natural gradient. This implies the following
natural gradient learning rule (Kakade, 2001/2)
ωt+1 = ωt + βt θt
which is better and simpler than standard policy gradient.
Remember this result is obtained at the cost of the calculation of θ!
27/02/2015 Michael Herrmann RL 12
Pros and Cons of the Fisher information
+ “Natural” (covariant): uses the geometry of the goal function
rather than the geometry of the parameter space (Choice of
parameters used to be critical, but isn’t any more so).
+ Related to Kullback-Leibler divergence and to Hessian
+ Describes efficiency in statistical estimation (Cramer-Rao)
+ Many applications in machine learning, statistics and physics
− Depends usually on parameters and is computationally
complex (but not here where we were lucky!)
− Requires sampling of high-dimensional probability distributions
+ May still work if some approximation is used, e.g. Gaussian
27/02/2015 Michael Herrmann RL 12
Kakade’s Example
Three right curves: standard gradient, three left curves: natural
gradient
Policy π (a|x; ω) ∼ exp ω1 s1 x 2 + ω2 ss x
Starting conditions: ω1 s1 = ω2 s2 = −0.8
27/02/2015 Michael Herrmann RL 12
Kakade’s Example
Left: average reward for the policy
π (a = 1|s; ω) ∼ exp (ω) / (1 + exp (ω))
Lower plot represents the beginning of the upper plot (different
scales!): dashed: natural gradient, solid: standard gradient.
Right: Movement in the parameter space (axes are actually ωi !)
27/02/2015 Michael Herrmann RL 12
Examples of natural gradients
standard gradient (dashed)
natural gradient (solid)
Grondman et al. (2012) A survey of actor-critic reinforcement learning: Standard and
natural policy gradients. IEEE TA Systems, Man, and Cybernetics 42(6), 1291-1307.
27/02/2015 Michael Herrmann RL 12
More examples
J. Kober & J. R. Peters: Policy search for motor primitives in
robotics. NIPS 2009, pp. 849-856.
27/02/2015 Michael Herrmann RL 12
Summary
A promising approach for continuous action and state spaces
(in discrete time)
Policy gradient as direct maximisation of the averaged
state-action value
Natural policy gradient arises from the optimisation of the
value function
Model-free reinforcement learning
27/02/2015 Michael Herrmann RL 12
Acknowledgements
Some material was adapted from web resources associated with
Sutton and Barto’s Reinforcement Learning book.
Today mainly based on C. Szepesvári: Algorithms for RL, Ch. 3.4.
See also: David Silber’s Lecture 7: Policy Gradient and Pieter
Abbeel’s lecture 20 on Advanced Robotics.
Check papers by Jan Peters and others (see above in the slides).
27/02/2015 Michael Herrmann RL 12
Download