RL 12: Natural Actor-Critic Michael Herrmann University of Edinburgh, School of Informatics 27/02/2015 Today’s topics Natural gradient Compatible function approximation Natural actor-critic (NAC) Biases, stochastic approximation, test experiments 27/02/2015 Michael Herrmann RL 12 Last time: Policy gradient Average reward give a (parametric) policy: ρQ,πω = X µπω (x) πω (a|x) Qπω (x, a) x,a In order to realise the policy gradient ωt+1 = ωt + βt ∇ω ρω we assume that the dependency of µ and Q on ω to be “weak”, i.e. use a simplifying assumption for the dependency of µ and Q on ω, namely X ∇ω ρ (ω) = µπ (x) {∇ω πω (a|x)} Qπ (x, a) x,a Many versions of the algorithm possible (REINFORCE) 27/02/2015 Michael Herrmann RL 12 Actor-Critic Algorithm Algorithm (SARSA/Q): Initialise x and ω, sample a ∼ πω (·|x) Iterate: obtain reward r , transition to new state x 0 new action a0 ∼ πω (·|x 0 ) δ = r + γQθ (x 0 , a0 ) − Qθ (x, a) ω = ω + β∇ω log πω (a|x) Qθ (x, a) θ = θ + αδ ∂Q ∂θ a ← a0 , x ← x 0 Until termination criterion 27/02/2015 Michael Herrmann RL 12 Policy gradient Actor-critic algorithms maintain two sets of parameters (θ, ω), one (θ) for the representation of the value function and one (ω) for the representation of the policy. Policy gradient methods are realised via stochastic descent using the current estimate of the value function. Simultaneously, the estimate of the value function is gradually improved. It is a suggestive idea to harmonise the two aspects of the optimisation process 27/02/2015 Michael Herrmann RL 12 Bias and Variance in the Actor-Critic Algorithm The approximation of the policy gradient introduces bias and variance. We need to be careful with the choice of the function approximation for Q. For compatibility of the representations of value function and policy, aim at ∇θ Qθ ∼ ∇ω log πω In order to check whether the approximation Q̂π (x, a; θ) of the true value function Qπ (x, a) is good enough for calculating ρ consider the following error measure: 2 X µπ (x) Q̂π (x, a; θ) − Qπ (x, a) πω (a|x) π (θ) = x,a We want to show now that using the best (w.r.t. θ) approximation Q̂π (x, a; θ) leaves the gradient of ρ (w.r.t to ω) unchanged. S. Kakade (2001) A natural policy gradient. NIPS 14, 1531-1538. 27/02/2015 Michael Herrmann RL 12 Compatible function approximation Idea: Use score function Ψi (x, a)π = ∂ log πω (a|x) ∂ωi as basis functions for the approximation of the state-action value function in terms of Ψ X Q̂π (x, a; θ) = θi Ψπi (x, a) i This implies ∇θ Qθ = ∇ω log πω . Using a score function as basis function is often possible, but may not always be a good choice (e.g. Gaussians πω give linear Ψ) 27/02/2015 Michael Herrmann RL 12 Consequences of the compatible function approximation For Q̂π (x, a; θ) = θΨπ (x, a), minimisation of implies ∂ ∂θ = 0: X ∂ π (θ) = µπ (x) Ψi (x, a)π Q̂π (x, a; θ)−Qπ (x, a) πω (a|x) = 0 ∂θi x,a or equivalently (this is what we wanted to show!) X X µπ(x)Ψi (x, a)πQ̂π(x, a; θ)πω (a|x) = µπ(x)Ψi (x, a)πQπ(x, a)πω (a|x) x,a x,a and in vector form using the basis functions for Q̂π = θ> Ψ(x, a)π X X µπ(x)Ψ(x, a)π θ> Ψ(x, a)ππω (a|x) = µπ(x)Ψ(x, a)πQπ(x, a)πω (a|x) x,a 27/02/2015 Michael Herrmann RL 12 x,a Consequences of the compatible function approximation X X µπ(x)Ψ(x, a)π Ψ(x, a)π> θπω (a|x) = µπ(x)Ψ(x, a)πQπ(x, a)πω (a|x) x,a x,a π ∂ logπω(a|x) By definition ∇ω π = πΨi (x, a) because Ψi (x, a)π = ∂ω i ! X X µπ(x)Ψ(x, a)π Ψ(x, a)π>πω (a|x) θ = µπ(x)Qπ(x, a)∇ω πω (a|x) x,a x,a = ∇ω ρ (ω) Compare left hand side and the Fisher information matrix: ∂ log πω (a|x) ∂ log πω (a|x) Fij (ω)= Eµπ (x) Eπω (a|x) ∂ωi ∂ωj X ∂ log πω (a|x) ∂ log πω (a|x) = µπ(x)πω (a|x) ∂ωi ∂ωj x,a X F (ω) = µπ(x)Ψ(x, a)π Ψ> (x, a)π πω (a|x) ⇒ θ = F −1 (ω) ∇ω ρ (ω) x,a 27/02/2015 Michael Herrmann RL 12 Recipe for Natural Actor-Critic 1 Given the current policy π we determine the score function Ψ. 2 Using π we also get a sample of rewards which we can use to estimate the value function Q̂. 3 At the same time we estimate the probability µ̂ of the agent in the state space. 4 From µ, π , Ψ, Q we can now find the optimal parameters θ by solving a (linear) equation. 5 θ is used in order to update the parameters of the policy (β learning rate). ωt+1 = ωt + βt θt This makes sense because we have seen that θ = F −1 ∇ω ρ (ω) which is a natural gradient on ρ. What is a natural gradient? 27/02/2015 Michael Herrmann RL 12 Natural gradient The gradient is orthogonal to the level lines of the cost function. For a circular problem it points towards the optimum, while, for non-circular problem, we might be able to do better. The natural gradient can be interpreted as a removal of the adverse effects of the particular model: In the above example we could simply “divide by the eigenvalues”, i.e. apply a linear transformation with the inverse eigenvalues and appropriate eigenvectors. 27/02/2015 Michael Herrmann RL 12 Gradient descent Gradient decent θt+1 = θt − η∇θ f (θt ) Assume a affine (linear) transformation ϕ = W −1 θ, so we have ∂θ ϕt+1 = ϕt − η ∇θ f (θt ) = ϕt − ηW > ∇θ f (θt ) ∂ϕ Multiply by W W ϕt+1 = W ϕt − ηWW > ∇θ f (θt ) 0 θt+1 = θt − ηWW > ∇θ f (θt ) 0 In general θt+1 6= θt+1 =⇒ Gradient is not affine invariant. This is nothing to worry about: The gradient works reasonably well with any positive definite matrix in front, but we can do better. 27/02/2015 Michael Herrmann RL 12 Beyond gradient decent Gradient decent is based on a first-order Taylor expansion f (θ) ≈ f (θ0 ) + ∇θ f (θ0 )> (θ − θ0 ) Consider second-order Taylor expansion 1 f (θ) ≈ f (θ0 ) + ∇θ f (θ0 ) + (θ − θ0 )> H (θ0 ) (θ − θ0 ) 2 2 ∂ f (θ0 ). In this where the Hessian is given by Hij (θ0 ) = ∂θ i θj approximation we can optimise f w.r.t. θ by θ = θ0 − H −1 (θ0 ) ∇θ f (θ0 ) This is left unchanged by a linear transform ϕ = W −1 θ: H (ϕ0 ) = W > H (θ0 ) W and ∇ϕ → W > ∇θ : −1 ϕ = ϕ0 − W > HW W > ∇θ f (W ϕ0 ) = W −1 H −1 (θ0 ) ∇θ f (W ϕ0 ) θ = θ0 − H −1 (θ0 ) ∇θ f (θ) (after multiplication by W ) Second order method (Newton) is affine invariant. 27/02/2015 Michael Herrmann RL 12 Reformulation of gradient descent Gradient descent improves the current estimate, perfect for a linear cost function in a specific coordinate system. Is it the best we can do (ignoring the 2nd order correction by the Hesse matrix)? Given step size η, we find θ∗ = arg max θ: kθ−θ0 k≤η f (θ) ≈ arg = arg max f (θ0 ) + ∇θ f (θ0 ) (θ − θ0 ) max ∇θ f (θ0 ) (θ − θ0 ) θ: kθ−θ0 k≤η θ: kθ−θ0 k≤η = θ0 + η ∇θ f (θ0 ) k∇θ f (θ0 )k i.e. optimally (θ − θ0 ) has length η and is parallel to the unit ∇θ f vector k∇ , where k·k is the Euclidean norm. θf k Can we use also other norms (or distance functions)? 27/02/2015 Michael Herrmann RL 12 Kullback-Leibler divergence Kullback-Leibler divergence KL (πω1 (a|x) , πω2 (a|x)) = X πω1 (a|x) log a,x πω1 (a|x) πω2 (a|x) Consider two similar policies πω (a|x) and πω+δω (a|x). Perform a Taylor expansion of KL (πω (a|x) , πω+δω (a|x)): Constant term: KL (πω (a|x) , πω (a|x)) = 0 Linear term: ∂ ∂ω KL (πω (a|x) , πω (a|x)) = 0 Quadratic term is the Fisher information matrix. 27/02/2015 Michael Herrmann RL 12 Fisher information In other words, the Hessian for the Kullback-Leibler divergence is the Fisher information matrix. ∂ log πω (a|x) ∂ log πω (a|x) Fij (x; ω) = Eπω (a|x) ∂ωi ∂ωj Benefits As a Hessian the Fisher matrix gives an affine invariant descent. The approximation of the down/uphill direction becomes better for non-linear cost functions Fisher information matrix is covariant (means: invariant against transformations of the parameter). Literature: Natural gradient (S. Amari: Natural gradient works efficiently in learning, NC 10, 251-276, 1998) Examples by Bagnell and Schneider (2003) and Jan Peters (2003, 2008) 27/02/2015 Michael Herrmann RL 12 Natural gradient θ = F −1 (ω) ∇ω ρ (ω) is a natural gradient on X ρQ,π,µ = µπω (x) Qπω (x, a) πω (a|x) x,a if we can assume the dependency of µ and Q on ω to be “weak”, i.e. X ∇ω ρ (ω) = µπ (x) Qπ (x, a) ∇ω πω (a|x) x,a We have seen that ˜ ω ρ (ω) F (ω) θ = ∇ω ρ (ω) ⇔ θ = F (ω)−1 ∇ω ρ (ω) =: ∇ which defines the natural gradient. This implies the following natural gradient learning rule (Kakade, 2001/2) ωt+1 = ωt + βt θt which is better and simpler than standard policy gradient. Remember this result is obtained at the cost of the calculation of θ! 27/02/2015 Michael Herrmann RL 12 Pros and Cons of the Fisher information + “Natural” (covariant): uses the geometry of the goal function rather than the geometry of the parameter space (Choice of parameters used to be critical, but isn’t any more so). + Related to Kullback-Leibler divergence and to Hessian + Describes efficiency in statistical estimation (Cramer-Rao) + Many applications in machine learning, statistics and physics − Depends usually on parameters and is computationally complex (but not here where we were lucky!) − Requires sampling of high-dimensional probability distributions + May still work if some approximation is used, e.g. Gaussian 27/02/2015 Michael Herrmann RL 12 Kakade’s Example Three right curves: standard gradient, three left curves: natural gradient Policy π (a|x; ω) ∼ exp ω1 s1 x 2 + ω2 ss x Starting conditions: ω1 s1 = ω2 s2 = −0.8 27/02/2015 Michael Herrmann RL 12 Kakade’s Example Left: average reward for the policy π (a = 1|s; ω) ∼ exp (ω) / (1 + exp (ω)) Lower plot represents the beginning of the upper plot (different scales!): dashed: natural gradient, solid: standard gradient. Right: Movement in the parameter space (axes are actually ωi !) 27/02/2015 Michael Herrmann RL 12 Examples of natural gradients standard gradient (dashed) natural gradient (solid) Grondman et al. (2012) A survey of actor-critic reinforcement learning: Standard and natural policy gradients. IEEE TA Systems, Man, and Cybernetics 42(6), 1291-1307. 27/02/2015 Michael Herrmann RL 12 More examples J. Kober & J. R. Peters: Policy search for motor primitives in robotics. NIPS 2009, pp. 849-856. 27/02/2015 Michael Herrmann RL 12 Summary A promising approach for continuous action and state spaces (in discrete time) Policy gradient as direct maximisation of the averaged state-action value Natural policy gradient arises from the optimisation of the value function Model-free reinforcement learning 27/02/2015 Michael Herrmann RL 12 Acknowledgements Some material was adapted from web resources associated with Sutton and Barto’s Reinforcement Learning book. Today mainly based on C. Szepesvári: Algorithms for RL, Ch. 3.4. See also: David Silber’s Lecture 7: Policy Gradient and Pieter Abbeel’s lecture 20 on Advanced Robotics. Check papers by Jan Peters and others (see above in the slides). 27/02/2015 Michael Herrmann RL 12