On the understanding of PPO-Clip: A Deep Reinforcement Learning algorithm Francisco Abusleme, Cristóbal Ponce, Miguel Rojas IPD431 - Probabilidades y Procesos Estocásticos August 11, 2021 Abstract This article presents a brief but detailed introduction to the problem of Deep Reinforcement Learning (DRL), covering the main methods based on policy gradients, such as Vanilla Policy Gradient (VPG) and Trust Region Policy Optimization (TRPO), which serve as the foundation for the understanding of Proximal Policy Optimization (PPO) methods, in particular in this work the PPO-Clip formulation, which corresponds to one of the latest advances in this field. Keywords: DRL, VPG, TRPO, PPO. 1. Introduction is a rule used by an agent to decide what actions to take. Methods in RL give rules to update this set of parameters θ (through some optimization algorithm) so that the agent can learn behaviours that allow maximizing the cumulative reward. In order to formulate the RL problem mathematically, we first define trajectory τ as the sequence of states and actions in the environment, that is τ = (s1 , a1 , ..., aT , sT +1 ) where the first state s1 is randomly sampled from an initial state distribution p(s1 ). On the other hand, we need to define the reward function, which in a general way can depend on the current state st , the action that has just been taken at , and the next state st+1 , that is rt = r(st , at , st+1 ). Finally, the return as a notion of the cumulative reward along a trajectory is denoted by R(τ ). Assuming a Markov model for the states, a joint distribution for the possible trajectories τ = (s1 , a1 , ..., aT , sT +1 ) following the policy πθ (·) can be expressed as: Reinforcement Learning (RL) is the study of agents and how they learn by trial and error [6]. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behaviour in the future. Figure 1: Agent-Environment loop. The environment is the world that the agent lives in and interacts with. At every step, the agent sees an observation of the state st , and then decides what action at to take (the environment changes when the agent acts on it, but may also change on its own). The agent also perceives a reward signal rt from the environment, a number that tells how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return. 1.1. Pθ (τ ) = p(s1 ) T Y πθ (at |st )p(st+1 |st , at ) (1) t=1 The objetive function’s goal is to converge to an optimal policy that maximizes the expected total reward over all trajectories: The RL problem The problem in RL is to maximize the expected total reward of a parametrized policy πθ (·|st ), that J(θ) = 1 E τ ∼Pθ (τ ) [R(τ )] (2) 1.2. Value functions called policy gradient methods. Considering the parameterized policy πθ (·|st ), it is possible to obtain an unbiased estimation of the gradient of the total expected returns so that they can be used in a gradient ascent algorithm; unfortunately, the variance of the gradient estimator scales unfavorably with the time horizon [4]. The policy gradient methods that will be presented below are considered essential to understanding the PPO-Clip algorithm. Each of them uses different ways to estimate the gradient, so that their performance depends of their estimator’s bias and variance. Almost all reinforcement learning algorithms involve estimating value functions —functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). The notion of “how good” here is defined in terms of future rewards that can be expected, or, to be precise, in terms of expected return. Of course the rewards the agent can expect to receive in the future depend on what actions it will take. Accordingly, value functions are defined with respect to a particular policy. On-Policy Value function: is a function of a state s under a policy π. Is the expected return star- 2.1. Vanilla Policy Gradient ting in s and following π thereafter. Note that the The VPG algorithm consists of calculating an esvalue function of a terminal state, if any, is always timation of the gradient of an objective function zero. with the purpose of optimizing it using gradient asV π (s) = E [R(τ )|s1 = s] (3) τ ∼π cent. Consider the identity: On-Policy Action function: similarly, the va∇θ Pθ (τ ) ∇θ log Pθ (τ ) = (6) lue of taking action a in state s under a policy π, Pθ (τ ) is the expected return from s, taking action a and Then, the gradient of (2) can be computed using thereafter following policy π. (6): Z Qπ (s, a) = E [R(τ )|s1 = s, a1 = a] (4) τ ∼π ∇θ J(θ) = ∇θ Pθ (τ )R(τ )dτ 1.3. Advantage functions Z = ∇θ Pθ (τ )R(τ )dτ Sometimes in RL, we don’t need to describe how = Pθ (τ )∇θ log Pθ (τ )R(τ )dτ good an action is in an absolute sense, but only how much better it is than others on average. That is to = E [∇θ log Pθ (τ )R(τ )] say, we want to know the relative advantage of that τ ∼Pθ (τ ) action. We make this concept precise with the adApplying the logarithm to (1) and then the gravantage function. The advantage function Aπ (s, a) corresponding to dient, some terms vanish and it yields to: " T ! # a policy π describes how much better it is to take X ∇θ log πθ (at |st ) R(τ ) (7) a specific action a in state s over randomly selec- ∇θ J(θ) = E τ ∼Pθ (τ ) t=1 ting an action according to π(·|s), assuming you act ! N T X according to π forever after. Mathematically, the ad1 X ∇θ log πθ (ai,t |si,t ) R(τi ) (8) ≈ vantage function is defined by N i=1 t=1 Z Aπ (s, a) = Qπ (s, a) − V π (s) 2. (5) Finally, gradient ascent is applied using a learning rate : θ ← θ + ∇θ J(θ) Policy gradient methods A useful interpretation of the latter expression, is The gradient of policy performance ∇θ J(πθ ) is to reformulate it as an optimization problem, conscalled the policy gradient, and algorithms that up- training how much the parameters can change betdate the policy parameters θ by gradient ascent are ween iterations: 2 θ∗ = arg max (θ0 − θ)T ∇θ J(θ) 0 2.4. θ s.t ||θ0 − θ|| ≤ 2.2. Trust Region Policy Optimization Now we modify the Vanilla Policy algorithm using importance sampling and the KL divergence. Applying importance sampling (10) to the gradient expression (7), we can sample from an old policy πθold (·) which generates a distribution Pθold (τ ) for the trajectories. In addition, the total expected reward R(τ ) will be replaced by the advantage function Aπ (s, a) that can be learned using a function approximator (neural network). Note that the use of a neural network adds bias to a previously unbiased estimation of the policy gradient, but the variance is reduced drastically. These considerations yield to: (9) Importance Sampling There are two kind of RL algorithms: on-policy learning and off-policy learning. The vanilla policy gradient (7) is an on-policy algorithm because it needs samples from the latest policy in order to calculate the gradient. Because gradient ascent is performed after each iteration, the parameters are changed (and so is the policy). So it is mandatory to obtain new samples at the beginning of every iteration. This can be very costly in many scenarios. Importance sample is a fundamental technique in ∇θ J(θ) = " T !# RL used to achieve something similar to off-policy X Pθ (τ ) πθold E (st , at ) ∇ log πθ (at |st )A learning, and it can be used in many different ways Pθold (τ ) θ τ ∼Pθold (τ ) t=1 [1]. We now explain the simpler and most frequent version. (TRPO and PPO are still considered onUsing (1) to expand the distributions: policy learning algorithms due to technical defini" T ! t tions, although they use an old policy to update the X Y πθ (at0 |st0 ) ∇θ J(θ) = E · gradient). π (a 0 |s 0 ) τ ∼Pθold (τ ) t=1 t0 =1 θold t t Suppose we want to calculate the expected value πθold Ex∼p(x) [f (x)]. We are sampling from the distribu∇θ log πθ (at |st ) A (st , at ) (12) tion p(x), but we can sample from another distribution q(x) without changing the expected value: Because we are sampling from the old policy, the Z Z ratio between policies will likely be less than one, p(x) E [f (x)] = p(x)f (x)dx = q(x) · f (x)dx making the product in (12) to decay exponentially q(x) x∼p(x) with T and adding a considerable amount of varian p(x) f (x) (10) ce. In order to use importance sampling in a more = E x∼q(x) q(x) efficient manner, we change the objective function by sampling actions from the current state margi2.3. Kullback–Leibler divergence nal using the old policy. Sampling in this way, the Also known as the KL divergence, it is a measu- gradient is reformulated: " " T ! re of how different are two probability distributions. X π (a |s ) t t θ Intuitively, if we wanted to use importance sampling ∇θ J(θ) = E E · πθold (at |st ) s∼Pθ (s) a∼πθold (a|s) (10), we would like the distributions p(x) and q(x) t=1 to be similar. The KL divergence formalizes this noπθold ∇θ log πθ (at |st ) A (st , at ) tion of similitude, although it not a distance function (not even symmetric). The only problem now is that we are sampling the The KL divergence DKL (p||q) uses the cross enstates from the current state marginal distribution. tropy H(p, q) and the entropy H(p) to calculate the amount of information loss produced when sampling This can be changed to the old policy with a bounfrom q instead of the true distribution p. Expanding: ded effect (see [3] for a detailed proof), but intuitively this is because if the policies are close enough Z (in a total variation sense), the expected states are p(x) DKL (p||q) = H(p, q) − H(p) = p(x) log dx not very different for a small t. Let δ(·||·) denote the q(x) (11) total variation. If δ(πθ ||πθold ) ≤ , then it holds that 3 δ(Pθ ||Pθold ) ≤ 2t. Therefore, the expression for the gradient is changed to: " ∇θ J(θ) = " E E s∼Pθold (s) a∼πθold (a|s) T X t=1 in the objective function to remove incentives for the new policy to get far from the old policy. PPO-Clip updates policies via ! πθ (at |st ) · πθold (at |st ) θk+1 = arg max θ ∇θ log πθ (at |st ) Aπθold (st , at ) E s,a∼πθk [L(s, a, θk , θ)] with (13) L(s, a, θk , θ) = min Of course (13) assumes the old and new policies are close, so the optimization problem is reformula- where ted as ( g(, A) = θ∗ = arg max (θ0 − θ)T ∇θ J(θ) 0 πθ (a|s) πθ A k (s, a), g (, A) πθk (a|s) (1 + )A; A ≥ 0 (1 − )A; A < 0 θ [DKL (πθ0 (·|s)||πθ (·|s)] ≤ To explain why this formulation makes sense , we consider two cases depending on the sign of the adA second order approximation using the Fischer vantage. Advantage is positive: Suppose the advantage information matrix can be calculated: for that state-action pair is positive, in which case 1 its contribution to the objective reduces to DKL (πθ0 ||πθ ) ≈ (θ0 − θ)T F (θ0 − θ) 2 πθ (a|s) h i L(s, a, θ , θ) = min , (1 + ) Aπθk (s, a) k πθk (a|s) where F = E ∇θ log πθ (a|s)∇θ log πθ (a|s)T . s.t E s∼Pθ0 (s) (14) πθ This allows to compute the gradient ascent as: Because the advantage is positive, the objective will increase if the action becomes more likely—that θ ← θ + F −1 ∇θ J(θ) is, if πθ (a|s) increases. But the min in this term puts a limit to how much the objective can increase. Once 3. Proximal Policy Optimization πθ (a|s) > (1πθ+ )πθk (a|s), the min creates a ceiling of (1 + )A k (s, a). Thus, the new policy does not PPO is a variation of TRPO that avoids the use benefit by going far away from the old policy. of complex second order methods, in the sense that Advantage is negative: Suppose the advantage now the policy update is done by means of first orfor that state-action pair is negative, in which case der methods along with some tricks that simplify its contribution to the objective reduces to the calculation. Because of this, PPO achieves po πθ (a|s) licy updates but in small steps, preventing perforL(s, a, θk , θ) = max , (1 − ) Aπθk (s, a) π (a|s) θ k mance crashes from occurring. There are two primary variants of PPO: PPOBecause the advantage is negative, the objective Penalty and PPO-Clip [5]. will increase if the action becomes less likely—that is, if πθ (a|s) decreases. But the max in this term puts a limit to how much the objective can increase. Once πθ (a|s) < (1 − )πθk (a|s), the max creates a ceiling of (1 − )Aπθk (s, a). Thus again, the new policy does not benefit by going far away from the old policy. What we have seen so far is that clipping serves as a regularizer by removing incentives for the policy PPO-Clip doesn’t have a KL-divergence term to change dramatically, and the hyperparameter in the objective and doesn’t have a constraint corresponds to how far away the new policy can go at all. Instead, it relies on specialized clipping from the old while still profiting the objective. PPO-Penalty approximately solves a KLconstrained update like TRPO, but penalizes the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over the course of training so that it’s scaled appropriately. 4 (a) (b) Figure 2: Policy gradient algorithms comparisons. How V π (s) cannot be computed exactly, so it has to be approximated. This will be done with a neural network Vφ (s), which is updated concurrently with the policy (so that the value network always approximates the value function of the most recent policy). The simplest method for learning Vφ (s) is to minimize a mean-squared-error objective Several studies were performed on different benchmarks [5]. The end results are shown in Figure 2. It can be seen that PPO (clipped version) outperforms the previous methods on almost all the continuous control environments and can converge more rapidly in some cases. 3.1. Application to Inverted Pendulum 2 φk = arg min E Vφ (s) − R̂t φ s,R̂t ∼πk An inverted pendulum is a classical control problem having two degrees of freedom of motion and where π is the current policy. k only one actuator to control its position. The goal is On the other hand, the advantage function can to remain at zero angle (vertical upward) with least rotational velocity and least effort. Some of the sta- be estimated from the value function estimates by using the generalized advantage estimator [4] given tes of the pendulum is shown in Figure 3. by the following equation Ât = ∞ X (γλ)` δt+` `=0 where parameter 0 ≤ λ ≤ 1 controls the trade-off between bias and variance, 0 ≤ γ ≤ 1 a discount factor and δt represents the time delay error given by δt = rt + γVφ (st+1 ) − Vφ (st ) Figure 3: States of an Inverted Pendulum [2]. Then, the pendulum problem is solved using an Actor-Critic model, where an actor network is used to approximate the policy function πθ (·|s) whereas a critic network is used to approximate the value function Vφ (s). The algorithm is presented below in Figure 5. This problem has a 3-dimensional state space s = [cos α, sin α, α̇], i.e. the cosine and sine of the angle as well as the derivative of the angle. The action space is 1 dimensional, which is the torque (bounded) applied to the joint. Then, according to [2], the reward function is defined as: rt = r(α, α̇, at ) = −α2 − 0.1α̇2 − 0.001a2t where α ∈ [−π, π] and at is the control torque. 5 understanding of a fairly new algorithm used in robotics applications. Proximal policy optimization is a relatively recent method that uses multiple epochs of stochastic gradient ascent to update its policy. It is a simplified version of TRPO but with the benefits of stability and reliability, and is applicable in more general settings having better overall performance. References [1] Mahammad Humayoo and Xueqi Cheng. Relative importance sampling for off-policy actorcritic in deep reinforcement learning. arXiv preprint arXiv:2105.07998, 2019. Figure 5: Pseudocode PPO-Clip [2] Swagat Kumar. Controlling an inverted pendulum with policy gradient methods-a tutorial. arThe algorithm shown in Figure 5 was implemenXiv preprint arXiv:2105.07998, 2021. ted in python. The design of the neural networks was done using the PyTorch library and the Ope- [3] John Schulman, Sergey Levine, Philipp Moritz, Michael Jordan, and Pieter Abbeel. Trust region nAI Gym library is used for the simulation of the policy optimization. University of California, environment. Berkeley, Department of Electrical Engineering Training results are presented in Figure 4a. It and Computer Sciences, 2015. shows the average reward for each episode of trai- Experiments and Results ning, and the task is considered solved when the [4] John Schulman, Philipp Moritz, Sergey Leviaverage reward is above −200, thus, PPO algorithm ne, Michael Jordan, and Pieter Abbeel. Highsolves the task around episode 60. dimensional continuous control using generaliThe actor and critic losses are shown in Figures zed advantage estimation. arXiv preprint ar4b and 4c. The critic loss decreases monotonically Xiv:1506.02438, 2015. indicating that the network is able to minimize the time-delay error of the V-network, and the actor [5] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal ponetwork parameters converge in a few episodes. licy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017. 4. Conclusions [6] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, A brief introduction to Deep Reinforcement Lear2018. ning has been presented, in order to convey to the 0.06 0 15000 -200 0.04 -400 10000 -600 0.02 -800 5000 -1000 -1200 0 -1400 0 0 50 100 (a) 150 200 0 50 100 150 200 0 (b) Figure 4: Control results using PPO-Clip algorithm. 6 50 100 (c) 150 200