Uploaded by fabusleme97

reinforcement learning

advertisement
On the understanding of PPO-Clip: A Deep Reinforcement
Learning algorithm
Francisco Abusleme, Cristóbal Ponce, Miguel Rojas
IPD431 - Probabilidades y Procesos Estocásticos
August 11, 2021
Abstract
This article presents a brief but detailed introduction to the problem of Deep Reinforcement Learning
(DRL), covering the main methods based on policy gradients, such as Vanilla Policy Gradient (VPG)
and Trust Region Policy Optimization (TRPO), which serve as the foundation for the understanding
of Proximal Policy Optimization (PPO) methods, in particular in this work the PPO-Clip formulation,
which corresponds to one of the latest advances in this field.
Keywords: DRL, VPG, TRPO, PPO.
1.
Introduction
is a rule used by an agent to decide what actions to
take. Methods in RL give rules to update this set
of parameters θ (through some optimization algorithm) so that the agent can learn behaviours that
allow maximizing the cumulative reward.
In order to formulate the RL problem mathematically, we first define trajectory τ as the sequence of states and actions in the environment, that
is τ = (s1 , a1 , ..., aT , sT +1 ) where the first state s1
is randomly sampled from an initial state distribution p(s1 ). On the other hand, we need to define
the reward function, which in a general way can depend on the current state st , the action that has
just been taken at , and the next state st+1 , that is
rt = r(st , at , st+1 ). Finally, the return as a notion of
the cumulative reward along a trajectory is denoted
by R(τ ).
Assuming a Markov model for the states, a
joint distribution for the possible trajectories τ =
(s1 , a1 , ..., aT , sT +1 ) following the policy πθ (·) can
be expressed as:
Reinforcement Learning (RL) is the study of
agents and how they learn by trial and error [6].
It formalizes the idea that rewarding or punishing
an agent for its behavior makes it more likely to
repeat or forego that behaviour in the future.
Figure 1: Agent-Environment loop.
The environment is the world that the agent lives
in and interacts with. At every step, the agent sees
an observation of the state st , and then decides what
action at to take (the environment changes when the
agent acts on it, but may also change on its own).
The agent also perceives a reward signal rt from the
environment, a number that tells how good or bad
the current world state is. The goal of the agent is
to maximize its cumulative reward, called return.
1.1.
Pθ (τ ) = p(s1 )
T
Y
πθ (at |st )p(st+1 |st , at )
(1)
t=1
The objetive function’s goal is to converge to an
optimal policy that maximizes the expected total
reward over all trajectories:
The RL problem
The problem in RL is to maximize the expected
total reward of a parametrized policy πθ (·|st ), that
J(θ) =
1
E
τ ∼Pθ (τ )
[R(τ )]
(2)
1.2.
Value functions
called policy gradient methods. Considering the parameterized policy πθ (·|st ), it is possible to obtain
an unbiased estimation of the gradient of the total
expected returns so that they can be used in a gradient ascent algorithm; unfortunately, the variance
of the gradient estimator scales unfavorably with
the time horizon [4].
The policy gradient methods that will be presented below are considered essential to understanding
the PPO-Clip algorithm. Each of them uses different ways to estimate the gradient, so that their
performance depends of their estimator’s bias and
variance.
Almost all reinforcement learning algorithms involve estimating value functions —functions of states (or of state–action pairs) that estimate how good
it is for the agent to be in a given state (or how good
it is to perform a given action in a given state). The
notion of “how good” here is defined in terms of future rewards that can be expected, or, to be precise,
in terms of expected return. Of course the rewards
the agent can expect to receive in the future depend
on what actions it will take. Accordingly, value functions are defined with respect to a particular policy.
On-Policy Value function: is a function of a
state s under a policy π. Is the expected return star- 2.1. Vanilla Policy Gradient
ting in s and following π thereafter. Note that the
The VPG algorithm consists of calculating an esvalue function of a terminal state, if any, is always
timation of the gradient of an objective function
zero.
with the purpose of optimizing it using gradient asV π (s) = E [R(τ )|s1 = s]
(3)
τ ∼π
cent. Consider the identity:
On-Policy Action function: similarly, the va∇θ Pθ (τ )
∇θ log Pθ (τ ) =
(6)
lue of taking action a in state s under a policy π,
Pθ (τ )
is the expected return from s, taking action a and
Then, the gradient of (2) can be computed using
thereafter following policy π.
(6):
Z
Qπ (s, a) = E [R(τ )|s1 = s, a1 = a]
(4)
τ ∼π
∇θ J(θ) = ∇θ Pθ (τ )R(τ )dτ
1.3.
Advantage functions
Z
=
∇θ Pθ (τ )R(τ )dτ
Sometimes in RL, we don’t need to describe how
= Pθ (τ )∇θ log Pθ (τ )R(τ )dτ
good an action is in an absolute sense, but only how
much better it is than others on average. That is to
= E [∇θ log Pθ (τ )R(τ )]
say, we want to know the relative advantage of that
τ ∼Pθ (τ )
action. We make this concept precise with the adApplying the logarithm to (1) and then the gravantage function.
The advantage function Aπ (s, a) corresponding to dient, some terms vanish and it yields to:
" T
!
#
a policy π describes how much better it is to take
X
∇θ log πθ (at |st ) R(τ ) (7)
a specific action a in state s over randomly selec- ∇θ J(θ) = E
τ ∼Pθ (τ )
t=1
ting an action according to π(·|s), assuming you act
!
N
T
X
according to π forever after. Mathematically, the ad1 X
∇θ log πθ (ai,t |si,t ) R(τi ) (8)
≈
vantage function is defined by
N i=1 t=1
Z
Aπ (s, a) = Qπ (s, a) − V π (s)
2.
(5)
Finally, gradient ascent is applied using a learning
rate :
θ ← θ + ∇θ J(θ)
Policy gradient methods
A useful interpretation of the latter expression, is
The gradient of policy performance ∇θ J(πθ ) is to reformulate it as an optimization problem, conscalled the policy gradient, and algorithms that up- training how much the parameters can change betdate the policy parameters θ by gradient ascent are ween iterations:
2
θ∗ = arg max
(θ0 − θ)T ∇θ J(θ)
0
2.4.
θ
s.t ||θ0 − θ|| ≤ 2.2.
Trust Region Policy Optimization
Now we modify the Vanilla Policy algorithm
using importance sampling and the KL divergence.
Applying importance sampling (10) to the gradient
expression (7), we can sample from an old policy
πθold (·) which generates a distribution Pθold (τ ) for
the trajectories. In addition, the total expected reward R(τ ) will be replaced by the advantage function Aπ (s, a) that can be learned using a function
approximator (neural network). Note that the use of
a neural network adds bias to a previously unbiased
estimation of the policy gradient, but the variance
is reduced drastically. These considerations yield to:
(9)
Importance Sampling
There are two kind of RL algorithms: on-policy
learning and off-policy learning. The vanilla policy
gradient (7) is an on-policy algorithm because it
needs samples from the latest policy in order to
calculate the gradient. Because gradient ascent is
performed after each iteration, the parameters are
changed (and so is the policy). So it is mandatory
to obtain new samples at the beginning of every iteration. This can be very costly in many scenarios.
Importance sample is a fundamental technique in ∇θ J(θ) =
" T
!#
RL used to achieve something similar to off-policy
X Pθ (τ )
πθold
E
(st , at )
∇ log πθ (at |st )A
learning, and it can be used in many different ways
Pθold (τ ) θ
τ ∼Pθold (τ )
t=1
[1]. We now explain the simpler and most frequent
version. (TRPO and PPO are still considered onUsing (1) to expand the distributions:
policy learning algorithms due to technical defini" T
!
t
tions, although they use an old policy to update the
X Y
πθ (at0 |st0 )
∇θ J(θ) =
E
·
gradient).
π (a 0 |s 0 )
τ ∼Pθold (τ )
t=1 t0 =1 θold t t
Suppose we want to calculate the expected value
πθold
Ex∼p(x) [f (x)]. We are sampling from the distribu∇θ log πθ (at |st ) A
(st , at )
(12)
tion p(x), but we can sample from another distribution q(x) without changing the expected value:
Because we are sampling from the old policy, the
Z
Z
ratio between policies will likely be less than one,
p(x)
E [f (x)] = p(x)f (x)dx = q(x) ·
f (x)dx making the product in (12) to decay exponentially
q(x)
x∼p(x)
with T and adding a considerable amount of varian
p(x)
f (x)
(10) ce. In order to use importance sampling in a more
= E
x∼q(x) q(x)
efficient manner, we change the objective function
by sampling actions from the current state margi2.3. Kullback–Leibler divergence
nal using the old policy. Sampling in this way, the
Also known as the KL divergence, it is a measu- gradient is reformulated:
"
" T
!
re of how different are two probability distributions.
X
π
(a
|s
)
t
t
θ
Intuitively, if we wanted to use importance sampling ∇θ J(θ) = E
E
·
πθold (at |st )
s∼Pθ (s) a∼πθold (a|s)
(10), we would like the distributions p(x) and q(x)
t=1
to be similar. The KL divergence formalizes this noπθold
∇θ log πθ (at |st ) A
(st , at )
tion of similitude, although it not a distance function (not even symmetric).
The only problem now is that we are sampling the
The KL divergence DKL (p||q) uses the cross enstates
from the current state marginal distribution.
tropy H(p, q) and the entropy H(p) to calculate the
amount of information loss produced when sampling This can be changed to the old policy with a bounfrom q instead of the true distribution p. Expanding: ded effect (see [3] for a detailed proof), but intuitively this is because if the policies are close enough
Z
(in a total variation sense), the expected states are
p(x)
DKL (p||q) = H(p, q) − H(p) = p(x) log
dx
not very different for a small t. Let δ(·||·) denote the
q(x)
(11) total variation. If δ(πθ ||πθold ) ≤ , then it holds that
3
δ(Pθ ||Pθold ) ≤ 2t. Therefore, the expression for the
gradient is changed to:
"
∇θ J(θ) =
"
E
E
s∼Pθold (s) a∼πθold (a|s)
T
X
t=1
in the objective function to remove incentives
for the new policy to get far from the old policy.
PPO-Clip updates policies via
!
πθ (at |st )
·
πθold (at |st )
θk+1 = arg max
θ
∇θ log πθ (at |st ) Aπθold (st , at )
E
s,a∼πθk
[L(s, a, θk , θ)]
with
(13)
L(s, a, θk , θ) = min
Of course (13) assumes the old and new policies
are close, so the optimization problem is reformula- where
ted as
(
g(, A) =
θ∗ = arg max
(θ0 − θ)T ∇θ J(θ)
0
πθ (a|s) πθ
A k (s, a), g (, A)
πθk (a|s)
(1 + )A; A ≥ 0
(1 − )A; A < 0
θ
[DKL (πθ0 (·|s)||πθ (·|s)] ≤ To explain why this formulation makes sense , we
consider two cases depending on the sign of the adA second order approximation using the Fischer vantage.
Advantage is positive: Suppose the advantage
information matrix can be calculated:
for that state-action pair is positive, in which case
1
its contribution to the objective reduces to
DKL (πθ0 ||πθ ) ≈ (θ0 − θ)T F (θ0 − θ)
2
πθ (a|s)
h
i
L(s,
a,
θ
,
θ)
=
min
,
(1
+
)
Aπθk (s, a)
k
πθk (a|s)
where F = E ∇θ log πθ (a|s)∇θ log πθ (a|s)T .
s.t
E
s∼Pθ0 (s)
(14)
πθ
This allows to compute the gradient ascent as:
Because the advantage is positive, the objective
will increase if the action becomes more likely—that
θ ← θ + F −1 ∇θ J(θ)
is, if πθ (a|s) increases. But the min in this term puts
a limit to how much the objective can increase. Once
3. Proximal Policy Optimization πθ (a|s) > (1πθ+ )πθk (a|s), the min creates a ceiling
of (1 + )A k (s, a). Thus, the new policy does not
PPO is a variation of TRPO that avoids the use benefit by going far away from the old policy.
of complex second order methods, in the sense that
Advantage is negative: Suppose the advantage
now the policy update is done by means of first orfor that state-action pair is negative, in which case
der methods along with some tricks that simplify
its contribution to the objective reduces to
the calculation. Because of this, PPO achieves po
πθ (a|s)
licy updates but in small steps, preventing perforL(s, a, θk , θ) = max
, (1 − ) Aπθk (s, a)
π
(a|s)
θ
k
mance crashes from occurring.
There are two primary variants of PPO: PPOBecause the advantage is negative, the objective
Penalty and PPO-Clip [5].
will increase if the action becomes less likely—that
is, if πθ (a|s) decreases. But the max in this term
puts a limit to how much the objective can increase. Once πθ (a|s) < (1 − )πθk (a|s), the max creates
a ceiling of (1 − )Aπθk (s, a). Thus again, the new
policy does not benefit by going far away from the
old policy.
What we have seen so far is that clipping serves as
a regularizer by removing incentives for the policy
PPO-Clip doesn’t have a KL-divergence term to change dramatically, and the hyperparameter in the objective and doesn’t have a constraint corresponds to how far away the new policy can go
at all. Instead, it relies on specialized clipping from the old while still profiting the objective.
PPO-Penalty approximately solves a KLconstrained update like TRPO, but penalizes
the KL-divergence in the objective function instead of making it a hard constraint, and automatically adjusts the penalty coefficient over
the course of training so that it’s scaled appropriately.
4
(a)
(b)
Figure 2: Policy gradient algorithms comparisons.
How V π (s) cannot be computed exactly, so it has
to be approximated. This will be done with a neural network Vφ (s), which is updated concurrently
with the policy (so that the value network always
approximates the value function of the most recent
policy). The simplest method for learning Vφ (s) is
to minimize a mean-squared-error objective
Several studies were performed on different
benchmarks [5]. The end results are shown in Figure 2. It can be seen that PPO (clipped version)
outperforms the previous methods on almost all the
continuous control environments and can converge
more rapidly in some cases.
3.1.
Application to Inverted Pendulum
2 φk = arg min E
Vφ (s) − R̂t
φ s,R̂t ∼πk
An inverted pendulum is a classical control problem having two degrees of freedom of motion and where π is the current policy.
k
only one actuator to control its position. The goal is
On the other hand, the advantage function can
to remain at zero angle (vertical upward) with least
rotational velocity and least effort. Some of the sta- be estimated from the value function estimates by
using the generalized advantage estimator [4] given
tes of the pendulum is shown in Figure 3.
by the following equation
Ât =
∞
X
(γλ)` δt+`
`=0
where parameter 0 ≤ λ ≤ 1 controls the trade-off
between bias and variance, 0 ≤ γ ≤ 1 a discount
factor and δt represents the time delay error given
by
δt = rt + γVφ (st+1 ) − Vφ (st )
Figure 3: States of an Inverted Pendulum [2].
Then, the pendulum problem is solved using an
Actor-Critic model, where an actor network is used
to approximate the policy function πθ (·|s) whereas
a critic network is used to approximate the value
function Vφ (s). The algorithm is presented below in
Figure 5.
This problem has a 3-dimensional state space
s = [cos α, sin α, α̇], i.e. the cosine and sine of the
angle as well as the derivative of the angle. The
action space is 1 dimensional, which is the torque
(bounded) applied to the joint. Then, according to
[2], the reward function is defined as:
rt = r(α, α̇, at ) = −α2 − 0.1α̇2 − 0.001a2t
where α ∈ [−π, π] and at is the control torque.
5
understanding of a fairly new algorithm used in robotics applications. Proximal policy optimization is
a relatively recent method that uses multiple epochs
of stochastic gradient ascent to update its policy. It
is a simplified version of TRPO but with the benefits
of stability and reliability, and is applicable in more
general settings having better overall performance.
References
[1] Mahammad Humayoo and Xueqi Cheng. Relative importance sampling for off-policy actorcritic in deep reinforcement learning. arXiv preprint arXiv:2105.07998, 2019.
Figure 5: Pseudocode PPO-Clip
[2] Swagat Kumar. Controlling an inverted pendulum with policy gradient methods-a tutorial. arThe algorithm shown in Figure 5 was implemenXiv preprint arXiv:2105.07998, 2021.
ted in python. The design of the neural networks
was done using the PyTorch library and the Ope- [3] John Schulman, Sergey Levine, Philipp Moritz,
Michael Jordan, and Pieter Abbeel. Trust region
nAI Gym library is used for the simulation of the
policy optimization. University of California,
environment.
Berkeley, Department of Electrical Engineering
Training results are presented in Figure 4a. It
and Computer Sciences, 2015.
shows the average reward for each episode of trai-
Experiments and Results
ning, and the task is considered solved when the [4] John Schulman, Philipp Moritz, Sergey Leviaverage reward is above −200, thus, PPO algorithm
ne, Michael Jordan, and Pieter Abbeel. Highsolves the task around episode 60.
dimensional continuous control using generaliThe actor and critic losses are shown in Figures
zed advantage estimation. arXiv preprint ar4b and 4c. The critic loss decreases monotonically
Xiv:1506.02438, 2015.
indicating that the network is able to minimize the
time-delay error of the V-network, and the actor [5] John Schulman, Filip Wolski, Prafulla Dhariwal,
Alec Radford, and Oleg Klimov. Proximal ponetwork parameters converge in a few episodes.
licy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
4.
Conclusions
[6] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,
A brief introduction to Deep Reinforcement Lear2018.
ning has been presented, in order to convey to the
0.06
0
15000
-200
0.04
-400
10000
-600
0.02
-800
5000
-1000
-1200
0
-1400
0
0
50
100
(a)
150
200
0
50
100
150
200
0
(b)
Figure 4: Control results using PPO-Clip algorithm.
6
50
100
(c)
150
200
Download