Uploaded by jd.the.footballer

rl cheatsheet

advertisement
Reinforcement Learning Cheat Sheet
Optimal
Similarly, we can do the same for the Q function:
V∗ (s) = max Vπ (s)
Agent-Environment Interface
(5)
π
qπ (s, a)
Action-Value (Q) Function
We can also denoted the expected reward for state, action
pairs.
qπ (s, a) = Eπ Gt |St = s, At = a
(6)
Optimal
Policy
A policy is a mapping from a state to an action
πt (s|a)
(1)
That is the probability of select an action At = a if St = s.
Reward
Gt =
k
γ rt+k+1
(2)
k=0
Where γ is the discount factor and H is the horizon, that can
be infinite.
Markov Decision Process
A Markov Decision Process, MPD, is a 5-tuple
(S, A, P, R, γ) where:
finite set of states:
s∈S
finite set of actions:
a∈A
state transition probabilities:
p(s0 |s, a) = P r{St+1 = s0 |St = s, At = a}
expected reward for state-action-nexstate:
r(s0 , s, a) = E[Rt+1 |St+1 = s0 , St = s, At = a]
∞
P
γ k Rt+k+1 |St = s, At = a
k=0
∞
P
γ k Rt+k+2 |St = s, At = a
= Eπ Rt+1 + γ
=
P
∞
P k
γ Rt+k+2 |St+1 = s0
p(s0 , r|s, a) r + γEπ
s0 ,r
q∗ (s, a) = max q π (s, a)
k=0
(7)
π
Clearly, using this new notation we can redefine V ∗ , equation
5, using q ∗ (s, a), equation 7:
V∗ (s) = max qπ∗ (s, a)
(8)
a∈A(s)
=
P
p(s0 , r|s, a) r + γVπ (s0 )
s0 ,r
(10)
Dynamic Programming
Intuitively, the above equation express the fact that the value
of a state under the optimal policy must be equal to the
expected return from the best action from that state.
Taking advantages of the subproblem structure of the V and Q
function we can find the optimal policy by just planning
Bellman Equation
Policy Iteration
An important recursive property emerges for both Value (4)
and Q (6) functions if we expand them.
We can now find the optimal policy
1. Initialisation
V (s) ∈ R, (e.g V (s) = 0) and π(s) ∈ A for all s ∈ S,
∆←0
2. Policy Evaluation
while ∆ ≥ θ (a small positive number) do
foreach s ∈ S do
v ← V (s)
P
P
V (s) ←
π(a|s)
p(s0 , r|s, a) r + γV (s0 )
The total reward is expressed as:
H
X
= Eπ
k=0
The optimal value-action function:
The Agent at each step t receives a representation of the
environment’s state, St ∈ S and it selects an action At ∈ A(s).
Then, as a consequence of its action the agent receives a
reward, Rt+1 ∈ R ∈ R.
= Eπ Gt |St = s, At = a
Value Function
Vπ (s)
= Eπ Gt |St = s
∞
P
= Eπ
γ k Rt+k+1 |St = s
k=0
s0 ,r
a
∞
P
(3)
Value Function
Value function describes how good is to be in a specific state s
under a certain policy π. For MDP:
Vπ (s) = E[Gt |St = s]
(4)
Informally, is the expected return (expected cumulative
discounted reward) when starting from s and following π
= Eπ Rt+1 + γ
=
X
π(a|s)
XX
|
p(s0 , r|s, a)
{z
}
Sum of all
∀ possible r
probabilities
∞
P
kR
γ
r
+
γ
E
t+k+2 |St+1 =
π

k=0

=
P
a
{z
Expected reward from st+1
π(a|s)
PP
s0 r
(9)
r

|
∆ ← max(∆, |v − V (s)|)
k=0
s0
a
γ k Rt+k+2 |St = s
s0


}
p(s0 , r|s, a) r + γVπ (s0 )
end
end
3. Policy Improvement
policy-stable ← true
foreach s ∈ S do
old-action ← π(s)P
π(s) ← argmax
p(s0 , r|s, a) r + γV (s0 )
a
s0 ,r
policy-stable ← old-action = π(s)
end
if policy-stable return V ≈ V∗ and π ≈ π∗ , else go to 2
Algorithm 1: Policy Iteration
Value Iteration
Sarsa
We can avoid to wait until V (s) has converged and instead do
policy improvement and truncated policy evaluation step in
one operation
Initialise V (s) ∈ R, e.gV (s) = 0
∆←0
while ∆ ≥ θ (a small positive number) do
foreach s ∈ S do
v ← V (s)
P
V (s) ← max
p(s0 , r|s, a) r + γV (s0 )
Sarsa (State-action-reward-state-action) is a on-policy TD
control. The update rule:
a
n-step Sarsa
Define the n-step Q-Return
a0
s ← s0
s0 ,r
∆ ← max(∆, |v − V (s)|)
end
end
ouput: Deterministic
policy π ≈ π∗ such that
P
π(s) = argmax
p(s0 , r|s, a) r + γV (s0 )
a
Q(st , at ) ← Q(st , at ) + α [rt + γQ(st+1 , at+1 ) − Q(st , at )]
The following algorithm gives a generic implementation.
Initialise Q(s, a) arbitrarily and
Q(terminal − state, ) = 0
foreach episode ∈ episodes do
while s is not terminal do
Choose a from s using policy derived from Q
(e.g., -greedy)
Take action a, observer r, s0
Q(s, a) ← h
i
0 0
Q(s, a) + α r + γ max Q(s , a ) − Q(s, a)
s0 ,r
q (n) = Rt+1 + γRt + 2 + . . . + γ n−1 Rt+n + γ n Q(St+n )
h
i
(n)
Q(st , at ) ← Q(st , at ) + α qt − Q(st , at )
Forward View Sarsa(λ)
Monte Carlo (MC) is a Model Free method, It does not require
complete knowledge of the environment. It is based on
averaging sample returns for each state-action pair. The
following algorithm gives the basic implementation
Initialise for all s ∈ S, a ∈ A(s) :
Q(s, a) ← arbitrary
π(s) ← arbitrary
Returns(s, a) ← empty list
while forever do
Choose S0 ∈ S and A0 ∈ A(S0 ), all pairs have
probability > 0
Generate an episode starting at S0 , A0 following π
foreach pair s, a appearing in the episode do
G ← return following the first occurrence of s, a
Append G to Returns(s, a))
Q(s, a) ← average(Returns(s, a))
end
foreach s in the episode do
π(s) ← argmax Q(s, a)
a
end
Algorithm 5: Q Learning
n-step Sarsa update Q(S, a) towards the n-step Q-return
Algorithm 2: Value Iteration
Monte Carlo Methods
end
end
qtλ = (1 − λ)
∞
X
(n)
λn−1 qt
n=1
Forward-view Sarsa(λ):
Deep Q Learning
Created by DeepM ind, Deep Q Learning, DQL, substitutes
the Q function with a deep neural network called Q-network.
It also keep track of some observation in a memory in order to
use them to train the network.


(r + γ max Q(s0 , a0 ; θi−1 ) − Q(s, a; θi ))2
a
|
{z
}


Li (θi ) = E(s,a,r,s0 )∼U (D) |
{z
}
target
(13)
h
i
Q(st , at ) ← Q(st , at ) + α qtλ − Q(st , at )
Initialise Q(s, a) arbitrarily and
Q(terminal − state, ) = 0
foreach episode ∈ episodes do
Choose a from s using policy derived from Q (e.g.,
-greedy)
while s is not terminal do
Take action a, observer r, s0
Choose a0 from s0 using policy derived from Q
(e.g., -greedy)
Q(s, a) ← Q(s, a) + α r + γQ(s0 , a0 ) − Q(s, a)
0
s←s
a ← a0
end
end
Where θ are the weights of the network and U (D) is the
experience replay history.
Initialise replay memory D with capacity N
Initialise Q(s, a) arbitrarily
foreach episode ∈ episodes do
while s is not terminal do
With probability select a random action
a ∈ A(s)
otherwise select a = maxa Q(s, a; θ)
Take action a, observer r, s0
Store transition (s, a, r, s0 ) in D
Sample random minibatch of transitions
(sj , aj , rj , s0j ) from D
Set yj ←
(
rj
for terminal s0j
0
0
rj + γ max Q(s , a ; θ) for non-terminal s0j
a
Algorithm 4: Sarsa(λ)
Perform gradient descent step on
(yj − Q(sj , aj ; Θ))2
s ← s0
end
Algorithm 3: Monte Carlo first-visit
For non-stationary problems, the Monte Carlo estimate for,
e.g, V is:
V (St ) ← V (St ) + α Gt − V (St )
(11)
Where α is the learning rate, how much we want to forget
about past experiences.
prediction
Temporal Difference - Q Learning
Temporal Difference (TD) methods learn directly from raw
experience without a model of the environment’s dynamics.
TD substitutes the expected discounted reward Gt from the
episode with an estimation:
V (St ) ← V (St ) + α Rt+1 + γV (St+1 ) − V (St )
(12)
end
end
Algorithm 6: Deep Q Learning
Copyright c 2018 Francesco Saverio Zuppichini
https://github.com/FrancescoSaverioZuppichini/ReinforcementLearning-Cheat-Sheet
Double Deep Q Learning
Download