Reinforcement Learning Cheat Sheet

Reinforcement Learning Cheat Sheet Optimal Similarly, we can do the same for the Q function: V∗ (s) = max Vπ (s) Agent-Environment Interface (5) π qπ (s, a) Action-Value (Q) Function We can also denoted the expected reward for state, action pairs. qπ (s, a) = Eπ Gt |St = s, At = a (6) Optimal Policy A policy is a mapping from a state to an action πt (s|a) (1) That is the probability of select an action At = a if St = s. Reward Gt = k γ rt+k+1 (2) k=0 Where γ is the discount factor and H is the horizon, that can be infinite. Markov Decision Process A Markov Decision Process, MPD, is a 5-tuple (S, A, P, R, γ) where: finite set of states: s∈S finite set of actions: a∈A state transition probabilities: p(s0 |s, a) = P r{St+1 = s0 |St = s, At = a} expected reward for state-action-nexstate: r(s0 , s, a) = E[Rt+1 |St+1 = s0 , St = s, At = a] ∞ P γ k Rt+k+1 |St = s, At = a k=0 ∞ P γ k Rt+k+2 |St = s, At = a = Eπ Rt+1 + γ = P ∞ P k γ Rt+k+2 |St+1 = s0 p(s0 , r|s, a) r + γEπ s0 ,r q∗ (s, a) = max q π (s, a) k=0 (7) π Clearly, using this new notation we can redefine V ∗ , equation 5, using q ∗ (s, a), equation 7: V∗ (s) = max qπ∗ (s, a) (8) a∈A(s) = P p(s0 , r|s, a) r + γVπ (s0 ) s0 ,r (10) Dynamic Programming Intuitively, the above equation express the fact that the value of a state under the optimal policy must be equal to the expected return from the best action from that state. Taking advantages of the subproblem structure of the V and Q function we can find the optimal policy by just planning Bellman Equation Policy Iteration An important recursive property emerges for both Value (4) and Q (6) functions if we expand them. We can now find the optimal policy 1. Initialisation V (s) ∈ R, (e.g V (s) = 0) and π(s) ∈ A for all s ∈ S, ∆←0 2. Policy Evaluation while ∆ ≥ θ (a small positive number) do foreach s ∈ S do v ← V (s) P P V (s) ← π(a|s) p(s0 , r|s, a) r + γV (s0 ) The total reward is expressed as: H X = Eπ k=0 The optimal value-action function: The Agent at each step t receives a representation of the environment’s state, St ∈ S and it selects an action At ∈ A(s). Then, as a consequence of its action the agent receives a reward, Rt+1 ∈ R ∈ R. = Eπ Gt |St = s, At = a Value Function Vπ (s) = Eπ Gt |St = s ∞ P = Eπ γ k Rt+k+1 |St = s k=0 s0 ,r a ∞ P (3) Value Function Value function describes how good is to be in a specific state s under a certain policy π. For MDP: Vπ (s) = E[Gt |St = s] (4) Informally, is the expected return (expected cumulative discounted reward) when starting from s and following π = Eπ Rt+1 + γ = X π(a|s) XX | p(s0 , r|s, a) {z } Sum of all ∀ possible r probabilities ∞ P kR γ r + γ E t+k+2 |St+1 = π  k=0  = P a {z Expected reward from st+1 π(a|s) PP s0 r (9) r  | ∆ ← max(∆, |v − V (s)|) k=0 s0 a γ k Rt+k+2 |St = s s0   } p(s0 , r|s, a) r + γVπ (s0 ) end end 3. Policy Improvement policy-stable ← true foreach s ∈ S do old-action ← π(s)P π(s) ← argmax p(s0 , r|s, a) r + γV (s0 ) a s0 ,r policy-stable ← old-action = π(s) end if policy-stable return V ≈ V∗ and π ≈ π∗ , else go to 2 Algorithm 1: Policy Iteration Value Iteration Sarsa We can avoid to wait until V (s) has converged and instead do policy improvement and truncated policy evaluation step in one operation Initialise V (s) ∈ R, e.gV (s) = 0 ∆←0 while ∆ ≥ θ (a small positive number) do foreach s ∈ S do v ← V (s) P V (s) ← max p(s0 , r|s, a) r + γV (s0 ) Sarsa (State-action-reward-state-action) is a on-policy TD control. The update rule: a n-step Sarsa Define the n-step Q-Return a0 s ← s0 s0 ,r ∆ ← max(∆, |v − V (s)|) end end ouput: Deterministic policy π ≈ π∗ such that P π(s) = argmax p(s0 , r|s, a) r + γV (s0 ) a Q(st , at ) ← Q(st , at ) + α [rt + γQ(st+1 , at+1 ) − Q(st , at )] The following algorithm gives a generic implementation. Initialise Q(s, a) arbitrarily and Q(terminal − state, ) = 0 foreach episode ∈ episodes do while s is not terminal do Choose a from s using policy derived from Q (e.g., -greedy) Take action a, observer r, s0 Q(s, a) ← h i 0 0 Q(s, a) + α r + γ max Q(s , a ) − Q(s, a) s0 ,r q (n) = Rt+1 + γRt + 2 + . . . + γ n−1 Rt+n + γ n Q(St+n ) h i (n) Q(st , at ) ← Q(st , at ) + α qt − Q(st , at ) Forward View Sarsa(λ) Monte Carlo (MC) is a Model Free method, It does not require complete knowledge of the environment. It is based on averaging sample returns for each state-action pair. The following algorithm gives the basic implementation Initialise for all s ∈ S, a ∈ A(s) : Q(s, a) ← arbitrary π(s) ← arbitrary Returns(s, a) ← empty list while forever do Choose S0 ∈ S and A0 ∈ A(S0 ), all pairs have probability > 0 Generate an episode starting at S0 , A0 following π foreach pair s, a appearing in the episode do G ← return following the first occurrence of s, a Append G to Returns(s, a)) Q(s, a) ← average(Returns(s, a)) end foreach s in the episode do π(s) ← argmax Q(s, a) a end Algorithm 5: Q Learning n-step Sarsa update Q(S, a) towards the n-step Q-return Algorithm 2: Value Iteration Monte Carlo Methods end end qtλ = (1 − λ) ∞ X (n) λn−1 qt n=1 Forward-view Sarsa(λ): Deep Q Learning Created by DeepM ind, Deep Q Learning, DQL, substitutes the Q function with a deep neural network called Q-network. It also keep track of some observation in a memory in order to use them to train the network.   (r + γ max Q(s0 , a0 ; θi−1 ) − Q(s, a; θi ))2 a | {z }   Li (θi ) = E(s,a,r,s0 )∼U (D) | {z } target (13) h i Q(st , at ) ← Q(st , at ) + α qtλ − Q(st , at ) Initialise Q(s, a) arbitrarily and Q(terminal − state, ) = 0 foreach episode ∈ episodes do Choose a from s using policy derived from Q (e.g., -greedy) while s is not terminal do Take action a, observer r, s0 Choose a0 from s0 using policy derived from Q (e.g., -greedy) Q(s, a) ← Q(s, a) + α r + γQ(s0 , a0 ) − Q(s, a) 0 s←s a ← a0 end end Where θ are the weights of the network and U (D) is the experience replay history. Initialise replay memory D with capacity N Initialise Q(s, a) arbitrarily foreach episode ∈ episodes do while s is not terminal do With probability select a random action a ∈ A(s) otherwise select a = maxa Q(s, a; θ) Take action a, observer r, s0 Store transition (s, a, r, s0 ) in D Sample random minibatch of transitions (sj , aj , rj , s0j ) from D Set yj ← ( rj for terminal s0j 0 0 rj + γ max Q(s , a ; θ) for non-terminal s0j a Algorithm 4: Sarsa(λ) Perform gradient descent step on (yj − Q(sj , aj ; Θ))2 s ← s0 end Algorithm 3: Monte Carlo first-visit For non-stationary problems, the Monte Carlo estimate for, e.g, V is: V (St ) ← V (St ) + α Gt − V (St ) (11) Where α is the learning rate, how much we want to forget about past experiences. prediction Temporal Difference - Q Learning Temporal Difference (TD) methods learn directly from raw experience without a model of the environment’s dynamics. TD substitutes the expected discounted reward Gt from the episode with an estimation: V (St ) ← V (St ) + α Rt+1 + γV (St+1 ) − V (St ) (12) end end Algorithm 6: Deep Q Learning Copyright c 2018 Francesco Saverio Zuppichini https://github.com/FrancescoSaverioZuppichini/ReinforcementLearning-Cheat-Sheet Double Deep Q Learning

Reinforcement Learning Cheat Sheet

Related documents

Products

Support

Reinforcement Learning Cheat Sheet

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib