Deep Reinforcement Learning badr hirchoua badr.hirchoua@gmail.com Motivation 2 First, automation of repeated physical solutions Industrial revolution (1750 - 1850) and machine Age (1870 - 1940) Second, automation of repeated mental solutions Digital Next step: allow machines to find solutions themselves AI revolution (1960- now) and information Age revolution (now - ???) This requires learning autonomously how to make decisions Lecture1: Learning To Do Outline 4 1 About Reinforcement Learning 3 The Reinforcement Learning Problem 4 Inside An RL Agent 5 Deep Q Learning AI for chess 5 AI for chess 6 Board Configurations 7 https://www.microsoft.com/en-us/research/blog/thehuman-side-of-ai-for-chess/ About Reinforcement Learning What is Reinforcement Learning 9 We, and other intelligent beings, learn by interacting with our environment. This differs from certain other types of learning: It is active rather than passive. Interactions are often sequential – future interactions can depend on earlier ones We are goal-directed We can learn without examples of optimal behavior Classes of Learning Problems 10 Supervised Learning Unsupervised Learning Data: (x,y) Data: x X is data, y is label X is data, no labels! Reinforcement Learning Data: State - action pairs RL: Our Focus Today Goal: Learn function Goal: Learn Goal: Maximize future to map x rewards over time y underlying structure Example: This is an apple Example: This thing Example: is like the other one Eat this because it will keep you alive Examples of RL Applications 11 Mnih et al. (2015) https://youtu.be/60pwnLB0DqY https://youtu.be/LJ4oCb6u7kk https://youtu.be/5WXVJ1A0k6Q Examples of Reinforcement Learning 12 Fly stunt manoeuvres in a helicopter Defeat the world champion at Backgammon and Go Manage an investment portfolio Control a power station Make a humanoid robot walk Play many different Atari games better than humans Reinforcement Learning: Key Concepts 13 Agent Environment Agent: Takes Actions Environment: the world in which the agent exists and operates Reinforcement Learning: Key Concepts 14 Agent Environment Action: at Actions Action: a move that the agent can make in the environment Action Space A: the set of possible actions an agent can make in the environment Reinforcement Learning: Key Concepts 15 Observations Agent Environment Action: at Actions Observations : of the environment after taking actions. Reinforcement Learning: Key Concepts 16 Observations Agent State changes: St+1 Environment Action: at Actions State : a situation which the agent perceives. Reinforcement Learning: Key Concepts 17 Observations Agent State changes: St+1 Reward: rt Environment Action: at Actions Reward : feedback that measures the success or failure of the agent’s action. Rt= σ∞ 𝑖=𝑡 𝑟𝑡 Agent and Environment Interaction 18 At each step t the agent: Executes action At Receives observation Ot Receives scalar reward Rt The environment: Receives action At Emits observation Ot+1 Emits scalar reward Rt+1 t increments at env. step Rewards 19 A reward Rt is a scalar feedback signal Indicates how well agent is doing at step t The agent’s job is to maximize cumulative reward Reinforcement learning is based on the reward hypothesis Definition (Reward Hypothesis) All goals can be described by the maximisation of expected cumulative reward. Rewards 20 Total Reward (Return): Rt= σ∞ 𝑖=𝑡 𝑟𝑡 = 𝑟𝑡 + 𝑟𝑡 + 1 +…+ 𝑟𝑡 + 𝑛 + … Discounted Total Reward (Return): i𝑟 Rt= σ∞ γ 𝑖=𝑡 𝑡 =γt𝑟𝑡 +γt+1𝑟𝑡 + 1 +…+γt+n𝑟𝑡 + 𝑛 + … γ is a discount factor, where γ є ]0; 1[ Examples of Rewards 21 Fly stunt manoeuvres in a helicopter Defeat the world champion at Backgammon and Go +ve reward for each $ in bank Control a power station +/-ve reward for winning/losing a game Manage an investment portfolio +ve reward for following desired trajectory -ve reward for crashing +ve reward for producing power -ve reward for exceeding safety thresholds Make a humanoid robot walk +ve reward for forward motion and -ve reward for falling over Sequential Decision Making – under uncertainty 22 Goal: select actions to maximize total future reward Actions may have long term consequences Reward may be delayed It may be better to sacrifice immediate reward to gain more long-term reward Examples: A financial investment (may take months to mature) Refuelling a helicopter (might prevent a crash in several hours) Blocking opponent moves (might help winning chances many moves from now) History and State 23 The history is the sequence of observations, actions, rewards Ht = O1, R1, A1, ..., At-1, Ot, Rt i.e. all observable variables up to time t i.e. the sensorimotor stream of a robot or embodied agent What happens next depends on the history: The agent selects actions The environment selects observations/rewards State is the information used to determine what happens next Formally, state is a function of the history: St = f (Ht ) Environment State 24 state Ste The Environment state Ste is the environment’s private representation i.e. whatever data the environment uses to pick the next observation/reward The environment state is not usually visible to the agent Even if Ste is visible, it may contain irrelevant information Agent State 25 state Sta The agent state Sta is the agent’s internal representation i.e. whatever information the agent uses to pick the next action i.e. it is the information used by reinforcement learning algorithms It can be any function of history Sta = f (Ht ) Information State 26 An information state (Markov state) contains all useful information from the history. Definition A state St is Markov if and only if ℙ [St+1| St] = ℙ [St+1| S1,…, St] Once the state is known, the history may be thrown away. Rat Example 27 What if agent state = last 3 items in sequence? What if agent state = counts for lights, bells and levers? What if agent state = complete sequence? Fully Observable Environments 28 Full observability: agent directly observes environment state Ot = Sta = Ste Agent state = environment state = information state Formally, this is a Markov decision process (MDP) Markov Decision Processes 29 Definition (Markov Decision Processes) A (discounted infinite horizon) Markov decision process is a tuple (S; A; T; γ; D; R), where: 1. S is the set of possible states for the system; 2. A is the set of possible actions; 3. T represents the (typically stochastic) system dynamics; 4. γ is a discount factor, where γ є [0; 1]; 5. D is the initial-state distribution; 6. R : S →ℝ is the reward function. Markov Decision Processes 30 State space Action space Transition function Reward function Fully Observable - MDP 31 Partially Observable Environments 32 Partial observability: agent indirectly observes environment: A trading agent only observes current prices A poker playing agent only observes public cards Now: agent state ≠ environment state. Formally this is a partially observable Markov decision process (POMDP) Agent must construct its own state representation Sta, e.g. Complete history: Sta = Ht Beliefs of environment state: Sta = (ℙ[Ste=S1],…,ℙ[Ste=Sn]) Recurrent neural network: Sta = σ(St-1a Ws + Ot Wo ) Partially Observable MDP (POMDP) 33 The partially observable Markov decision process framework, can model diversity of dynamic real-world decision processes to produce a policy that fulfills a given task. A POMDP models an agent decision process which cannot directly observe the underlying state. The targeted aim in the POMDP model is to maximize the obtained reward, while reasoning about the state uncertainty. POMDP 34 Definition (Partially Observable Markov Decision Processes) A POMDP is a 7-tuple < S; A; T; R; Ω; O; γ >, where: 1. S is a set of states, 2. A is a set of actions, 3. T is a set of conditional transition probabilities T(s’|s,a) for the state transition s →s’, 4. R : S × A →ℝ : reward function, 5. Ω is a set of observations, 6. O is a set of conditional observation probabilities O(o/s’, a); 7. γ є [0,1] is the discount factor Partially Observable -POMDP 35 MDP augmented with observations RL Components Defining the Q-function 37 Q(st , at)= 𝔼 [Rt| st, at] The Q-function captures the expected total future reward an agent in state s, can receive by executing a certain action a. Defining the Q-function 38 Q(st , at)= 𝔼 [Rt| st, at] (state, action) Ultimately, the agent needs a policy π(s), to infer the best action to take at its state s. Strategy: the policy should choose an action that maximizes future reward. π*(s) = argmax Q(s,a) a Major Components of an RL Agent 39 An RL agent may include one or more of these components: Policy: agent’s behavior function. Value function: how good is each state and/or action. Model: agent’s representation of the environment. Deep Reinforcement Learning Algorithms 40 Value Learning Policy Learning Find Q(s,a) Find π(s) a = argmax Q(s,a) a Sample a ~ π(s) Policy 41 A policy is the agent’s behavior. It is a map from state to action, e.g. Deterministic policy: a ~ π(s). Stochastic policy: π(a/s) = ℙ [At = a/ St=s] The objective is to find the policy that maximizes the expected sum of rewards accumulated over time. Value Function 42 Value function is a prediction of future reward. Used to evaluate the goodness/badness of states. And therefore to select between actions, e.g. Q(s, a) = vπ(s) = 𝔼π [Rt+1 +γRt+2 + γ2 Rt+3 …| St = s] Model 43 A model predicts what the environment will do next. Ƿ predicts the next state Ɍ predicts the next (immediate) reward, e.g. Ƿass’ = ℙ [St+1 = s’| St=s, At = a] Ɍ as = 𝔼π [Rt+1 ’| St=s, At = a] Q-Learning: The Procedure 44 Agent Q(s1, a1) Q(s1, a1) + Δ π(s2) = a2 Q(s1, a) = 0 π(s1) = a1 a1 s1 a2 s2 r2 δ(s1, a1) = s2 r(s1, a1) = r2 Environment s3 r3 δ(s2, a2) = s3 r(s2, a2) = r3 Maze Example 45 ▪ Rewards: -1 per time-step Start ▪ Actions: N, E, S, W ▪ States: Agent’s location Goal Maze Example: Policy 46 ▪ Arrows represent policy π(s) for each state s Start Goal Maze Example: Value Function 47 -14 Start -16 -13 -12 -15 -16 -10 -12 -17 -18 -24 -23 -11 -22 -8 -6 -19 -5 -20 -4 -21 -22 -9 -7 -3 -2 -1 Goal ▪ Numbers represent value vπ (s) of each state s Maze Example: Model 48 -1 -1 -1 -1 -1 -1 Start -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ▪ Agent may have an internal model of the environment ▪ Dynamics: how actions change the state ▪ Rewards: how much reward from each state ▪ The model may be imperfect -1 Goal ▪ Grid layout represents transition model Ƿass’ ▪ Numbers represent immediate reward Ɍas from each state s (same for all a) Digging deeper into the Q-function 49 Example: Atari Breakout - Side It can be very difficult for humans to accurately estimate Q-values Which (s,a) pair has a higher A-value? Lecture 2: Deep Q Learning Deep Q-Network 51 How can we use deep neural networks to model Q-function? State s Action + State → Expected Return Deep NN State → Expected Return for each Action State s Q(s,a1) Deep NN Q(s,a) “Move Right” Action a Input Q(s,a2) Q(s,an) Agent Output Input Agent Output Deep Q-Network - Training 52 How can we use deep neural networks to model Q-function? State s State s Deep NN Q(s,a1) Deep NN Q(s,a) “Move Right” Action a Input Q(s,a2) Q(s,an) Agent Output Input What happens if we take all the best actions? Maximize target return → train the agent Agent Output Deep Q-Network - Training 53 State s State s Deep NN Q(s,a1) Deep NN Q(s,a) “Move Right” Action a Input Q(s,a2) Q(s,an) Agent Output Input Agent Output Target r + γ maxQ(s′,a′) a’ Take all the best actions → target return Deep Q-Network - Training 54 State s State s Deep NN Q(s,a1) Deep NN Q(s,a) “Move Right” Action a Input Q(s,a2) Q(s,an) Agent Output Input Target Predicted r + γ maxQ(s′,a′) a’ Q(s,a) Agent Output Network Prediction Deep Q-Network - Training 55 State s State s Deep NN Q(s,a1) Deep NN Q(s,a) “Move Right” Action a Input Q(s,a2) Q(s,an) Agent Output Target L = 𝔼 [ || (r + γ maxQ(s′,a′)) a’ Input Agent Output Predicted - Q(s,a) ||2] Q-Loss Deep Q-Network - Summary 56 State s Q(s,a1)=20 Deep NN Q(s,a2)=3 Q(s,a3)=0 π(s) = argmax Q(s,a) a =a1 Send action back to environment and receive next state Downside of Q-Learning 57 Complexity: Can model scenario where the action space is discrete and small Cannot handle continuous action spaces Flexibility: Policy is deterministically computed from the Q function by maximizing the reward → cannot learn stochastic policies To address these, consider a new class of RL training algorithms: Policy gradient methods Deep Reinforcement Learning Algorithms 58 Value Learning Find Q(s,a) a = argmax Q(s,a) a Policy Learning Find π(s) Sample a ~ π(s) Policy Gradient (PG): Key Concepts 59 Policy Gradient: Directly optimize the policy π(s) P(ai|s)=1 State s P(a1|s)=0.9 Deep NN P(a2|s)=0.1 P(a3|s)=0 What are some advantages of this formulation? 𝑎𝑖 ∈𝐴 π(s) ~ P(a|s) =a1 Discrete Vs Continuous Action Spaces 60 Discrete action space: which direction should I move? P(a|s) State s a Left Stay Right Discrete Vs Continuous Action Spaces 61 Discrete action space: which direction should I move? Continuous action space: how fast should we move 0.7m/s P(a|s) State s a Faster Left Faster Right Policy Gradient (PG): Key Concepts 62 Policy Gradient: Enables modeling of continuous action space ∞ න State s 𝑎=−∞ Mean, µ = -1 Deep NN Variance, P(a|s) = N(µ, σ2 ) π(s) ~ P(a|s) = -0.8 [m/s] σ2 = 0.5 P(a|s) = N(µ, σ2 ) a -1 Faster Left Faster Right P(ai|s)=1 Solving an MDP 63 Objective: Goal: Maximum expected future rewards stating at state si, choosing action ai, and then following an optimal policy * Q Learning 64 The Max future reward for taking action at is the current reward plus the next step’s max future reward from taking the best next action a’ Q Learning 65 Function Approximation 66 Model: Training data: Loss function: where Implementation 67 Action-in Action-out Off-Policy Learning → The target depends in part on our model → old observations are still useful → Use a Replay Buffer of most recent transitions as dataset DQN Issues 68 → Convergence is not guaranteed – hope for deep magic! Replay Buffer Error Clipping Reward scaling Using replicas → Double Q Learning – decouple action selection and value estimation Policy Gradients 69 → Parameterize policy and update those parameters directly → Enables new kinds of policies: stochastic, continuous action spaces → On policy learning → learn directly from your actions Policy Gradients 70 → Approximate expectation value from samples Advanced Policy Gradient Methods 71 → For stochastic functions, the gradient is not the best direction → Consider the KL divergence NPG → Approximating the Fisher information TRPO → matrix PPO → Computing gradients with KL constraint Gradients with KL penalty Advanced Policy Gradient Methods 72 Rajeswaran et al. (2017) Heess et al. (2017) https://youtu.be/frojcskMkkY https://youtu.be/hx_bgoTF7bs Actor Critic 73 Critic Estimate Advantage using Q learning update Propose Actions Actor using policy gradient update