Uploaded by badr Hirchoua

DRL

advertisement
Deep Reinforcement Learning
badr hirchoua
badr.hirchoua@gmail.com
Motivation
2

First, automation of repeated physical solutions
 Industrial
revolution (1750 - 1850) and machine Age
(1870 - 1940)

Second, automation of repeated mental solutions
 Digital

Next step: allow machines to find solutions themselves
 AI

revolution (1960- now) and information Age
revolution (now - ???)
This requires learning autonomously how to make
decisions
Lecture1: Learning To Do
Outline
4
1 About Reinforcement Learning
3 The Reinforcement Learning Problem
4 Inside An RL Agent
5 Deep Q Learning
AI for chess
5
AI for chess
6
Board Configurations
7
https://www.microsoft.com/en-us/research/blog/thehuman-side-of-ai-for-chess/
About Reinforcement Learning
What is Reinforcement Learning
9


We, and other intelligent beings, learn by
interacting with our environment.
This differs from certain other types of learning:
 It
is active rather than passive.
 Interactions are often sequential – future interactions
can depend on earlier ones


We are goal-directed
We can learn without examples of optimal
behavior
Classes of Learning Problems
10
Supervised Learning
Unsupervised Learning
Data: (x,y)
Data: x
X is data, y is label
X is data, no labels!
Reinforcement Learning
Data: State - action pairs
RL:
Our
Focus
Today
Goal: Learn function
Goal: Learn
Goal: Maximize future
to map x
rewards over time
y
underlying structure
Example: This is an apple Example: This thing Example:
is like the other one
Eat this
because it will keep you
alive
Examples of RL Applications
11
Mnih et al. (2015)
https://youtu.be/60pwnLB0DqY
https://youtu.be/LJ4oCb6u7kk
https://youtu.be/5WXVJ1A0k6Q
Examples of Reinforcement Learning
12






Fly stunt manoeuvres in a helicopter
Defeat the world champion at Backgammon and Go
Manage an investment portfolio
Control a power station
Make a humanoid robot walk
Play many different Atari games better than humans
Reinforcement Learning: Key Concepts
13
Agent
Environment
Agent: Takes Actions
Environment: the world in which the agent exists and operates
Reinforcement Learning: Key Concepts
14
Agent
Environment
Action: at
Actions
Action: a move that the agent can make in the environment
Action Space A: the set of possible actions an agent can make
in the environment
Reinforcement Learning: Key Concepts
15
Observations
Agent
Environment
Action: at
Actions
Observations : of the environment after taking actions.
Reinforcement Learning: Key Concepts
16
Observations
Agent
State changes: St+1
Environment
Action: at
Actions
State : a situation which the agent perceives.
Reinforcement Learning: Key Concepts
17
Observations
Agent
State changes: St+1
Reward: rt
Environment
Action: at
Actions
Reward : feedback that measures the
success or failure of the agent’s action.
Rt= σ∞
𝑖=𝑡 𝑟𝑡
Agent and Environment Interaction
18



At each step t the agent:

Executes action At

Receives observation Ot

Receives scalar reward Rt
The environment:

Receives action At

Emits observation Ot+1

Emits scalar reward Rt+1
t increments at env. step
Rewards
19



A reward Rt is a scalar feedback signal
Indicates how well agent is doing at step t
The agent’s job is to maximize cumulative reward
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected
cumulative reward.
Rewards
20
Total Reward (Return):
Rt= σ∞
𝑖=𝑡 𝑟𝑡
= 𝑟𝑡 + 𝑟𝑡 + 1 +…+ 𝑟𝑡 + 𝑛 + …
Discounted Total Reward (Return):
i𝑟
Rt= σ∞
γ
𝑖=𝑡
𝑡
=γt𝑟𝑡 +γt+1𝑟𝑡 + 1 +…+γt+n𝑟𝑡 + 𝑛 + …
γ is a discount factor, where γ є ]0; 1[
Examples of Rewards
21

Fly stunt manoeuvres in a helicopter



Defeat the world champion at Backgammon and Go


+ve reward for each $ in bank
Control a power station



+/-ve reward for winning/losing a game
Manage an investment portfolio


+ve reward for following desired trajectory
-ve reward for crashing
+ve reward for producing power
-ve reward for exceeding safety thresholds
Make a humanoid robot walk

+ve reward for forward motion and -ve reward for falling over
Sequential Decision Making – under uncertainty
22





Goal: select actions to maximize total future reward
Actions may have long term consequences
Reward may be delayed
It may be better to sacrifice immediate reward to gain more
long-term reward
Examples:



A financial investment (may take months to mature)
Refuelling a helicopter (might prevent a crash in several hours)
Blocking opponent moves (might help winning chances many
moves from now)
History and State
23




The history is the sequence of observations, actions, rewards
Ht = O1, R1, A1, ..., At-1, Ot, Rt
i.e. all observable variables up to time t
i.e. the sensorimotor stream of a robot or embodied agent
What happens next depends on the history:

The agent selects actions

The environment selects observations/rewards


State is the information used to determine what happens next
Formally, state is a function of the history:
St = f (Ht )
Environment State
24




state Ste
The Environment state Ste is the
environment’s private representation
i.e. whatever data the environment
uses
to
pick
the
next
observation/reward
The environment state is not usually
visible to the agent
Even if Ste is visible, it may contain
irrelevant information
Agent State
25
state Sta




The agent
state Sta is the
agent’s internal representation
i.e. whatever information the
agent uses to pick the next action
i.e. it is the information used by
reinforcement learning algorithms
It can be any function of history
Sta = f (Ht )
Information State
26

An information state (Markov state) contains all useful
information from the history.
Definition
A state St is Markov if and only if
ℙ [St+1| St] = ℙ [St+1| S1,…, St]

Once the state is known, the history may be thrown away.
Rat Example
27

What if agent state = last 3 items in sequence?

What if agent state = counts for lights, bells and levers?

What if agent state = complete sequence?
Fully Observable Environments
28
Full observability: agent directly observes environment state
Ot = Sta = Ste


Agent state = environment state = information state
Formally, this is a Markov decision process (MDP)
Markov Decision Processes
29
Definition (Markov Decision Processes)
A (discounted infinite horizon) Markov decision process is a
tuple (S; A; T; γ; D; R), where:
1. S is the set of possible states for the system;
2. A is the set of possible actions;
3. T represents the (typically stochastic) system dynamics;
4. γ is a discount factor, where γ є [0; 1];
5. D is the initial-state distribution;
6. R : S →ℝ is the reward function.
Markov Decision Processes
30
State space
Action
space
Transition
function
Reward
function
Fully Observable - MDP
31
Partially Observable Environments
32




Partial observability: agent indirectly observes environment:
 A trading agent only observes current prices
 A poker playing agent only observes public cards
Now: agent state ≠ environment state.
Formally this is a partially observable Markov decision
process (POMDP)
Agent must construct its own state representation Sta, e.g.
 Complete history: Sta = Ht
 Beliefs of environment state: Sta = (ℙ[Ste=S1],…,ℙ[Ste=Sn])
 Recurrent neural network: Sta = σ(St-1a Ws + Ot Wo )
Partially Observable MDP (POMDP)
33



The partially observable Markov decision process framework, can
model diversity of dynamic real-world decision processes to
produce a policy that fulfills a given task.
A POMDP models an agent decision process which cannot directly
observe the underlying state.
The targeted aim in the POMDP model is to maximize the
obtained reward, while reasoning about the state uncertainty.
POMDP
34
Definition (Partially Observable Markov Decision Processes)
A POMDP is a 7-tuple < S; A; T; R; Ω; O; γ >, where:
1. S is a set of states,
2. A is a set of actions,
3. T is a set of conditional transition probabilities T(s’|s,a) for the
state transition s →s’,
4. R : S × A →ℝ : reward function,
5. Ω is a set of observations,
6. O is a set of conditional observation probabilities O(o/s’, a);
7. γ є [0,1] is the discount factor
Partially Observable -POMDP
35
MDP augmented with observations
RL Components
Defining the Q-function
37
Q(st , at)= 𝔼 [Rt| st, at]
The Q-function captures the expected total future reward an
agent in state s, can receive by executing a certain action a.
Defining the Q-function
38
Q(st , at)= 𝔼 [Rt| st, at]
(state, action)
Ultimately, the agent needs a policy π(s), to infer the best
action to take at its state s.
Strategy: the policy should choose an action that maximizes
future reward.
π*(s) = argmax Q(s,a)
a
Major Components of an RL Agent
39

An RL agent may include one or more of these components:

Policy: agent’s behavior function.

Value function: how good is each state and/or action.

Model: agent’s representation of the environment.
Deep Reinforcement Learning Algorithms
40
Value Learning
Policy Learning
Find Q(s,a)
Find π(s)
a = argmax Q(s,a)
a
Sample a ~ π(s)
Policy
41

A policy is the agent’s behavior.

It is a map from state to action, e.g.

Deterministic policy: a ~ π(s).

Stochastic policy: π(a/s) = ℙ [At = a/ St=s]

The objective is to find the policy that maximizes the
expected sum of rewards accumulated over time.
Value Function
42

Value function is a prediction of future reward.

Used to evaluate the goodness/badness of states.

And therefore to select between actions, e.g.
Q(s, a) = vπ(s) = 𝔼π [Rt+1 +γRt+2 + γ2 Rt+3 …| St = s]
Model
43



A model predicts what the environment will do next.
Ƿ predicts the next state
Ɍ predicts the next (immediate) reward, e.g.
Ƿass’ = ℙ [St+1 = s’| St=s, At = a]
Ɍ as = 𝔼π [Rt+1 ’| St=s, At = a]
Q-Learning: The Procedure
44
Agent
Q(s1, a1)  Q(s1, a1) + Δ
π(s2) = a2
Q(s1, a) = 0
π(s1) = a1
a1
s1
a2
s2
r2
δ(s1, a1) = s2
r(s1, a1) = r2
Environment
s3
r3
δ(s2, a2) = s3
r(s2, a2) = r3
Maze Example
45
▪ Rewards: -1 per time-step
Start
▪ Actions: N, E, S, W
▪ States: Agent’s location
Goal
Maze Example: Policy
46
▪ Arrows represent policy π(s)
for each state s
Start
Goal
Maze Example: Value Function
47
-14
Start
-16
-13
-12
-15
-16
-10
-12
-17
-18
-24
-23
-11
-22
-8
-6
-19
-5
-20
-4
-21
-22
-9
-7
-3
-2
-1
Goal
▪ Numbers represent value vπ (s) of each state s
Maze Example: Model
48
-1 -1 -1 -1 -1 -1
Start -1 -1
-1
-1
-1 -1
-1
-1
-1 -1
-1
▪ Agent may have an internal
model of the environment
▪ Dynamics: how actions
change the state
▪ Rewards: how much reward
from each state
▪ The model may be imperfect
-1 Goal
▪ Grid layout represents transition model Ƿass’
▪ Numbers represent immediate reward Ɍas from each state s
(same for all a)
Digging deeper into the Q-function
49
Example: Atari Breakout - Side
It can be very difficult for humans
to accurately estimate Q-values
Which (s,a) pair has a higher A-value?
Lecture 2: Deep Q Learning
Deep Q-Network
51
How can we use deep neural networks to model Q-function?
State s
Action + State →
Expected Return
Deep
NN
State → Expected Return
for each Action
State s
Q(s,a1)
Deep
NN
Q(s,a)
“Move
Right”
Action a
Input
Q(s,a2)
Q(s,an)
Agent
Output
Input
Agent
Output
Deep Q-Network - Training
52
How can we use deep neural networks to model Q-function?
State s
State s
Deep
NN
Q(s,a1)
Deep
NN
Q(s,a)
“Move
Right”
Action a
Input
Q(s,a2)
Q(s,an)
Agent
Output
Input
What happens if we take all the best actions?
Maximize target return → train the agent
Agent
Output
Deep Q-Network - Training
53
State s
State s
Deep
NN
Q(s,a1)
Deep
NN
Q(s,a)
“Move
Right”
Action a
Input
Q(s,a2)
Q(s,an)
Agent
Output
Input
Agent
Output
Target
r + γ maxQ(s′,a′)
a’
Take all the best actions
→ target return
Deep Q-Network - Training
54
State s
State s
Deep
NN
Q(s,a1)
Deep
NN
Q(s,a)
“Move
Right”
Action a
Input
Q(s,a2)
Q(s,an)
Agent
Output
Input
Target
Predicted
r + γ maxQ(s′,a′)
a’
Q(s,a)
Agent
Output
Network
Prediction
Deep Q-Network - Training
55
State s
State s
Deep
NN
Q(s,a1)
Deep
NN
Q(s,a)
“Move
Right”
Action a
Input
Q(s,a2)
Q(s,an)
Agent
Output
Target
L = 𝔼 [ || (r + γ maxQ(s′,a′))
a’
Input
Agent
Output
Predicted
- Q(s,a) ||2]
Q-Loss
Deep Q-Network - Summary
56
State s
Q(s,a1)=20
Deep
NN
Q(s,a2)=3
Q(s,a3)=0
π(s) = argmax Q(s,a)
a
=a1
Send action back to environment and receive next state
Downside of Q-Learning
57

Complexity:
Can model scenario where the action space is discrete and small
 Cannot handle continuous action spaces


Flexibility:

Policy is deterministically computed from the Q function by
maximizing the reward → cannot learn stochastic policies
To address these, consider a new class of RL
training algorithms:
Policy gradient methods
Deep Reinforcement Learning Algorithms
58
Value Learning
Find Q(s,a)
a = argmax Q(s,a)
a
Policy Learning
Find π(s)
Sample a ~ π(s)
Policy Gradient (PG): Key Concepts
59
Policy Gradient: Directly optimize the policy π(s)
෍ P(ai|s)=1
State s
P(a1|s)=0.9
Deep
NN
P(a2|s)=0.1
P(a3|s)=0
What are some advantages of this formulation?
𝑎𝑖 ∈𝐴
π(s) ~ P(a|s)
=a1
Discrete Vs Continuous Action Spaces
60

Discrete action space: which direction should I move?
P(a|s)
State s
a
Left
Stay
Right
Discrete Vs Continuous Action Spaces
61


Discrete action space: which direction should I move?
Continuous action space: how fast should we move
0.7m/s
P(a|s)
State s
a
Faster Left
Faster Right
Policy Gradient (PG): Key Concepts
62
Policy Gradient: Enables modeling of continuous action space
∞
න
State s
𝑎=−∞
Mean, µ = -1
Deep
NN
Variance,
P(a|s) = N(µ, σ2 )
π(s) ~ P(a|s)
= -0.8 [m/s]
σ2 = 0.5
P(a|s) = N(µ, σ2 )
a
-1
Faster Left
Faster Right
P(ai|s)=1
Solving an MDP
63
Objective:
Goal:
Maximum expected future rewards stating at state si,
choosing action ai, and then following an optimal policy
*
Q Learning
64
The Max future reward for taking action at is the current
reward plus the next step’s max future reward from
taking the best next action a’
Q Learning
65
Function Approximation
66
Model:
Training data:
Loss function:
where
Implementation
67
Action-in
Action-out
Off-Policy Learning
→ The target depends in part
on our model → old
observations
are
still
useful
→ Use a Replay Buffer of
most recent transitions as
dataset
DQN Issues
68
→ Convergence is not guaranteed – hope for deep magic!
Replay Buffer
Error Clipping Reward scaling
Using replicas
→ Double Q Learning – decouple action selection and value estimation
Policy Gradients
69
→ Parameterize policy and update those parameters directly
→ Enables new kinds of policies: stochastic, continuous action spaces
→ On policy learning → learn directly from your actions
Policy Gradients
70
→ Approximate expectation value from samples
Advanced Policy Gradient Methods
71
→ For stochastic functions, the gradient is not the best direction
→ Consider the KL divergence
NPG → Approximating the Fisher information
TRPO → matrix
PPO → Computing gradients with KL constraint
Gradients with KL penalty
Advanced Policy Gradient Methods
72
Rajeswaran et al. (2017)
Heess et al. (2017)
https://youtu.be/frojcskMkkY
https://youtu.be/hx_bgoTF7bs
Actor Critic
73
Critic
Estimate
Advantage
using Q learning update
Propose
Actions
Actor
using policy gradient update
Download