Slides - go here for webmail

advertisement
Introduction to
Reinforcement
Learning
E0397 Lecture
Slides by Richard Sutton
(With small changes)
Reinforcement Learning:
Basic Idea



Learn to take correct actions over time by experience
Similar to how humans learn: “trial and error”
Try an action –





“see” what happens
“remember” what happens
Use this to change choice of action next time you are in the
same situation
“Converge” to learning correct actions
Focus on long term return, not immediate benefit

Some action may seem not beneficial for short term, but its long
term return will be good.
RL in Cloud Computing
Why are we talking about RL in this class?
 Which problems due you think can be
solved by a learning approach?
 Is focus on long-term benefit appropriate?

RL Agent
Object is to affect the environment
 Environment is stochastic and uncertain

Environment
action
state
reward
Agent
What is Reinforcement
Learning?
An approach to Artificial Intelligence
 Learning from interaction
 Goal-oriented learning
 Learning about, from, and while
interacting with an external environment
 Learning what to do—how to map
situations to actions—so as to maximize
a numerical reward signal

Key Features of RL
Learner is not told which actions to take
 Trial-and-Error search
 Possibility of delayed reward

 Sacrifice
short-term gains for greater longterm gains
The need to explore and exploit
 Considers the whole problem of a goaldirected agent interacting with an
uncertain environment

Examples of Reinforcement Learning

Robocup Soccer Teams


World’s best player of simulated soccer, 1999; Runner-up 2000
Inventory Management

Stone & Veloso, Reidmiller et al.
Van Roy, Bertsekas, Lee & Tsitsiklis
10-15% improvement over industry standard methods

Dynamic Channel Assignment

Elevator Control
Singh & Bertsekas, Nie & Haykin
 World's best assigner of radio channels to mobile telephone calls


(Probably) world's best down-peak elevator controller
Many Robots


Crites & Barto
navigation, bi-pedal walking, grasping, switching between skills...
TD-Gammon and Jellyfish

World's best backgammon player
Tesauro, Dahl
Supervised Learning
Training Info = desired (target) outputs
Inputs
Supervised Learning
System
Outputs
Error = (target output – actual output)
•Best examples: Classification/identification systems.
E.g. fault classification, WiFi
location determination, workload identification
•E.g. Classification systems: give agent lots of data where data has been already
classified. Agent can “train” itself using this knowledge.
•Training phase is non trivial
•(Normally) no learning happens in on-line phase
Reinforcement Learning
Training Info = evaluations (“rewards” / “penalties”)
Inputs
RL
System
Outputs (“actions”)
Objective: get as much reward as possible
•Absolutely no “already existing training data”
•Agent learns ONLY by experience
Elements of RL


State: state of the system
Actions: Suggested by RL agent, taken by
actuators in the system
 Actions
change state (deterministically, or
stochastically)


Reward: Immediate gain when taking action a in
state s
Value: Long term benefit of taking action a in
state s
 Rewards
gained over a long time horizon
The n-Armed Bandit Problem
Choose repeatedly from one of n
actions; each choice is called a play
 After each play a t , you get a reward rt
, where

These are unknown action values
Distribution of rt depends only on a t

Objective is to maximize the reward in
the long term, e.g., over 1000 plays
To solve the n-armed bandit problem,
you must explore a variety of actions
and exploit the best of them
The Exploration/Exploitation
Dilemma
 Suppose you form estimates
*

Qt (a)  Q (a)
action value estimates
The greedy action at t is at
at*  argmax Qt (a)
a
*
at  at  exploitation
at  at*  exploration


You can’t exploit all the time; you can’t explore
all the time
You can never stop exploring; but you should
always reduce exploring. Maybe.
Action-Value Methods

Methods that adapt action-value estimates
and nothing else, e.g.: suppose by the t-th
play, action a had been chosen k a times,
producing rewards r1 , r2 , , rk ,
then
a
“sample average”
*
lim Qt (a)  Q (a)
ka 

e-Greedy Action Selection

Greedy action selection:
*
t
at  a  arg max Qt (a)

a
e-Greedy:
at 
{
at* with probability 1  e
random action with probability
e
. . . the simplest way to balance exploration and exploitation
10-Armed Testbed
n = 10 possible actions
Q (a)
 Each
is
chosen
randomly
from
a
rt
normal distrib.:
 each
is also normal:
 1000 plays
 repeat the whole thing 2000 times and
average the results

*
e-Greedy Methods on the 10Armed Testbed
Incremental Implementation
Recall the sample average estimation method:
The average of the first k rewards is
(dropping the dependence on a ):
r1  r2 
Qk 
k
rk
Can we do this incrementally (without storing all the rewards)?
We could keep a running sum and count, or, equivalently:
Qk 1  Qk 
1
rk 1  Qk 

k 1
This is a common form for update rules:
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
Tracking a Nonstationary
ProblemQ
k
Choosing
to be a sample average is appropriate in a
stationary problem, i.e., when none of the Q * (a)
change over time,
But not in a nonstationary problem.
Better in the nonstationary case is:
Qk 1  Qk   rk 1  Qk 
for constant , 0    1
k
 (1 ) Q0   (1   ) ri
k
k i
i 1
exponential, recency-weighted average
Formal Description
of RL
The Agent-Environment
Interface
Agent and environment interact at discrete time steps
: t  0,1, 2,
st S
Agent observes state at step t :
produces action at step t : at  A(st )
gets resulting reward :
rt 1 
and resulting next state : st 1
...
st
at
rt +1
st +1
at +1
rt +2
st +2
at +2
rt +3 s
t +3
...
at +3
The Agent Learns a Policy
Policy at step t, t :
a mapping from states to action probabilities
t (s, a)  probability that at  a when st  s
Reinforcement learning methods specify
how the agent changes its policy as a
result of experience.
 Roughly, the agent’s goal is to get as
much reward as it can over the long run.

Getting the Degree of
Abstraction Right






Time steps need not refer to fixed intervals of real time.
Actions can be low level (e.g., voltages to motors), or high level
(e.g., accept a job offer), “mental” (e.g., shift in focus of
attention), etc.
States can low-level “sensations”, or they can be abstract,
symbolic, based on memory, or subjective (e.g., the state of
being “surprised” or “lost”).
An RL agent is not like a whole animal or robot.
Reward computation is in the agent’s environment because the
agent cannot change it arbitrarily.
The environment is not necessarily unknown to the agent, only
incompletely controllable.
Goals and Rewards




Is a scalar reward signal an adequate notion
of a goal?—maybe not, but it is surprisingly
flexible.
A goal should specify what we want to
achieve, not how we want to achieve it.
A goal must be outside the agent’s direct
control—thus outside the agent.
The agent must be able to measure success:
 explicitly;
 frequently
during its lifespan.
Returns
Suppose the sequence of rewards after step t is :
rt 1 , rt 2 , rt  3 ,
What do we want to maximize?
In general,
we want to maximize the
expected return, ERt , for each step t.
Episodic tasks: interaction breaks naturally into
episodes, e.g., plays of a game, trips through a maze.
Rt  rt 1  rt 2 
 rT ,
where T is a final time step at which a terminal state is reached,
ending an episode.
Returns for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:
Rt  rt 1   rt 2   2 rt 3 

   k rt  k 1 ,
k 0
where  , 0    1, is the discount rate .
shortsighted 0   1 farsighted
An Example
Avoid failure: the pole falling beyond
a critical angle or the cart hitting end of
track.
As an episodic task where episode ends upon failure:
reward  1 for each step before failure
 return  number of steps before failure
As a continuing task with discounted return:
reward  1 upon failure; 0 otherwise
 return    k , for k steps before failure
In either case, return is maximized by
avoiding failure for as long as possible.
A Unified Notation




In episodic tasks, we number the time steps of each
episode starting from zero.
We usually do not have to distinguish between episodes,
so we write st instead of s t, j for the state at step t of
episode j.
Think of each episode as ending in an absorbing state
that always produces reward of zero:
We can cover all cases by writing

Rt    k rt k 1 ,
k 0
where  can be 1 only if a zero reward absorbing state is always reached.
The Markov Property



“the state” at step t, means whatever information is
available to the agent at step t about its environment.
The state can include immediate “sensations,” highly
processed sensations, and structures built up over time
from sequences of sensations.
Ideally, a state should summarize past sensations so as to
retain all “essential” information, i.e., it should have the
Markov Property:
Prst 1  s,rt 1  r st ,at ,rt , st 1 ,at 1 , ,r1 ,s0 ,a0 
Prst 1  s,rt 1  r st ,at 
for all s, r, and histories st ,at ,rt , st 1 ,at 1 , ,r1, s0 ,a0 .
Markov Decision Processes



If a reinforcement learning task has the Markov
Property, it is basically a Markov Decision Process
(MDP).
If state and action sets are finite, it is a finite MDP.
To define a finite MDP, you need to give:


state and action sets
one-step “dynamics” defined by transition probabilities:
Psas  Prst 1  s st  s,at  a for all s, s S, a  A(s).


Expected values of reward:
Rass  E rt 1 st  s,at  a,st 1  s for all s, s S, a  A(s).
An Example Finite MDP
Recycling Robot




At each step, robot has to decide whether it should (1)
actively search for a can, (2) wait for someone to bring it
a can, or (3) go to home base and recharge.
Searching is better but runs down the battery; if runs out
of power while searching, has to be rescued (which is
bad).
Decisions made on basis of current energy level: high,
low.
Reward = number of cans collected
Recycling Robot MDP
Rs e a r c h expected no. of cans while searching
S  high, low
Rw a i t expected no. of cans while waiting
A(high)  search, wait
A(low)  search, wait, recharge

Rs e a r c h Rw a i t
Value Functions

The value of a state is the expected return starting from that state;
depends on the agent’s policy:
State - value function for policy  :
  k

V (s)  E Rt st  s E   rt k 1 st  s
k 0



The value of taking an action in a state under policy  is the
expected return starting from that state, taking that action, and
thereafter following  :
Action - value function for policy  :
 k

Q (s, a)  E Rt st  s, at  a E  rt  k 1 st  s,at  a 
k  0


Bellman Equation for a Policy 
The basic idea:
Rt  rt 1   rt 2   2 rt  3   3 rt  4
 rt 1   rt 2   rt 3   rt  4
2

 rt 1   Rt 1
So:
V (s)  E Rt st  s

 E rt 1   V st 1  st  s
Or, without the expectation operator:
V  (s)   (s,a)PsasRass   V  (s)
a
s
More on the Bellman Equation
V  (s)   (s,a)PsasRass   V  (s)
a
s
This is a set of equations (in fact, linear), one for each state.
The value function for  is its unique solution.

Backup diagrams:
for V

for Q

Optimal Value Functions



For finite MDPs, policies can be partially ordered:
    if and only if V  (s)  V (s) for all s S
There are always one or more policies that are better
than or equal to all the others. These are the optimal
policies. We denote them all *.
Optimal policies share the same optimal state-value
function: 

V (s)  max V (s) for all s S


Optimal policies also share the same optimal action

Q
(s,a)

max
Q
(s, a) for all s S and a A(s)
value function:

This is the expected return for taking action a in state s
and thereafter following an optimal policy.
Bellman Optimality Equation for
V*
The value of a state under an optimal policy must equal
the expected return for the best action from that state:


V (s)  max Q (s,a)
a A (s)
 max E rt 1   V  (st 1 ) st  s,at  a
a A (s)
 max  PsasRass   V  ( s)
a A (s)
s
The relevant backup diagram:


V is the unique solution of this system of nonlinear equations.
Bellman Optimality Equation for
Q*
Q (s,a)  E r   max Q (s , a) s  s,a  a

t 1


a 
t 1
t
  Psas Rass   max Q ( s, a)
s

a 
t

The relevant backup diagram:
*
Q is the unique solution of this system of nonlinear equations.
Gridworld



Actions: north, south, east, west;
deterministic.
If would take agent off the grid: no move but
reward = –1
Other actions produce reward = 0, except
actions that move agent out of special states A
and B as shown.
State-value function
for equiprobable
random policy;
= 0.9
Why Optimal State-Value
Functions are Useful
Any policy that is greedy with respect to

V is an optimal policy.

Therefore, given V , one-step-ahead search produces the
long-term optimal actions.
E.g., back to the gridworld:
*
What About Optimal Action-Value
Functions?
*
Given Q , the agent does not even
have to do a one-step-ahead search:
  (s)  arg max Q (s,a)
aA (s)
Solving the Bellman Optimality
Equation

Finding an optimal policy by solving the Bellman Optimality
Equation requires the following:

accurate knowledge of environment dynamics;
 we have enough space and time to do the computation;
 the Markov Property.

How much space and time do we need?

polynomial in number of states (via dynamic programming
methods; Chapter 4),
 BUT, number of states is often huge (e.g., backgammon has about
1020 states).


We usually have to settle for approximations.
Many RL methods can be understood as approximately solving
the Bellman Optimality Equation.
Solving MDPs
The goal in MDP is to find the policy that
maximizes long term reward (value)
 If all the information is there, this can be
done by explicitly solving, or by iterative
methods

Policy Iteration
0  V
0
1
 1  V 
policy evaluation
 V 
*
policy improvement
“greedification”
*
*
Policy Iteration
From MDPs to
reinforcement learning
MDPs vs RL

MDPs are great if we know transition probabilities
Psas  Prst 1  s st  s,at  a for all s, s S, a  A(s).

And expected immediate rewards
Rass  E rt 1 st  s,at  a,st 1  s for all s, s S, a  A(s).



But in real life tasks – this is precisely what we do
NOT know!

This is what turns the task into a learning problem
Simplest method to learn: Monte Carlo
Monte Carlo Methods

Monte Carlo methods learn from complete
sample returns
 Only

defined for episodic tasks
Monte Carlo methods learn directly from
experience
 On-line:
No model necessary and still attains
optimality
 Simulated: No need for a full model
Monte Carlo Policy Evaluation



Goal: learn V(s)
Given: some number of episodes under  which contain
s
Idea: Average returns observed after visits to s
1



2
3
4
5
Every-Visit MC: average returns for every time s is visited
in an episode
First-visit MC: average returns only for first time s is
visited in an episode
Both converge asymptotically
First-visit Monte Carlo policy
evaluation
Monte Carlo Estimation of
Action Values (Q)

Monte Carlo is most useful when a model is not
available
 We



want to learn Q*
Q(s,a) - average return starting from state s and
action a following 
Also converges asymptotically if every stateaction pair is visited
Exploring starts: Every state-action pair has a
non-zero probability of being the starting pair
Monte Carlo Control


MC policy iteration: Policy evaluation using MC
methods followed by policy improvement
Policy improvement step: greedify with respect
to value (or action-value) function
Monte Carlo Exploring Starts
Fixed point is optimal
policy *
Now proven (almost)
Problem with Monte
Carlo – learning
happens only after
end of entire episode
TD Prediction
Policy Evaluation (the prediction problem):

for a given policy , compute the state-value function V
Recall:
Simple every - visit Monte Carlo method :
V(st )  V(st )   Rt  V (st )
target: the actual return after time t
The simplest TD method,
TD(0) :
V(st )  V(st )   rt 1   V (st1 )  V(st )
target: an estimate of the return
Simple Monte Carlo
V(st )  V(st )   Rt  V (st )
where Rt is the actual return following state
st
T
T
T
TT
T
TT
T
TT
st .
T
TT
T
Simplest TD Method
V(st )  V(st )   rt 1   V (st1 )  V(st )
st
rt 1
st 1
TT
T
T
T
TT
T
T
TT
T
TT
T
cf. Dynamic Programming
V(st )  E rt 1   V(st )
st
rt 1
T
TT
TT
T
T
T
T
T
T
st 1
T
TD methods bootstrap and
sample

Bootstrapping: update involves an estimate




MC does not bootstrap
DP bootstraps
TD bootstraps
Sampling: update does not involve an expected
value



MC samples
DP does not sample
TD samples
Example: Driving Home
State
leaving o ffice
reach car, raining
Elapsed Time
(minutes)
0
5
(5)
Predicted
Time to Go
30
Predicted
Total Time
30
35
40
exit highway
20
(15)
15
35
behind truck
30
(10)
10
40
home street
40
3
43
arrive ho me
43
0
43
(10)
(3)
Driving Home
Changes recommended by
Monte Carlo methods =1)
Changes recommended
by TD methods (=1)
Advantages of TD Learning


TD methods do not require a model of the environment,
only experience
TD, but not MC, methods can be fully incremental

You can learn before knowing the final outcome



You can learn without the final outcome


Less memory
Less peak computation
From incomplete sequences
Both MC and TD converge (under certain assumptions
to be detailed later), but which is faster?
Random Walk Example
Values learned by TD(0) after
various numbers of episodes
TD and MC on the Random
Walk
Data averaged over
100 sequences of episodes
Optimality of TD(0)
Batch Updating: train completely on a finite amount of data,
e.g., train repeatedly on 10 episodes until convergence.
Compute updates according to TD(0), but only update
estimates after each complete pass through the data.
For any finite Markov prediction task, under batch updating,
TD(0) converges for sufficiently small .
Constant- MC also converges under these conditions, but to
a difference answer!
Random Walk under Batch
Updating
After each new episode, all previous episodes were treated as a batch,
and algorithm was trained until convergence. All repeated 100 times.
You are the Predictor
Suppose you observe the following 8 episodes:
A, 0, B, 0
B, 1
B, 1
B, 1
B, 1
B, 1
B, 1
B, 0
V(A)?
V(B)?
You are the Predictor
V(A)?
You are the Predictor

The prediction that best matches the training data is
V(A)=0



This minimizes the mean-square-error on the training set
This is what a batch Monte Carlo method gets
If we consider the sequentiality of the problem, then we
would set V(A)=.75




This is correct for the maximum likelihood estimate of a Markov
model generating the data
i.e, if we do a best fit Markov model, and assume it is exactly
correct, and then compute what it predicts (how?)
This is called the certainty-equivalence estimate
This is what TD(0) gets
Learning An Action-Value
Function

Estimate Q for the current behavior policy
After every transition from a nonterminal state
.
st , do this :
Qst , at   Qst , at    rt 1   Qst 1 ,at 1   Qst ,at 
If st 1 is terminal, then Q(st 1, at 1 )  0.
Sarsa: On-Policy TD Control
Turn this into a control method by always updating the
policy to be greedy with respect to the current estimate:
Windy Gridworld
undiscounted, episodic, reward = –1 until goal
Results of Sarsa on the Windy
Gridworld
Q-Learning: Off-Policy TD
Control
One - step Q - learning :


Qst , at   Qst , at    rt 1   max Qst1 , a  Qst , at 
a
Planning and Learning
Use of environment models
 Integration of planning and learning
methods

The Original Idea
Sutton, 1990
The Original Idea
Sutton, 1990
Models


Model: anything the agent can use to predict how the
environment will respond to its actions
Distribution model: description of all possibilities and
their probabilities



a
ss
a
ss 
P and R for all s, s, and a A(s)
Sample model: produces sample experiences


e.g.,
e.g., a simulation model
Both types of models can be used to produce simulated
experience
Often sample models are much easier to come by
Planning

Planning: any computational process that uses a
model to create or improve a policy

Planning in AI:



state-space planning
plan-space planning (e.g., partial-order planner)
We take the following (unusual) view:


all state-space planning methods involve computing value
functions, either explicitly or implicitly
they all apply backups to simulated experience
Planning Cont.



Classical DP methods are state-space planning
methods
Heuristic search methods are state-space
planning methods
A planning method based on Q-learning:
Random-Sample One-Step Tabular Q-Planning
Learning, Planning, and Acting

Two uses of real
experience:



model learning: to improve
the model
direct RL: to directly
improve the value function
and policy
Improving value function
and/or policy via a model
is sometimes called
indirect RL or modelbased R L. Here, we call
it planning.
Direct vs. Indirect RL

Indirect methods:
 make
fuller use of
experience: get
better policy with
fewer environment
interactions

Direct methods
 simpler
 not
affected by bad
models
But they are very closely related and can be usefully combined:
planning, acting, model learning, and direct RL can occur
simultaneously and in parallel
The Dyna Architecture (Sutton
1990)
The Dyna-Q Algorithm
direct RL
model learning
planning
Dyna-Q on a Simple Maze
rewards = 0 until goal, when =1
Value Prediction with FA
As usual: Policy Evaluation (the prediction problem):

for a given policy , compute the state-value function V
In earlier chapters, value functions were stored in lookup tables.
Here, the value function estimate at time
t, Vt , depends
on a parameter vector t , and only the parameter vector
is updated.
e.g., t could be the vector of connection weights
of a neural network.
Adapt Supervised Learning
Algorithms
Training Info = desired (target) outputs
Inputs
Supervised Learning
System
Outputs
Training example = {input, target output}
Error = (target output – actual output)
Download