Title Slide - Department of Computing Science

advertisement
Reinforcement Learning:
A Tutorial
Rich Sutton
AT&T Labs -- Research
www.research.att.com/~sutton
Outline
• Core RL Theory
– Value functions
– Markov decision processes
– Temporal-Difference Learning
– Policy Iteration
• The Space of RL Methods
– Function Approximation
– Traces
– Kinds of Backups
• Lessons for Artificial Intelligence
• Concluding Remarks
RL is Learning from Interaction
Environment
action
state
reward
Agent
• complete agent
• temporally situated
• continual learning & planning
• object is to affect environment
• environment stochastic & uncertain
Ball-Ballancing Demo
• RL applied to a real physical system
• Illustrates online learning
• An easy problem...but learns as you watch!
• Courtesy of GTE Labs:
Hamid Benbrahim
Judy Franklin
John Doleac
Oliver Selfridge
• Same learning algorithm as early pole-balancer
Markov Decision Processes (MDPs)
In RL, the environment is a modeled as an MDP,
defined by
S – set of states of the environment
A(s) – set of actions possible in state s S
P(s,s',a) – probability of transition from s to s' given a
R(s,s',a) – expected reward on transition s to s' given a

– discount rate for delayed reward
discrete time, t
...
st
at
= 0, 1, 2, . . .
rt +1
st +1
at +1
rt +2
st +2
at +2
rt +3 s
t +3
...
at +3
The Objective is to Maximize
Long-term Total Discounted Reward
Find a policy  s
S  aA(s)
(could be stochastic)
that maximizes the value (expected future reward) of each s :

V (s) = E {rt +1 +  rt +2 + 2 rt +3 +
and each s,a pair:

...
s t =s, }
rewards
Q (s,a) = E {rt +1+  rt +2 + 2rt +3+
...
s t =s, a t=a, }
These are called value functions - cf. evaluation functions
Optimal Value Functions and Policies
There exist optimal value functions:
Q* (s, a)  max Q (s,a)
V * (s)  max V  (s)


And corresponding optimal policies:
 * (s)  arg max Q* (s, a)
a
* is the greedy policy with respect to Q*
1-Step Tabular Q-Learning
On each state transition:
st
rt 1
at
st 1
Update:


Q(st ,at )  Q(st , at )   rt 1   max Q(st 1 ,a)  Q(st ,at )
a table
entry
a
TD Error
lim Q(s, a)  Q* (s,a) Optimal behavior found without a
t 
model of the environment!
lim  t   *
(Watkins, 1989)
t 
Assumes finite MDP
Example of Value Functions
Vk for the
random policy
T
k=0
Greedy Policy
w.r.t V k
0.0 0.0 0.0
0.0 0.0 0.0 0.0
random
policy
4Grid Example
0.0 0.0 0.0 0.0
0.0 0.0 0.0
T
T -1.0 -1.0 -1.0
k= 1
-1.0 -1.0 -1.0 -1.0
-1.0 -1.0 -1.0 -1.0
-1.0 -1.0 -1.0
T
T -1.7 -2.0 -2.0
k= 2
-1.7 -2.0 -2.0 -2.0
-2.0 -2.0 -2.0 -1.7
-2.0 -2.0 -1.7
T
V k+1 (s) = E{ r +  V k (s')}
T -2.4 -2.9 -3.0
k= 3
-2.4 -2.9 -3.0 -2.9
-2.9 -3.0 -2.9 -2.4
-3.0 -2.9 -2.4
T
T -6.1 -8.4 -9.0
k = 10
-6.1 -7.7 -8.4 -8.4
-8.4 -8.4 -7.7 -6.1
-9.0 -8.4 -6.1
T
T -14. -20. -22.
k=
-14. -18. -20. -20.
-20. -20. -18. -14.
-22. -20. -14.
T
optimal
policy
Policy Iteration
Summary of RL Theory:
Generalized Policy Iteration
policy
evaluation
value
learning
Value
Function
Policy

V, Q
policy
improvement
greedification

*
*
V ,Q
*
Outline
• Core RL Theory
– Value functions
– Markov decision processes
– Temporal-Difference Learning
– Policy Iteration
• The Space of RL Methods
– Function Approximation
– Traces
– Kinds of Backups
• Lessons for Artificial Intelligence
• Concluding Remarks
Sarsa Backup
On transition:
st
rt 1
at
st 1
Rummery & Niranjan
Holland
Sutton
at1
Update:
Q(st ,at )  Q(st , at )   rt 1   Q(st 1 ,at 1 )  Q(st , at )
TD Error
st , at
Backup!
The value of this pair
st 1
rt 1
at1
Is updated from the
value of the next pair
Dimensions of RL algorithms
1. Action Selection and Exploration
2. Function Approximation
3. Eligibility Trace
4. On-line Experience vs Simulated Experience
5. Kind of Backups
.
.
.
1. Action Selection (Exploration)
Ok, you've learned a value function,
How do you pick Actions?
Q  Q*
Greedy Action Selection:
Always select the action that looks best:
 (s)  arg max Q(s, a)
a
-Greedy Action Selection:
Be greedy most of the time
Occasionally take a random action
Surprisingly, this is the state of the art.
Exploitation
vs
Exploration!
2. Function Approximation
s
a
Function
Approximator
Q(s,a )
targets or errors
Could be:
• table
• Backprop Neural Network
• Radial-Basis-Function Network
• Tile Coding (CMAC)
• Nearest Neighbor, Memory Based
• Decision Tree
gradientdescent
methods
Adjusting Network Weights
Q(s,a)  f (s, a, w)
weight vector
standard
backprop
gradient
e.g., gradient-descent Sarsa:
w  w  rt 1   Q(st 1 , at 1 )  Q(st ,at ) w f (st ,at , w)
target value
estimated value
Sparse Coarse Coding
.
fixed expansive
Re-representation
.
.
.
.
.
.
.
.
.
.
Linear
last
layer
features
Coarse: Large receptive fields
Sparse: Few features present at one time
The Mountain Car Problem
Moore, 1990
Goal
Gravity wins
Minimum-Time-to-Goal Problem
SITUATIONS: car's position
and velocity
ACTIONS: three thrusts:
forward, reverse, none
REWARDS: always –1 until
car reaches the goal
No Discounting
RL Applied to the Mountain Car
1. TD Method = Sarsa Backup
2. Action Selection = Greedy
with initial optimistic values
Q0 (s, a)  0
3. Function approximator = CMAC
4. Eligibility traces = Replacing
Video Demo
(Sutton, 1995)
Tile Coding applied to Mountain Car
a.k.a. CMACs (Albus, 1980)
Shape/Size of tiles
# Tilings
 Generalization
 Resolution of final approximation
5x5 grid
(10)
The Acrobot Problem
Goal: Raise tip above line
e.g., Dejong & Spong, 1994
Sutton, 1995
Torque
applied
here


tip
Minimum–Time–to–Goal:
4 state variables:
2 joint angles
2 angular velocities
CMAC of 48 layers
RL same as Mountain Car
Reward = -1 per time step
3. Eligibility Traces (*)
• The  in TD(), Q(), Sarsa() . . .
• A more primitive way of dealing with delayed reward
• The link to Monte Carlo methods
• Passing credit back farther than one action
• The general form is:
global
change in Q(s,a)
at time t
~
specific to s, a
(TD error) • (eligibility of s, a at time t )
at time t
e.g., was s, a taken at t ?
or shortly before t ?
Eligibility Traces
visits to
s, a
accumulating
trace
replace
trace
TIME
Complex TD Backups (*)
trial
primitive TD backups
e.g. TD (  )
Incrementally computes a
weighted mixture of
these backups as states
are visited
=0
Simple TD
=1
Simple Monte Carlo
1–
(1–)
=1
2 (1–)
3
Kinds of Backups
Exhaustive
search
Dynamic
programming
full
backups
sample
backups
Monte Carlo
Temporaldifference
learning
shallow
backups
bootstrapping, 
deep
backups
Outline
• Core RL Theory
– Value functions
– Markov decision processes
– Temporal-Difference Learning
– Policy Iteration
• The Space of RL Methods
– Function Approximation
– Traces
– Kinds of Backups
• Lessons for Artificial Intelligence
• Concluding Remarks
World-Class Applications of RL
• TD-Gammon and Jellyfish
Tesauro, Dahl
World's best backgammon player
• Elevator Control
Crites & Barto
World's best down-peak elevator controller
• Inventory Management
Van Roy, Bertsekas, Lee & Tsitsiklis
10-15% improvement over industry standard methods
• Dynamic Channel Assignment
Singh & Bertsekas, Nie & Haykin
World's best assigner of radio channels to mobile telephone calls
• All these applications are large,
stochastic, optimal control problems
– Too hard for conventional methods to
solve exactly
– Require problem to be simplified
• RL just finds an approximate solution
– . . . which can be much better!
Lesson #1:
Approximate the Solution, Not the Problem
Backgammon
Tesauro, 1992,1995
SITUATIONS: configurations
of the playing board
(about 10 20 )
ACTIONS: moves
REWARDS: win:
+1
lose: –1
else:
Pure delayed reward
0
Tesauro, 1992–1995
TD-Gammon
Value
Action selection
by 2–3 ply search
TD error
Vt1  Vt
Start with a random network
Play millions of games against self
Learn a value function from this simulated experience
This produces arguably the best player in the world
Lesson #2:
The Power of Learning from Experience
TD-Gammon
70%
self-play
performance
against
gammontool
Tesauro, 1992
Neurogammon
same network, but
trained from 15,000
expert-labeled examples
50%
0
10
20
40
80
# hidden units
Expert examples are expensive and scarce
Experience is cheap and plentiful!
And teaches the real solution
Lesson #3:
The Central Role of Value Functions
...of modifiable moment-by-moment estimates of
how well things are going
All RL methods learn value functions
All state-space planners compute value functions
Both are based on "backing up" value
Recognizing and reacting to the ups and downs of
life is an important part of intelligence
Lesson #4:
Learning and Planning
can be radically similar
Historically, planning and trial-and-error
learning have been seen as opposites
But RL treats both as processing of experience
environment
on-line
interaction
experience
environment
model
simulation
learning
planning
value/policy
1-Step Tabular Q-Planning
1. Generate a state, s, and action, a
2. Consult model for next state, s', and reward, r
3. Learn from the experience, s,a,r,s' :
Q(s,a)  Q(s, a)   r   max Q(s' , a' )  Q(s,a)


a'
4. go to 1
With function approximation and cleverness in search
control (Step 1), this is a viable, perhaps even
superior, approach to state-space planning
Lesson #5:
Accretive Computation
"Solves" Planning Dilemmas
value/policy
planning
model
acting
direct
RL
experience
model
learning
Processes only loosely
coupled
Can proceed in parallel,
asynchronously
Quality of solution
accumulates in value
& model memories
Reactivity/Deliberation dilemma
"solved" simply by not opposing search and
memory
Intractability of planning
"solved" by anytime improvement of solution
Dyna Algorithm
1. s  current state
2. Choose an action, a, and take it
3. Receive next state, s’, and reward, r
4. Apply RL backup to s, a, s’, r
e.g., Q-learning update
5. Update Model(s, a) with s’, r
6. Repeat k times:
- select a previously seen state-action pair s,a
- s’, r  Model(s, a)
- Apply RL backup to s, a, s’, r
7. Go to 1
Actions –> Behaviors
MDPs seems too too flat, low-level
Need to learn/plan/design agents at a higher level
at the level of behaviors
rather than just low-level actions
e.g., open-the-door rather than twitch-muscle-17
walk-to-work rather than 1-step-forward
Behaviors are based on entire policies,
directed toward subgoals
enable abstraction, hierarchy, modularity, transfer...
Rooms Example
4 rooms
4 hallways
ROOM
HALLWAYS
4 unreliable
primitive actions
up
left
O1
right
Fail 33%
of the time
down
G?
O2
G?
8 multi-step options
(to each room's 2 hallways)
Given goal location,
quickly plan shortest route
Goal states are given
a terminal value of 1
All rewards zero
 = .9
Planning by Value Iteration
with primitive actions (cell-to-cell)
V (goal )=1
Iteration #1
Iteration #2
Iteration #3
with behaviors (room-to-room)
V (goal )=1
Iteration #1
Iteration #2
Iteration #3
Behaviors are
much like Primitive Actions
• Define a higher level Semi-MDP,
overlaid on the original MDP
• Can be used with same planning methods
- faster planning
- guaranteed correct results
• Interchangeable with primitive actions
• Models and policies can be learned by TD methods
Provide a clear semantics for multi-level, re-usable
knowledge for the general MDP context
Lesson #6: Generality is no impediment to
working with higher-level knowledge
Summary of Lessons
1. Approximate the Solution, Not the Problem
2. The Power of Learning from Experience
3. The Central Role of Value Functions
in finding optimal sequential behavior
4. Learning and Planning can be Radically Similar
5. Accretive Computation "Solves Dilemmas"
6. A General Approach need not be Flat, Low-level
All stem from trying to solve MDPs
Outline
• Core RL Theory
– Value functions
– Markov decision processes
– Temporal-Difference Learning
– Policy Iteration
• The Space of RL Methods
– Function Approximation
– Traces
– Kinds of Backups
• Lessons for Artificial Intelligence
• Concluding Remarks
Strands of History of RL
Trial-and-error
learning
Thorndike ()
1911
Temporal-difference Optimal control,
learning
value functions
Secondary
reinforcement ()
Hamilton (Physics)
1800s
Shannon
Minsky
Samuel
Klopf
Holland
Witten
Bellman/Howard (OR)
Werbos
Barto et al
Sutton
Watkins
Radical Generality
of the RL/MDP problem
RL/MDPs
Classical
General stochastic dynamics
Determinism
General goals
Goals of achievement
Uncertainty
Complete knowledge
Reactive decision-making
Closed world
Unlimited deliberation
Getting the degree of abstraction right
• Actions can be low level (e.g. voltages to
motors), high level (e.g., accept a job offer),
"mental" (e.g., shift in focus of attention)
• Situations: direct sensor readings, or symbolic
descriptions, or "mental" situations (e.g., the
situation of being lost)
• An RL agent is not like a whole animal or robot,
which consist of many RL agents as well as
other components
• The environment is not necessarily unknown to
the RL agent, only incompletely controllable
• Reward computation is in the RL agent's
environment because the RL agent cannot
change it arbitrarily
Some of the Open Problems
• Incomplete state information
• Exploration
• Structured states and actions
• Incorporating prior knowledge
• Using teachers
• Theory of RL with function approximators
• Modular and hierarchical architectures
• Integration with other problem–solving
and planning methods
Dimensions of RL Algorithms
• TD --- Monte Carlo ()
• Model --- Model free
• Kind of Function Approximator
• Exploration method
• Discounted --- Undiscounted
• Computed policy --- Stored policy
• On-policy --- Off-policy
• Action values --- State values
• Sample model --- Distribution model
• Discrete actions --- Continuous actions
A Unified View
• There is a space of methods, with tradeoffs and mixtures
- rather than discrete choices
• Value function are key
- all the methods are about learning, computing, or
estimating values
- all the methods store a value function
• All methods are based on backups
- different methods just have backups of different
shapes and sizes,
sources and mixtures
Psychology and RL
PAVLOVIAN CONDITIONING
The single most common and important kind of learning
Ubiquitous - from people to paramecia
learning to predict and anticipate, learning affect
TD methods are a compelling match to the data
Most modern, real-time methods are based on TemporalDifference Learning
INSTRUMENTAL LEARNING (OPERANT CONDITIONING)
"Actions followed by good outcomes are repeated"
Goal-directed learning, The Law of Effect
Biological version of Reinforcement Learning
But careful comparisons to data have not yet been made
Hawkins & Kandel
Hopfield & Tank
Klopf
Moore et al.
Byrne et al.
Sutton & Barto
Malaka
Neuroscience and RL
RL interpretations of brain structures are
being explored in several areas:
Basal Ganglia
Dopamine Neurons
Value Systems
Final Summary
• RL is about a problem:
- goal-directed, interactive, real-time decision making
- all hail the problem!
• Value functions seem to be the key to solving it,
and TD methods the key new algorithmic idea
• There are a host of immediate applications
- large-scale stochastic optimal control and
scheduling
• There are also a host of scientific problems
• An exciting time
Bibliography
Sutton, R.S., Barto, A.G., Reinforcement Learning: An
Introduction. MIT Press 1998.
Kaelbling, L.P., Littman, M.L., and Moore, A.W.
"Reinforcement learning: A survey". Journal of Artificial
Intelligence Research, 4, 1996.
Special issues on Reinforcement Learning: Machine
Learning, 8(3/4),1992, and 22(1/2/3), 1996.
Neuroscience: Schultz, W., Dayan, P., and Montague, P.R. "A
neural substrate of prediction and reward". Science,
275:1593--1598, 1997.
Animal Learning: Sutton, R.S. and Barto, A.G. "Timederivative models of Pavlovian reinforcement". In
Gabriel, M. and Moore, J., Eds., Learning and
Computational Neuroscience, pp. 497--537. MIT Press,
1990.
See also web pages, e.g., starting from
http://www.cs.umass.edu/~rich.
Download