RLTutorialHUTnew

advertisement
Reinforcement Learning
HUT Spatial Intelligence course August/September 2004
Bram Bakker
Computer Science, University of Amsterdam
bram@science.uva.nl
Overview day 1 (Monday 13-16)
 Basic concepts
 Formalized model
 Value functions
 Learning value functions
 In-class assignment & discussion
Overview day 2 (Tuesday 9-12)
 Learning value functions more efficiently
 Generalization
 Case studies
 In-class assignment & discussion
Overview day 3 (Thursday 13-16)
 Models and planning
 Multi-agent reinforcement learning
 Other advanced RL issues
 Presentation of home assignments & discussion
Machine Learning
 What is it?
 Subfield of Artificial Intelligence
 Making computers learn tasks rather than directly
program them
 Why is it interesting?
 Some tasks are very difficult to program, or difficult
to optimize, so learning might be better
 Relevance for geoinformatics/spatial intelligence:
 Geoinformatics deals with many such tasks:
transport optimization, water management, etc.
Classes of Machine Learning techniques
 Supervised learning
 Works by instructing the learning system what output
to give for each input
 Unsupervised learning
 Clustering inputs based on similarity (e.g. Kohonen
Self-organizing maps)
 Reinforcement learning
 Works by letting the learning system learn
autonomously what is good and bad
Some well-known Machine Learning techniques
 Neural networks
 Work in a way analogous to brains, can be used with
supervised, unsupervised, reinforcement learning,
genetic algorithms
 Genetic algorithms
 Work in a way analogous to evolution
 Ant Colony Optimization
 Works in a way analogous to ant colonies
What is Reinforcement Learning?
 Learning from interaction
 Goal-oriented learning
 Learning about, from, and while interacting with an
external environment
 Learning what to do—how to map situations to
actions—so as to maximize a numerical reward signal
Some Notable RL Applications
 TD-Gammon: Tesauro
– world’s best backgammon program
 Elevator Control: Crites & Barto
– high performance elevator controller
 Dynamic Channel Assignment: Singh & Bertsekas, Nie & Haykin
– high performance assignment of radio channels to mobile
telephone calls
 Traffic light control: Wiering et al., Choy et al.
– high performance control of traffic lights to optimize traffic
flow
 Water systems control: Bhattacharya et al.
– high performance control of water levels of regional water
systems
Relationships to other fields
Artificial Intelligence Planning methods
Control Theory and
Operations Research
Psychology
Reinforcement
Learning (RL)
Neuroscience
Artificial Neural Networks
Recommended literature
 Sutton & Barto (1998). Reinforcement learning: an
introduction. MIT Press.
 Kaelbling, Littmann, & Moore (1996). Reinforcement
learning: a survey. Artificial Inteligence Research, vol. 4,
pp. 237--285.
Complete Agent
 Temporally situated
 Continual learning and planning
 Agent affects the environment
 Environment is stochastic and uncertain
Environment
action
state
reward
Agent
Supervised Learning
Training Info = desired (target) outputs
Inputs
Supervised Learning
System
Outputs
Error = (target output – actual output)
Reinforcement Learning (RL)
Training Info = evaluations (“rewards” / “penalties”)
Inputs
RL
System
Outputs (“actions”)
Objective: get as much reward as possible
Key Features of RL
 Learner is not told which actions to take
 Trial-and-Error search
 Possibility of delayed reward
 Sacrifice short-term gains for greater long-term gains
 The need to explore and exploit
 Considers the whole problem of a goal-directed agent
interacting with an uncertain environment
What is attractive about RL?
 Online, “autonomous” learning without a need for
preprogrammed behavior or instruction
 Learning to satisfy long-term goals
 Applicable to many tasks
Some RL History
Trial-and-Error
learning
Thorndike ()
1911
Temporal-difference
learning
Secondary
reinforcement ()
Optimal control,
value functions
Hamilton (Physics)
1800s
Shannon
Minsky
Samuel
Klopf
Holland
Witten
Bellman/Howard (OR)
Werbos
Barto et al.
Sutton
Watkins
Elements of RL
 Policy: what to do
 Maps states to actions
 Reward: what is good
 Value: what is good because it predicts reward
 Reflects total, long-term reward
 Model: what follows what
 Maps states and actions to new states and rewards
An Extended Example: Tic-Tac-Toe
X
X
O X
O X
x
X
O X
O
O X
O
...
X O X
X O X
O X
O
O X
X
O
} x’s move
...
x
...
x o
X
X
x
} o’s move
...
o
o x
x
x
...
...
...
...
...
} x’s move
} o’s move
Assume an imperfect opponent:
—he/she sometimes makes mistakes
} x’s move
x o
x
x o
An RL Approach to Tic-Tac-Toe
1. Make a table with one entry per state:
State
x
V(s) – estimated probability of winning
.5
?
.5
?
x x x
o
o
1
win
x o
o
o
0
loss
x
o x o
o x x
x o o
0
draw
2. Now play lots of games.
To pick our moves,
look ahead one step:
current state
*
various possible
next states
Just pick the next state with the highest
estimated prob. of winning — the largest V(s);
a greedy move.
But 10% of the time pick a move at random;
an exploratory move.
RL Learning Rule for Tic-Tac-Toe
“Exploratory” move
s – the state before our greedy move
s – the state after our greedy move
We increment each V(s) toward V( s) – a backup :
V(s)  V (s)   V( s)  V (s)
a small positive fraction, e.g.,   .1
the step - size parameter
How can we improve this T.T.T. player?
 Take advantage of symmetries
 representation/generalization
 Do we need “random” moves? Why?
 Do we always need a full 10%?
 Can we learn from “random” moves?
 Can we learn offline?
 Pre-training from self play?
 Using learned models of opponent?
...
How is Tic-Tac-Toe easy?
 Small number of states and actions
 Small number of steps until reward
...
RL Formalized
Agent and environment interact at discrete time steps
Agent observes state at step t :
: t  0,1, 2,
st S
produces action at step t : at  A(st )
gets resulting reward :
rt 1 
and resulting next state : st 1
...
st
at
rt +1
st +1
at +1
rt +2
st +2
at +2
rt +3 s
t +3
...
at +3
The Agent Learns a Policy
Policy at step t, t :
a mapping from states to action probabilities
t (s, a)  probability that at  a when st  s
 Reinforcement learning methods specify how the agent
changes its policy as a result of experience.
 Roughly, the agent’s goal is to get as much reward as it can
over the long run.
Getting the Degree of Abstraction Right
 Time steps need not refer to fixed intervals of real time.
 Actions can be low level (e.g., voltages to motors), or high
level (e.g., accept a job offer), “mental” (e.g., shift in focus
of attention), etc.
 States can be low-level “sensations”, or they can be
abstract, symbolic, based on memory, or subjective (e.g.,
the state of being “surprised” or “lost”).
 Reward computation is in the agent’s environment because
the agent cannot change it arbitrarily.
Goals and Rewards
 Is a scalar reward signal an adequate notion of a goal?—
maybe not, but it is surprisingly flexible.
 A goal should specify what we want to achieve, not how
we want to achieve it.
 A goal must be outside the agent’s direct control—thus
outside the agent.
 The agent must be able to measure success:
 explicitly;
 frequently during its lifespan.
Returns
Suppose the sequence of rewards after step t is :
rt 1 ,rt  2 ,rt 3 ,
What do we want to maximize?
In general,
we want to maximize the expected return , ERt , for each step t.
Episodic tasks: interaction breaks naturally into
episodes, e.g., plays of a game, trips through a maze.
Rt  rt 1  rt  2    rT ,
where T is a final time step at which a terminal state is reached,
ending an episode.
Returns for Continuing Tasks
Continuing tasks: interaction does not have natural episodes.
Discounted return:

Rt  rt 1   rt  2   2 rt 3      k rt  k 1 ,
k 0
where  ,0    1, is the discount rate .
shortsighted 0   1 farsighted
An Example
Avoid failure: the pole falling beyond
a critical angle or the cart hitting end of
track.
As an episodic task where episode ends upon failure:
reward  1 for each step before failure
 return  number of steps before failure
As a continuing task with discounted return:
reward  1 upon failure; 0 otherwise
 return    k , for k steps before failure
In either case, return is maximized by
avoiding failure for as long as possible.
Another Example
Get to the top of the hill
as quickly as possible.
reward  1 for each step where not at top of hill
 return   number of steps before reaching top of hill
Return is maximized by minimizing
number of steps reach the top of the hill.
A Unified Notation
 Think of each episode as ending in an absorbing state that
always produces reward of zero:

k
 We can cover all cases by writing Rt    rt  k 1 ,
k 0
where  can be 1 only if a zero reward absorbing state is always reached.
The Markov Property
 A state should retain all “essential” information, i.e., it should
have the Markov Property:
Prst 1  s,rt 1  r st ,at ,rt ,st 1 , at 1 , , r1 , s0 , a0  
Prst 1  s,rt 1  r st , at 
for all s,r ,and histories st ,at ,rt ,st 1 , at 1 , , r1 , s0 , a0 .
Markov Decision Processes
 If a reinforcement learning task has the Markov Property, it is
a Markov Decision Process (MDP).
 If state and action sets are finite, it is a finite MDP.
 To define a finite MDP, you need to give:
 state and action sets
 one-step “dynamics” defined by state transition
probabilities:
Psas  Prst 1  s st  s,at  a for all s, sS, a A(s).

expected rewards:
Rsas   Ert 1 st  s,at  a,st 1  s  for all s, sS, a A(s).
Value Functions
 The value of a state is the expected return starting from
that state; depends on the agent’s policy:
State - value function for policy  :
  k

V (s)  E Rt st  s E   rt k 1 st  s
k 0


 The value of taking an action in a state under policy 
is the expected return starting from that state, taking that
action, and thereafter following  :
Action - value function for policy  :
 k

Q (s, a)  E Rt st  s, at  a E  rt  k 1 st  s,at  a 
k  0


Bellman Equation for a Policy 
The basic idea:
Rt  rt 1   rt  2   2 rt 3   3 rt  4 


 rt 1   rt  2   rt 3   2 rt  4 
 rt 1   Rt 1
So:
V (s)  E Rt st  s

 E rt 1   V st 1  st  s
Or, without the expectation operator:
V (s)    (s,a) PsasRsas   V  ( s)
a
s
Gridworld
 Actions: north, south, east, west; deterministic.
 If action would take agent off the grid: no move but reward = –1
 Other actions produce reward = 0, except actions that move agent out
of special states A and B as shown.
State-value function
for equiprobable
random policy;
= 0.9
Optimal Value Functions
 For finite MDPs, policies can be partially ordered:
    if and only if V  (s)  V (s) for all s S
 There is always at least one (and possibly many) policies that
is better than or equal to all the others. This is an optimal
policy. We denote them all by *.
 Optimal policies share the same optimal state-value function:
V  (s)  max V  (s) for all s S

 Optimal policies also share the same optimal action-value
function:
Q (s,a)  max Q (s, a) for all s S and a A(s)

This is the expected return for taking action a in state s
and thereafter following an optimal policy.
Bellman Optimality Equation for V*
The value of a state under an optimal policy must equal
the expected return for the best action from that state:


V (s)  max Q (s,a)
aA(s)
 max Ert 1   V  (st 1 ) st  s, at  a
aA(s)
 max
aA(s)

a
a

P
R


V
(s)

 ss ss
s 
V is the unique solution of this system of nonlinear equations.
Bellman Optimality Equation for Q*

Q  ( s, a)  E rt 1   max Q  ( st 1 , a) st  s, at  a

a
  Psas Rsas   max Q  ( s, a)
s
a


Q * is the unique solution of this system of nonlinear equations.
Why Optimal State-Value Functions are Useful
Any policy that is greedy with respect to


V is an optimal policy.
Therefore, given V , one-step-ahead search produces the
long-term optimal actions.
E.g., back to the gridworld:
What About Optimal Action-Value Functions?
*
Q
Given
, the agent does not even
have to do a one-step-ahead search:
  (s)  arg max Q (s,a)
aA (s)
Solving the Bellman Optimality Equation
 Finding an optimal policy by solving the Bellman Optimality Equation
exactly requires the following:

accurate knowledge of environment dynamics;

we have enough space and time to do the computation;

the Markov Property.
 How much space and time do we need?

polynomial in number of states (via dynamic programming
methods),
 BUT, number of states is often huge (e.g., backgammon has about
10**20 states).
 We usually have to settle for approximations.
 Many RL methods can be understood as approximately solving the
Bellman Optimality Equation.
Temporal Difference (TD) Learning
 Basic idea: transform the Bellman Equation into an update
rule, using two consecutive timesteps
 Policy Evaluation: learn approximation to the value
function of the current policy
 Policy Improvement: Act greedily with respect to the
intermediate, learned value function
 Repeating this over and over again leads to
approximations of the optimal value function
Q-Learning: TD-learning of action values
One - step Q - learning :


Qst , at   Qst , at    rt 1   max Qst 1 , a   Qst , at 
a
Exploration/Exploitation revisited
 Suppose you form estimates
Qt ( s, a )  Q * ( s, a ) action value estimates
 The greedy action at t is
at*  arg max Qt ( s, a)
a
at  at*  exploitati on
at  at*  exploratio n
 You can’t exploit all the time; you can’t explore all the time
 You can never stop exploring; but you should always reduce
exploring
e-Greedy Action Selection
 Greedy action selection:
at  at*  arg max Qt (s, a)
a
 e-Greedy:
at 
{
at* with probability 1  e
random action with probability
e
. . . the simplest way to try to balance exploration and exploitation
Softmax Action Selection
 Softmax action selection methods grade action probs. by
estimated values.
 The most common softmax uses a Gibbs, or Boltzmann,
distribution:
Choose action a on play t with probabilit y
eQt ( s ,a ) 

n
b 1
e
Qt ( s ,b ) 
,
where  is the
“computational temperature”
Pole balancing learned using RL
Improving the basic TD learning scheme
 Can we learn more efficiently?
 Can we update multiple values at the same timestep?
 Can we look ahead further in time, rather than just use the
value at the next timestep?
 Yes! All these can be done simultaneously with one
extension: eligibility traces
N-step TD Prediction
 Idea: Look farther into the future when you do TD backup
(1, 2, 3, …, n steps)
Mathematics of N-step TD Prediction
 Monte Carlo:
Rt  rt 1  rt  2   2 rt 3     T t 1rT
 TD: Rt(1)  rt 1  Vt ( st 1 )
 Use V to estimate remaining return
 n-step TD:
 2 step return:

n-step return:
Rt( 2)  rt 1  rt  2   2Vt ( st  2 )
Rt( n )  rt 1  rt  2   2 rt 3     n 1rt  n   nVt ( st  n )
Learning with N-step Backups
 Backup (on-line or off-line):
Vt ( st )   Rt( n)  Vt ( st )
Random Walk Example
 How does 2-step TD work here?
 How about 3-step TD?
Forward View of TD(l)
 TD(l) is a method for
averaging all n-step backups

weight by ln-1 (time since
visitation)

l-return:

Rt  (1 l ) ln 1 Rt(n)
l
n1
 Backup using l-return:


Vt (st )   Rtl  Vt (st )
l-return Weighting Function
Relation to TD(0) and MC
 l-return can be rewritten as:
T  t 1
Rt  (1 l )  ln1 Rt(n)  lT t 1 Rt
l
n1
Until termination
After termination
 If l = 1, you get MC:
T t 1
Rt  (11) 1n1 Rt(n )  1T  t 1 Rt  Rt
l
n1
 If l = 0, you get TD(0)
T t 1
Rt  (1 0)  0 n1 Rt(n )  0T  t 1 Rt  Rt(1)
l
n1
Forward View of TD(l) II
 Look forward from each state to determine update from
future states and rewards:
l-return on the Random Walk
 Same random walk as before, but now with 19 states
 Why do you think intermediate values of l are best?
Backward View of TD(l)
 The forward view was for theory
 The backward view is for mechanism
et (s )   
 New variable called eligibility trace
 On each step, decay all traces by l and increment the
trace for the current state by 1
 Accumulating trace
if s  st
 let 1 (s)
et (s)  
let 1 (s)  1 if s  st
Backward View
d t  rt 1  Vt (st 1 )  Vt (st )
 Shout dt backwards over time
 The strength of your voice decreases with temporal
distance by l
Relation of Backwards View to MC &
TD(0)
 Using update rule:
Vt ( s)  dt et ( s)
 As before, if you set l to 0, you get to TD(0)
 If you set l to 1, you get MC but in a better way
 Can apply TD(1) to continuing tasks
 Works incrementally and on-line (instead of waiting to
the end of the episode)
Forward View = Backward View
 The forward (theoretical) view of TD(l) is equivalent to
the backward (mechanistic) view
 Sutton & Barto’s book shows:
T 1
T 1
l
V
(s)

V
 t
 t (st )Isst
t0
TD
t0
Backward updates Forward updates
algebra shown in book
Q(l)-learning
 Zero out eligibility trace after a
non-greedy action. Do max
when backing up at first nongreedy choice.
1  let 1 (s,a)

et (s, a)  
0
 le (s,a)

t 1
if s  st , a  at ,Qt 1 (st ,at )  max a Qt 1 (st , a)
if Qt 1 (st ,at )  max a Qt 1 (st ,a)
Qt 1 (s,a)  Qt (s,a)  dt et (s, a)
dt  rt 1   max a Qt (st 1 , a )  Qt (st ,at )
otherwise
Q(l)-learning
Initialize Q(s , a) arbitraril y and e(s , a)  0, for all s, a
Repeat (for each episode) :
Initialize s, a
Repeat (for each step of episode) :
Take action a, observe r , s
Choose a from s using policy derived from Q (e.g. e - greedy)
a *  arg max b Q( s, b) (if a ties for the max, then a *  a)
d  r  Q( s, a)  Q( s, a * )
e(s,a)  e(s,a)  1
For all s,a :
Q( s, a )  Q( s, a )  de( s, a )
If a  a * , then e( s, a )  le( s, a )
else e( s, a )  0
s  s; a  a
Until s is terminal
Q(l) Gridworld Example
 With one trial, the agent has much more information about how to get
to the goal

not necessarily the best way
 Can considerably accelerate learning
Conclusions TD(l)/Q(l) methods
 Can significantly speed learning
 Robustness against unreliable value estimations (e.g.
caused by Markov violation)
 Does have a cost in computation
Generalization and Function Approximation
 Look at how experience with a limited part of the state set
be used to produce good behavior over a much larger part
 Overview of function approximation (FA) methods and how
they can be adapted to RL
Generalization
Table
State
s
1
s
2
s3
.
.
.
Train
here
s
N
Generalizing Function Approximator
V
State
V
So with function approximation a single value
update affects a larger region of the state space
Value Prediction with FA
Before, value functions were stored in lookup tables.
Now, the value function estimate at time t , Vt , depends

on a parameter vector  t , and only the parameter vector
is updated.

e.g.,  t could be the vector of connection weights
of a neural network.
Adapt Supervised Learning Algorithms
Training Info = desired (target) outputs
Inputs
Supervised Learning
System
Outputs
Training example = {input, target output}
Error = (target output – actual output)
Backups as Training Examples
e.g., the TD(0) backup :
V(st )  V(st )   rt 1   V(st 1 )  V(st )
As a training example:
description of
input
st , rt 1   V (st1 )
target output
Any FA Method?
In principle, yes:
 artificial neural networks
 decision trees
 multivariate regression methods
 etc.
But RL has some special requirements:
 usually want to learn while interacting
 ability to handle nonstationarity
 other?
Gradient Descent Methods

 t  t (1),t (2),,t (n)
T
Assume Vt is a (sufficien tly smooth) differenti able function

of  t , for all s  S .
Assume, for now, training examples of this form

description
of
s
,
V
(st )

t
:
Performance Measures for Gradient Descent
 Many are applicable but…
 a common and simple one is the mean-squared error
(MSE) over a distribution P :
MSE( t )   P(s)V (s)  Vt (s)

s S
2
Gradient Descent
Let f be any function of the parameter space.

Its gradient at any point  t in this space is :


 T
  f ( t ) f ( t )
f ( t ) 


 f ( t )  
,
,,
.

 (n) 
  (1)  (2)
 (2)
Iteratively move down the gradient:



 t 1   t   f ( t )

t  t (1),t (2)T
 (1)
Control with FA
Learning state-action values
Training examples of the form:
description of (
st , at ), v t 
The general gradient-descent rule:


 t 1   t  vt  Qt ( st , at ) Q(st , at )
Gradient-descent Q(l) (backward view):



 t 1   t  d t et
where
d t  rt 1   max Qt ( st 1 , at 1 )  Qt ( st , at )



et   let 1   Qt ( st , at )
Linear Gradient Descent Q(l)
Mountain-Car Task
Mountain-Car Results
Summary
 Generalization can be done in those cases where there are
too many states
 Adapting supervised-learning function approximation
methods
 Gradient-descent methods
Case Studies
 Illustrate the promise of RL
 Illustrate the difficulties, such as long learning times,
finding good state representations
TD Gammon
Tesauro 1992, 1994, 1995, ...
 Objective is to advance all pieces
to points 19-24
 30 pieces, 24 locations implies
enormous number of
configurations
 Effective branching factor of 400
A Few Details
 Reward: 0 at all times except those in which the game is
won, when it is 1
 Episodic (game = episode), undiscounted
 Gradient descent TD(l) with a multi-layer neural network
 weights initialized to small random numbers
 backpropagation of TD error
 four input units for each point; unary encoding of
number of white pieces, plus other features
 Learning during self-play
Multi-layer Neural Network
Summary of TD-Gammon Results
The Acrobot
Spong 1994
Sutton 1996
Acrobot Learning Curves for Q(l)
Typical Acrobot Learned Behavior
Elevator Dispatching
Crites and Barto 1996
State Space
18
• 18 hall call buttons: 2 combinations
4
• positions and directions of cars: 18
(rounding
4 to nearest floor)
• motion states of cars (accelerating, moving, decelerating, stopped, loading, turning): 6
• 40 car buttons: 2 40
• Set of passengers waiting at each floor, each passenger's arrival time and destination:
unobservable. However, 18 real numbers are available giving elapsed time since hall
buttons pushed; we discretize these.
• Set of passengers riding each car and their destinations: observable only through the
car buttons
Conservatively about 10
22
states
Control Strategies
• Zoning: divide building into zones; park in zone
when idle. Robust in heavy traffic.
• Search-based methods: greedy or non-greedy.
Receding Horizon control.
• Rule-based methods: expert systems/fuzzy logic;
from human “experts”
• Other heuristic methods: Longest Queue First (LQF),
Highest Unanswered Floor First (HUFF), Dynamic
Load Balancing (DLB)
• Adaptive/Learning methods: NNs for prediction,
parameter space search using simulation, DP on
simplified model, non-sequential RL
Performance Criteria
Minimize:
• Average wait time
• Average system time (wait + travel time)
• % waiting > T seconds (e.g., T = 60)
• Average squared wait time (to encourage fast and fair service)
Average Squared Wait Time
Instantaneous cost:
r   wait p ( )
2
p
Define return as an integral rather than a sum (Bradtke and Duff, 1994):

2

 rt
t0

 
e
 r d
0
becomes
Algorithm
Repeat forever :
1. In state x at time t x , car c must decide to STOP or CONTINUE
2. It selects an action using Boltzmann distribution
(with decreasing temperature) based on current Q values
3. The next decision by car c is required in state y at time t y
4. Implements the gradient descent version of the following backup using backprop
t y    t x 

 t y t x 
Q(x,a)  Q(x,a)    e
r d  e
max Q(y, a )  Q(x,a)
a 
t x

5. x  y, t x  t y
:
Neural Networks
47 inputs, 20 sigmoid hidden units, 1 or 2
output units
Inputs:
• 9 binary: state of each hall down button
• 9 real: elapsed time of hall down button if pushed
• 16 binary: one on at a time: position and direction
of car making decision
• 10 real: location/direction of other cars
• 1 binary: at highest floor with waiting passenger?
• 1 binary: at floor with longest waiting passenger?
• 1 bias unit  1
Elevator Results
Dynamic Channel Allocation
Details in:
Singh and Bertsekas 1997
Helicopter flying
 Difficult nonlinear control problem
 Also difficult for humans
 Approach: learn in simulation, then transfer to real
helicopter
 Uses function approximator for generalization
 Bagnell, Ng, and Schneider (2001, 2003, …)
In-class assignment
 Think again of your own RL problem, with states, actions,
and rewards
 This time think especially about how uncertainty may
play a role, and about how generalization may be
important
 Discussion
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
101
Homework assignment
 Due Thursday 13-16
 Think again of your own RL problem, with states, actions,
and rewards
 Do a web search on your RL problem or related work
 What is there already, and what, roughly, have they done to
solve the RL problem?
 Present briefly in class
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
102
Overview day 3
 Summary of what we’ve learnt about RL so far
 Models and planning
 Multi-agent RL
 Presentation of homework assignments and discussion
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
103
RL summary
 Objective: maximize the total amount of (discounted)
reward
 Approach: estimate a value function (defined over state
space) which represents this total amount of reward
 Learn this value function incrementally by doing updates
based on values of consecutive states (temporal differences).
One - step Q - learning :


Qst , at   Qst , at    rt 1   max Qst 1 , a   Qst , at 
a
 After having learnt optimal value function, optimal behavior
can be obtained by taking action which has or leads to
highest value
 Use function approximation techniques for generalization if
state space becomes too large for tables
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
104
RL weaknesses
 Still “art” involved in defining good state (and action)
representations
 Long learning times
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
105
Planning and Learning
 Use of environment models
 Integration of planning and learning methods
Models
 Model: anything the agent can use to predict how the
environment will respond to its actions
 Models can be used to produce simulated experience
Planning
 Planning: any computational process that uses a model to
create or improve a policy
Learning, Planning, and Acting
 Two uses of real experience:

model learning: to improve
the model

direct RL: to directly
improve the value function
and policy
 Improving value function
and/or policy via a model is
sometimes called indirect RL or
model-based RL. Here, we call
it planning.
Direct vs. Indirect RL
Indirect methods:
 make fuller use of
experience: get
better policy with
fewer environment
interactions
Direct methods
 simpler
 not affected by bad
models
But they are very closely related and can be usefully combined:
planning, acting, model learning, and direct RL can occur
simultaneously and in parallel
The Dyna Architecture (Sutton 1990)
The Dyna-Q Algorithm
direct RL
model learning
planning
Dyna-Q on a Simple Maze
rewards = 0 until goal, when =1
Dyna-Q Snapshots: Midway in 2nd Episode
Using Dyna-Q for real-time robot learning
Before learning
After learning (approx. 15 minutes)
Multi-agent RL
 So far considered only single-agent RL
 But many domains have multiple agents!
 Group of industrial robots working on a single car
 Robot soccer
 Traffic
 Can we extend the methods of single-agent RL to multiagent RL?
Dimensions of multi-agent RL
 Is the objective to maximize individual rewards or to
maximize global rewards?
 Competition vs. cooperation
 Do the agents share information?
 Shared state representation?
 Communication?
 Homogeneous or heterogeneous agents?
 Do some agents have special capabilities?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
117
Competion
 Like multiple single-agent cases simultaneously
 Related to game theory
 Nash equilibria etc.
 Research goals
 study how to optimize individual rewards in the face of
competition
 study group dynamics
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
118
Cooperation
 More different from single-agent case than competition
 How can we make the individual agents work together?
 Are rewards shared among the agents?
 should all agents be punished for individual mistakes?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
119
Robot soccer example: cooperation
 Riedmiller group in Karlsruhe
 Robots must play together to beat other groups of robots in
Robocup tournaments
 Riedmiller group uses reinforcement learning techniques to
do this
Opposite approaches to cooperative case
 Consider the multi-agent system as a collection of
individual reinforcement learners
 Design individual reward functions such that
cooperation “emerges”
 They may become “selfish”, or may not cooperate in a
desirable way
 Consider the whole multi-agent system as one big MDP
with a large action vector
 State-action space may become very large, but perhaps
possible with advanced function approximation
Interesting intermediate approach
 Let agents learn mostly individually
 Assign (or learn!) a limited number of states where agents
must coordinate, and at those points consider those agents
as a larger single agent
 This can be represented and computed efficiently using
coordination graphs
 Guestrin & Koller (2003), Kok & Vlassis (2004)
Robocup simulation league
 Kok & Vlassis (2002-2004)
Advanced Generalization Issues
 Generalization over states
 tables
 linear methods
 nonlinear methods
 Generalization over actions
 Proving convergence with generalizion methods
Non-Markov case
 Try to do the best you can with non-Markov states
 Partially Observable MDPs (POMDPs)
– Bayesian approach: belief states
– construct state from sequence of observations
Other issues
 Model-free vs. model-based
 Value functions vs. directly searching for good policies
(e.g. using genetic algorithms)
 Hierarchical methods
 Incorporating prior knowledge
 advice and hints
 trainers and teachers
 shaping
 Lyapunov functions
 etc.
The end!
Download