slides - UCSD Cognitive Science

advertisement
An Overview of
Reinforcement Learning
Angela Yu
Cogs 118A
February 26, 2009
Outline
• A formal framework for learning from reinforcement
– Markov decision problems
– Interactions between an agent and its environment
• Dynamic programming as a formal solution
– Policy iteration
– Value iteration
• Temporal difference methods as a practical solution
– Actor-critic learning
– Q-learning
• Extensions
– Exploration vs. exploitation
– Representation and neural networks
RL as a Markov Decision Process
Markov blanket for rt and xt+1
action
state

reward
at 1
at
x t1
xt

rt 1

rt
Pxy (a)

rt  r(xt ,at )

xt 1
rt 1
RL as a Markov Decision Process
Goal: find optimal policy : x  a by maximizing return:


V (x)    t r(t)
t 0
action

state

reward
x,r
at 1
at
x t1
xt

rt 1

rt
Pxy (a)

rt  r(xt ,at )

xt 1
rt 1
RL as a Markov Decision Process
Simple case: assume transition and reward probabilities are known
action
state

reward
at 1
x t1
xt

rt 1
at'
at

rt
Pxy (a)

rt  r(xt ,at )

xt 1

rt 1
x
t 1
rt'1
Dynamic Programming I: Policy Iteration
Policy Evaluation (system of linear equations)
V  (x) 

t

 r(t)
t 0
 r (x,  (x))  

t

 r(t  1)
t 0
x,r
x,r
 r (x,  (x))    P(x t 1 | x t ,  (x))V  (x t 1 )
x t 1
Policy Improvement

Q ( xt , at )  rˆ( xt , at )   P( xt 1 | xt , at )V  ( xt 1 )
xt 1
Based on the values of these state-action pairs, incrementally improve policy:
 '( xt )  argmax Q  ( xt , at )
at
Guaranteed to converge on (one set of) optimal values: V * ( x ), Q* ( x, a )
Dynamic Programming II: Value Iteration
Q-value Update
Q '( xt , at )  rˆ( xt , at )   P( xt 1 | xt , at ) max Q( xt 1, at 1 )
xt 1
at 1
Guaranteed to converge on (one set of) optimal values: Q* ( x, a )
Policy
 '( xt )  argmax Q '( xt , at )
at
Temporal Difference Learning
Difficult (realistic) case:
transition and reward probabilities are unknown
action
state

reward
at 1
at
x t1
xt

rt 1

rt
Pxy (a)?

xt 1
rt 1
rt  r(xt ,at )?
Solution: temporal difference (TD) learning

Actor-Critic Learning
(related to policy iteration)
Critic improves value estimation incrementally: stochastic gradient ascent
V  ( xt )  rˆ( xt ,  ( xt ))   P( xt 1 | xt , ( xt ))V  ( xt 1 )
xt 1
• MC samples for < >
MC samples
• Boot-strapping: V(xt)
V '( xt )  V ( xt )    rt  V ( xt 1 )  V ( xt ) 
Learning rate
 a ( x) 
exp(  M xa )
b exp(  M xb )
• Convergence?
Temporal difference t
Actor improves policy execution incrementally
Stochastic policy
• Mutual dependence
V  (x)   a (x)Q (x,a)
Delta-rule
a
'
M xa
 M xa   Q ( xt , at )  V  ( xt ) 


 M xa   ( rˆ( x
t , at )    x ( xt 1 | xt , at )V ( xt 1 )  V ( xt ))
t 1
 M xa   ( rt   V  ( xt 1 )  V  ( xt ))  M xa   t
Monte Carlo samples
Learning rate
Actor-Critic Learning
Exploration vs. Exploitation
exp(  M xa )
 a ( x) 
b exp(  M xb )
Best annealing schedule?
Q-Learning
(related to value iteration)
State-action value estimation
Q '( xt , at )  rˆ( xt , at )   P( xt 1 | xt , at ) max Q ( xt 1, at 1 )
xt 1
at 1
 rt  max Q ( xt 1 , at 1 )
at 1
• MC samples for < >
Policy
• Boot-strapping: Q(xt, at)
• Proven convergence
 ( xt )  argmax Q( xt , at )
at
• No explicit parameter
to control explore/exploit
Pro’s and Con’s of TD Learning
TD learning practically appealing
– no representation of sequences of states & actions
– relatively simple computations
– TD in the brain: dopamine signals temporal difference t
TD suffers from several disadvantages
– local optima
– can be (exponentially) slow to converge
– actor-critic not guaranteed to converge
– no principled way to trade off exploration and exploitation
– cannot easily deal with non-stationary environment
TD in the Brain
TD in the Brain
Extensions to basic TD Learning
• A continuum of improvements possible
– more complete partial models of the effects of actions
– estimate expected reward <r(xt)>
– representing & processing longer sequences of actions & states
V(x(t)) V(x(t))  r(0)  r(1)   2V(x(t  2)) V(x(t))
– faster learning & more efficient use of agent’s experiences
– parameterize value function (versus look-up table)

V (x(t))  V (x)  w   (x)
w  w (t)(x(t))
• Timing and partial observability in reward prediction

– state not (always) directly observable

– delayed payoffs
– reward-prediction only (no instrumental contingencies)
References
Sutton, RS & Barto, AG (1998). Reinforcement Learning: An Introduction.
Cambridge, MA: MIT Press.
Bellman, RE (1957). Dynamic Programming. Princeton, NJ: Princeton
University Press.
Daw, ND, Courville, AC, & Touretsky, DS (2003). Timing and partial
observability in the dopamine system. In Neural Information Processing
Systems 15. Cambridge, MA: MIT Press.
Dayan, P & Watkins, CJCH (2001). Reinforcement learning.
Encyclopedia of Cognitive Science. London, England: MacMillan Press.
Dayan, P & Abbott, LF (2001). Theoretical Neuroscience. Cambridge, MA:
MIT Press.
Gittins, JC (1979). Bandit processes and dynamic allocation indices. Journal
of Royal Statistical Society B, 41: 148-177.
Schultz, W, Dayan, P, & Montague, PR (1997). A neural substrate of prediction
and reward. Science, 275, 1593-1599.
Download