Chapter 7: Eligibility Traces

advertisement
Chapter 7: Eligibility Traces
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
1
Midterm
Mean = 77.33 Median = 82
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
2
N-step TD Prediction
❐ Idea: Look farther into the future when you do TD backup
(1, 2, 3, …, n steps)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
3
Mathematics of N-step TD Prediction
❐ Monte Carlo:
Rt = rt +1 + "rt + 2 + " 2 rt +3 + L + " T !t !1rT
❐ TD: Rt(1) = rt +1 + !Vt ( st +1 )
 Use V to estimate remaining return
❐ n-step TD:
 2 step return:

n-step return:
Rt( 2 ) = rt +1 + !rt + 2 + ! 2Vt ( st + 2 )
Rt( n ) = rt +1 + "rt + 2 + " 2 rt +3 + L + " n !1rt + n + " nVt ( st + n )
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
4
Learning with N-step Backups
❐ Backup (on-line or off-line):
!Vt (st ) = " [ Rt( n) # Vt (st )]
❐ Error reduction property of n-step returns
max E! {Rtn | st = s} # V ! ( s ) $ " n max V ( s ) # V ! ( s )
s
s
n step return
Maximum error using n-step return
Maximum error using V
❐ Using this, you can show that n-step methods converge
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
5
Random Walk Examples
❐ How does 2-step TD work here?
❐ How about 3-step TD?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
6
A Larger Example
❐ Task: 19 state
random walk
❐ Do you think there
is an optimal n (for
everything)?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
7
Averaging N-step Returns
❐ n-step methods were introduced to help with
TD(λ) understanding
❐ Idea: backup an average of several returns
 e.g. backup half of 2-step and half of 4step
Rtavg =
One backup
1 ( 2) 1 ( 4)
Rt + Rt
2
2
❐ Called a complex backup
 Draw each component
 Label with the weights for that
component
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
8
Forward View of TD(λ)
❐ TD(λ) is a method for
averaging all n-step backups
n-1 (time since
 weight by λ
visitation)
 λ-return:
#
Rt = (1" ! )$ !n "1Rt(n)
!
n=1
❐ Backup using λ-return:
[
!Vt (st ) = " Rt# $ Vt (st )
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
]
9
λ-return Weighting Function
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
10
Relation to TD(0) and MC
❐ λ-return can be rewritten as:
T " t "1
Rt = (1" ! ) # !n"1 Rt(n) + !T "t "1Rt
!
n=1
Until termination
After termination
❐ If λ = 1, you get MC:
T "t "1
Rt = (1" 1) #1n"1 Rt(n ) + 1T " t "1 Rt = Rt
!
n=1
❐ If λ = 0, you get TD(0)
T "t "1
Rt = (1" 0) # 0 n"1 Rt(n ) + 0T " t "1 Rt = Rt(1)
!
n=1
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
11
Forward View of TD(λ) II
❐ Look forward from each state to determine update from
future states and rewards:
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
12
λ-return on the Random Walk
❐ Same 19 state random walk as before
❐ Why do you think intermediate values of λ are best?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
13
Backward View of TD(λ)
❐ The forward view was for theory
❐ The backward view is for mechanism
et (s ) " ! +
❐ New variable called eligibility trace
 On each step, decay all traces by γλ and increment the
trace for the current state by 1
 Accumulating trace
if s $ st
% !"et #1(s)
et (s) = &
'!"et #1(s) + 1 if s = st
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
14
On-line Tabular TD(λ)
Initialize V(s) arbitrarily and e(s) = 0, for all s !S
Repeat (for each episode) :
Initialize s
Repeat (for each step of episode) :
a " action given by # for s
Take action a, observe reward, r, and next state s$
% " r + &V( s$) ' V (s)
e(s) " e(s) + 1
For all s :
V(s) " V(s) + (%e(s)
e(s) " &)e(s)
s " s$
Until s is terminal
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
15
Backward View
# t = rt +1 + "Vt ( st +1 ) ! Vt ( st )
❐ Shout δt backwards over time
❐ The strength of your voice decreases with temporal
distance by γλ
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
16
Relation of Backwards View to MC & TD(0)
❐ Using update rule:
#Vt ( s ) = !" t et ( s )
❐ As before, if you set λ to 0, you get to TD(0)
❐ If you set λ to 1, you get MC but in a better way
 Can apply TD(1) to continuing tasks
 Works incrementally and on-line (instead of waiting to
the end of the episode)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
17
Forward View = Backward View
❐ The forward (theoretical) view of TD(λ) is equivalent to
the backward (mechanistic) view for off-line updating
❐ The book shows:
T "1
TD
T "1
$
!V
(s)
=
!V
# t
# t (st )Isst
t=0
t=0
Backward updates Forward updates
algebra shown in book
T "1
TD
T "1
# !Vt (s) = # $ Isst
t=0
t=0
T "1
k"t
#(%& ) ' k
k =t
T #1
T #1
T #1
t=0
k =t
k#t
!V
(s
)I
=
%
I
(
&"
)
'k
$ t t sst $ sst $
"
t=0
❐ On-line updating with small α is similar
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
18
On-line versus Off-line on Random Walk
❐ Same 19 state random walk
❐ On-line performs better over a broader range of parameters
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
19
Control: Sarsa(λ)
❐ Save eligibility for state-action
pairs instead of just states
$!"e (s, a) + 1 if s = st and a = at
et (s, a) = % t #1
otherwise
& !"et #1(s,a)
Qt +1(s, a) = Qt (s, a) + '(t et (s, a)
(t = rt +1 + !Qt (st +1,at +1) # Qt (st , at )
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
20
Sarsa(λ) Algorithm
Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a
Repeat (for each episode) :
Initialize s, a
Repeat (for each step of episode) :
Take action a, observe r, s!
Choose a ! from s! using policy derived from Q (e.g. ? - greedy)
" # r + $Q(s!, a !) % Q(s, a)
e(s,a) # e(s,a) + 1
For all s,a :
Q(s, a) # Q(s, a) + &"e(s, a)
e(s, a) # $'e(s, a)
s # s!;a # a !
Until s is terminal
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
21
Sarsa(λ) Gridworld Example
❐ With one trial, the agent has much more information about how to get
to the goal
 not necessarily the best way
❐ Can considerably accelerate learning
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
22
Three Approaches to Q(λ)
❐ How can we extend this to Qlearning?
❐ If you mark every state action
pair as eligible, you backup
over non-greedy policy
 Watkins: Zero out
eligibility trace after a nongreedy action. Do max
when backing up at first
non-greedy choice.
%'1 + !"et #1(s, a)
et (s, a) = &
0
'( !"e (s,a)
t #1
if s = st , a = at ,Qt #1(st ,at ) = max a Qt #1(st , a)
if Qt #1(st ,at ) $ max a Qt #1(st ,a)
otherwise
Qt +1(s, a) = Qt (s, a) + )*t et (s, a)
*t = rt +1 + ! max a + Qt (st +1, a +) # Qt (st ,at )
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
23
Watkins’s Q(λ)
Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a
Repeat (for each episode) :
Initialize s, a
Repeat (for each step of episode) :
Take action a, observe r, s!
Choose a ! from s! using policy derived from Q (e.g. ? - greedy)
a* " arg max b Q(s!, b) (if a ties for the max, then a* " a!)
# " r + $Q(s!, a !) % Q(s, a* )
e(s,a) " e(s,a) + 1
For all s,a :
Q(s, a) " Q(s, a) + &#e(s, a)
If a ! = a* , then e(s, a) " $'e(s,a)
else e(s, a) " 0
s " s!;a " a !
Until s is terminal
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
24
Peng’s Q(λ)
❐ Disadvantage to Watkins’s
method:
 Early in learning, the
eligibility trace will be
“cut” (zeroed out)
frequently resulting in little
advantage to traces
❐ Peng:
 Backup max action except
at end
 Never cut traces
❐ Disadvantage:
 Complicated to implement
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
25
Naïve Q(λ)
❐ Idea: is it really a problem to
backup exploratory actions?
 Never zero traces
 Always backup max at
current action (unlike Peng
or Watkins’s)
❐ Is this truly naïve?
❐ Works well is preliminary
empirical studies
What is the backup diagram?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
26
Comparison Task
❐ Compared Watkins’s, Peng’s, and Naïve (called
McGovern’s here) Q(λ) on several tasks.
 See McGovern and Sutton (1997). Towards a Better Q(
λ) for other tasks and results (stochastic tasks,
continuing tasks, etc)
❐ Deterministic gridworld with obstacles
 10x10 gridworld
 25 randomly generated obstacles
 30 runs
 α = 0.05, γ = 0.9, λ = 0.9, ε = 0.05, accumulating traces
From McGovern and Sutton (1997). Towards a better Q(λ)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
27
Comparison Results
From McGovern and Sutton (1997). Towards a better Q(λ)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
28
Convergence of the Q(λ)’s
❐ None of the methods are proven to converge.
 Much extra credit if you can prove any of them.
❐ Watkins’s is thought to converge to Q*
❐ Peng’s is thought to converge to a mixture of Qπ and Q*
❐ Naïve - Q*?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
29
Eligibility Traces for Actor-Critic Methods
❐ Critic: On-policy learning of Vπ. Use TD(λ) as described
before.
❐ Actor: Needs eligibility traces for each state-action pair.
❐ We change the update equation:
# pt (s,a) + !" t
pt +1(s, a) = $
% pt (s, a)
if a = at and s = st
otherwise
to
pt +1 ( s, a ) = pt ( s, a ) + !" t et ( s, a )
❐ Can change the other actor-critic update:
% pt (s,a) + !" t [1# $ (s, a)] if a = at and s = st
pt +1(s, a) = &
pt (s,a)
otherwise
'
where
to
pt +1 ( s, a ) = pt ( s, a ) + !" t et ( s, a )
%!"et #1(s, a) + 1 # $ t (st ,at ) if s = st and a = at
et (s, a) = &
!"et #1(s, a)
otherwise
'
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
30
Replacing Traces
❐ Using accumulating traces, frequently visited states can
have eligibilities greater than 1
 This can be a problem for convergence
❐ Replacing traces: Instead of adding 1 when you visit a
state, set that trace to 1
%!"et #1(s) if s $ st
et (s) = &
if s = st
' 1
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
31
Replacing Traces Example
❐ Same 19 state random walk task as before
❐ Replacing traces perform better than accumulating traces over more
values of λ
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
32
Why Replacing Traces?
❐ Replacing traces can significantly speed learning
❐ They can make the system perform well for a broader set of
parameters
❐ Accumulating traces can do poorly on certain types of tasks
Why is this task particularly onerous
for accumulating traces?
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
33
More Replacing Traces
❐ Off-line replacing trace TD(1) is identical to first-visit MC
❐ Extension to action-values:
 When you revisit a state, what should you do with the
traces for the other actions?
 Singh and Sutton say to set them to zero:
$&
1
if s = st and a = at
et (s, a) = %
0
if s = st and a ( at
&!"e (s, a)
if s ( st
' t #1
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
34
Implementation Issues
❐ Could require much more computation
 But most eligibility traces are VERY close to zero
❐ If you implement it in Matlab, backup is only one line of
code and is very fast (Matlab is optimized for matrices)
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
35
Variable λ
❐ Can generalize to variable λ
if s $ st
% !"t et #1(s)
et (s) = &
'!"t et #1(s) + 1 if s = st
❐ Here λ is a function of time
 Could define
"t = " ( st ) or "t = "
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
t
!
36
Conclusions
❐ Provides efficient, incremental way to combine MC and
TD
 Includes advantages of MC (can deal with lack of
Markov property)
 Includes advantages of TD (using TD error,
bootstrapping)
❐ Can significantly speed learning
❐ Does have a cost in computation
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
37
Something Here is Not Like the Other
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction
38
Download