Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2 N-step TD Prediction ❐ Idea: Look farther into the future when you do TD backup (1, 2, 3, …, n steps) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3 Mathematics of N-step TD Prediction ❐ Monte Carlo: Rt = rt +1 + "rt + 2 + " 2 rt +3 + L + " T !t !1rT ❐ TD: Rt(1) = rt +1 + !Vt ( st +1 ) Use V to estimate remaining return ❐ n-step TD: 2 step return: n-step return: Rt( 2 ) = rt +1 + !rt + 2 + ! 2Vt ( st + 2 ) Rt( n ) = rt +1 + "rt + 2 + " 2 rt +3 + L + " n !1rt + n + " nVt ( st + n ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4 Learning with N-step Backups ❐ Backup (on-line or off-line): !Vt (st ) = " [ Rt( n) # Vt (st )] ❐ Error reduction property of n-step returns max E! {Rtn | st = s} # V ! ( s ) $ " n max V ( s ) # V ! ( s ) s s n step return Maximum error using n-step return Maximum error using V ❐ Using this, you can show that n-step methods converge R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5 Random Walk Examples ❐ How does 2-step TD work here? ❐ How about 3-step TD? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6 A Larger Example ❐ Task: 19 state random walk ❐ Do you think there is an optimal n (for everything)? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7 Averaging N-step Returns ❐ n-step methods were introduced to help with TD(λ) understanding ❐ Idea: backup an average of several returns e.g. backup half of 2-step and half of 4step Rtavg = One backup 1 ( 2) 1 ( 4) Rt + Rt 2 2 ❐ Called a complex backup Draw each component Label with the weights for that component R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8 Forward View of TD(λ) ❐ TD(λ) is a method for averaging all n-step backups n-1 (time since weight by λ visitation) λ-return: # Rt = (1" ! )$ !n "1Rt(n) ! n=1 ❐ Backup using λ-return: [ !Vt (st ) = " Rt# $ Vt (st ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction ] 9 λ-return Weighting Function R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10 Relation to TD(0) and MC ❐ λ-return can be rewritten as: T " t "1 Rt = (1" ! ) # !n"1 Rt(n) + !T "t "1Rt ! n=1 Until termination After termination ❐ If λ = 1, you get MC: T "t "1 Rt = (1" 1) #1n"1 Rt(n ) + 1T " t "1 Rt = Rt ! n=1 ❐ If λ = 0, you get TD(0) T "t "1 Rt = (1" 0) # 0 n"1 Rt(n ) + 0T " t "1 Rt = Rt(1) ! n=1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11 Forward View of TD(λ) II ❐ Look forward from each state to determine update from future states and rewards: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12 λ-return on the Random Walk ❐ Same 19 state random walk as before ❐ Why do you think intermediate values of λ are best? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13 Backward View of TD(λ) ❐ The forward view was for theory ❐ The backward view is for mechanism et (s ) " ! + ❐ New variable called eligibility trace On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace if s $ st % !"et #1(s) et (s) = & '!"et #1(s) + 1 if s = st R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14 On-line Tabular TD(λ) Initialize V(s) arbitrarily and e(s) = 0, for all s !S Repeat (for each episode) : Initialize s Repeat (for each step of episode) : a " action given by # for s Take action a, observe reward, r, and next state s$ % " r + &V( s$) ' V (s) e(s) " e(s) + 1 For all s : V(s) " V(s) + (%e(s) e(s) " &)e(s) s " s$ Until s is terminal R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15 Backward View # t = rt +1 + "Vt ( st +1 ) ! Vt ( st ) ❐ Shout δt backwards over time ❐ The strength of your voice decreases with temporal distance by γλ R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16 Relation of Backwards View to MC & TD(0) ❐ Using update rule: #Vt ( s ) = !" t et ( s ) ❐ As before, if you set λ to 0, you get to TD(0) ❐ If you set λ to 1, you get MC but in a better way Can apply TD(1) to continuing tasks Works incrementally and on-line (instead of waiting to the end of the episode) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17 Forward View = Backward View ❐ The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating ❐ The book shows: T "1 TD T "1 $ !V (s) = !V # t # t (st )Isst t=0 t=0 Backward updates Forward updates algebra shown in book T "1 TD T "1 # !Vt (s) = # $ Isst t=0 t=0 T "1 k"t #(%& ) ' k k =t T #1 T #1 T #1 t=0 k =t k#t !V (s )I = % I ( &" ) 'k $ t t sst $ sst $ " t=0 ❐ On-line updating with small α is similar R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18 On-line versus Off-line on Random Walk ❐ Same 19 state random walk ❐ On-line performs better over a broader range of parameters R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19 Control: Sarsa(λ) ❐ Save eligibility for state-action pairs instead of just states $!"e (s, a) + 1 if s = st and a = at et (s, a) = % t #1 otherwise & !"et #1(s,a) Qt +1(s, a) = Qt (s, a) + '(t et (s, a) (t = rt +1 + !Qt (st +1,at +1) # Qt (st , at ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20 Sarsa(λ) Algorithm Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a Repeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, s! Choose a ! from s! using policy derived from Q (e.g. ? - greedy) " # r + $Q(s!, a !) % Q(s, a) e(s,a) # e(s,a) + 1 For all s,a : Q(s, a) # Q(s, a) + &"e(s, a) e(s, a) # $'e(s, a) s # s!;a # a ! Until s is terminal R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21 Sarsa(λ) Gridworld Example ❐ With one trial, the agent has much more information about how to get to the goal not necessarily the best way ❐ Can considerably accelerate learning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22 Three Approaches to Q(λ) ❐ How can we extend this to Qlearning? ❐ If you mark every state action pair as eligible, you backup over non-greedy policy Watkins: Zero out eligibility trace after a nongreedy action. Do max when backing up at first non-greedy choice. %'1 + !"et #1(s, a) et (s, a) = & 0 '( !"e (s,a) t #1 if s = st , a = at ,Qt #1(st ,at ) = max a Qt #1(st , a) if Qt #1(st ,at ) $ max a Qt #1(st ,a) otherwise Qt +1(s, a) = Qt (s, a) + )*t et (s, a) *t = rt +1 + ! max a + Qt (st +1, a +) # Qt (st ,at ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23 Watkins’s Q(λ) Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a Repeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, s! Choose a ! from s! using policy derived from Q (e.g. ? - greedy) a* " arg max b Q(s!, b) (if a ties for the max, then a* " a!) # " r + $Q(s!, a !) % Q(s, a* ) e(s,a) " e(s,a) + 1 For all s,a : Q(s, a) " Q(s, a) + &#e(s, a) If a ! = a* , then e(s, a) " $'e(s,a) else e(s, a) " 0 s " s!;a " a ! Until s is terminal R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24 Peng’s Q(λ) ❐ Disadvantage to Watkins’s method: Early in learning, the eligibility trace will be “cut” (zeroed out) frequently resulting in little advantage to traces ❐ Peng: Backup max action except at end Never cut traces ❐ Disadvantage: Complicated to implement R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25 Naïve Q(λ) ❐ Idea: is it really a problem to backup exploratory actions? Never zero traces Always backup max at current action (unlike Peng or Watkins’s) ❐ Is this truly naïve? ❐ Works well is preliminary empirical studies What is the backup diagram? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26 Comparison Task ❐ Compared Watkins’s, Peng’s, and Naïve (called McGovern’s here) Q(λ) on several tasks. See McGovern and Sutton (1997). Towards a Better Q( λ) for other tasks and results (stochastic tasks, continuing tasks, etc) ❐ Deterministic gridworld with obstacles 10x10 gridworld 25 randomly generated obstacles 30 runs α = 0.05, γ = 0.9, λ = 0.9, ε = 0.05, accumulating traces From McGovern and Sutton (1997). Towards a better Q(λ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27 Comparison Results From McGovern and Sutton (1997). Towards a better Q(λ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28 Convergence of the Q(λ)’s ❐ None of the methods are proven to converge. Much extra credit if you can prove any of them. ❐ Watkins’s is thought to converge to Q* ❐ Peng’s is thought to converge to a mixture of Qπ and Q* ❐ Naïve - Q*? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29 Eligibility Traces for Actor-Critic Methods ❐ Critic: On-policy learning of Vπ. Use TD(λ) as described before. ❐ Actor: Needs eligibility traces for each state-action pair. ❐ We change the update equation: # pt (s,a) + !" t pt +1(s, a) = $ % pt (s, a) if a = at and s = st otherwise to pt +1 ( s, a ) = pt ( s, a ) + !" t et ( s, a ) ❐ Can change the other actor-critic update: % pt (s,a) + !" t [1# $ (s, a)] if a = at and s = st pt +1(s, a) = & pt (s,a) otherwise ' where to pt +1 ( s, a ) = pt ( s, a ) + !" t et ( s, a ) %!"et #1(s, a) + 1 # $ t (st ,at ) if s = st and a = at et (s, a) = & !"et #1(s, a) otherwise ' R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30 Replacing Traces ❐ Using accumulating traces, frequently visited states can have eligibilities greater than 1 This can be a problem for convergence ❐ Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1 %!"et #1(s) if s $ st et (s) = & if s = st ' 1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31 Replacing Traces Example ❐ Same 19 state random walk task as before ❐ Replacing traces perform better than accumulating traces over more values of λ R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32 Why Replacing Traces? ❐ Replacing traces can significantly speed learning ❐ They can make the system perform well for a broader set of parameters ❐ Accumulating traces can do poorly on certain types of tasks Why is this task particularly onerous for accumulating traces? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33 More Replacing Traces ❐ Off-line replacing trace TD(1) is identical to first-visit MC ❐ Extension to action-values: When you revisit a state, what should you do with the traces for the other actions? Singh and Sutton say to set them to zero: $& 1 if s = st and a = at et (s, a) = % 0 if s = st and a ( at &!"e (s, a) if s ( st ' t #1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34 Implementation Issues ❐ Could require much more computation But most eligibility traces are VERY close to zero ❐ If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35 Variable λ ❐ Can generalize to variable λ if s $ st % !"t et #1(s) et (s) = & '!"t et #1(s) + 1 if s = st ❐ Here λ is a function of time Could define "t = " ( st ) or "t = " R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction t ! 36 Conclusions ❐ Provides efficient, incremental way to combine MC and TD Includes advantages of MC (can deal with lack of Markov property) Includes advantages of TD (using TD error, bootstrapping) ❐ Can significantly speed learning ❐ Does have a cost in computation R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 37 Something Here is Not Like the Other R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 38