TD-Gammon’s Neural Net Training Rule As specified in Tesauro’s paper “Temporal Difference Learning and TD-Gammon”, the formula for weight change in the neural net is: t Wt 1 Wt (Yt 1 Yt ) t k wYk k 1 We now show how to calculate the gradient wYk . For concreteness, let us suppose we have a 2-layer feed-forward network, with h inputs, h hidden units, and one output unit. We use the following notation: X @ t [ x1@ t xh @ t ] is the vector of inputs to the neural net at time t. Wo [ wo ,1 wo ,h ] are the weights on the output unit. w1,1 w1,h WH are the weights on the hidden units, wh ,1 wh ,h where each row is a particular hidden unit. WH ,i is the i - th row of WH . Yo @ t is the value output by the output unit at time t. Yi @ t is the value output by hidden unit i at time t. The sigmoid function applied to a vector is defined as the vector of the sigmoid applied to each element : ([ x1 xk ]) [ ( x1 ) ( xk )] Using the @ - sign notation consistently to represent time indices, the update formula is written : t W@ t 1 W@ t (Y@ t 1 Y@ t ) t k wY@ k k 1 The follow formula represents the neural network: f ( x) (Wo (WH X T )) Now we can calculate the partial differential of the network with respect to the weight on hidden unit i that receives input j. This will us to calculate the update for the weight. f ( x) (Wo (WH X T )) wij wij (Wo (WH X T ))(1 (Wo (WH X T )) (Yo )(1 Yo ) (Yo )(1 Yo ) Wo (WH X T ) wij Wo (WH X T ) wij definition of Yo wo ,i (WH ,i X T ) eliminate terms not dependent on wij wij (Yo )(1 Yo ) wo ,i (WH ,i X )(1 (WH ,i X )) T = (Yo )(1 Yo ) wo ,iYi (1 Yi ) chain rule T (WH ,i X T ) chain rule wij (WH ,i X T ) definition of Yi wij (Yo )(1 Yo ) wo ,iYi (1 Yi ) x j This result makes good intuitive sense: the weight update is based on the network output error, the network output, the hidden unit output, the weight between the hidden unit and the output unit, and the input. To update weights on the output unit the calculate is simpler: f ( x) (Wo (WH X T )) woi woi Wo (WH X T ) (Wo (WH X ))(1 (Wo (WH X )) woi T T Wo (WH X T ) (Yo )(1 Yo ) woi chain rule definition of Yo (Yo )(1 Yo )Yi Therefore, the update rules for a hidden unit and the output unit respectively, where again we use the @ sign notation for clarity, are: t wi , j @ t 1 wi , j @ t (Yo @ t 1 Yo @ t ) t k (Yo @ k )(1 Yo @ k ) wo ,iYi @ k (1 Yi @ k ) x j @ k k 1 t wo,i @ t 1 wo ,i @ t (Yo @ t 1 Yo @ t ) t k (Yo @ k )(1 Yo @ k )Yi @ k k 1 We now compare this update rule to that for backpropagation. Using our notation the backpropagation rule for a hidden unit and (single) output unit respectively are: Where To is the target output value : wi , j @ t 1 wi , j @ t (To @ t Yo @ t )(Yo @ t )(1 Yo @ t ) wo ,i @ tYi @ t (1 Yi @ t ) x j @ t wo ,i @ t 1 wo ,i @ t (To @ t Yo @ t )(Yo @ t )(1 Yo @ t )Yi @ t Define the error signal e To Yo Then the backpropagation update functions can be written : B(e, i, j , t ) (e)(Yo @ t )(1 Yo @ t ) wo ,i @ tYi @ t (1 Yi @ t ) x j @ t B(e, i, t ) (e)(Yo @ t )(1 Yo @ t )Yi @ t Thus we can rewrite the TD-lambda update equations using the backpropagation update functions as follows: t wi , j @ t 1 wi , j @ t B( t k (Yo @ t 1 Yo @ t ), i, j ) k 1 t wo,i @ t 1 wo,i @ t B( t k (Yo @ t 1 Yo @ t ), i ) k 1 Thus we see that one can implement TD-lamda using routines for backpropagation, simply by adjusting the error signal provided to the backpropagation update function. If the input to the backpropagation update function are the output and target values rather than the error signal, you simply let: To Yo @ t k (Yo @ t 1 Yo @ t ) There is still one subtle problem in implementing TD-lambda using a standard backpropagation routine. As described TD-lambda updates the weights once, while using a backpropagation function would update the weights many times – and then the later updates would using the new weights rather than the original weights. I believe this is not a significant difference in practice.