TD-Gammon`s Neural Net Training Rule

advertisement
TD-Gammon’s Neural Net Training Rule
As specified in Tesauro’s paper “Temporal Difference Learning and TD-Gammon”, the
formula for weight change in the neural net is:
t
Wt 1  Wt   (Yt 1  Yt )  t  k  wYk
k 1
We now show how to calculate the gradient  wYk .
For concreteness, let us suppose we have a 2-layer feed-forward network, with h inputs, h
hidden units, and one output unit. We use the following notation:
X @ t  [ x1@ t xh @ t ] is the vector of inputs to the neural net at time t.
Wo  [ wo ,1
wo ,h ] are the weights on the output unit.
 w1,1
w1,h 


WH  
 are the weights on the hidden units,
 wh ,1
wh ,h 
where each row is a particular hidden unit.
WH ,i is the i - th row of WH .
Yo @ t is the value output by the output unit at time t.
Yi @ t is the value output by hidden unit i at time t.
The sigmoid function applied to a vector is defined as the vector of
the sigmoid applied to each element :
 ([ x1
xk ])  [ ( x1 )
 ( xk )]
Using the @ - sign notation consistently to represent time indices, the update
formula is written :
t
W@ t 1  W@ t   (Y@ t 1  Y@ t )  t  k  wY@ k
k 1
The follow formula represents the neural network:
f ( x)   (Wo  (WH  X T ))
Now we can calculate the partial differential of the network with respect to the weight on
hidden unit i that receives input j. This will us to calculate the update for the weight.
f ( x)  (Wo  (WH  X T ))

wij
wij
  (Wo  (WH  X T ))(1   (Wo  (WH  X T ))
 (Yo )(1  Yo )
 (Yo )(1  Yo )
Wo  (WH  X T )
wij
Wo  (WH  X T )
wij
definition of Yo
wo ,i (WH ,i  X T )
eliminate terms not dependent on wij
wij
 (Yo )(1  Yo ) wo ,i (WH ,i  X )(1   (WH ,i  X ))
T
= (Yo )(1  Yo ) wo ,iYi (1  Yi )
chain rule
T
 (WH ,i  X T )
chain rule
wij
 (WH ,i  X T )
definition of Yi
wij
 (Yo )(1  Yo ) wo ,iYi (1  Yi ) x j
This result makes good intuitive sense: the weight update is based on the network output
error, the network output, the hidden unit output, the weight between the hidden unit and
the output unit, and the input.
To update weights on the output unit the calculate is simpler:
f ( x)  (Wo  (WH  X T ))

woi
woi
Wo  (WH  X T )
  (Wo  (WH  X ))(1   (Wo  (WH  X ))
woi
T
T
Wo  (WH  X T )
 (Yo )(1  Yo )
woi
chain rule
definition of Yo
 (Yo )(1  Yo )Yi
Therefore, the update rules for a hidden unit and the output unit respectively, where again
we use the @ sign notation for clarity, are:
t
wi , j @ t 1  wi , j @ t   (Yo @ t 1  Yo @ t )  t  k (Yo @ k )(1  Yo @ k ) wo ,iYi @ k (1  Yi @ k ) x j @ k
k 1
t
wo,i @ t 1  wo ,i @ t   (Yo @ t 1  Yo @ t )  t  k (Yo @ k )(1  Yo @ k )Yi @ k
k 1
We now compare this update rule to that for backpropagation. Using our notation the
backpropagation rule for a hidden unit and (single) output unit respectively are:
Where To is the target output value :
wi , j @ t 1  wi , j @ t   (To @ t  Yo @ t )(Yo @ t )(1  Yo @ t ) wo ,i @ tYi @ t (1  Yi @ t ) x j @ t
wo ,i @ t 1  wo ,i @ t   (To @ t  Yo @ t )(Yo @ t )(1  Yo @ t )Yi @ t
Define the error signal e  To  Yo
Then the backpropagation update functions can be written :
B(e, i, j , t )   (e)(Yo @ t )(1  Yo @ t ) wo ,i @ tYi @ t (1  Yi @ t ) x j @ t
B(e, i, t )   (e)(Yo @ t )(1  Yo @ t )Yi @ t
Thus we can rewrite the TD-lambda update equations using the backpropagation update
functions as follows:
t
wi , j @ t 1  wi , j @ t   B( t  k (Yo @ t 1  Yo @ t ), i, j )
k 1
t
wo,i @ t 1  wo,i @ t   B( t  k (Yo @ t 1  Yo @ t ), i )
k 1
Thus we see that one can implement TD-lamda using routines for backpropagation,
simply by adjusting the error signal provided to the backpropagation update function. If
the input to the backpropagation update function are the output and target values rather
than the error signal, you simply let:
To  Yo @ t   k (Yo @ t 1  Yo @ t )
There is still one subtle problem in implementing TD-lambda using a standard
backpropagation routine. As described TD-lambda updates the weights once, while using
a backpropagation function would update the weights many times – and then the later
updates would using the new weights rather than the original weights. I believe this is
not a significant difference in practice.
Download