Gradient Descent/LMS

advertisement
Gradient Descent/LMS Rao Vemuri 1 LMS Learning LMS = Least Mean Square learning Systems, more general than the previous perceptron learning rule. The concept is to minimize the total error, as measured over all D training examples. y is the raw output, as calculated by 2
1
Error(LMS) = ∑ ( t d − y d )
2 d ∈D
e.g. if we have two paGerns and t1=1, y1=0.8, t2=0, y2=0.5 then E=(0.5)[(1-­‐0.8)2+(0-­‐0.5)2]=.145 €ant to minimize the LMS: We w
C-­‐learning rate E W(old) W(new) W 2 Minimizing Error •  Using LMS, we want to minimize the error. •  We can do this by finding the direcYon on the error surface that most rapidly reduces the error rate •  This is done by finding the slope of the error funcYon by taking the derivaYve. •  The approach is called gradient descent (similar to hill climbing). 3 Single-­‐Layer Networks !
4 Gradient Descent: Linear Unit •  Let us first consider a liner unit  
y = w0 + w1x1 +...+ wn xn = w. x
•  Let us learn the weight values that minimize the square of the error 
1
E( w) = ∑ (td − yd )2
2 d ∈D
•  Here D is {training examples} = the rows in the Experience Table. d is a specific row 5 Gradient Descent Gradient)of)E)with)respect)to)the)weights)
T
! " !E !E
!E %
!E(w) = $
,
,...,
'
! wn &
#! w0 ! w1
)
Training)rule:)Make)a)step)in)the)negative)direction)of)the)
gradient)for)minimizing)E)
!
"
"w = # $ %E(w ) )
That)is)
%E
"w j = #$
%w j
)
6 Gradient CalculaYon "E " 1
2 1 "
=
(td # y d ) =
(td # y d )2 !
$
$
"w j "w j 2 d
2 "w j d
!
!
!
!
1
$
= # 2(td " y d ) (td " y d )
! 2
!
$
w
j
d
#
!!
= $ (td " y d ) (td " w. x d )
!
#w j
d
!E
= !(td " yd )("x j,d ) = "!(td " yd )(x j,d )
!
!wj d
d
!
7 Note the NegaYve Sign •  The negaYve sign in front of the gradient expression came because we defined error as (true-­‐actual) •  The two negaYve signs cancel each other out in the final formula. •  Had we defined error as (actual – true), the stoy would have been different 8 Gradient-­‐Descent(training-­‐
examples, η) Supply training examples {(x,t)} where x is the paGern vector and t is the target class label. η is the learning rate, chosen small, say 0.05. •  step 1. IniYalize wj to a small random values •  step 2. UnYl terminaYon condiYon is met, do –  IniYalize each Delta wj to zero •  For each xj from the training-­‐examples, do –  Calculate the output y –  For each of the weights wj of the liner unit, do 9 Weight Update Formulas w j ←w j +Δw j
10 Note the NegaYve Sign •  The negaYve sign in front of Delta w came because we want to travel in a direcYon opposite to that of the gradient. •  We want to travel against the gradient because we want to climb down the error surface (not up) 11 Nonlinear Case •  LMS (Gradient descent) can also be applied when the squashing funcYon is non-­‐linear •  But we need the funcYon to be differenYable •  Two popular non-­‐linear, differenYable funcYons are –  Sigmoid –  Hyperbolic tangent •  Then you have to re-­‐calculate the Delta w’s for each case because the derivaYves are different. 12 Gradient Method: Summary •  Perceptron training rule guaranteed to succeed if –  Training examples are linearly separable –  Learning rate is sufficiently small •  Linear unit training rule that uses gradient descent – 
– 
– 
– 
Guaranteed to converge while making the squared error a minimum Works if learning rate is sufficiently small Works even if there is noise in training data Works even when the training data is NOT linearly separable •  That is gradient descent method gives you some soluYon, for any data set as long as you keep eta small –  Convergence may be slow or oscillatory 13 Batch Mode •  Batch Mode of Gradient Method •  Define the batch mode error:  1
 2
ED (w) = ∑ (td − yd (w))
2 d∈D
•  Do unYl error criterion is saYsfied 
–  Compute the gradient: ∇E (w)
D
 

w ← w − η ∇ED (w)
–  Update the weights: 14 Incremental Mode •  Incremental Mode of Gradient Method •  Define incremental mode error: 
 2
1
Ed (w) = (td − yd (w))
2
•  Do unYl error criterion is saYsfied •  Compute the gradient: 
∇Ed (w)
•  Update the weights  

w ← w − η ∇Ed (w)
15 Both are approximately same •  Incremental gradient method will approximate Batch version if η is sufficiently small 16 Gradient Descent: Sigmoid Unit •If the squashing funcYon is sigmoid, then its derivaYve enters the calculaYon •The sigmoid and its derivaYve are 1
f ( x) = σ ( x) =
1 + e− x
dσ ( x )
dx
= σ ( x )(1 − σ ( x ))
•  Now calculate the gradient as we did before •  The resulYng method is called DELTA RULE 17 Gradient CalculaYon: DELTA RULE ∂E ∂ 1 2 1 ∂ 2
= ∑(td −yd ) = ∑(td −yd )
∂w j ∂w j 2 d 2 ∂w j d
18 Delta Rule: Step 1 ∂E ∂ 1
1
∂
2
2
= ∑(td −yd ) = ∑(td −yd )
∂w j ∂w j 2 d
2 ∂w j d
Up to this point, the derivation is same as before
Now yd is the output of a sigmoid. That is
yd= σ(netd)
19 Delta Rule: Step 2 ∂E ∂ 1 2 1 ∂ 2
= ∑(td −yd ) = ∑(td −yd )
∂w j ∂w j 2 d 2 ∂w j d
20 Delta Rule: Step 3 •  Using the chain rule of differenYaYon: ∂
∂σ (netd ) ∂ (netd )
∑d (td − yd ) ∂w (−σ (netd )) = −∑d (td − yd ) ∂ (net ) ∂w
j
d
j
21 DELTA RULE: Final Step !
But!we!know!
!
"# (net d )
= # (net)(1 $ # (net))
" (net d )
! !
" (net d ) " ( w . x d )
!
=
= x i,d
"w j
"w j
Therefore!
"E
= $ (t d % y d )y d (1 % y d ) x i,d
!
"w j d #D
!
22 Weight Update Formulas ∂E
Δw j = − η
∂w j
23 Gradient Descent: tanh unit •  You can repeat these calculaYons if you choose a hyperbolic tagent as your non-­‐linear funcYon and get a slightly modified formulas because the derivaYve of tanh is different from the derivaYve of the sigmoid. You can do this at home 24 
Download