Gradient Descent/LMS Rao Vemuri 1 LMS Learning LMS = Least Mean Square learning Systems, more general than the previous perceptron learning rule. The concept is to minimize the total error, as measured over all D training examples. y is the raw output, as calculated by 2 1 Error(LMS) = ∑ ( t d − y d ) 2 d ∈D e.g. if we have two paGerns and t1=1, y1=0.8, t2=0, y2=0.5 then E=(0.5)[(1-­‐0.8)2+(0-­‐0.5)2]=.145 €ant to minimize the LMS: We w C-­‐learning rate E W(old) W(new) W 2 Minimizing Error • Using LMS, we want to minimize the error. • We can do this by finding the direcYon on the error surface that most rapidly reduces the error rate • This is done by finding the slope of the error funcYon by taking the derivaYve. • The approach is called gradient descent (similar to hill climbing). 3 Single-­‐Layer Networks ! 4 Gradient Descent: Linear Unit • Let us first consider a liner unit y = w0 + w1x1 +...+ wn xn = w. x • Let us learn the weight values that minimize the square of the error 1 E( w) = ∑ (td − yd )2 2 d ∈D • Here D is {training examples} = the rows in the Experience Table. d is a specific row 5 Gradient Descent Gradient)of)E)with)respect)to)the)weights) T ! " !E !E !E % !E(w) = $ , ,..., ' ! wn & #! w0 ! w1 ) Training)rule:)Make)a)step)in)the)negative)direction)of)the) gradient)for)minimizing)E) ! " "w = # $ %E(w ) ) That)is) %E "w j = #$ %w j ) 6 Gradient CalculaYon "E " 1 2 1 " = (td # y d ) = (td # y d )2 ! $ $ "w j "w j 2 d 2 "w j d ! ! ! ! 1 $ = # 2(td " y d ) (td " y d ) ! 2 ! $ w j d # !! = $ (td " y d ) (td " w. x d ) ! #w j d !E = !(td " yd )("x j,d ) = "!(td " yd )(x j,d ) ! !wj d d ! 7 Note the NegaYve Sign • The negaYve sign in front of the gradient expression came because we defined error as (true-­‐actual) • The two negaYve signs cancel each other out in the final formula. • Had we defined error as (actual – true), the stoy would have been different 8 Gradient-­‐Descent(training-­‐ examples, η) Supply training examples {(x,t)} where x is the paGern vector and t is the target class label. η is the learning rate, chosen small, say 0.05. • step 1. IniYalize wj to a small random values • step 2. UnYl terminaYon condiYon is met, do – IniYalize each Delta wj to zero • For each xj from the training-­‐examples, do – Calculate the output y – For each of the weights wj of the liner unit, do 9 Weight Update Formulas w j ←w j +Δw j 10 Note the NegaYve Sign • The negaYve sign in front of Delta w came because we want to travel in a direcYon opposite to that of the gradient. • We want to travel against the gradient because we want to climb down the error surface (not up) 11 Nonlinear Case • LMS (Gradient descent) can also be applied when the squashing funcYon is non-­‐linear • But we need the funcYon to be differenYable • Two popular non-­‐linear, differenYable funcYons are – Sigmoid – Hyperbolic tangent • Then you have to re-­‐calculate the Delta w’s for each case because the derivaYves are different. 12 Gradient Method: Summary • Perceptron training rule guaranteed to succeed if – Training examples are linearly separable – Learning rate is sufficiently small • Linear unit training rule that uses gradient descent – – – – Guaranteed to converge while making the squared error a minimum Works if learning rate is sufficiently small Works even if there is noise in training data Works even when the training data is NOT linearly separable • That is gradient descent method gives you some soluYon, for any data set as long as you keep eta small – Convergence may be slow or oscillatory 13 Batch Mode • Batch Mode of Gradient Method • Define the batch mode error: 1 2 ED (w) = ∑ (td − yd (w)) 2 d∈D • Do unYl error criterion is saYsfied – Compute the gradient: ∇E (w) D w ← w − η ∇ED (w) – Update the weights: 14 Incremental Mode • Incremental Mode of Gradient Method • Define incremental mode error: 2 1 Ed (w) = (td − yd (w)) 2 • Do unYl error criterion is saYsfied • Compute the gradient: ∇Ed (w) • Update the weights w ← w − η ∇Ed (w) 15 Both are approximately same • Incremental gradient method will approximate Batch version if η is sufficiently small 16 Gradient Descent: Sigmoid Unit •If the squashing funcYon is sigmoid, then its derivaYve enters the calculaYon •The sigmoid and its derivaYve are 1 f ( x) = σ ( x) = 1 + e− x dσ ( x ) dx = σ ( x )(1 − σ ( x )) • Now calculate the gradient as we did before • The resulYng method is called DELTA RULE 17 Gradient CalculaYon: DELTA RULE ∂E ∂ 1 2 1 ∂ 2 = ∑(td −yd ) = ∑(td −yd ) ∂w j ∂w j 2 d 2 ∂w j d 18 Delta Rule: Step 1 ∂E ∂ 1 1 ∂ 2 2 = ∑(td −yd ) = ∑(td −yd ) ∂w j ∂w j 2 d 2 ∂w j d Up to this point, the derivation is same as before Now yd is the output of a sigmoid. That is yd= σ(netd) 19 Delta Rule: Step 2 ∂E ∂ 1 2 1 ∂ 2 = ∑(td −yd ) = ∑(td −yd ) ∂w j ∂w j 2 d 2 ∂w j d 20 Delta Rule: Step 3 • Using the chain rule of differenYaYon: ∂ ∂σ (netd ) ∂ (netd ) ∑d (td − yd ) ∂w (−σ (netd )) = −∑d (td − yd ) ∂ (net ) ∂w j d j 21 DELTA RULE: Final Step ! But!we!know! ! "# (net d ) = # (net)(1 $ # (net)) " (net d ) ! ! " (net d ) " ( w . x d ) ! = = x i,d "w j "w j Therefore! "E = $ (t d % y d )y d (1 % y d ) x i,d ! "w j d #D ! 22 Weight Update Formulas ∂E Δw j = − η ∂w j 23 Gradient Descent: tanh unit • You can repeat these calculaYons if you choose a hyperbolic tagent as your non-­‐linear funcYon and get a slightly modified formulas because the derivaYve of tanh is different from the derivaYve of the sigmoid. You can do this at home 24