Gradient Descent/LMS

Gradient Descent/LMS Rao Vemuri 1 LMS Learning LMS = Least Mean Square learning Systems, more general than the previous perceptron learning rule. The concept is to minimize the total error, as measured over all D training examples. y is the raw output, as calculated by 2 1 Error(LMS) = ∑ ( t d − y d ) 2 d ∈D e.g. if we have two paGerns and t1=1, y1=0.8, t2=0, y2=0.5 then E=(0.5)[(1-‐0.8)2+(0-‐0.5)2]=.145 €ant to minimize the LMS: We w C-‐learning rate E W(old) W(new) W 2 Minimizing Error •  Using LMS, we want to minimize the error. •  We can do this by finding the direcYon on the error surface that most rapidly reduces the error rate •  This is done by finding the slope of the error funcYon by taking the derivaYve. •  The approach is called gradient descent (similar to hill climbing). 3 Single-‐Layer Networks ! 4 Gradient Descent: Linear Unit •  Let us first consider a liner unit   y = w0 + w1x1 +...+ wn xn = w. x •  Let us learn the weight values that minimize the square of the error  1 E( w) = ∑ (td − yd )2 2 d ∈D •  Here D is {training examples} = the rows in the Experience Table. d is a specific row 5 Gradient Descent Gradient)of)E)with)respect)to)the)weights) T ! " !E !E !E % !E(w) = $ , ,..., ' ! wn & #! w0 ! w1 ) Training)rule:)Make)a)step)in)the)negative)direction)of)the) gradient)for)minimizing)E) ! " "w = # $ %E(w ) ) That)is) %E "w j = #$ %w j ) 6 Gradient CalculaYon "E " 1 2 1 " = (td # y d ) = (td # y d )2 ! $ $ "w j "w j 2 d 2 "w j d ! ! ! ! 1 $ = # 2(td " y d ) (td " y d ) ! 2 ! $ w j d # !! = $ (td " y d ) (td " w. x d ) ! #w j d !E = !(td " yd )("x j,d ) = "!(td " yd )(x j,d ) ! !wj d d ! 7 Note the NegaYve Sign •  The negaYve sign in front of the gradient expression came because we defined error as (true-‐actual) •  The two negaYve signs cancel each other out in the final formula. •  Had we defined error as (actual – true), the stoy would have been different 8 Gradient-‐Descent(training-‐ examples, η) Supply training examples {(x,t)} where x is the paGern vector and t is the target class label. η is the learning rate, chosen small, say 0.05. •  step 1. IniYalize wj to a small random values •  step 2. UnYl terminaYon condiYon is met, do –  IniYalize each Delta wj to zero •  For each xj from the training-‐examples, do –  Calculate the output y –  For each of the weights wj of the liner unit, do 9 Weight Update Formulas w j ←w j +Δw j 10 Note the NegaYve Sign •  The negaYve sign in front of Delta w came because we want to travel in a direcYon opposite to that of the gradient. •  We want to travel against the gradient because we want to climb down the error surface (not up) 11 Nonlinear Case •  LMS (Gradient descent) can also be applied when the squashing funcYon is non-‐linear •  But we need the funcYon to be differenYable •  Two popular non-‐linear, differenYable funcYons are –  Sigmoid –  Hyperbolic tangent •  Then you have to re-‐calculate the Delta w’s for each case because the derivaYves are different. 12 Gradient Method: Summary •  Perceptron training rule guaranteed to succeed if –  Training examples are linearly separable –  Learning rate is sufficiently small •  Linear unit training rule that uses gradient descent –  –  –  –  Guaranteed to converge while making the squared error a minimum Works if learning rate is sufficiently small Works even if there is noise in training data Works even when the training data is NOT linearly separable •  That is gradient descent method gives you some soluYon, for any data set as long as you keep eta small –  Convergence may be slow or oscillatory 13 Batch Mode •  Batch Mode of Gradient Method •  Define the batch mode error:  1  2 ED (w) = ∑ (td − yd (w)) 2 d∈D •  Do unYl error criterion is saYsfied  –  Compute the gradient: ∇E (w) D    w ← w − η ∇ED (w) –  Update the weights: 14 Incremental Mode •  Incremental Mode of Gradient Method •  Define incremental mode error:   2 1 Ed (w) = (td − yd (w)) 2 •  Do unYl error criterion is saYsfied •  Compute the gradient:  ∇Ed (w) •  Update the weights    w ← w − η ∇Ed (w) 15 Both are approximately same •  Incremental gradient method will approximate Batch version if η is sufficiently small 16 Gradient Descent: Sigmoid Unit •If the squashing funcYon is sigmoid, then its derivaYve enters the calculaYon •The sigmoid and its derivaYve are 1 f ( x) = σ ( x) = 1 + e− x dσ ( x ) dx = σ ( x )(1 − σ ( x )) •  Now calculate the gradient as we did before •  The resulYng method is called DELTA RULE 17 Gradient CalculaYon: DELTA RULE ∂E ∂ 1 2 1 ∂ 2 = ∑(td −yd ) = ∑(td −yd ) ∂w j ∂w j 2 d 2 ∂w j d 18 Delta Rule: Step 1 ∂E ∂ 1 1 ∂ 2 2 = ∑(td −yd ) = ∑(td −yd ) ∂w j ∂w j 2 d 2 ∂w j d Up to this point, the derivation is same as before Now yd is the output of a sigmoid. That is yd= σ(netd) 19 Delta Rule: Step 2 ∂E ∂ 1 2 1 ∂ 2 = ∑(td −yd ) = ∑(td −yd ) ∂w j ∂w j 2 d 2 ∂w j d 20 Delta Rule: Step 3 •  Using the chain rule of differenYaYon: ∂ ∂σ (netd ) ∂ (netd ) ∑d (td − yd ) ∂w (−σ (netd )) = −∑d (td − yd ) ∂ (net ) ∂w j d j 21 DELTA RULE: Final Step ! But!we!know! ! "# (net d ) = # (net)(1 $ # (net)) " (net d ) ! ! " (net d ) " ( w . x d ) ! = = x i,d "w j "w j Therefore! "E = $ (t d % y d )y d (1 % y d ) x i,d ! "w j d #D ! 22 Weight Update Formulas ∂E Δw j = − η ∂w j 23 Gradient Descent: tanh unit •  You can repeat these calculaYons if you choose a hyperbolic tagent as your non-‐linear funcYon and get a slightly modified formulas because the derivaYve of tanh is different from the derivaYve of the sigmoid. You can do this at home 24

Gradient Descent/LMS

Related documents

Products

Support

Gradient Descent/LMS

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib