Regression and Classification with Neural Networks Note to other teachers and users of these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use of a significant portion of these slides in your own lecture, please include this message, or the following link to the source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . Comments and corrections gratefully received. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 Copyright © 2001, 2003, Andrew W. Moore Sep 25th, 2001 Linear Regression DATASET inputs ↑ w ← 1 →↓ outputs x1 = 1 y1 = 1 x2 = 3 y2 = 2.2 x3 = 2 y3 = 2 x4 = 1.5 y4 = 1.9 x5 = 4 y5 = 3.1 Linear regression assumes that the expected value of the output given an input, E[y|x], is linear. Simplest case: Out(x) = wx for some unknown w. Given the data, we can estimate w. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 2 1 1-parameter linear regression Assume that the data is formed by yi = wxi + noisei where… • the noise signals are independent • the noise has a normal distribution with mean 0 and unknown variance σ2 P(y|w,x) has a normal distribution with • mean wx • variance σ2 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 3 Bayesian Linear Regression P(y|w,x) = Normal (mean wx, var σ2) We have a set of datapoints (x1,y1) (x2,y2) … (xn,yn) which are EVIDENCE about w. We want to infer w from the data. P(w|x1, x2, x3,…xn, y1, y2…yn) •You can use BAYES rule to work out a posterior distribution for w given the data. •Or you could do Maximum Likelihood Estimation Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 4 2 Maximum likelihood estimation of w Asks the question: “For which value of w is this data most likely to have happened?” <=> For what w is P(y1, y2…yn |x1, x2, x3,…xn, w) maximized? <=> For what w is n ∏P( y w, x ) maximized? i i i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 5 For what w is n ∏ P( y i w , x i ) maximized? i =1 For what w isn 1 ∏ exp(− 2 ( i =1 yi − wxi σ ) 2 ) maximized? For what w is n ∑ i =1 For what w is 2 1 y − wx i − i maximized? 2 σ 2 n ∑ (y i =1 i − wx Copyright © 2001, 2003, Andrew W. Moore i ) minimized? Neural Networks: Slide 6 3 Linear Regression The maximum likelihood w is the one that minimizes sumof-squares of residuals E(w) w Ε = ∑( yi − wxi ) 2 i = ∑ yi − (2∑ xi yi )w + 2 i (∑ x )w 2 2 i We want to minimize a quadratic function of w. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 7 Linear Regression Easy to show the sum of squares is minimized when w= ∑x y ∑x i i 2 i The maximum likelihood model is Out(x) = wx We can use it for prediction Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 8 4 Linear Regression Easy to show the sum of squares is minimized when w= ∑ xi yi ∑x p(w) w 2 Note: i The maximum likelihood model is Out(x) = wx In Bayesian stats you’d have ended up with a prob dist of w And predictions would have given a prob dist of expected output Often useful to know your confidence. We can use it for prediction Max likelihood can give some kinds of confidence too. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 9 Multivariate Regression What if the inputs are vectors? 3. 6. .4 . x2 2-d input example 5 .8 . 10 x1 Dataset has form x1 x2 x3 .: . xR Copyright © 2001, 2003, Andrew W. Moore y1 y2 y3 : yR Neural Networks: Slide 10 5 Multivariate Regression Write matrix X and Y thus: .....x1 ..... x11 .....x ..... x 2 = 21 x= M .....x R ..... xR1 x12 x22 xR 2 x1m y1 y ... x2 m y = 2 M M ... xRm yR ... (there are R datapoints. Each input has m components) The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 11 Multivariate Regression Write matrix X and Y thus: .....x1 ..... x11 .....x ..... x 2 = 21 x= M .....x R ..... xR1 x12 x22 xR 2 x1m y1 y ... x2 m y = 2 M M ... xRm yR ... IMPORTANT EXERCISE: (there are R datapoints. Each input hasPROVE m components) IT !!!!! The linear regression model assumes a vector w such that Out(x) = wTx = w1x[1] + w2x[2] + ….wmx[D] The max. likelihood w is w = (XTX) -1(XTY) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 12 6 Multivariate Regression (con’t) The max. likelihood w is w = (XTX)-1(XTY) R XTX is an m x m matrix: i,j’th elt is ∑x x ki kj k =1 R XTY is an m-element vector: i’th elt ∑x k =1 Copyright © 2001, 2003, Andrew W. Moore y ki k Neural Networks: Slide 13 What about a constant term? We may expect linear data that does not go through the origin. Statisticians and Neural Net Folks all agree on a simple obvious hack. Can you guess?? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 14 7 The constant term • The trick is to create a fake input “X0” that always takes the value 1 X1 X2 Y X0 X1 X2 Y 2 3 5 4 4 5 16 17 20 1 1 1 2 3 5 4 4 5 16 17 20 Before: After: Y=w1X1+ w2X2 Y= w0X0+w1X1+ w2X2 = w0+w1X1+ w2X2 …has to be a poor model In this example, You should be able to see the MLE w0 , w1 and w2 by inspection …has a fine constant term Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 15 Regression with varying noise • Suppose you know the variance of the noise that was added to each datapoint. xi ½ 1 2 2 3 yi ½ 1 1 3 2 Assume y=3 σi2 4 1 1/4 4 1/4 σ=2 y=2 σ=1/2 σ=1 y=1 σ=1/2 σ=2 y=0 x=0 x=1 x=2 yi ~ N ( wxi , σ i2 ) Copyright © 2001, 2003, Andrew W. Moore x=3 LE eM ? h t fw at’s W h at e o m esti Neural Networks: Slide 16 8 MLE estimation with varying noise argmax log p( y , y ,..., y 1 2 R | x1 , x2 ,..., xR , σ 12 , σ 22 ,..., σ R2 , w) = w R argmin ∑ i =1 w ( yi − wxi ) 2 σ i2 = R x ( y − wx ) w such that ∑ i i 2 i = 0 = σi i =1 R xi yi ∑ 2 i =1 σ i R xi2 ∑ 2 i =1 σ i Assuming i.i.d. and then plugging in equation for Gaussian and simplifying. Setting dLL/dw equal to zero Trivial algebra Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 17 This is Weighted Regression • We are asking to minimize the weighted sum of squares y=3 R argmin ∑ w i =1 ( yi − wxi ) σ i2 σ=2 2 y=2 σ=1/2 σ=1 y=1 σ=2 σ=1/2 y=0 x=0 where weight for i’th datapoint is Copyright © 2001, 2003, Andrew W. Moore x=1 x=2 x=3 1 σ i2 Neural Networks: Slide 18 9 Weighted Multivariate Regression The max. likelihood w is w = (WXTWX)-1(WXTWY) (WXTWX) R xki xkj k =1 σ i2 ∑ is an m x m matrix: i,j’th elt is R (WXTWY) is an m-element vector: i’th elt ∑ k =1 Copyright © 2001, 2003, Andrew W. Moore xki yk σ i2 Neural Networks: Slide 19 Non-linear Regression • Suppose you know that y is related to a function of x in such a way that the predicted values have a non-linear dependence on w, e.g: xi ½ 1 2 3 3 Assume yi ½ 2.5 3 2 3 y=3 y=2 y=1 y=0 x=0 x=1 x=2 yi ~ N ( w + xi , σ ) Copyright © 2001, 2003, Andrew W. Moore 2 x=3 LE eM ? h t fw at’s W h at e o m esti Neural Networks: Slide 20 10 Non-linear MLE estimation argmax log p( y , y ,..., y 1 w 2 argmin ∑ (y R i =1 w i R | x1 , x2 ,..., xR , σ , w) = − w + xi ) 2 = R y − w + xi w such that ∑ i = 0 = w + xi i =1 Copyright © 2001, 2003, Andrew W. Moore Assuming i.i.d. and then plugging in equation for Gaussian and simplifying. Setting dLL/dw equal to zero Neural Networks: Slide 21 Non-linear MLE estimation argmax log p( y , y ,..., y 1 w 2 argmin ∑ (y R w i =1 i R | x1 , x2 ,..., xR , σ , w) = − w + xi ) 2 = R y − w + xi w such that ∑ i = 0 = w + xi i =1 Assuming i.i.d. and then plugging in equation for Gaussian and simplifying. Setting dLL/dw equal to zero We’re down the algebraic toilet t wha s s e u So g e do? w Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 22 11 Non-linear MLE estimation argmax log p( y , y ,..., y 1 w 2 R argmin ∑ ( | x1 , x2 ,..., xR , σ , w) = Assuming i.i.d. and ) R then plugging in 2 Common (but not only) approach: equation for Gaussian yi − w + xi = Numerical Solutions: and simplifying. i =1 w • Line Search • Simulated Annealing R Setting dLL/dw yi − w + xi w such that equal to zero • Gradient Descent = 0 = + w x i = 1 i • Conjugate Gradient • Levenberg Marquart We’re down the • Newton’s Method ∑ algebraic toilet Also, special purpose statisticaloptimization-specific tricks such as E.M. (See Gaussian Mixtures lecture for introduction) hat ss w e u So g e do? w Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 23 GRADIENT DESCENT Suppose we have a scalar function f(w): ℜ → ℜ We want to find a local minimum. Assume our current weight is w GRADIENT DESCENT RULE: w ← w −η ∂ f (w) ∂w η is called the LEARNING RATE. A small positive number, e.g. η = 0.05 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 24 12 GRADIENT DESCENT Suppose we have a scalar function f(w): ℜ → ℜ We want to find a local minimum. Assume our current weight is w GRADIENT DESCENT RULE: w ← w −η ∂ f (w) ∂w Recall Andrew’s favorite default value for anything η is called the LEARNING RATE. A small positive number, e.g. η = 0.05 QUESTION: Justify the Gradient Descent Rule Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 25 Gradient Descent in “m” Dimensions Given f(w ) : ℜ m → ℜ ∂ f (w ) ∂w1 ∇f (w) = M points in direction of steepest ascent. ∂ ∂w f (w) m ∇f (w) is the gradient in that direction GRADIENT DESCENT RULE: Equivalently wj ← wj - η w ← w -η∇f (w) ∂ f (w ) ….where wj is the jth weight ∂w j “just like a linear feedback system” Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 26 13 What’s all this got to do with Neural Nets, then, eh?? For supervised learning, neural nets are also models with vectors of w parameters in them. They are now called weights. As before, we want to compute the weights to minimize sumof-squared residuals. Which turns out, under “Gaussian i.i.d noise” assumption to be max. likelihood. Instead of explicitly solving for max. likelihood weights, we use GRADIENT DESCENT to SEARCH for them. s exp , a querulou sk a u yo ” y? “Wh e later.” ply: “We’ll se “Aha!!” I re our eyes. ression in y Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 27 Linear Perceptrons They are multivariate linear models: Out(x) = wTx And “training” consists of minimizing sum-of-squared residuals by gradient descent. Ε = ∑ (Out (x ) − yk ∑ (w ) k k = Τ x k − yk )2 2 k QUESTION: Derive the perceptron training rule. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 28 14 Linear Perceptron Training Rule R E = ∑ ( yk − w T x k ) 2 k =1 Gradient descent tells us we should update w thusly if we wish to minimize E: wj ← wj - η So what’s ∂E ∂w j ∂E ? ∂w j Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 29 Linear Perceptron Training Rule R E = ∑ ( yk − w T x k ) 2 k =1 Gradient descent tells us we should update w thusly if we wish to minimize E: wj ← wj - η So what’s ∂E ∂w j ∂E ? ∂w j R ∂E ∂ =∑ ( yk − w T x k ) 2 ∂w j k =1 ∂w j R = ∑ 2( y k − w T x k ) k =1 R = −2∑ δk k =1 ∂ ( yk − w T x k ) ∂w j ∂ wT xk ∂w j …where… δk = yk − w T x k R = −2∑ δk k =1 ∂ ∂w j m ∑w x i =1 i ki R = −2∑ δk xkj k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 30 15 Linear Perceptron Training Rule R E = ∑ ( yk − w T x k ) 2 k =1 Gradient descent tells us we should update w thusly if we wish to minimize E: wj ← wj - η ∂E ∂w j …where… R ∂E = −2∑ δk xkj ∂w j k =1 R w j ← w j + 2η∑ δk xkj k =1 We frequently neglect the 2 (meaning we halve the learning rate) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 31 The “Batch” perceptron algorithm 1) Randomly initialize weights w1 w2 … wm 2) Get your dataset (append 1’s to the inputs if you don’t want to go through the origin). 3) for i = 1 to R δ i := yi − wΤxi 4) for j = 1 to m w j ← w j + η ∑ δ i xij R i =1 5) if ∑ δ i 2 stops improving then stop. Else loop back to 3. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 32 16 δ i ← yi − w Τ x i w j ← w j + ηδ i xij Th ule SR M L e A RULE KNOWN BY MANY NAMES The The delta ru le Classical conditioning Hoff row d i W Th e rule ad alin er ule Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 33 If data is voluminous and arrives fast Input-output pairs (x,y) come streaming in very quickly. THEN Don’t bother remembering old ones. Just keep using new ones. observe (x,y) δ ← y − wΤx ∀j w j ← w j + η δ x j Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 34 17 Gradient Descent vs Matrix Inversion for Linear Perceptrons GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MI advantages): • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 35 Gradient Descent vs Matrix Inversion for Linear Perceptrons GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion • More easily parallelizable (or implementable in wetware)? GD Disadvantages (MI advantages): • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix and then solve a set of linear equations • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 36 18 Gradient Descent vs Matrix Inversion for Linear Perceptrons GD Advantages (MI disadvantages): • Biologically plausible • With very very many attributes each iteration costs only O(mR). If fewer than m iterations needed we’ve beaten Matrix Inversion But we’llin wetware)? • More easily parallelizable (or implementable see that GD Disadvantages (MIsoon advantages): GD • It’s moronic • It’s essentially a slow implementation of a way to build the XTX matrix has an important extra and then solve a set of linear equations trick up its sleeve • If m is small it’s especially outageous. If m is large then the direct matrix inversion method gets fiddly but not impossible if you want to be efficient. • Hard to choose a good learning rate • Matrix inversion takes predictable time. You can’t be sure when gradient descent will stop. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 37 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or We can do a linear fit. Our prediction is 0 if out(x)≤1/2 1 if out(x)>1/2 WHAT’S THE BIG PROBLEM WITH THIS??? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 38 19 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or We can do a linear fit. Blue = Out(x) Our prediction is 0 if out(x)≤½ 1 if out(x)>½ WHAT’S THE BIG PROBLEM WITH THIS??? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 39 Perceptrons for Classification What if all outputs are 0’s or 1’s ? or We can do a linear fit. Blue = Out(x) Our prediction is 0 if out(x)≤½ Green = Classification 1 if out(x)>½ Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 40 20 Classification with Perceptrons I Don’t minimize ∑ (y ) 2 i − w Τxi . Minimize number of misclassifications instead. [Assume outputs are +1 & -1, not +1 & 0] ∑ (y where Round(x) = i ( − Round w Τ x i )) -1 if x<0 NOTE: CUTE & NON OBVIOUS WHY THIS WORKS!! 1 if x≥0 The gradient descent rule can be changed to: if (xi,yi) correctly classed, don’t change if wrongly predicted as 1 w Å w - xi if wrongly predicted as -1 w Å w + xi Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 41 Classification with Perceptrons II: Sigmoid Functions Least squares fit useless Copyright © 2001, 2003, Andrew W. Moore This fit would classify much better. But not a least squares fit. Neural Networks: Slide 42 21 Classification with Perceptrons II: Sigmoid Functions Least squares fit useless This fit would classify much better. But not a least squares fit. SOLUTION: Instead of Out(x) = wTx We’ll use Out(x) = g(wTx) where g( x) : ℜ → (0,1) is a squashing function Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 43 The Sigmoid g ( h) = 1 1 + exp(− h) Note that if you rotate this curve through 180o centered on (0,1/2) you get the same curve. i.e. g(h)=1-g(-h) Can you prove this? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 44 22 The Sigmoid g ( h) = 1 1 + exp(− h) Now we choose w to minimize 2 2 ∑ [ yi − Out(x i )] = ∑ [yi − g (w Τ x i )] R R i =1 i =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 45 Linear Perceptron Classification Regions 0 X2 0 0 1 1 1 X1 We’ll use the model Out(x) = g(wT(x,1)) = g(w1x1 + w2x2 + w0) Which region of above diagram classified with +1, and which with 0 ?? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 46 23 Gradient descent with sigmoid on a perceptron First, notice g ' ( x ) = g ( x )(1 − g ( x )) Because : g ( x ) = = 1 so g ' ( x ) = 1 + e− x − e− x 1 + e − x 2 −1 1 −1 − e− x 1 1 1 1 − = − g ( x )(1 − g ( x )) = − = − − 2 2 x x 1+ e 1+ e 1 + e− x 1 + e − x 1 + e − x Out(x) = g ∑ wk xk k Ε = ∑ yi − g ∑ wk xik i k The sigmoid perceptron update rule: 2 R w j ← w j + η ∑ δ i gi (1 − gi )xij ∂Ε ∂ = ∑ 2 yi − g ∑ wk xik − g ∑ wk xik ∂w j i k ∂w j k ∂ = ∑ − 2 yi − g ∑ wk xik g ' ∑ wk xik i k k ∂w j = ∑ − 2δ i g (net i )(1 − g (net i ))xij i where δ i = yi − Out(x i ) net i = ∑ wk xk i =1 ∑w x k k ik where m gi = g ∑ w j xij j =1 δ i = yi − gi k Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 47 Other Things about Perceptrons • Invented and popularized by Rosenblatt (1962) • Even with sigmoid nonlinearity, correct convergence is guaranteed • Stable behavior for overconstrained and underconstrained problems Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 48 24 Perceptrons and Boolean Functions If inputs are all 0’s and 1’s and outputs are all 0’s and 1’s… • Can learn the function x1 ∧ x2 X2 X1 • Can learn the function x1 ∨ x2 . X2 X1 • Can learn any conjunction of literals, e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5 QUESTION: WHY? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 49 Perceptrons and Boolean Functions • Can learn any disjunction of literals e.g. x1 ∧ ~x2 ∧ ~x3 ∧ x4 ∧ x5 • Can learn majority function f(x1,x2 … xn) = 1 if n/2 xi’s or more are = 1 0 if less than n/2 xi’s are = 1 • What about the exclusive or function? f(x1,x2) = x1 ∀ x2 = (x1 ∧ ~x2) ∨ (~ x1 ∧ x2) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 50 25 Multilayer Networks The class of functions representable by perceptrons is limited ( ) Out(x) = g w Τ x = g ∑ w j x j j Use a wider representation ! Out(x) = g ∑W j g ∑ w jk x jk This is a nonlinear function k j Of a linear combination Of non linear functions Of linear combinations of inputs Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 51 A 1-HIDDEN LAYER NET NINPUTS = 2 NHIDDEN = 3 NINS v1 = g ∑ w1k xk k =1 w11 x1 w31 w12 x2 w1 w21 w22 NINS v2 = g ∑ w2k xk k =1 w2 N HID Out = g ∑ Wk vk k =1 w3 w32 NINS v3 = g ∑ w3k xk k =1 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 52 26 OTHER NEURAL NETS 1 x1 x2 x3 2-Hidden layers + Constant Term “JUMP” CONNECTIONS x1 N HID N INS Out = g ∑ w0 k xk + ∑Wk vk k =1 k =1 x2 Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 53 Backpropagation Out(x) = g ∑W j g ∑ w jk xk k j Find a set of weights {W j },{w jk } to minimize ∑ ( y − Out(x )) 2 i i i by gradient descent. That’s That’sit! it! That’s the That’s thebackpropagation backpropagation algorithm. algorithm. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 54 27 Backpropagation Convergence Convergence to a global minimum is not guaranteed. •In practice, this is not a problem, apparently. Tweaking to find the right number of hidden units, or a useful learning rate η, is more hassle, apparently. IMPLEMENTING BACKPROP: Differentiate Monster sum-square residual Write down the Gradient Descent Rule It turns out to be easier & computationally efficient to use lots of local variables with names like hj ok vj neti etc… Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 55 Choosing the learning rate • This is a subtle art. • Too small: can take days instead of minutes to converge • Too large: diverges (MSE gets larger and larger while the weights increase and usually oscillate) • Sometimes the “just right” value is hard to find. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 56 28 Learning-rate problems From J. Hertz, A. Krogh, and R. G. Palmer. Introduction to the Theory of Neural Computation. Addison-Wesley, 1994. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 57 Improving Simple Gradient Descent Momentum Don’t just change weights according to the current datapoint. Re-use changes from earlier iterations. Let ∆w(t) = weight changes at time t. Let be the change we would make with ∂Ε −η ∂w regular gradient descent. Instead we use ∂Ε + α∆w (t ) ∂w w (t + 1) = w (t ) + ∆w (t ) ∆w (t +1) = −η Momentum damps oscillations. A hack? Well, maybe. Copyright © 2001, 2003, Andrew W. Moore momentum parameter Neural Networks: Slide 58 29 Momentum illustration Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 59 Improving Simple Gradient Descent Newton’s method E ( w + h) = E ( w ) + hT ∂E 1 T ∂ 2 E + h h + O(| h |3 ) 2 ∂w 2 ∂w If we neglect the O(h3) terms, this is a quadratic form Quadratic form fun facts: If y = c + bT x - 1/2 xT A x And if A is SPD Then xopt = A-1b is the value of x that maximizes y Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 60 30 Improving Simple Gradient Descent Newton’s method E ( w + h) = E ( w ) + hT ∂E 1 T ∂ 2 E + h h + O(| h |3 ) 2 ∂w 2 ∂w If we neglect the O(h3) terms, this is a quadratic form −1 ∂ 2 E ∂E w ←w− 2 ∂w ∂w This should send us directly to the global minimum if the function is truly quadratic. And it might get us close if it’s locally quadraticish Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 61 Improving Simple Gradient Descent Newton’s method E ( w + h) = E ( w ) + hT ∂E 1 T ∂ 2 E + h h + O(| h |3 ) 2 ∂w 2 ∂w IfBwe the O(h3) terms, this is a quadratic form UT neglect (and it’s a big b That −1 ut) secon ∂ 2 E ∂E d der w ←… expen w− 2 i va s i ve a nd fid tive matr ∂w ∂w ix If we dly to ’re no comp can be t weThis uto a te.the global minimum if the ’ll goshould lrsend nuts. eadyus in directly t h e quad function is truly quadratic. ratic bowl, And it might get us close if it’s locally quadraticish Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 62 31 Improving Simple Gradient Descent Conjugate Gradient Another method which attempts to exploit the “local quadratic bowl” assumption But does so while only needing to use ∂E ∂w 2 and not ∂ E ∂w 2 It is also more stable than Newton’s method if the local quadratic bowl assumption is violated. It’s complicated, outside our scope, but it often works well. More details in Numerical Recipes in C. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 63 BEST GENERALIZATION Intuitively, you want to use the smallest, simplest net that seems to fit the data. HOW TO FORMALIZE THIS INTUITION? 1. 2. 3. 4. Don’t. Just use intuition Bayesian Methods Get it Right Statistical Analysis explains what’s going on Cross-validation Copyright © 2001, 2003, Andrew W. Moore Discussed in the next lecture Neural Networks: Slide 64 32 What You Should Know • How to implement multivariate Leastsquares linear regression. • Derivation of least squares as max. likelihood estimator of linear coefficients • The general gradient descent rule Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 65 What You Should Know • Perceptrons Æ Linear output, least squares Æ Sigmoid output, least squares • Multilayer nets Æ The idea behind back prop Æ Awareness of better minimization methods • Generalization. What it means. Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 66 33 APPLICATIONS To Discuss: • What can non-linear regression be useful for? • What can neural nets (used as non-linear regressors) be useful for? • What are the advantages of N. Nets for nonlinear regression? • What are the disadvantages? Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 67 Other Uses of Neural Nets… • Time series with recurrent nets • Unsupervised learning (clustering principal components and non-linear versions thereof) • Combinatorial optimization with Hopfield nets, Boltzmann Machines • Evaluation function learning (in reinforcement learning) Copyright © 2001, 2003, Andrew W. Moore Neural Networks: Slide 68 34