Connectionist Models: Backprop Jerome Feldman CS182/CogSci110/Ling109 Spring 2008 Hebb’s rule is not sufficient What happens if the neural circuit fires perfectly, but the result is very bad for the animal, like eating something sickening? A pure invocation of Hebb’s rule would strengthen all participating connections, which can’t be good. On the other hand, it isn’t right to weaken all the active connections involved; much of the activity was just recognizing the situation – we would like to change only those connections that led to the wrong decision. No one knows how to specify a learning rule that will change exactly the offending connections when an error occurs. Computer systems, and presumably nature as well, rely upon statistical learning rules that tend to make the right changes over time. More in later lectures. Hebb’s rule is insufficient tastebud tastes rotten eats food gets sick drinks water should you “punish” all the connections? Models of Learning Hebbian – coincidence Supervised – correction (backprop) Recruitment – one-trial Reinforcement Learning- delayed reward Unsupervised – similarity Abbstract Neuron output y Threshold Activation Function y {1 if net > 0 0 otherwise n net wiii i 0 w0 i0=1 w1 i1 w2 i2 wn ... input i in Boolean XOR XOR o 0.5 input x1 input x2 output 0 0 1 1 0 1 0 1 0 1 1 0 -1 1 OR 0.5 AND h2 1.5 h1 1 1 1 1 x1 x2 Supervised Learning - Backprop How do we train the weights of the network Basic Concepts Use a continuous, differentiable activation function (Sigmoid) Use the idea of gradient descent on the error surface Extend to multiple layers Backprop To learn on data which is not linearly separable: Build multiple layer networks (hidden layer) Use a sigmoid squashing function instead of a step function. Tasks Unconstrained pattern classification Credit assessment Digit Classification Speech Recognition Function approximation Learning control Stock prediction Sigmoid Squashing Function output y 1 1 e -net n net wiyi i 0 w0 y0=1 w1 y1 w2 y2 wn ... input yn The Sigmoid Function y=a x=net The Sigmoid Function Output=1 y=a Output=0 x=neti The Sigmoid Function Output=1 Sensitivity to input y=a Output=0 x=net Gradient Descent Gradient Descent on an error Learning Rule – Gradient Descent on an Root Mean Square (RMS) Learn wi’s that minimize squared error 1 2 E[ w] (tk - o k ) 2kO O = output layer Gradient Descent E[ w] Gradient: Training rule: E E E E[ w] , ,..., w w w 1 n 0 w - E[w] E wi - wi 1 2 ( t o ) k k 2kO Gradient Descent i2 i1 global mimimum: this is your goal it should be 4-D (3 weights) but you get the idea Backpropagation Algorithm Generalization to multiple layers and multiple output units Backprop Details Here we go… The output layer wjk k wij j yi ti: target i E = Error = ½ ∑i (ti – yi)2 The derivative of the sigmoid is just learning rate E Wij E Wij - Wij Wij Wij - E E yi xi -ti - yi f ' ( xi ) y j Wij yi xi Wij yi 1 - yi Wij - -ti - yi yi 1 - yi y j Wij - - y j i i ti - yi yi 1 - yi Nice Property of Sigmoids The hidden layer wjk wij yi ti: target W jk - E W jk E E y j x j W jk y j x j W jk k j i E = Error = ½ ∑i (ti – yi)2 E E yi xi - (ti - yi ) f ' ( xi ) Wij y j i yi xi y j i E - (ti - yi ) f ' ( xi ) Wij f ' ( x j ) yk W jk i W jk - - (ti - yi ) yi 1 - yi Wij y j 1 - y j yk i W jk - - yk j j (ti - yi ) yi 1 - yi Wij y j 1 - y j i j Wij i y j 1 - y j i Let’s just do an example 0 i 1 0 i 2 b=1 w01 0.8 w02 0.6 w0b 1/(1+e^-0.5) 0.5 x0 f y0 0.6224 i1 i2 y0 0 0 0 0 1 1 1 0 1 1 1 1 E = Error = ½ ∑i (ti – yi)2 E = ½ (t0 – y0)2 0.5 0.4268 E = ½ (0 – 0.6224)2 = 0.1937 Wij - - y j i i ti - yi yi 1 - yi W01 - - y1 0 - -i1 0 0 0 t0 - y0 y0 1 - y0 0 W02 - - y2 0 - -i2 0 W0b - - yb 0 - -b 0 -0.1463 learning rate 0 0 - 0.6224 0.62241 - 0.6224 0 -0.1463 suppose = 0.5 W0b 0.5 -0.1463 -0.0731 An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute w for each wt in 2nd layer Compute delta (generalized error expression) for hidden units Compute w for each wt in 1st layer After amassing w for all weights and, change each wt a little bit, as determined by the learning rate wij ipo jp Backpropagation Algorithm Initialize all weights to small random numbers For each training example do For For each hidden unit h: y ( w x ) hi i h i each output unit k: y ( w x ) kh h k k For each output unit k: y (1 - y ) (t - y ) k k k k k For each hidden unit h: y (1 - y ) w h h h hk k Update each network weight wij: wij wij wij with k wij j xij Backpropagation Algorithm “activations” “errors” Momentum term The speed of learning is governed by the learning rate. If the rate is low, convergence is slow If the rate is too high, error oscillates without reaching minimum. w ( n) w ( n - 1) -i ( n)y j ( n) ij ij 0 1 Momentum tends to smooth small weight error fluctuations. the momentum accelerates the descent in steady downhill directions. the momentum has a stabilizing effect in directions that oscillate in time. Convergence May get stuck in local minima Weights may diverge …but often works well in practice Representation power: 2 layer networks : any continuous function 3 layer networks : any function Pattern Separation and NN architecture Overfitting and generalization TOO MANY HIDDEN NODES TENDS TO OVERFIT Stopping criteria Sensible stopping criteria: total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]). generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights. Overfitting in ANNs Summary Multiple layer feed-forward networks Replace Step with Sigmoid (differentiable) function Learn weights by gradient descent on error function Backpropagation algorithm for learning Avoid overfitting by early stopping ALVINN drives 70mph on highways Use MLP Neural Networks when … (vectored) Real inputs, (vectored) real outputs You’re not interested in understanding how it works Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset Applications of FFNN Classification, pattern recognition: FFNN can be applied to tackle non-linearly separable learning problems. Recognizing printed or handwritten characters, Face recognition Classification of loan applications into credit-worthy and non-credit-worthy groups Analysis of sonar radar to determine the nature of the source of a signal Regression and forecasting: FFNN can be applied to learn non-linear functions (regression) and in particular functions whose inputs is a sequence of measurements over time (time series). Extensions of Backprop Nets Recurrent Architectures Backprop through time Elman Nets & Jordan Nets Output 1 Output 1 Hidden Context Input α Hidden Context Input Updating the context as we receive input In Jordan nets we model “forgetting” as well The recurrent connections have fixed weights You can train these networks using good ol’ backprop Recurrent Backprop w2 a w4 b w1 c w3 unrolling 3 iterations a b c a b c a b c w1 a w2 w3 b w4 c we’ll pretend to step through the network one iteration at a time backprop as usual, but average equivalent weights (e.g. all 3 highlighted edges on the right