Part 7: Neural Networks & Learning 11/2/05 Supervised Learning • Produce desired outputs for training inputs • Generalize reasonably & appropriately to other inputs • Good example: pattern recognition • Feedforward multilayer networks Neural Networks and Back-Propagation Learning 11/2/05 1 11/2/05 2 Credit Assignment Problem Feedforward Network input layer 11/2/05 input layer 3 Control Parameters 11/2/05 ... ... ... ... output layer 4 Gradient Evaluation Function (Fitness, Figure of Merit) F measures how F is altered by variation of Pk Pk F P1 M F = F P k M F Pm F S P1 … Pk … Pm hidden layers Desired output 11/2/05 Adaptive System System ... ... ... ... ... ... ... output layer hidden layers ... ... How do we adjust the weights of the hidden layers? Control Algorithm F points in direction of maximum increase in F C 5 11/2/05 6 1 Part 7: Neural Networks & Learning 11/2/05 Gradient Ascent on Fitness Surface Gradient Ascent by Discrete Steps F F + gradient ascent + – 11/2/05 – 7 11/2/05 Gradient Ascent is Local But Not Shortest 8 Gradient Ascent Process P˙ = F (P) Change in fitness : m F d P m dF k F˙ = = = (F ) k P˙k k=1 k=1 dt Pk d t ˙ ˙ F = F P + – 2 F˙ = F F = F 0 Therefore gradient ascent increases fitness (until reaches 0 gradient) 11/2/05 9 11/2/05 10 General Ascent on Fitness Surface General Ascent in Fitness Note that any adaptive process P( t ) will increase fitness provided : 0 < F˙ = F P˙ = F P˙ cos + where is angle between F and P˙ F – Hence we need cos > 0 or < 90 o 11/2/05 11 11/2/05 12 2 Part 7: Neural Networks & Learning 11/2/05 Gradient of Fitness Fitness as Minimum Error Suppose for Q different inputs we have target outputs t1,K,t Q Suppose for parameters P the corresponding actual outputs F = E q = E q q q D(t q ,y q ) y qj E q = D(t q ,y q ) = Pk Pk y qj Pk j are y1,K,y Q Suppose D(t,y ) [0,) measures difference between target & actual outputs Let E = D(t ,y q q q d D(t q ,y q ) y q d yq Pk q q q = D t ,y y = ) be error on qth sample Q Q Let F (P) = E q (P) = D[t q ,y q (P)] q=1 yq q=1 11/2/05 13 y1q y1q P1 L Pm O M Define Jacobian matrix J = M y q y nq n P L Pm 1 Note J and D(t ,y q q ) 14 k q q y q D(t ,y ) E = j , q Pk y j j Pk = T E q = (J q ) D(t q ,y q ) 15 q q ) y = 2( y q t q ) d( t j y j ) dyj 2 2 = 2( t j y j ) 11/2/05 16 Recap T P˙ = ( J q ) (t q y q ) q q Pk To know how to decrease the differences between y q Pk actual & desired outputs, we need to know elements of Jacobian, y qj Pk , which says how jth output varies with kth parameter (given the qth input) y q = 2 ( y t ) j j Pk q j i d D(t,y ) = 2( y t ) dy Gradient of Error on qth Input E q d D(t ,y = Pk d yq i D(t y) (t y ) = (ti y i )2 = iy i y j i y j j i n1 11/2/05 2 i q Since (E q ) = Pk 11/2/05 2 q nm ) Derivative of Squared Euclidean Distance Suppose D(t,y ) = t y = (t y ) Jacobian Matrix q ( q j The Jacobian depends on the specific form of the system, in this case, a feedforward neural network 11/2/05 17 11/2/05 18 3 Part 7: Neural Networks & Learning 11/2/05 Equations Net input: Neuron output: Multilayer Notation n hi = w ij s j j=1 h = Ws xq W1 s1 si = ( hi ) WL–2 WL–1 W2 s2 sL–1 yq sL s = (h) 11/2/05 19 11/2/05 20 Typical Neuron Notation • • • • • • • L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern) output layer sL = yq (the actual output) Wl = weights between layers l and l+1 Problem: find how outputs yiq vary with weights Wjkl (l = 1, …, L–1) 11/2/05 21 s1l–1 sjl–1 Wi1l–1 Wijl–1 hil sil WiNl–1 sNl–1 11/2/05 22 Error Back-Propagation Delta Values Convenient to break derivatives by chain rule : E q We will compute starting with last layer (l = L 1) W ijl E q E q hil = W ijl1 hil W ijl1 and working back to earlier layers (l = L 2,K,1) Let il = So 11/2/05 23 11/2/05 E q hil E q hil = il W ijl1 W ijl1 24 4 Part 7: Neural Networks & Learning 11/2/05 Output-Layer Neuron Output-Layer Derivatives (1) s1L–1 sjL–1 Wi1L–1 WijL–1 iL = hiL tiq siL = yiq = WiNL–1 Eq sNL–1 11/2/05 = 2(siL t iq ) d siL d hiL 11/2/05 26 s1l s1l–1 sjl–1 q E = iL sL1 j W ijL1 W1il Wi1l–1 Wijl–1 hil sil Wkil s1l+1 skl+1 Eq l–1 WiN 11/2/05 WNil sNl–1 where iL = 2( siL t iq ) ( hiL ) 27 Hidden-Layer Derivatives (1) sNl sNl+1 11/2/05 28 Hidden-Layer Derivatives (2) E q hil = il l1 W ij W ijl1 l1 l1 dW s hil = W ikl1skl1 = dWij l1j = sl1j W ijl1 W ijl1 k ij E q E q h l +1 h l +1 = l = l +1 k l = kl +1 k l h i h h h i k i k k l i l l d ( hil ) hkl +1 m W km sm W kil sil = = = W kil = W kil ( hil ) hil hil hil d hil E q = il sl1 j W ijl1 where il = ( hil )kl +1W kil il = kl +1W kil (hil ) = ( hil )kl +1W kil 11/2/05 L i Hidden-Layer Neuron hiL = W ikL1skL1 = sL1 j W ijL1 W ijL1 k k dh 2 = 2( s t iq ) ( hiL ) Output-Layer Derivatives (2) Recall d( siL t iq ) L i 25 2 E q = (skL tkq ) hiL hiL k k k 29 11/2/05 30 5 Part 7: Neural Networks & Learning 11/2/05 Summary of Back-Propagation Algorithm Derivative of Sigmoid Suppose s = ( h) = 1 (logistic sigmoid) 1+ exp(h ) 1 Output layer : iL = 2siL (1 siL )( siL t iq ) 2 E q = iL sL1 j W ijL1 Dh s = Dh [1+ exp(h )] = [1+ exp(h )] Dh (1+ eh ) = (1+ eh ) 2 (e ) = e h h (1+ eh ) 2 Hidden layers : il = sil (1 sil )kl +1W kil 1+ eh 1 1 e h = = s h 1+ eh 1+ eh 1+ eh 1+ e k E q = il sl1 j W ijl1 = s(1 s) 11/2/05 31 11/2/05 32 Hidden-Layer Computation Output-Layer Computation W s1L–1 sjL–1 Wi1 WijL–1 L1 ij L L1 i j = s hiL siL WiNL–1 – Wi1l–1 tiq Wijl–1 sjl–1 Wijl–1 L i )( t q i L i s sNl–1 ) Training Procedures Wkil WNil 1– il = sil (1 sil )kl +1W kil 33 sil il 2 11/2/05 hil WiNl–1 = 2s (1 s L i = yi q 1– iL L i W1il L–1 WijL–1 sNL–1 W ijl1 = il sl1 j s1l–1 s1l+1 1l+1 Eq skl+1 kl+1 sNl+1 Nl+1 k 11/2/05 34 Summation of Error Surfaces • Batch Learning – – – – on each epoch (pass through all the training pairs), weight changes for all patterns accumulated weight matrices updated at end of epoch accurate computation of gradient E E1 E2 • Online Learning – weight are updated after back-prop of each training pair – usually randomize order for each epoch – approximation of gradient • Doesn’t make much difference 11/2/05 35 11/2/05 36 6 Part 7: Neural Networks & Learning 11/2/05 Gradient Computation in Batch Learning Gradient Computation in Online Learning E E E1 E1 E2 11/2/05 E2 37 11/2/05 38 The Golden Rule of Neural Nets Neural Networks are the second-best way to do everything! 11/2/05 39 7