Adaptive System System Evaluation Function (Fitness, Figure of Merit) S F P1 … Pk … Pm Control Parameters 11/26/03 Control Algorithm C 1 Gradient ∂F measures how F is altered by variation of Pk ∂Pk ∂F ∂P1 M ∂F ∇F = ∂P k M ∂F ∂ P m € ∇F points in direction of maximum increase in F 11/26/03 € € 2 Gradient Ascent on Fitness Surface ∇F + gradient ascent 11/26/03 – 3 Gradient Ascent by Discrete Steps ∇F + – 11/26/03 4 Gradient Ascent is Local But Not Shortest + – 11/26/03 5 Gradient Ascent Process P˙ = η∇F (P) Change in fitness : m ∂F d P m dF k ˙ F= =∑ = ∑ (∇F ) k P˙k k=1 ∂P d t k=1 €d t k F˙ = ∇F ⋅ P˙ F˙ = ∇F ⋅ η∇F = η ∇F € € 2 ≥0 Therefore gradient ascent increases fitness (until reaches 0 gradient) 11/26/03 6 General Ascent in Fitness Note that any parameter adjustment process P( t ) will increase fitness provided : 0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ where ϕ is angle between ∇F and P˙ Hence we need cos ϕ > 0 € or ϕ < 90 o 11/26/03 € 7 General Ascent on Fitness Surface + ∇F – 11/26/03 8 Fitness as Minimum Error Suppose for Q different inputs we have target outputs t1,K,t Q Suppose for parameters P the corresponding actual outputs € are y1,K,y Q Suppose D(t,y ) ∈ [0,∞) measures difference between € target & actual outputs Let E q = D(t q ,y q ) be error on qth sample € € Q Q Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)] q=1 11/26/03 q=1 9 Gradient of Fitness ∇F = ∇ −∑ E q = −∑ ∇E q q q q € ∂E ∂ = D(t q ,y q ) = ∑ ∂Pk ∂Pk j € € ∂D(t ,y € ∂y qj q q ∂ y ) j ∂Pk d D(t q ,y q ) ∂y q = ⋅ q dy ∂Pk q q q ∂ y = ∇ D t ,y ⋅ yq 11/26/03 q ( ) ∂Pk 10 Jacobian Matrix q ∂y1q ∂ y 1 ∂P1 L ∂Pm Define Jacobian matrix J q = M O M q ∂y q ∂ y n n ∂P L ∂ P 1 m Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1 € Since (∇E q ) k € ∴∇E = (J q € € q q q ∂ D t ,y ∂ y ( ) ∂E j = =∑ , q ∂Pk ∂y j j ∂Pk q 11/26/03 q T ) ∇D(t q ,y q ) 11 € € Derivative of Squared Euclidean Distance Suppose D(t,y ) = t − y = ∑ (t − y ) 2 2 i i i ∂D(t − y) ∂ ∂ (t i − y i ) 2 = (ti − y i ) = ∑ ∑ ∂y j ∂y j i ∂y j i 2 2 d( t i − y i ) = = −2( t i − y i ) d yi d D(t,y ) ∴ = 2( y − t ) dy € 11/26/03 12 Gradient of Error on qth Input q q q d D t ,y ( ) ∂E ∂y = ⋅ q ∂Pk dy ∂Pk q q ∂y = 2( y − t ) ⋅ ∂Pk q q q ∂ y j q q = 2∑ ( y j − t j ) j ∂Pk € 11/26/03 13 Recap ˙P = η∑ ( J q ) T (t q − y q ) q To know how to decrease the differences between actual & desired outputs, € we need to know elements of Jacobian, ∂y qj ∂Pk , which says how jth output varies with kth parameter (given the qth input) The Jacobian depends on the specific form of the system, in this case, a feedforward neural network € 11/26/03 14 Multilayer Notation xq W1 s1 11/26/03 W2 s2 WL–2 WL–1 sL–1 yq sL 15 Notation • • • • • • • L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern) output layer sL = yq (the actual output) Wl = weights between layers l and l+1 Problem: find out outputs yiq vary with weights Wjkl (l = 1, …, L–1) 11/26/03 16 Typical Neuron s1l–1 Wi1l–1 sjl–1 Wijl–1 Σ hil σ si l WiNl–1 sNl–1 11/26/03 17 Error Back-Propagation ∂E q We will compute starting with last layer (l = L −1) l ∂W ij and working back to earlier layers (l = L − 2,K,1) € 11/26/03 18 Delta Values Convenient to break derivatives by chain rule : ∂E q ∂E q ∂hil = l l−1 ∂W ij ∂hi ∂W ijl−1 q ∂ E Let δil = l ∂h i l ∂E q ∂ h l i So = δ i l−1 ∂W ij ∂W ijl−1 € 11/26/03 19 Output-Layer Neuron s1L–1 Wi1L–1 sjL–1 WijL–1 Σ hiL σ tiq siL = yiq WiNL–1 sNL–1 11/26/03 Eq 20 Output-Layer Derivatives (1) q ∂E ∂ L q 2 δ = L = L ∑ ( sk − t k ) k ∂h i ∂h i L i = d( s − t L i q 2 i d hiL ) L d s L q = 2(si − t i ) iL d hi L ′ = 2( s − t )σ ( hi ) L i 11/26/03 € q i 21 Output-Layer Derivatives (2) ∂hiL ∂ L−1 L−1 L−1 = W s = s ik k j L−1 L−1 ∑ ∂W ij ∂W ij k € ∂E q L L−1 ∴ = δ i sj L−1 ∂W ij where δiL = 2( siL − t iq )σ ′( hiL ) 11/26/03 € 22 Hidden-Layer Neuron s1l s1l–1 W1kl Wi1l–1 sjl–1 Wijl–1 Σ hil σ si l Wkil s1l+1 skl+1 Eq WiNl–1 sNl–1 11/26/03 WNil sNl sNl+1 23 Hidden-Layer Derivatives (1) l ∂E q ∂ h l i Recall = δ i l−1 ∂W ij ∂W ijl−1 q q l +1 l +1 ∂ E ∂ E ∂ h ∂ h δil = l = ∑ l +1 k l = ∑δkl +1 k l ∂h i ∂h i ∂h i k ∂h k k € l +1 k l i ∂h = ∂h € € m ∂hil l d σ h ( ∂W kil sil i) l l l ′ = = W = W σ h ki ki ( i ) l l ∂h i d hi ∴ δil = ∑δkl +1W kilσ ′(hil ) = σ ′( hil )∑δkl +1W kil k 11/26/03 € ∂ ∑ W kml sml k 24 Hidden-Layer Derivatives (2) l i l−1 ij ∂h ∂W € l−1 l−1 dW ∂ ij s j l−1 l−1 l−1 = W s = = s ik k j l−1 ∑ ∂W ij k dW ijl−1 ∂E q l l−1 ∴ = δ i sj l−1 ∂W ij where δil = σ ′( hil )∑δkl +1W kil k 11/26/03 € 25 Derivative of Sigmoid 1 Suppose s = σ ( h) = (logistic sigmoid) 1+ exp(−αh ) −1 −2 Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh ) € = −(1+ e −αh −2 ) (−αe ) = α −αh e−αh −αh 2 (1+ e ) 1+ e−αh 1 e−αh 1 =α = αs − −αh −αh −αh −αh 1+ e 1+ e 1+ e 1+ e = αs(1− s) 11/26/03 € 26 Summary of Back-Propagation Algorithm Output layer : δiL = 2αsiL (1− siL )( siL − t iq ) ∂E q L L−1 = δ i sj L−1 ∂W ij Hidden layers : δil = αsil (1− sil )∑δkl +1W kil k € ∂E q l l−1 = δ i sj l−1 ∂W ij 11/26/03 € 27