Part 4A: Artificial Neural Network Learning 2/23/12 IV. Neural Network Learning 2/23/12 1 A. Artificial Neural Network Learning 2/23/12 2 1 Part 4A: Artificial Neural Network Learning 2/23/12 Supervised Learning • Produce desired outputs for training inputs • Generalize reasonably & appropriately to other inputs • Good example: pattern recognition • Feedforward multilayer networks 2/23/12 3 input layer 2/23/12 hidden layers . . . . . . . . . . . . . . . . . . Feedforward Network output layer 4 2 Part 4A: Artificial Neural Network Learning 2/23/12 Typical Artificial Neuron connection weights inputs output threshold 2/23/12 5 Typical Artificial Neuron linear combination activation function net input (local field) 2/23/12 6 3 Part 4A: Artificial Neural Network Learning 2/23/12 Equations ⎛ n ⎞ hi = ⎜⎜ ∑ w ij s j ⎟⎟ − θ ⎝ j=1 ⎠ h = Ws − θ Net input: Neuron output: € sʹ′i = σ ( hi ) sʹ′ = σ (h) 2/23/12 7 € 2/23/12 . . . . . . Single-Layer Perceptron 8 4 Part 4A: Artificial Neural Network Learning 2/23/12 Variables x1 w1 xj h Σ wj Θ y wn xn θ 2/23/12 9 Single Layer Perceptron Equations Binary threshold activation function : ⎧1, if h > 0 σ ( h ) = Θ( h ) = ⎨ ⎩0, if h ≤ 0 ⎧1, Hence, y = ⎨ ⎩0, ⎧1, = ⎨ ⎩0, € 2/23/12 if ∑ w j x j > θ j otherwise if w ⋅ x > θ if w ⋅ x ≤ θ 10 € 5 Part 4A: Artificial Neural Network Learning 2/23/12 2D Weight Vector w2 w ⋅ x = w x cos φ cos φ = € x – + v x w w⋅ x = w v φ € € w1 v w⋅ x > θ ⇔ w v >θ θ w ⇔v >θ w 2/23/12 € 11 € N-Dimensional Weight Vector + w normal vector separating hyperplane – 2/23/12 12 6 Part 4A: Artificial Neural Network Learning 2/23/12 Goal of Perceptron Learning • Suppose we have training patterns x1, x2, …, xP with corresponding desired outputs y1, y2, …, yP • where xp ∈ {0, 1}n, yp ∈ {0, 1} • We want to find w, θ such that yp = Θ(w⋅xp – θ) for p = 1, …, P 2/23/12 13 Treating Threshold as Weight ⎛ n ⎞ h = ⎜⎜∑ w j x j ⎟⎟ − θ ⎝ j=1 ⎠ x1 n = −θ + ∑ w j x j w1 xj j=1 Σ wj wn xn 2/23/12 h Θ y € θ 14 7 Part 4A: Artificial Neural Network Learning 2/23/12 Treating Threshold as Weight ⎛ n ⎞ h = ⎜⎜∑ w j x j ⎟⎟ − θ ⎝ j=1 ⎠ x0 = –1 x1 n = −θ + ∑ w j x j θ = w0 w1 xj Σ wj wn xn j=1 h Θ y € Let x 0 = −1 and w 0 = θ n n ˜ ⋅ x˜ h = w 0 x 0 + ∑ w j x j =∑ w j x j = w j=1 2/23/12 j= 0 € 15 € Augmented Vectors ⎛ θ ⎞ ⎜ ⎟ w ˜ = ⎜ 1 ⎟ w ⎜ ⎟ ⎜ ⎟ ⎝ w n ⎠ ⎛ −1⎞ ⎜ p ⎟ x x˜ p = ⎜ 1 ⎟ ⎜ ⎟ ⎜ p ⎟ ⎝ x n ⎠ ˜ ⋅ x˜ p ), p = 1,…,P We want y p = Θ( w € 2/23/12 € 16 € 8 Part 4A: Artificial Neural Network Learning 2/23/12 Reformulation as Positive Examples We have positive (y p = 1) and negative (y p = 0) examples ˜ ⋅ x˜ p > 0 for positive, w ˜ ⋅ x˜ p ≤ 0 for negative Want w € € € € Let z p = x˜ p for positive, z p = −x˜ p for negative ˜ ⋅ z p ≥ 0, for p = 1,…,P Want w Hyperplane through origin with all z p on one side 2/23/12 17 € Adjustment of Weight Vector z5 z1 z9 z10 11 z z6 z2 z3 z8 z4 z7 2/23/12 18 9 Part 4A: Artificial Neural Network Learning 2/23/12 Outline of Perceptron Learning Algorithm 1. initialize weight vector randomly 2. until all patterns classified correctly, do: a) for p = 1, …, P do: 1) if zp classified correctly, do nothing 2) else adjust weight vector to be closer to correct classification 2/23/12 19 Weight Adjustment zp p ˜ ʹ′ʹ′ ηz w ηz p ˜ ʹ′ w ˜ w € € € € € € 2/23/12 20 10 Part 4A: Artificial Neural Network Learning 2/23/12 Improvement in Performance ˜ ⋅ z p < 0, If w ˜ ʹ′ ⋅ z p = ( w ˜ + ηz p ) ⋅ z p w ˜ ⋅ z p + ηz p ⋅ z p =w € ˜ ⋅ zp + η zp =w 2 ˜ ⋅ zp >w 2/23/12 21 € Perceptron Learning Theorem • If there is a set of weights that will solve the problem, • then the PLA will eventually find it • (for a sufficiently small learning rate) • Note: only applies if positive & negative examples are linearly separable 2/23/12 22 11 Part 4A: Artificial Neural Network Learning 2/23/12 NetLogo Simulation of Perceptron Learning Run Perceptron-Geometry.nlogo 2/23/12 23 Classification Power of Multilayer Perceptrons • Perceptrons can function as logic gates • Therefore MLP can form intersections, unions, differences of linearly-separable regions • Classes can be arbitrary hyperpolyhedra • Minsky & Papert criticism of perceptrons • No one succeeded in developing a MLP learning algorithm 2/23/12 24 12 Part 4A: Artificial Neural Network Learning 2/23/12 Hyperpolyhedral Classes 2/23/12 25 Credit Assignment Problem input layer 2/23/12 . . . . . . . . . . . . . . . hidden layers . . . . . . How do we adjust the weights of the hidden layers? Desired output output layer 26 13 Part 4A: Artificial Neural Network Learning 2/23/12 NetLogo Demonstration of Back-Propagation Learning Run Artificial Neural Net.nlogo 2/23/12 27 Adaptive System System Evaluation Function (Fitness, Figure of Merit) S F P1 … Pk … Pm Control Parameters 2/23/12 Control Algorithm C 28 14 Part 4A: Artificial Neural Network Learning 2/23/12 Gradient ∂F measures how F is altered by variation of Pk ∂Pk ⎛ ∂F ⎞ ⎜ ∂P1 ⎟ ⎜ ⎟ ⎜ ⎟ ∇F = ⎜ ∂F ∂P ⎟ k ⎜ ⎟ ⎜∂F ⎟ ⎜ ⎟ ⎝ ∂Pm ⎠ € ∇F points in direction of maximum local increase in F 2/23/12 € 29 € Gradient Ascent on Fitness Surface ∇F + gradient ascent 2/23/12 – 30 15 Part 4A: Artificial Neural Network Learning 2/23/12 Gradient Ascent by Discrete Steps ∇F + – 2/23/12 31 Gradient Ascent is Local But Not Shortest + – 2/23/12 32 16 Part 4A: Artificial Neural Network Learning 2/23/12 Gradient Ascent Process P˙ = η∇F (P) Change in fitness : m ∂F d P m dF k F˙ = =∑ = ∑ (∇F ) k P˙k k=1 ∂P d t k=1 €d t k F˙ = ∇F ⋅ P˙ F˙ = ∇F ⋅ η∇F = η ∇F € € 2 ≥0 Therefore gradient ascent increases fitness (until reaches 0 gradient) 2/23/12 33 General Ascent in Fitness Note that any adaptive process P( t ) will increase fitness provided : 0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ where ϕ is angle between ∇F and P˙ Hence we need cos ϕ > 0 € or ϕ < 90 2/23/12 34 € 17 Part 4A: Artificial Neural Network Learning 2/23/12 General Ascent on Fitness Surface ∇F + – 2/23/12 35 Fitness as Minimum Error Suppose for Q different inputs we have target outputs t1,…,t Q Suppose for parameters P the corresponding actual outputs € are y1,…,y Q Suppose D(t,y ) ∈ [0,∞) measures difference between € target & actual outputs Let E q = D(t q ,y q ) be error on qth sample € Q Q Let F (P) = −∑ E (P) = −∑ D[t q ,y q (P)] q € q=1 2/23/12 q=1 36 € 18 Part 4A: Artificial Neural Network Learning 2/23/12 Gradient of Fitness ⎛ ⎞ ∇F = ∇⎜⎜ −∑ E q ⎟⎟ = −∑ ∇E q ⎝ q ⎠ q € ∂D(t q ,y q ) ∂y qj ∂E q ∂ q q = D(t ,y ) = ∑ ∂Pk ∂Pk ∂y qj ∂Pk j € d D(t q ,y q ) ∂y q = ⋅ d yq ∂Pk q = ∇ D t q ,y q ⋅ ∂y € yq 2/23/12 ( ) ∂Pk 37 € € Jacobian Matrix ⎛∂y1q ⎞ ∂y1q ⎜ ∂P1 ∂Pm ⎟ q ⎜ Define Jacobian matrix J = ⎟ q ⎜∂y q ⎟ ∂y n ⎜ n ∂P ∂Pm ⎟⎠ 1 ⎝ Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1 € Since (∇E € q ) k q q ∂y qj ∂D(t ,y ) ∂E q = =∑ , ∂Pk ∂Pk ∂y qj j T ∴∇E q = ( J q ) ∇D(t q ,y q ) € 2/23/12 38 € 19 Part 4A: Artificial Neural Network Learning 2/23/12 Derivative of Squared Euclidean Distance Suppose D(t,y ) = t − y = ∑ ( t − y ) 2 2 i i € ∂D(t − y ) ∂ ∂ (t − y ) 2 = (t i − y i ) = ∑ i i ∑ ∂y j ∂y j i ∂y j i = € i d( t j − y j ) dyj 2 2 = −2( t j − y j ) d D(t,y ) = 2( y − t ) dy € 2/23/12 ∴ 39 € Gradient of Error on qth Input d D(t ,y ) ∂y ∂E = ⋅ q q ∂Pk q d yq q ∂Pk ∂y q = 2( y − t ) ⋅ ∂Pk q q ∂y qj = 2∑ ( y − t ) j ∂Pk q j q j T ∇E q = 2( J q ) ( y q − t q ) € 2/23/12 40 € 20 Part 4A: Artificial Neural Network Learning 2/23/12 Recap P˙ = η∑ ( J ) (t − y ) q T q q q To know how to decrease the differences between actual & desired outputs, € we need to know elements of Jacobian, ∂y qj ∂Pk , which says how jth output varies with kth parameter (given the qth input) The Jacobian depends on the specific form of the system, in this case, a feedforward neural network € 2/23/12 41 Multilayer Notation xq W1 s1 2/23/12 W2 s2 WL–2 WL–1 sL–1 yq sL 42 21 Part 4A: Artificial Neural Network Learning 2/23/12 Notation • • • • • • • L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern) output layer sL = yq (the actual output) Wl = weights between layers l and l+1 Problem: find out how outputs yiq vary with weights Wjkl (l = 1, …, L–1) 2/23/12 43 Typical Neuron s1l–1 Wi1l–1 sjl–1 Wijl–1 Σ hil σ sil WiNl–1 sNl–1 2/23/12 44 22 Part 4A: Artificial Neural Network Learning 2/23/12 Error Back-Propagation We will compute ∂E q starting with last layer (l = L −1) ∂W ijl and working back to earlier layers (l = L − 2,…,1) € 2/23/12 45 Delta Values Convenient to break derivatives by chain rule : ∂E q ∂E q ∂hil = ∂W ijl−1 ∂hil ∂W ijl−1 Let δil = So € 2/23/12 ∂E q ∂hil l ∂E q l ∂h i = δ i ∂W ijl−1 ∂W ijl−1 46 23 Part 4A: Artificial Neural Network Learning 2/23/12 Output-Layer Neuron s1L–1 Wi1L–1 W L–1 sjL–1 ij Σ hiL tiq siL = yiq σ WiNL–1 Eq sNL–1 2/23/12 47 Output-Layer Derivatives (1) 2 ∂E q ∂ δ = L = L ∑ ( skL − t kq ) k ∂h i ∂h i L i = d( siL − t iq ) d hiL 2 d siL = 2( s − t ) L d hi L i q i = 2( siL − t iq )σ ʹ′( hiL ) 2/23/12 48 € 24 Part 4A: Artificial Neural Network Learning 2/23/12 Output-Layer Derivatives (2) ∂hiL ∂ = W ikL−1skL−1 = sL−1 j L−1 L−1 ∑ ∂W ij ∂W ij k ∂E q ∴ = δiL sL−1 j L−1 ∂W ij € where δiL = 2( siL − t iq )σ ʹ′( hiL ) 2/23/12 49 € Hidden-Layer Neuron s1l s1l–1 W1il Wi1l–1 sjl–1 Wijl–1 Σ hil σ sil Wkil s1l+1 skl+1 Eq WiNl–1 sNl–1 2/23/12 WNil sNl sNl+1 50 25 Part 4A: Artificial Neural Network Learning 2/23/12 Hidden-Layer Derivatives (1) Recall l ∂E q l ∂h i = δ i ∂W ijl−1 ∂W ijl−1 l +1 ∂E q ∂E q ∂hkl +1 l +1 ∂hk δ = l = ∑ l +1 = ∑δ k ∂h i ∂hil ∂hil k ∂h k k l i € l l d σ ( hil ) ∂hkl +1 ∂ ∑m W km sm ∂W kil sil l = = = W = W kilσ ʹ′( hil ) ki l l l l ∂h i ∂h i ∂h i d hi € € ∴ δil = ∑δkl +1W kilσ ʹ′( hil ) = σ ʹ′( hil )∑δkl +1W kil k k 2/23/12 51 € Hidden-Layer Derivatives (2) l−1 l−1 dW ij s j ∂hil ∂ = W ikl−1skl−1 = = sl−1 j l−1 l−1 ∑ l−1 ∂W ij ∂W ij k dW ij € ∂E q ∴ = δil sl−1 j l−1 ∂W ij where δil = σ ʹ′( hil )∑δkl +1W kil k 2/23/12 52 € 26 Part 4A: Artificial Neural Network Learning 2/23/12 Derivative of Sigmoid Suppose s = σ ( h ) = 1 (logistic sigmoid) 1+ exp(−αh ) −1 −2 Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh ) € = −(1+ e−αh ) =α −2 (−αe ) = α e−αh −αh −αh 2 (1+ e ) ⎛ 1+ e−αh 1 e−αh 1 ⎞ = α s − ⎜ ⎟ −αh −αh −αh 1+ e 1+ e 1+ e−αh ⎠ ⎝ 1+ e = αs(1− s) 2/23/12 53 € Summary of Back-Propagation Algorithm Output layer : δiL = 2αsiL (1− siL )( siL − t iq ) ∂E q = δiL sL−1 j L−1 ∂W ij Hidden layers : δil = αsil (1− sil )∑δkl +1W kil k € q ∂E = δil sl−1 j l−1 ∂W ij 2/23/12 54 € 27 Part 4A: Artificial Neural Network Learning 2/23/12 Output-Layer Computation ΔW ijL−1 = ηδiL sL−1 j s1L–1 Wi1L–1 WijL–1 sjL–1 € Σ hiL ΔWijL–1 sNL–1 WiNL–1 η × siL = yiq – tiq σ 1– δiL × δiL = 2αsiL (1− siL )( t iq − siL ) 2α 2/23/12 55 € Hidden-Layer Computation ΔW ijl−1 = ηδil sl−1 j s1l–1 W1il Wi1l–1 l–1 W sjl–1 € ij Σ hil ΔWijl–1 sNl–1 WiNl–1 × η × δil = αsil (1− sil )∑δkl +1W kil k Wkil × WNil 1– δil 2/23/12 sil σ Σ s1l+1 δ1l+1 skl+1 Eq δkl+1 sNl+1 δNl+1 α 56 € 28 Part 4A: Artificial Neural Network Learning 2/23/12 Training Procedures • Batch Learning – on each epoch (pass through all the training pairs), – weight changes for all patterns accumulated – weight matrices updated at end of epoch – accurate computation of gradient • Online Learning – weight are updated after back-prop of each training pair – usually randomize order for each epoch – approximation of gradient • Doesn’t make much difference 2/23/12 57 Summation of Error Surfaces E E1 E2 2/23/12 58 29 Part 4A: Artificial Neural Network Learning 2/23/12 Gradient Computation in Batch Learning E E1 E2 2/23/12 59 Gradient Computation in Online Learning E E1 E2 2/23/12 60 30 Part 4A: Artificial Neural Network Learning 2/23/12 Testing Generalization Available Data Training Data Domain Test Data 2/23/12 61 Problem of Rote Learning error error on test data error on training data epoch stop training here 2/23/12 62 31 Part 4A: Artificial Neural Network Learning 2/23/12 Improving Generalization Training Data Available Data Domain Test Data Validation Data 2/23/12 63 A Few Random Tips • Too few neurons and the ANN may not be able to decrease the error enough • Too many neurons can lead to rote learning • Preprocess data to: – standardize – eliminate irrelevant information – capture invariances – keep relevant information • If stuck in local min., restart with different random weights 2/23/12 64 32 Part 4A: Artificial Neural Network Learning 2/23/12 Run Example BP Learning 2/23/12 65 Beyond Back-Propagation • Adaptive Learning Rate • Adaptive Architecture – Add/delete hidden neurons – Add/delete hidden layers • Radial Basis Function Networks • Recurrent BP • Etc., etc., etc.… 2/23/12 66 33 Part 4A: Artificial Neural Network Learning 2/23/12 What is the Power of Artificial Neural Networks? • With respect to Turing machines? • As function approximators? 2/23/12 67 Can ANNs Exceed the “Turing Limit”? • There are many results, which depend sensitively on assumptions; for example: • Finite NNs with real-valued weights have super-Turing power (Siegelmann & Sontag ‘94) • Recurrent nets with Gaussian noise have sub-Turing power (Maass & Sontag ‘99) • Finite recurrent nets with real weights can recognize all languages, and thus are super-Turing (Siegelmann ‘99) • Stochastic nets with rational weights have super-Turing power (but only P/POLY, BPP/log*) (Siegelmann ‘99) • But computing classes of functions is not a very relevant way to evaluate the capabilities of neural computation 2/23/12 68 34 Part 4A: Artificial Neural Network Learning 2/23/12 A Universal Approximation Theorem Suppose f is a continuous function on [0,1] Suppose σ is a nonconstant, bounded, monotone increasing real function on ℜ. € n For any ε > 0, there is an m such that ∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if € ⎛ n ⎞ F ( x1,…, x n ) = ∑ aiσ ⎜⎜ ∑W ij x j + b j ⎟⎟ i=1 ⎝ j=1 ⎠ m € [i.e., F (x) = a ⋅ σ (Wx + b)] €then F ( x ) − f ( x ) < ε for all x ∈ [0,1] 2/23/12 € n 69 (see, e.g., Haykin, N.Nets 2/e, 208–9) € One Hidden Layer is Sufficient • Conclusion: One hidden layer is sufficient to approximate any continuous function arbitrarily closely 1 b1 x1 xn 2/23/12 Σ σ Σ σ a1 a2 Σ am Wmn Σ σ 70 35 Part 4A: Artificial Neural Network Learning 2/23/12 The Golden Rule of Neural Nets Neural Networks are the second-best way to do everything! 2/23/12 IVB 71 36