II. Neural Network Learning 2014/1/17 1 A. Neural Network Learning 2014/1/17 2 Supervised Learning • Produce desired outputs for training inputs • Generalize reasonably & appropriately to other inputs • Good example: pattern recognition • Feedforward multilayer networks 2014/1/17 3 input layer 2014/1/17 hidden layers . . . . . . . . . . . . . . . . . . Feedforward Network output layer 4 Typical Artificial Neuron connection weights inputs output threshold 2014/1/17 5 Typical Artificial Neuron linear combination activation function net input (local field) 2014/1/17 6 Equations # n & hi = %% ∑ w ij s j (( − θ $ j=1 ' h = Ws − θ Net input: Neuron output: € 2014/1/17 s"i = σ ( hi ) s" = σ (h) 7 2014/1/17 . . . . . . Single-Layer Perceptron 8 Variables x1 w1 xj Σ wj h Θ y wn xn 2014/1/17 θ 9 Single Layer Perceptron Equations Binary threshold activation function : %1, if h > 0 σ ( h ) = Θ( h ) = & '0, if h ≤ 0 € $1, if ∑ w x > θ j j j Hence, y = % &0, otherwise $1, if w ⋅ x > θ =% &0, if w ⋅ x ≤ θ 2014/1/17 10 2D Weight Vector w2 w ⋅ x = w x cos φ x – v cos φ = x + w w⋅ x = w v φ w⋅ x > θ ⇔ w v >θ ⇔v >θ w 2014/1/17 v w1 θ w 11 N-Dimensional Weight Vector + w normal vector separating hyperplane – 2014/1/17 12 Goal of Perceptron Learning • Suppose we have training patterns x1, x2, …, xP with corresponding desired outputs y1, y2, …, yP • where xp ∈ {0, 1}n, yp ∈ {0, 1} • We want to find w, θ such that yp = Θ(w⋅xp – θ) for p = 1, …, P 2014/1/17 13 Treating Threshold as Weight #n & h = %%∑ w j x j (( − θ $ j=1 ' x1 n = −θ + ∑ w j x j w1 xj j=1 Σ wj wn xn 2014/1/17 h Θ y € θ 14 Treating Threshold as Weight #n & h = %%∑ w j x j (( − θ $ j=1 ' x0 = –1 x1 n = −θ + ∑ w j x j θ = w0 w1 xj Σ wj wn xn j=1 h Θ y € Let x 0 = −1 and w 0 = θ n n ˜ ⋅ x˜ h = w 0 x 0 + ∑ w j x j =∑ w j x j = w j=1 2014/1/17 € j= 0 15 Augmented Vectors #θ& % ( w1 ( % ˜ = w % ! ( % ( $ wn ' # −1& % p( x 1 ( p % x˜ = %!( % p( $ xn ' ˜ ⋅ x˜ p ), p = 1,…,P We want y p = Θ( w € 2014/1/17 € 16 Reformulation as Positive Examples We have positive (y p = 1) and negative (y p = 0) examples ˜ ⋅ x˜ p > 0 for positive, w ˜ ⋅ x˜ p ≤ 0 for negative Want w Let z p = x˜ p for positive, z p = −x˜ p for negative ˜ ⋅ z p ≥ 0, for p = 1,…,P Want w Hyperplane through origin with all z p on one side 2014/1/17 17 Adjustment of Weight Vector z5 z1 z9 z10 11 z z6 z2 z3 z8 z4 z7 2014/1/17 18 Outline of Perceptron Learning Algorithm 1. initialize weight vector randomly 2. until all patterns classified correctly, do: a) for p = 1, …, P do: 1) if zp classified correctly, do nothing 2) else adjust weight vector to be closer to correct classification 2014/1/17 19 Weight Adjustment p η z ˜ "" p w ηz z ˜" w p ˜ w € € € € € € 2014/1/17 20 Improvement in Performance ! ! ⋅ z p = (w ! + ηz p ) ⋅ z p w ! ⋅ z p + ηz p ⋅ z p =w ! ⋅ z +η z =w p p 2 ! ⋅ zp >w 2014/1/17 21 Perceptron Learning Theorem • If there is a set of weights that will solve the problem, • then the PLA will eventually find it • (for a sufficiently small learning rate) • Note: only applies if positive & negative examples are linearly separable 2014/1/17 22 NetLogo Simulation of Perceptron Learning Run Perceptron-Geometry.nlogo 2014/1/17 23 Classification Power of Multilayer Perceptrons • Perceptrons can function as logic gates • Therefore MLP can form intersections, unions, differences of linearly-separable regions • Classes can be arbitrary hyperpolyhedra • Minsky & Papert criticism of perceptrons • No one succeeded in developing a MLP learning algorithm 2014/1/17 24 Hyperpolyhedral Classes 2014/1/17 25 Credit Assignment Problem input layer 2014/1/17 . . . . . . . . . . . . . . . hidden layers . . . . . . How do we adjust the weights of the hidden layers? Desired output output layer 26 NetLogo Demonstration of Back-Propagation Learning Run Artificial Neural Net.nlogo 2014/1/17 27 Adaptive System System Evaluation Function (Fitness, Figure of Merit) S F P1 … Pk … Pm Control Parameters 2014/1/17 Control Algorithm C 28 Gradient ∂F measures how F is altered by variation of Pk ∂Pk $ ∂F ' & ∂P1 ) & ! ) & ∂F ) ∇F = & ∂P ) k & ! ) &∂F ) & ) ∂ P % m( € ∇F points in direction of maximum local increase in F 2014/1/17 € 29 Gradient Ascent on Fitness Surface ∇F + gradient ascent 2014/1/17 – 30 Gradient Ascent by Discrete Steps ∇F + – 2014/1/17 31 Gradient Ascent is Local But Not Shortest + – 2014/1/17 32 Gradient Ascent Process P˙ = η∇F (P) Change in fitness : m ∂F d P m dF k ˙ F= =∑ = ∑ (∇F ) k P˙k k=1 ∂P d t k=1 €d t k F˙ = ∇F ⋅ P˙ F˙ = ∇F ⋅ η∇F = η ∇F € € 2 ≥0 Therefore gradient ascent increases fitness (until reaches 0 gradient) 2014/1/17 33 General Ascent in Fitness Note that any adaptive process P( t ) will increase fitness provided : 0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ where ϕ is angle between ∇F and P˙ Hence we need cos ϕ > 0 or ϕ < 90 ! 2014/1/17 34 General Ascent on Fitness Surface + ∇F – 2014/1/17 35 Fitness as Minimum Error Suppose for Q different inputs we have target outputs t1,…,t Q Suppose for parameters P the corresponding actual outputs are y1,…,y Q Suppose D(t,y ) ∈ [0,∞) measures difference between target & actual outputs Let E q = D(t q ,y q ) be error on qth sample Q Q Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)] q=1 2014/1/17 q=1 36 Gradient of Fitness % ( ∇F = ∇'' −∑ E q ** = −∑ ∇E q & q ) q q ∂E ∂ = D(t q ,y q ) = ∑ ∂Pk ∂Pk j € ∂D(t q ,y q ) ∂y qj € ∂Pk d D(t q ,y q ) ∂y q = ⋅ q dy ∂Pk q ∂ y q q = ∇ D t ,y ⋅ yq 2014/1/17 ∂y qj ( ) ∂Pk 37 Jacobian Matrix q #∂y1q & ∂ y 1 % ∂P1 ! ∂Pm ( Define Jacobian matrix J q = % " # " ( q %∂y q ( ∂ y n % n ∂P ! ( ∂ P 1 m' $ Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1 Since (∇E q ) k ∴∇E = ( J q 2014/1/17 q q q ∂ D t ,y ∂ y ( ) ∂E j = =∑ , q ∂Pk ∂y j j ∂Pk q T ) q ∇D(t q ,y q ) 38 Derivative of Squared Euclidean Distance Suppose D(t,y ) = t − y = ∑ ( t − y ) 2 2 i i i ∂D(t − y ) ∂ ∂ (t i − y i ) 2 = (t i − y i ) = ∑ ∑ ∂y j ∂y j i ∂y j i = d( t j − y j ) dyj d D(t,y ) ∴ = 2( y − t ) dy € 2014/1/17 2 2 = −2( t j − y j ) 39 Gradient of Error on qth Input d D(t ,y ) ∂y ∂E = ⋅ q q ∂Pk q d yq q ∂Pk q ∂y = 2( y − t ) ⋅ ∂Pk q q q ∂ y j q q = 2∑ ( y j − t j ) j ∂Pk ∇E = 2( J q q T ) q q y − t ( ) € 2014/1/17 40 € Recap P˙ = η∑ ( J ) (t − y ) q T q q q To know how to decrease the differences between actual & desired outputs, € we need to know elements of Jacobian, ∂y qj ∂Pk , which says how jth output varies with kth parameter (given the qth input) The Jacobian depends on the specific form of the system, in this case, a feedforward neural network 2014/1/17 41 Multilayer Notation xq W1 s1 2014/1/17 W2 s2 WL–2 WL–1 sL–1 yq sL 42 Notation • • • • • • • L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern) output layer sL = yq (the actual output) Wl = weights between layers l and l+1 Problem: find out how outputs yiq vary with weights Wjkl (l = 1, …, L–1) 2014/1/17 43 Typical Neuron s1l–1 Wi1l–1 l–1 W ij sjl–1 Σ hil σ sil WiNl–1 sNl–1 2014/1/17 44 Error Back-Propagation ∂E q We will compute starting with last layer (l = L −1) l ∂W ij and working back to earlier layers (l = L − 2,…,1) € 2014/1/17 45 Delta Values Convenient to break derivatives by chain rule : ∂E q ∂E q ∂hil = l l−1 ∂W ij ∂hi ∂W ijl−1 q ∂ E Let δil = l ∂h i l ∂E q ∂ h l i So = δ i l−1 ∂W ij ∂W ijl−1 € 2014/1/17 46 Output-Layer Neuron s1L–1 Wi1L–1 L–1 W ij sjL–1 Σ hiL σ tiq siL = yiq WiNL–1 sNL–1 2014/1/17 Eq 47 Output-Layer Derivatives (1) q ∂E ∂ L q 2 δ = L = L ∑ ( sk − t k ) k ∂h i ∂h i L i = d( s − t L i q 2 i d hiL ) L d s L q = 2( si − t i ) iL d hi L & = 2( s − t )σ ( hi ) L i 2014/1/17 q i 48 Output-Layer Derivatives (2) ∂hiL ∂ L−1 L−1 L−1 = W s = s ik k j L−1 L−1 ∑ ∂W ij ∂W ij k € ∂E q L L−1 ∴ = δ i sj L−1 ∂W ij where δiL = 2( siL − t iq )σ &( hiL ) 2014/1/17 49 Hidden-Layer Neuron s1l s1l–1 W1il Wi1l–1 l–1 W ij sjl–1 Σ hil σ sil Wkil s1l+1 skl+1 Eq WiNl–1 sNl–1 2014/1/17 WNil sNl sNl+1 50 Hidden-Layer Derivatives (1) l ∂E q ∂ h l i Recall = δ i l−1 ∂W ij ∂W ijl−1 q q l +1 l +1 ∂ E ∂ E ∂ h ∂ h δil = l = ∑ l +1 k l = ∑δkl +1 k l ∂h i ∂h k ∂h i ∂h i k k € € € l +1 k l i ∂h = ∂h ∂ ∑ W kml sml m ∂hil l d σ h ( ∂W kil sil i) l l l % = = W = W σ h ( ki ki i) l l ∂h i d hi ∴ δil = ∑δkl +1W kilσ $( hil ) = σ $( hil )∑δkl +1W kil k 2014/1/17 k 51 Hidden-Layer Derivatives (2) l i l−1 ij ∂h ∂W € l−1 l−1 dW ∂ ij s j l−1 l−1 l−1 = W s = = s ik k j l−1 ∑ ∂W ij k dW ijl−1 ∂E q l l−1 ∴ = δ i sj l−1 ∂W ij where δil = σ &( hil )∑δkl +1W kil k 2014/1/17 52 Derivative of Sigmoid 1 Suppose s = σ ( h ) = (logistic sigmoid) 1+ exp(−αh ) −1 −2 Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh ) € = −(1+ e −αh −2 ) (−αe ) = α −αh e−αh −αh 2 (1+ e ) $ 1+ e−αh 1 e−αh 1 ' =α = αs& − −αh −αh −αh −αh ) 1+ e 1+ e 1+ e ( % 1+ e = αs(1− s) 2014/1/17 53 Summary of Back-Propagation Algorithm Output layer : δiL = 2αsiL (1− siL )( siL − t iq ) ∂E q L L−1 = δ i sj L−1 ∂W ij Hidden layers : δil = αsil (1− sil )∑δkl +1W kil k € ∂E q l l−1 = δ i sj l−1 ∂W ij 2014/1/17 54 Output-Layer Computation ΔW ijL−1 = ηδiL sL−1 j s1L–1 Wi1L–1 L–1 W ij sjL–1 € Σ ΔWijL–1 sNL–1 2014/1/17 WiNL–1 η × hiL σ siL = yiq – tiq 1– δiL × δiL = 2αsiL (1− siL )( t iq − siL ) 2α 55 Hidden-Layer Computation ΔW ijl−1 = ηδil sl−1 j s1l–1 W1il Wi1l–1 l–1 W ij sjl–1 € Σ hil ΔWijl–1 sNl–1 WiNl–1 × η × WNil 1– δil × δil = αsil (1− sil )∑δkl +1W kil 2014/1/17 sil σ k Wkil Σ s1l+1 δ1l+1 skl+1 Eq δkl+1 sNl+1 δNl+1 α 56 Training Procedures • Batch Learning – – – – on each epoch (pass through all the training pairs), weight changes for all patterns accumulated weight matrices updated at end of epoch accurate computation of gradient • Online Learning – weight are updated after back-prop of each training pair – usually randomize order for each epoch – approximation of gradient • Doesn’t make much difference 2014/1/17 57 Summation of Error Surfaces E E1 E2 2014/1/17 58 Gradient Computation in Batch Learning E E1 E2 2014/1/17 59 Gradient Computation in Online Learning E E1 E2 2014/1/17 60 Testing Generalization Available Data Training Data Domain Test Data 2014/1/17 61 Problem of Rote Learning error error on test data error on training data epoch stop training here 2014/1/17 62 Improving Generalization Training Data Available Data Domain Test Data Validation Data 2014/1/17 63 A Few Random Tips • Too few neurons and the ANN may not be able to decrease the error enough • Too many neurons can lead to rote learning • Preprocess data to: – – – – standardize eliminate irrelevant information capture invariances keep relevant information • If stuck in local min., restart with different random weights 2014/1/17 64 Run Example BP Learning 2014/1/17 65 Beyond Back-Propagation • Adaptive Learning Rate • Adaptive Architecture – Add/delete hidden neurons – Add/delete hidden layers • Radial Basis Function Networks • Recurrent BP • Etc., etc., etc.… 2014/1/17 66 What is the Power of Artificial Neural Networks? • With respect to Turing machines? • As function approximators? 2014/1/17 67 Can ANNs Exceed the “Turing Limit”? • There are many results, which depend sensitively on assumptions; for example: • Finite NNs with real-valued weights have super-Turing power (Siegelmann & Sontag ‘94) • Recurrent nets with Gaussian noise have sub-Turing power (Maass & Sontag ‘99) • Finite recurrent nets with real weights can recognize all languages, and thus are super-Turing (Siegelmann ‘99) • Stochastic nets with rational weights have super-Turing power (but only P/POLY, BPP/log*) (Siegelmann ‘99) • But computing classes of functions is not a very relevant way to evaluate the capabilities of neural computation 2014/1/17 68 € A Universal Approximation Theorem Suppose f is a continuous function on [0,1] Suppose σ is a nonconstant, bounded, monotone increasing real function on ℜ. n For any ε > 0, there is an m such that ∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if € $ n ' F ( x1,…, x n ) = ∑ aiσ && ∑W ij x j + b j )) i=1 % j=1 ( m € [i.e., F (x) = a ⋅ σ (Wx + b)] €then F ( x ) − f ( x ) < ε for all x ∈ [0,1] 2014/1/17 € n (see, e.g., Haykin, N.Nets 2/e, 208–9) 69 One Hidden Layer is Sufficient • Conclusion: One hidden layer is sufficient to approximate any continuous function arbitrarily closely 1 b1 x1 xn 2014/1/17 Σ σ Σ σ a1 a2 Σ am Wmn Σ σ 70 The Golden Rule of Neural Nets Neural Networks are the second-best way to do everything! 2014/1/17 IVB 71