Part 4A: Neural Network Learning 3/11/15 II. Neural Network Learning 3/11/15 1 A. Neural Network Learning 3/11/15 2 Supervised Learning • Produce desired outputs for training inputs • Generalize reasonably & appropriately to other inputs • Good example: pattern recognition • Feedforward multilayer networks 3/11/15 3 1 Part 4A: Neural Network Learning 3/11/15 input layer ... ... ... ... ... ... Feedforward Network output layer hidden layers 3/11/15 4 Typical Artificial Neuron connection weights inputs output threshold 3/11/15 5 Typical Artificial Neuron linear combination activation function net input (local field) 3/11/15 6 2 Part 4A: Neural Network Learning 3/11/15 Equations # n & hi = %% ∑ w ij s j (( − θ $ j=1 ' h = Ws − θ Net input: s"i = σ ( hi ) Neuron output: s" = σ (h) € 3/11/15 7 € ... ... Single-Layer Perceptron 3/11/15 8 Variables x1 w1 xj Σ wj h Θ y wn xn 3/11/15 θ 9 3 Part 4A: Neural Network Learning 3/11/15 Single Layer Perceptron Equations Binary threshold activation function : %1, if h > 0 σ ( h ) = Θ( h ) = & '0, if h ≤ 0 $1, if ∑ w x > θ j j j Hence, y = % &0, otherwise € $1, if w ⋅ x > θ =% &0, if w ⋅ x ≤ θ 3/11/15 10 € 2D Weight Vector w2 w ⋅ x = w x cos φ € x – + v cos φ = x w w⋅ x = w v φ € € w1 v w⋅ x > θ ⇔ w v >θ θ w ⇔v >θ w 3/11/15 € 11 € N-Dimensional Weight Vector + w normal vector separating hyperplane – 3/11/15 12 4 Part 4A: Neural Network Learning 3/11/15 Goal of Perceptron Learning • Suppose we have training patterns x1, x2, …, xP with corresponding desired outputs y1, y2, …, yP • where xp ∈ {0, 1}n, yp ∈ {0, 1} • We want to find w, θ such that yp = Θ(w∙xp – θ) for p = 1, …, P 3/11/15 13 Treating Threshold as Weight #n & h = %%∑ w j x j (( − θ $ j=1 ' x1 n = −θ + ∑ w j x j w1 xj j=1 h Σ wj wn xn Θ y € θ 3/11/15 14 Treating Threshold as Weight #n & h = %%∑ w j x j (( − θ $ j=1 ' x0 = –1 x1 n w1 xj wn xn j=1 Σ wj = −θ + ∑ w j x j θ = w0 h Θ y € Let x 0 = −1 and w 0 = θ n n ˜ ⋅ x˜ h = w 0 x 0 + ∑ w j x j =∑ w j x j = w j=1 3/11/15 € j= 0 15 € 5 Part 4A: Neural Network Learning 3/11/15 Augmented Vectors #θ& % ( w ˜ =% 1( w % ! ( % ( $ wn ' # −1& % p( x p x˜ = % 1 ( %!( % p( $ xn ' ˜ ⋅ x˜ p ), p = 1,…,P We want y p = Θ( w € € 3/11/15 16 € Reformulation as Positive Examples We have positive (y p = 1) and negative (y p = 0) examples ˜ ⋅ x˜ p > 0 for positive, w ˜ ⋅ x˜ p ≤ 0 for negative Want w € € € € Let z p = x˜ p for positive, z p = −x˜ p for negative ˜ ⋅ z p ≥ 0, for p = 1,…,P Want w Hyperplane through origin with all z p on one side 3/11/15 17 € Adjustment of Weight Vector z5 z1 z9 z10 11 z z6 z2 z3 z8 z4 z7 3/11/15 18 6 Part 4A: Neural Network Learning 3/11/15 Outline of Perceptron Learning Algorithm 1. initialize weight vector randomly 2. until all patterns classified correctly, do: a) for p = 1, …, P do: 1) if zp classified correctly, do nothing 2) else adjust weight vector to be closer to correct classification 3/11/15 19 Weight Adjustment zp p ˜ "" ηz w ηz p ˜" w ˜ w € € € € € € 3/11/15 20 Improvement in Performance ! ! ⋅ z p = (w ! + ηz p ) ⋅ z p w ! ⋅ z p + ηz p ⋅ z p =w ! ⋅ zp +η zp =w 2 ! ⋅ zp >w 3/11/15 21 7 Part 4A: Neural Network Learning 3/11/15 Perceptron Learning Theorem • If there is a set of weights that will solve the problem, • then the PLA will eventually find it • (for a sufficiently small learning rate) • Note: only applies if positive & negative examples are linearly separable 3/11/15 22 NetLogo Simulation of Perceptron Learning Run Perceptron-Geometry.nlogo 3/11/15 23 Classification Power of Multilayer Perceptrons • Perceptrons can function as logic gates • Therefore MLP can form intersections, unions, differences of linearly-separable regions • Classes can be arbitrary hyperpolyhedra • Minsky & Papert criticism of perceptrons • No one succeeded in developing a MLP learning algorithm 3/11/15 24 8 Part 4A: Neural Network Learning 3/11/15 Hyperpolyhedral Classes 3/11/15 25 Credit Assignment Problem input layer ... ... ... ... ... hidden layers ... ... How do we adjust the weights of the hidden layers? Desired output output layer 3/11/15 26 NetLogo Demonstration of Back-Propagation Learning Run Artificial Neural Net.nlogo 3/11/15 27 9 Part 4A: Neural Network Learning 3/11/15 Adaptive System System Evaluation Function (Fitness, Figure of Merit) S F P1 … Pk … Pm Control Algorithm Control Parameters C 3/11/15 28 Gradient ∂F measures how F is altered by variation of Pk ∂Pk $ ∂F ' & ∂P1 ) & ! ) & ) ∇F = & ∂F ∂P ) k & ! ) &∂F ) & ) % ∂Pm ( € ∇F points in direction of maximum local increase in F 3/11/15 € 29 € Gradient Ascent on Fitness Surface ∇F + gradient ascent 3/11/15 – 30 10 Part 4A: Neural Network Learning 3/11/15 Gradient Ascent by Discrete Steps ∇F + – 3/11/15 31 Gradient Ascent is Local But Not Shortest + – 3/11/15 32 Gradient Ascent Process P˙ = η∇F (P) Change in fitness : m ∂F d P m dF k F˙ = =∑ = ∑ (∇F ) k P˙k k=1 ∂P d t k=1 €d t k F˙ = ∇F ⋅ P˙ F˙ = ∇F ⋅ η∇F = η ∇F € € 2 ≥0 Therefore gradient ascent increases fitness (until reaches 0 gradient) 3/11/15 33 11 Part 4A: Neural Network Learning 3/11/15 General Ascent in Fitness Note that any adaptive process P( t ) will increase fitness provided : 0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ where ϕ is angle between ∇F and P˙ Hence we need cos ϕ > 0 € or ϕ < 90 ! 3/11/15 34 € General Ascent on Fitness Surface + ∇F – 3/11/15 35 Fitness as Minimum Error Suppose for Q different inputs we have target outputs t1,…,t Q Suppose for parameters P the corresponding actual outputs € are y1,…,y Q Suppose D(t,y ) ∈ [0,∞) measures difference between € target & actual outputs Let E q = D(t q ,y q ) be error on qth sample € € Q Q Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)] q=1 3/11/15 q=1 36 € 12 Part 4A: Neural Network Learning 3/11/15 Gradient of Fitness % ( ∇F = ∇'' −∑ E q ** = −∑ ∇E q & q ) q € ∂D(t q ,y q ) ∂y qj ∂E q ∂ = D(t q ,y q ) = ∑ ∂Pk ∂Pk ∂y qj ∂Pk j d D(t q ,y q ) ∂y q ⋅ d yq ∂Pk q q q = ∇ D t ,y ⋅ ∂y = € € yq 3/11/15 ( ) ∂Pk 37 € € Jacobian Matrix #∂y1q & ∂y1q % ∂P1 ! ∂Pm ( Define Jacobian matrix J = % " # " ( %∂y q ( ∂y nq % n ∂P ! ∂Pm (' 1 $ q Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1 € Since (∇E q ) k = € q q ∂y q ∂D(t ,y ) ∂E q =∑ j , q ∂Pk ∂y j j ∂Pk T ∴∇E q = ( J q ) ∇D(t q ,y q ) € 3/11/15 38 € Derivative of Squared Euclidean Distance Suppose D(t,y ) = t − y = ∑ ( t − y ) 2 2 i € i ∂D(t − y ) ∂ ∂ (t − y ) = ∑ (ti − y i )2 = ∑ i∂y i ∂y j ∂y j i j i = € i d( t j − y j ) dyj d D(t,y ) = 2( y − t ) dy € 3/11/15 2 2 = −2( t j − y j ) ∴ 39 € 13 Part 4A: Neural Network Learning 3/11/15 Gradient of Error on qth Input d D(t ,y ) ∂y ∂E = ⋅ q q ∂Pk dy q q q ∂Pk ∂y q = 2( y − t ) ⋅ ∂Pk q q = 2∑ ( y qj − t qj ) j ∂y qj ∂Pk T ∇E q = 2( J q ) ( y q − t q ) € 3/11/15 40 € Recap P˙ = η∑ ( J ) (t q T q q − yq ) To know how to decrease the differences between actual & desired outputs, € we need to know elements of Jacobian, ∂y qj ∂Pk , which says how jth output varies with kth parameter (given the qth input) The Jacobian depends on the specific form of the system, in this case, a feedforward neural network € 3/11/15 41 Multilayer Notation xq W1 s1 3/11/15 W2 s2 WL–2 WL–1 sL–1 yq sL 42 14 Part 4A: Neural Network Learning 3/11/15 Notation • • • • • • • L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern) output layer sL = yq (the actual output) Wl = weights between layers l and l+1 Problem: find out how outputs yiq vary with weights Wjkl (l = 1, …, L–1) 3/11/15 43 Typical Neuron s1l–1 sjl–1 Wi1l–1 Wijl–1 WiN Σ hil σ sil l–1 sNl–1 3/11/15 44 Error Back-Propagation We will compute ∂E q starting with last layer (l = L −1) ∂W ijl and working back to earlier layers (l = L − 2,…,1) € 3/11/15 45 15 Part 4A: Neural Network Learning 3/11/15 Delta Values Convenient to break derivatives by chain rule : ∂E q ∂E q ∂hil = ∂W ijl−1 ∂hil ∂W ijl−1 Let δil = So € ∂E q ∂hil ∂E q ∂hil = δil l−1 ∂W ij ∂W ijl−1 3/11/15 46 Output-Layer Neuron s1L–1 sjL–1 Wi1L–1 WijL–1 Σ hiL σ tiq siL = yiq WiNL–1 Eq sNL–1 3/11/15 47 Output-Layer Derivatives (1) δiL = = 2 ∂E q ∂ = ∑ (skL − tkq ) ∂hiL ∂hiL k d( siL − t iq ) dh L i 2 = 2( siL − t iq ) d siL d hiL = 2( siL − t iq )σ &( hiL ) 3/11/15 48 € 16 Part 4A: Neural Network Learning 3/11/15 Output-Layer Derivatives (2) ∂hiL ∂ = ∑W ikL−1skL−1 = sL−1 j ∂W ijL−1 ∂W ijL−1 k ∴ € ∂E q = δiL sL−1 j ∂W ijL−1 where δiL = 2( siL − t iq )σ &( hiL ) 3/11/15 49 € Hidden-Layer Neuron s1l s1l–1 W1il Wi1l–1 Wijl–1 sjl–1 WiN Σ hil σ Wkil sil skl+1 Eq l–1 WNil sNl–1 s1l+1 sNl sNl+1 3/11/15 50 Hidden-Layer Derivatives (1) Recall δil = € ∂E q ∂hil = δil l−1 ∂W ij ∂W ijl−1 ∂E q ∂E q ∂h l +1 ∂h l +1 = ∑ l +1 k l = ∑δkl +1 k l ∂hil ∂ h ∂ h ∂h i k i k k l l d σ ( hil ) ∂hkl +1 ∂ ∑m W km sm ∂W kil sil = = = W kil = W kilσ %( hil ) ∂hil ∂hil ∂hil d hil € ∴ δil = ∑δkl +1W kilσ $( hil ) = σ $( hil )∑δkl +1W kil € k 3/11/15 k 51 € 17 Part 4A: Neural Network Learning 3/11/15 Hidden-Layer Derivatives (2) dW ijl−1sl−1 ∂hil ∂ j l−1 l−1 = W s = = sl−1 ∑ ik k j ∂W ijl−1 ∂W ijl−1 k dW ijl−1 ∴ € ∂E q = δil sl−1 j ∂W ijl−1 where δil = σ &( hil )∑δkl +1W kil k 3/11/15 52 € Derivative of Sigmoid Suppose s = σ ( h ) = 1 (logistic sigmoid) 1+ exp (−α h ) −1 −2 Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh ) = −(1+ e−αh ) =α −2 (−αe ) = α e−αh −αh −αh 2 (1+ e ) $ 1+ e−αh 1 e−αh 1 ' = αs& − ) −αh 1+ e−αh 1+ e−αh 1+ e−αh ( % 1+ e = αs(1− s) 3/11/15 53 € Summary of Back-Propagation Algorithm Output layer : δiL = 2αsiL (1− siL )( siL − t iq ) ∂E q = δiL sL−1 j ∂W ijL−1 Hidden layers : δil = αsil (1− sil )∑δkl +1W kil k € ∂E q = δil sl−1 j ∂W ijl−1 3/11/15 54 € 18 Part 4A: Neural Network Learning 3/11/15 Output-Layer Computation ΔW ijL−1 = ηδiL sL−1 j s1L–1 sjL–1 Wi1L–1 WijL–1 € Σ hiL WiNL–1 η × sNL–1 siL = yiq – σ ∆WijL–1 1– δiL × δ = 2αs (1− s L i tiq L i L i )( t q i L i −s ) 2α 3/11/15 55 € Hidden-Layer Computation ΔW ijl−1 = ηδil sl−1 j s1l–1 s1l+1 W1il δ1l+1 l–1 Wi1 W l–1 sjl–1 € ij ∆W l–1 ij sNl–1 WiN × hil Σ sil σ Wkil skl+1 η WNil 1– δil × δil = αsil (1− sil )∑δkl +1W kil k 3/11/15 Eq δkl+1 × l–1 sNl+1 Σ δNl+1 α 56 € Training Procedures • Batch Learning – on each epoch (pass through all the training pairs), – weight changes for all patterns accumulated – weight matrices updated at end of epoch – accurate computation of gradient • Online Learning – weight are updated after back-prop of each training pair – usually randomize order for each epoch – approximation of gradient • Doesn’t make much difference 3/11/15 57 19 Part 4A: Neural Network Learning 3/11/15 Summation of Error Surfaces E E1 E2 3/11/15 58 Gradient Computation in Batch Learning E E1 E2 3/11/15 59 Gradient Computation in Online Learning E E1 E2 3/11/15 60 20 Part 4A: Neural Network Learning 3/11/15 Testing Generalization Available Data Training Data Domain Test Data 3/11/15 61 Problem of Rote Learning error error on test data error on training data epoch stop training here 3/11/15 62 Improving Generalization Training Data Available Data Domain Test Data Validation Data 3/11/15 63 21 Part 4A: Neural Network Learning 3/11/15 A Few Random Tips • Too few neurons and the ANN may not be able to decrease the error enough • Too many neurons can lead to rote learning • Preprocess data to: – standardize – eliminate irrelevant information – capture invariances – keep relevant information • If stuck in local min., restart with different random weights 3/11/15 64 Run Example BP Learning 3/11/15 65 Beyond Back-Propagation • Adaptive Learning Rate • Adaptive Architecture – Add/delete hidden neurons – Add/delete hidden layers • Radial Basis Function Networks • Recurrent BP • Etc., etc., etc.… 3/11/15 66 22 Part 4A: Neural Network Learning 3/11/15 Deep Belief Networks • Inspired by hierarchical representations in mammalian sensory systems • Use “deep” (multilayer) feed-forward nets • Layers self-organize to represent input at progressively more abstract, task-relevant levels • Supervised training (e.g., BP) can be used to tune network performance. • Each layer is a Restricted Boltzmann Machine 3/11/15 67 Restricted Boltzmann Machine • Goal: hidden units become model of input domain • Should capture statistics of input • Evaluate by testing its ability to reproduce input statistics • Change weights to decrease difference (fig. from wikipedia) 3/11/15 68 Unsupervised RBM Learning • Stochastic binary units • Assume bias units x0 = y0 = 1 • Set yi with probability " % σ $$∑Wij x j '' # j & xj/ • Set with probability " % σ $∑Wij yi ' # 3/11/15 i & • Set yi/ with probability # & σ %%∑Wij x!j (( $ ' j • After several cycles of sampling, update weights based on statistics: ( ΔWij = η yi x j − yi#x#j ) 69 23 Part 4A: Neural Network Learning 3/11/15 Training a DBN Network • Present inputs and do RBM learning with first hidden layer to develop model • When converged, do RBM learning between first and second hidden layers to develop higher-level model • Continue until all weight layers trained • May further train with BP or other supervised learning algorithms 3/11/15 70 What is the Power of Artificial Neural Networks? • With respect to Turing machines? • As function approximators? 3/11/15 71 Can ANNs Exceed the “Turing Limit”? • There are many results, which depend sensitively on assumptions; for example: • Finite NNs with real-valued weights have super-Turing power (Siegelmann & Sontag ‘94) • Recurrent nets with Gaussian noise have sub-Turing power (Maass & Sontag ‘99) • Finite recurrent nets with real weights can recognize all languages, and thus are super-Turing (Siegelmann ‘99) • Stochastic nets with rational weights have super-Turing power (but only P/POLY, BPP/log*) (Siegelmann ‘99) • But computing classes of functions is not a very relevant way to evaluate the capabilities of neural computation 3/11/15 72 24 Part 4A: Neural Network Learning 3/11/15 A Universal Approximation Theorem Suppose f is a continuous function on [0,1] Suppose σ is a nonconstant, bounded, monotone increasing real function on ℜ. € n For any ε > 0, there is an m such that ∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if € m $ n ' F ( x1,…, x n ) = ∑ aiσ && ∑W ij x j + b j )) i=1 % j=1 ( € [i.e., F (x) = a ⋅ σ (Wx + b)] €then F ( x ) − f ( x ) < ε for all x ∈ [0,1] 3/11/15 € n 73 (see, e.g., Haykin, N.Nets 2/e, 208–9) € One Hidden Layer is Sufficient • Conclusion: One hidden layer is sufficient to approximate any continuous function arbitrarily closely 1 b1 x1 xn Σσ Σσ a1 a2 Σ am Wmn Σσ 3/11/15 74 The Golden Rule of Neural Nets Neural Networks are the second-best way to do everything! 3/11/15 IVB 75 25