Part 4A: Neural Network Learning 3/23/16 IV. Neural Networks and Learning 3/23/16 1 A. Artificial Neural Net Learning 3/23/16 2 Supervised Learning • Produce desired outputs for training inputs • Generalize reasonably & appropriately to other inputs • Good example: pattern recognition • Feedforward multilayer networks 3/23/16 3 1 Part 4A: Neural Network Learning 3/23/16 input layer ... ... ... ... ... ... Feedforward Network output layer hidden layers 3/23/16 4 Typical Artificial Neuron connection weights inputs output threshold 3/23/16 5 Typical Artificial Neuron linear combination activation function net input (local field) 3/23/16 6 2 Part 4A: Neural Network Learning 3/23/16 Equations # n & hi = %% ∑ w ij s j (( − θ $ j=1 ' h = Ws − θ Net input: s"i = σ ( hi ) Neuron output: s" = σ (h) € 3/23/16 7 € ... ... Single-Layer Perceptron 3/23/16 8 Variables x1 w1 xj Σ wj h Θ y wn xn 3/23/16 θ 9 3 Part 4A: Neural Network Learning 3/23/16 Single Layer Perceptron Equations Binary threshold activation function : %1, if h > 0 σ ( h ) = Θ( h ) = & '0, if h ≤ 0 $1, if ∑ w x > θ j j j Hence, y = % &0, otherwise € $1, if w ⋅ x > θ =% &0, if w ⋅ x ≤ θ 3/23/16 10 € 2D Weight Vector w2 w ⋅ x = w x cos φ € x – + v cos φ = x w w⋅ x = w v φ € € w1 v w⋅ x > θ ⇔ w v >θ θ w ⇔v >θ w 3/23/16 € 11 € N-Dimensional Weight Vector + w normal vector separating hyperplane – 3/23/16 12 4 Part 4A: Neural Network Learning 3/23/16 Goal of Perceptron Learning • Suppose we have training patterns x1 , x2 , …, xP with corresponding desired outputs y1 , y2 , …, yP • where xp ∈ {0, 1}n , yp ∈ {0, 1} • We want to find w, θ such that yp = Θ(w·xp – θ) for p = 1, …, P 3/23/16 13 Treating Threshold as Weight # n & h = %%∑ w j x j (( − θ $ j=1 ' x1 n = −θ + ∑ w j x j w1 xj j=1 h Σ wj wn xn Θ y € θ 3/23/16 14 Treating Threshold as Weight # n & h = %%∑ w j x j (( − θ $ j=1 ' x0 = 1 x1 w1 xj wn xn = −θ + ∑ w j x j j=1 Σ wj n –θ = w0 h Θ y € Let x 0 = 1 and w0 = –𝜃 n n ˜ ⋅ x˜ h = w 0 x 0 + ∑ w j x j =∑ w j x j = w j=1 3/23/16 j= 0 15 € 5 Part 4A: Neural Network Learning 3/23/16 Augmented Vectors #θ& % ( w ˜ =% 1( w % ! ( % ( $ wn ' # −1& % p( x x˜ p = % 1 ( %! ( % p( $ xn ' ˜ ⋅ x˜ p ), p = 1,…,P We want y p = Θ( w € € 3/23/16 16 € Reformulation as Positive Examples We have positive (y p = 1) and negative (y p = 0) examples ˜ ⋅ x˜ p > 0 for positive, w ˜ ⋅ x˜ p ≤ 0 for negative Want w € € € € Let z p = x˜ p for positive, z p = −x˜ p for negative ˜ ⋅ z p ≥ 0, for p = 1,…,P Want w Hyperplane through origin with all z p on one side 3/23/16 17 € Adjustment of Weight Vector z5 z1 z9 z1 0 1 1 z z6 z2 z3 z8 z4 z7 3/23/16 18 6 Part 4A: Neural Network Learning 3/23/16 Outline of Perceptron Learning Algorithm 1. initialize weight vector randomly 2. until all patterns classified correctly, do: a) for p = 1, …, P do: 1) if z p classified correctly, do nothing 2) else adjust weight vector to be closer to correct classification 3/23/16 19 Weight Adjustment ˜ "" w ηz p ˜" w zp € ηz p ˜ w € € € € € 3/23/16 20 Improvement in Performance ! ! ⋅ z p = (w ! + ηz p ) ⋅ z p w ! ⋅ z p + ηz p ⋅ z p =w ! ⋅ zp +η zp =w 2 ! ⋅ zp >w 3/23/16 21 7 Part 4A: Neural Network Learning 3/23/16 Perceptron Learning Theorem • If there is a set of weights that will solve the problem, • then the PLA will eventually find it • (for a sufficiently small learning rate) • Note: only applies if positive & negative examples are linearly separable 3/23/16 22 NetLogo Simulation of Perceptron Learning Run Perceptron-Geometry.nlogo 3/23/16 23 Classification Power of Multilayer Perceptrons • Perceptrons can function as logic gates • Therefore MLP can form intersections, unions, differences of linearly-separable regions • Classes can be arbitrary hyperpolyhedra • Minsky & Papert criticism of perceptrons • No one succeeded in developing a MLP learning algorithm 3/23/16 24 8 Part 4A: Neural Network Learning 3/23/16 Hyperpolyhedral Classes 3/23/16 25 Credit Assignment Problem input layer ... ... ... ... ... hidden layers ... ... How do we adjust the weights of the hidden layers? Desired output output layer 3/23/16 26 NetLogo Demonstration of Back-Propagation Learning Run Artificial Neural Net.nlogo 3/23/16 27 9 Part 4A: Neural Network Learning 3/23/16 Adaptive System System Evaluation Function (Fitness, Figure of Merit) S F P1 … Pk … Pm Control Algorithm Control Parameters C 3/23/16 28 Gradient ∂F measures how F is altered by variation of Pk ∂Pk $ ∂F ' & ∂P1 ) & ! ) & ) ∇F = & ∂F ∂P ) k & ! ) &∂F ) & ) % ∂Pm ( € ∇F points in direction of maximum local increase in F 3/23/16 € 29 € Gradient Ascent on Fitness Surface ∇F + gradient ascent 3/23/16 – 30 10 Part 4A: Neural Network Learning 3/23/16 Gradient Ascent by Discrete Steps ∇F + – 3/23/16 31 Gradient Ascent is Local But Not Shortest + – 3/23/16 32 Gradient Ascent Process P˙ = η∇F (P) Change in fitness : m ∂F d P m dF k F˙ = = = (∇F ) k P˙k €d t ∑k=1 ∂Pk d t ∑k=1 F˙ = ∇F ⋅ P˙ F˙ = ∇F ⋅ η∇F = η ∇F € € 2 ≥0 Therefore gradient ascent increases fitness (until reaches 0 gradient) 3/23/16 33 11 Part 4A: Neural Network Learning 3/23/16 General Ascent in Fitness Note that any adaptive process P( t ) will increase fitness provided : 0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ where ϕ is angle between ∇F and P˙ Hence we need cos ϕ > 0 € or ϕ < 90 ! 3/23/16 34 € General Ascent on Fitness Surface + ∇F – 3/23/16 35 Fitness as Minimum Error Suppose for Q different inputs we have target outputs t1,…,t Q Suppose for parameters P the corresponding actual outputs are y1,…,y Q € € Suppose D(t,y ) ∈ [0,∞) measures difference between target & actual outputs Let E q = D(t q ,y q ) be error on qth sample € € Q Q Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)] q=1 3/23/16 q=1 36 € 12 Part 4A: Neural Network Learning 3/23/16 Gradient of Fitness % ( ∇F = ∇'' −∑ E q ** = −∑ ∇E q & q ) q € ∂D(t q ,y q ) ∂y qj ∂E q ∂ = D(t q ,y q ) = ∑ ∂Pk ∂Pk ∂y qj ∂Pk j d D(t q ,y q ) ∂y q ⋅ d yq ∂Pk q q q = ∇ D t ,y ⋅ ∂y = € € yq 3/23/16 ( ) ∂Pk 37 € € Jacobian Matrix #∂y1q & ∂y1q % ∂P1 ! ∂Pm ( q % Define Jacobian matrix J = " # " ( q %∂y q ( ∂ y n n % ∂P ! ∂Pm (' 1 $ Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1 € Since (∇E q ) = k € q q ∂y q ∂D(t ,y ) ∂E q =∑ j , q ∂Pk ∂y j j ∂Pk T ∴∇E q = ( J q ) ∇D(t q ,y q ) € 3/23/16 38 € Derivative of Squared Euclidean Distance Suppose D(t,y ) = t − y = ∑ ( t − y ) 2 2 i € i i ∂D(t − y ) ∂ ∂ (t − y ) = ∑ (t i − y i )2 = ∑ i∂y i ∂y j ∂y j i j i = € ∴ d( t j − y j ) dyj 2 2 = −2( t j − y j ) d D(t,y ) = 2( y − t ) dy € 3/23/16 39 € 13 Part 4A: Neural Network Learning 3/23/16 Gradient of Error on qth Input ∂E q d D(t ,y = ∂Pk d yq q q ) ⋅ ∂y q ∂Pk = 2( y q − t q ) ⋅ ∂y q ∂Pk = 2∑ ( y qj − t qj ) j ∂y qj ∂Pk T ∇E q = 2( J q ) ( y q − t q ) € 3/23/16 40 € Recap T P˙ = η∑ ( J q ) (t q − y q ) q To know how to decrease the differences between actual & desired outputs, € we need to know elements of Jacobian, ∂y qj ∂Pk , which says how jth output varies with kth parameter (given the qth input) The Jacobian depends on the specific form of the system, in this case, a feedforward neural network € 3/23/16 41 Multilayer Notation xq W1 s1 3/23/16 W2 s2 WL–2 WL–1 sL–1 yq sL 42 14 Part 4A: Neural Network Learning 3/23/16 Notation • L layers of neurons labeled 1, …, L • N l neurons in layer l • sl = vector of outputs from neurons in layer l • • • • input layer s1 = xq (the input pattern) output layer sL = yq (the actual output) Wl = weights between layers l and l+1 Problem: find out how outputs yiq vary with weights Wjkl (l = 1, …, L–1) 3/23/16 43 Typical Neuron s1 l–1 sjl–1 W i1 l–1 W ijl–1 Σ hil σ sil W iNl–1 sNl–1 3/23/16 44 Error Back-Propagation We will compute ∂E q starting with last layer (l = L −1) ∂W ijl and working back to earlier layers (l = L − 2,…,1) € 3/23/16 45 15 Part 4A: Neural Network Learning 3/23/16 Delta Values Convenient to break derivatives by chain rule : ∂E q ∂E q ∂hil = l l−1 ∂W ij ∂hi ∂W ijl−1 Let δil = So € ∂E q ∂hil ∂E q ∂hil = δil l−1 ∂W ij ∂W ijl−1 3/23/16 46 Output-Layer Neuron s1 L–1 sjL–1 W i1 L–1 W ijL–1 Σ hiL σ tiq siL = y iq W iNL–1 Eq sNL–1 3/23/16 47 Output-Layer Derivatives (1) δiL = = 2 ∂E q ∂ = L ∑ ( skL − t kq ) L k ∂h i ∂h i d( siL − t iq ) d hiL 2 = 2( siL − t iq ) d siL d hiL = 2( siL − t iq )σ &( hiL ) 3/23/16 48 € 16 Part 4A: Neural Network Learning 3/23/16 Output-Layer Derivatives (2) ∂hiL ∂ = ∑W ikL−1skL−1 = sL−1 j ∂W ijL−1 ∂W ijL−1 k ∴ € ∂E q = δiL sL−1 j ∂W ijL−1 where δiL = 2( siL − t iq )σ &( hiL ) 3/23/16 49 € Hidden-Layer Neuron s1 l s1 l–1 sjl–1 W 1il W i1 l–1 W ijl–1 Σ hil σ W kil sil s1 l+1 skl+1 Eq W iNl–1 W Nil sNl–1 sNl+1 sNl 3/23/16 50 Hidden-Layer Derivatives (1) Recall δil = € ∂E q ∂hil = δil l−1 ∂W ij ∂W ijl−1 ∂E q ∂E q ∂h l +1 ∂h l +1 = ∑ l +1 k l = ∑δkl +1 k l l ∂h i ∂h i ∂h i k ∂h k k l l d σ ( hil ) ∂hkl +1 ∂ ∑m W km sm ∂W kil sil = = = W kil = W kilσ %( hil ) l l l ∂h i ∂h i ∂h i d hil € ∴ δil = ∑δkl +1W kilσ $( hil ) = σ $( hil )∑δkl +1W kil € k 3/23/16 k 51 € 17 Part 4A: Neural Network Learning 3/23/16 Hidden-Layer Derivatives (2) l−1 l−1 dW s ∂hil ∂ = ∑W ikl−1skl−1 = dWij l−1j = sl−1j ∂W ijl−1 ∂W ijl−1 k ij ∴ € ∂E q = δil sl−1 j ∂W ijl−1 where δil = σ &( hil )∑δkl +1W kil k 3/23/16 52 € Derivative of Sigmoid Suppose s = σ ( h ) = 1 (logistic sigmoid) 1+ exp (−α h ) −1 −2 Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh ) = −(1+ e−αh ) =α −2 (−αe ) = α e−αh −αh −αh 2 (1+ e ) $ 1+ e−αh 1 e−αh 1 ' = αs& − ) −αh 1+ e−αh 1+ e−αh 1+ e−αh ( % 1+ e = αs(1− s) 3/23/16 53 € Summary of Back-Propagation Algorithm Output layer : δiL = 2αsiL (1− siL )( siL − t iq ) ∂E q = δiL sL−1 j ∂W ijL−1 Hidden layers : δil = αsil (1− sil )∑δkl +1W kil k € ∂E q = δil sl−1 j ∂W ijl−1 3/23/16 54 € 18 Part 4A: Neural Network Learning 3/23/16 Output-Layer Computation ΔW ijL−1 = ηδiL sL−1 j s1 L–1 sjL–1 W i1 L–1 W ijL–1 € L–1 Σ hiL σ ∆Wij siL = y iq – tiq W iNL–1 η × sNL–1 1– δiL × δiL = 2αsiL (1− siL )( t iq − siL ) 2α 3/23/16 55 € Hidden-Layer Computation ΔW ijl−1 = ηδil sl−1 j s1 l–1 W 1il s1 l+1 δ1 l+1 W i1 l–1 W ijl–1 sjl–1 € ∆Wijl–1 W iNl–1 × sNl–1 hil Σ σ sil W kil skl+1 η W Nil 1– δil sNl+1 × δil = αsil (1− sil )∑δkl +1W kil Σ δNl+1 α k 3/23/16 Eq δkl+1 × 56 € Training Procedures • Batch Learning – – – – on each epoch (pass through all the training pairs), weight changes for all patterns accumulated weight matrices updated at end of epoch accurate computation of gradient • Online Learning – weight are updated after back-prop of each training pair – usually randomize order for each epoch – approximation of gradient • Doesn’t make much difference 3/23/16 57 19 Part 4A: Neural Network Learning 3/23/16 Summation of Error Surfaces E E1 E2 3/23/16 58 Gradient Computation in Batch Learning E E1 E2 3/23/16 59 Gradient Computation in Online Learning E 1 E E2 3/23/16 60 20 Part 4A: Neural Network Learning 3/23/16 Testing Generalization Available Data Training Data Domain Test Data 3/23/16 61 Problem of Rote Learning error error on test data error on training data epoch stop training here 3/23/16 62 Improving Generalization Training Data Available Data Domain Validation Data Test Data 3/23/16 63 21 Part 4A: Neural Network Learning 3/23/16 A Few Random Tips • Too few neurons and the ANN may not be able to decrease the error enough • Too many neurons can lead to rote learning • Preprocess data to: – – – – standardize eliminate irrelevant information capture invariances keep relevant information • If stuck in local min., restart with different random weights 3/23/16 64 Run Example BP Learning 3/23/16 65 Beyond Back-Propagation • Adaptive Learning Rate • Adaptive Architecture – Add/delete hidden neurons – Add/delete hidden layers • Radial Basis Function Networks • Recurrent BP • Etc., etc., etc.… 3/23/16 66 22 Part 4A: Neural Network Learning 3/23/16 Deep Belief Networks • Inspired by hierarchical representations in mammalian sensory systems • Use “deep” (multilayer) feed-forward nets • Layers self-organize to represent input at progressively more abstract, task-relevant levels • Supervised training (e.g., BP) can be used to tune network performance. • Each layer is a Restricted Boltzmann Machine 3/23/16 67 Restricted Boltzmann Machine • Goal: hidden units become model of input domain • Should capture statistics of input • Evaluate by testing its ability to reproduce input statistics • Change weights to decrease difference (fig. from wikipedia) 3/23/16 68 Unsupervised RBM Learning • Stochastic binary units • Set yi / with probability # & • Assume bias units σ %%∑Wij x!j (( x0 = y0 = 1 $ j ' • Set yi with probability • After several cycles of sampling, update weights based on statistics: / • Set xj with probability " % ΔW yi x j − yi#x#j ij = η σ $∑Wij yi ' " % σ $$∑Wij x j '' # j & # 3/23/16 i & ( ) 69 23 Part 4A: Neural Network Learning 3/23/16 Training a DBN Network • Present inputs and do RBM learning with first hidden layer to develop model • When converged, do RBM learning between first and second hidden layers to develop higher-level model • Continue until all weight layers trained • May further train with BP or other supervised learning algorithms 3/23/16 70 What is the Power of Artificial Neural Networks? • With respect to Turing machines? • As function approximators? 3/23/16 71 Can ANNs Exceed the “Turing Limit”? • There are many results, which depend sensitively on assumptions; for example: • Finite NNs with real-valued weights have super-Turing power (Siegelmann & Sontag ‘94) • Recurrent nets with Gaussian noise have sub-Turing power (Maass & Sontag ‘99) • Finite recurrent nets with real weights can recognize all languages, and thus are super-Turing (Siegelmann ‘99) • Stochastic nets with rational weights have super-Turing power (but only P/POLY, BPP/log* ) (Siegelmann ‘99) • But computing classes of functions is not a very relevant way to evaluate the capabilities of neural computation 3/23/16 72 24 Part 4A: Neural Network Learning 3/23/16 A Universal Approximation Theorem Suppose f is a continuous function on [0,1] Suppose σ is a nonconstant, bounded, n monotone increasing real function on ℜ. € For any ε > 0, there is an m such that ∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if € m $ n ' F ( x1,…, x n ) = ∑ aiσ && ∑W ij x j + b j )) i=1 % j=1 ( € [i.e., F (x) = a ⋅ σ (Wx + b)] €then F ( x ) − f ( x ) < ε for all x ∈ [0,1] n 3/23/16 € 73 (see, e.g., Haykin, N.Nets 2/e, 208–9) € One Hidden Layer is Sufficient • Conclusion: One hidden layer is sufficient to approximate any continuous function arbitrarily closely b1 1 x1 Σσ Σσ a1 a2 Σ am xn W mn Σσ 3/23/16 74 The Golden Rule of Neural Nets Neural Networks are the second-best way to do everything! 3/23/16 IVB 75 25