Part 4A: Neural Network Learning 3/23/16 IV. Neural Networks and Learning 1 3/23/16 ... ... ... • Produce desired outputs for training inputs • Generalize reasonably & appropriately to other inputs • Good example: pattern recognition ... Feedforward Network ... Supervised Learning 2 ... 3/23/16 A. Artificial Neural Net Learning • Feedforward multilayer networks input layer 3/23/16 3 3/23/16 Typical Artificial Neuron 4 Typical Artificial Neuron connection weights inputs output layer hidden layers linear combination activation function output net input (local field) threshold 3/23/16 5 3/23/16 6 1 Part 4A: Neural Network Learning 3/23/16 Equations Single-Layer Perceptron ... s"i = σ ( hi ) Neuron output: ... # n & hi = %% ∑ w ij s j (( − θ $ j=1 ' h = Ws − θ Net input: s" = σ (h) € 3/23/16 7 3/23/16 8 € Single Layer Perceptron Equations Variables Binary threshold activation function : %1, if h > 0 σ ( h ) = Θ( h ) = & '0, if h ≤ 0 x1 w1 xj h Σ wj Θ y wn $1, if ∑ w x > θ j j j Hence, y = % &0, otherwise € xn θ $1, if w ⋅ x > θ =% &0, if w ⋅ x ≤ θ 3/23/16 9 3/23/16 10 € 2D Weight Vector N-Dimensional Weight Vector w2 w ⋅ x = w x cos φ € x – v cos φ = x + + w w separating hyperplane w⋅ x = w v φ € € v w⋅ x > θ ⇔ w v >θ w1 – θ w ⇔v >θ w 3/23/16 € normal vector 11 3/23/16 12 € 2 Part 4A: Neural Network Learning 3/23/16 Treating Threshold as Weight Goal of Perceptron Learning # n & h = %%∑ w j x j (( − θ $ j=1 ' • Suppose we have training patterns x1 , x2 , …, xP with corresponding desired outputs y1 , y2 , …, yP x1 w1 • where ∈ {0, ∈ {0, 1} • We want to find w, θ such that yp = Θ(w·xp – θ) for p = 1, …, P xp 1}n , yp xj xj wj wn xn = −θ + ∑ w j x j h Θ y € Let x 0 = 1 and w0 = –𝜃 n j= 0 3/23/16 θ 14 # −1& % p( x1 x˜ p = % ( %! ( % p( $ xn ' ˜ ⋅ x˜ p ), p = 1,…,P We want y p = Θ( w n ˜ ⋅ x˜ h = w 0 x 0 + ∑ w j x j =∑ w j x j = w j=1 € #θ& % ( w ˜ =% 1( w % ! ( % ( $ wn ' n j=1 Σ y Augmented Vectors # n & h = %%∑ w j x j (( − θ $ j=1 ' w1 Θ 3/23/16 Treating Threshold as Weight –θ = w0 wj wn 13 x1 j=1 h Σ xn 3/23/16 x0 = 1 n = −θ + ∑ w j x j € 15 € 3/23/16 16 € € Reformulation as Positive Examples Adjustment of Weight Vector We have positive (y p = 1) and negative (y p = 0) examples z5 z1 ˜ ⋅ x˜ p > 0 for positive, w ˜ ⋅ x˜ p ≤ 0 for negative Want w € z9 z1 0 1 1 z z6 z2 Let z p = x˜ p for positive, z p = −x˜ p for negative z3 z8 z4 z7 € p ˜ ⋅ z ≥ 0, for p = 1,…,P Want w € € Hyperplane through origin with all z p on one side 3/23/16 17 3/23/16 18 € 3 Part 4A: Neural Network Learning 3/23/16 Outline of Perceptron Learning Algorithm Weight Adjustment ˜ "" w 1. initialize weight vector randomly € a) for p = 1, …, P do: 1) if zp classified correctly, do nothing 19 Improvement in Performance € € € € 3/23/16 20 • If there is a set of weights that will solve the problem, • then the PLA will eventually find it • (for a sufficiently small learning rate) ! ⋅ z p + ηz p ⋅ z p =w 2 • Note: only applies if positive & negative examples are linearly separable ! ⋅ zp >w 3/23/16 ˜ w Perceptron Learning Theorem ! ! ⋅ z p = (w ! + ηz p ) ⋅ z p w ! ⋅ zp +η zp =w ηz p € 2) else adjust weight vector to be closer to correct classification 3/23/16 ˜" w zp 2. until all patterns classified correctly, do: ηz p 21 3/23/16 22 Classification Power of Multilayer Perceptrons • Perceptrons can function as logic gates • Therefore MLP can form intersections, unions, differences of linearly-separable regions • Classes can be arbitrary hyperpolyhedra • Minsky & Papert criticism of perceptrons • No one succeeded in developing a MLP learning algorithm NetLogo Simulation of Perceptron Learning Run Perceptron-Geometry.nlogo 3/23/16 23 3/23/16 24 4 Part 4A: Neural Network Learning 3/23/16 Credit Assignment Problem Hyperpolyhedral Classes 3/23/16 25 ... ... ... ... ... input layer ... ... How do we adjust the weights of the hidden layers? Desired output output layer hidden layers 3/23/16 26 Adaptive System NetLogo Demonstration of Back-Propagation Learning System Evaluation Function (Fitness, Figure of Merit) S F Run Artificial Neural Net.nlogo P1 … Pk … Pm Control Algorithm Control Parameters 3/23/16 27 C 3/23/16 28 Gradient Ascent on Fitness Surface Gradient ∂F measures how F is altered by variation of Pk ∂Pk ∇F $ ∂F ' & ∂P1 ) & ! ) & ) ∇F = & ∂F ∂P ) k & ! ) &∂F ) & ) % ∂Pm ( € + gradient ascent – ∇F points in direction of maximum local increase in F 3/23/16 € 29 3/23/16 30 € 5 Part 4A: Neural Network Learning 3/23/16 Gradient Ascent by Discrete Steps Gradient Ascent is Local But Not Shortest ∇F + + – 3/23/16 – 31 3/23/16 Gradient Ascent Process General Ascent in Fitness P˙ = η∇F (P) Note that any adaptive process P( t ) will increase Change in fitness : m ∂F d P m dF k F˙ = = = (∇F ) k P˙k €d t ∑k=1 ∂Pk d t ∑k=1 F˙ = ∇F ⋅ P˙ F˙ = ∇F ⋅ η∇F = η ∇F € € 2 32 fitness provided : 0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ where ϕ is angle between ∇F and P˙ Hence we need cos ϕ > 0 ≥0 € Therefore gradient ascent increases fitness (until reaches 0 gradient) 3/23/16 33 or ϕ < 90 ! 3/23/16 34 € General Ascent on Fitness Surface Fitness as Minimum Error Suppose for Q different inputs we have target outputs t1,…,t Q Suppose for parameters P the corresponding actual outputs + are y1,…,y Q ∇F € – € Suppose D(t,y ) ∈ [0,∞) measures difference between target & actual outputs Let E q = D(t q ,y q ) be error on qth sample € 3/23/16 35 € Q Q Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)] q=1 3/23/16 q=1 36 € 6 Part 4A: Neural Network Learning 3/23/16 Jacobian Matrix Gradient of Fitness #∂y1q & ∂y1q % ∂P1 ! ∂Pm ( q Define Jacobian matrix J = % " # " ( %∂y q ( ∂y nq % n ∂P ! ( ∂ P 1 m $ ' % ( ∇F = ∇'' −∑ E q ** = −∑ ∇E q & q ) q ∂D(t q ,y q ) ∂y qj ∂E q ∂ = D(t q ,y q ) = ∑ ∂Pk ∂Pk ∂y qj ∂Pk j € d D(t q ,y q ) ∂y q ⋅ d yq ∂Pk q q q = ∇ D t ,y ⋅ ∂y Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1 € = € € y 3/23/16 q ( ) Since (∇E q ) = k € T ∴∇E q = ( J q ) ∇D(t q ,y q ) ∂Pk 37 € € 3/23/16 Derivative of Squared Euclidean Distance Suppose D(t,y ) = t − y = ∑ ( t − y ) 2 i € ∴ dyj ∂E q d D(t ,y = ∂Pk d yq q i ∂D(t − y ) ∂ ∂ (t − y ) = ∑ (t i − y i )2 = ∑ i∂y i ∂y j ∂y j i j i d( t j − y j ) Gradient of Error on qth Input 2 i = 38 € € € q q ∂y q ∂D(t ,y ) ∂E q =∑ j , q ∂Pk ∂y j j ∂Pk q ) ⋅ ∂y q ∂Pk ∂y q = 2( y q − t q ) ⋅ ∂Pk 2 2 = 2∑ ( y qj − t qj ) = −2( t j − y j ) j ∂y qj ∂Pk T d D(t,y ) = 2( y − t ) dy ∇E q = 2( J q ) ( y q − t q ) € € 3/23/16 39 3/23/16 40 € € Recap Multilayer Notation T P˙ = η∑ ( J q ) (t q − y q ) q To know how to decrease the differences between actual & desired outputs, € we need to know elements of Jacobian, xq ∂y qj ∂Pk , which says how jth output varies with kth parameter (given the qth input) W1 s1 W2 s2 WL–2 WL–1 sL–1 yq sL The Jacobian depends on the specific form of the system, in this case, a feedforward neural network € 3/23/16 41 3/23/16 42 7 Part 4A: Neural Network Learning 3/23/16 Typical Neuron Notation • L layers of neurons labeled 1, …, L • N l neurons in layer l • sl = vector of outputs from neurons in layer l • • • • s1 l–1 input layer s1 = xq (the input pattern) output layer sL = yq (the actual output) Wl = weights between layers l and l+1 Problem: find out how outputs yiq vary with weights Wjkl (l = 1, …, L–1) 3/23/16 sjl–1 W i1 l–1 W ijl–1 Σ hil σ sil W iNl–1 sNl–1 43 3/23/16 44 Error Back-Propagation Delta Values Convenient to break derivatives by chain rule : ∂E q ∂E q ∂hil = ∂W ijl−1 ∂hil ∂W ijl−1 ∂E q We will compute starting with last layer (l = L −1) ∂W ijl and working back to earlier layers (l = L − 2,…,1) Let δil = € So 3/23/16 45 € δiL = hiL σ tiq siL = y iq = W iNL–1 sNL–1 3/23/16 46 Output-Layer Derivatives (1) s1 L–1 sjL–1 ∂E q ∂hil = δil l−1 ∂W ij ∂W ijl−1 3/23/16 Output-Layer Neuron W i1 L–1 W ijL–1 Σ ∂E q ∂hil Eq 2 ∂E q ∂ = L ∑ ( skL − t kq ) L k ∂h i ∂h i d( siL − t iq ) d hiL 2 = 2( siL − t iq ) d siL d hiL = 2( siL − t iq )σ &( hiL ) 47 3/23/16 48 € 8 Part 4A: Neural Network Learning 3/23/16 Hidden-Layer Neuron Output-Layer Derivatives (2) s1 l s1 l–1 ∂hiL ∂ = ∑W ikL−1skL−1 = sL−1 j ∂W ijL−1 ∂W ijL−1 k W 1il W i1 l–1 W ijl–1 Σ sjl–1 ∂E q ∴ = δiL sL−1 j ∂W ijL−1 € σ sil W kil skl+1 Eq W iNl–1 where δ = 2( s − t )σ &( h L i hil s1 l+1 L i q i L i W Nil sNl–1 ) 3/23/16 49 sNl+1 sNl 3/23/16 50 € Hidden-Layer Derivatives (1) Hidden-Layer Derivatives (2) ∂E q ∂hil Recall = δil l−1 ∂W ij ∂W ijl−1 l−1 l−1 dW ij s j ∂hil ∂ = W ikl−1skl−1 = = sl−1 j l−1 l−1 ∑ ∂W ij ∂W ij k dW ijl−1 ∂E q ∂E q ∂h l +1 ∂h l +1 δil = l = ∑ l +1 k l = ∑δkl +1 k l ∂h i ∂ h ∂ h ∂h i k i k k € l l d σ ( hil ) ∂hkl +1 ∂ ∑m W km sm ∂W kil sil = = = W kil = W kilσ %( hil ) ∂hil ∂hil ∂hil d hil € ∴ € where δil = σ &( hil )∑δkl +1W kil ∴ δil = ∑δkl +1W kilσ $( hil ) = σ $( hil )∑δkl +1W kil € k ∂E q = δil sl−1 j ∂W ijl−1 k k 3/23/16 51 3/23/16 52 € € Summary of Back-Propagation Algorithm Derivative of Sigmoid Suppose s = σ ( h ) = 1 (logistic sigmoid) 1+ exp (−α h ) −1 Output layer : δiL = 2αsiL (1− siL )( siL − t iq ) ∂E q = δiL sL−1 j ∂W ijL−1 −2 Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh ) = −(1+ e−αh ) −2 (−αe ) = α −αh e−αh −αh 2 (1+ e ) Hidden layers : δil = αsil (1− sil )∑δkl +1W kil $ 1+ e−αh 1 e−αh 1 ' =α = αs& − ) −αh −αh −αh 1+ e 1+ e 1+ e−αh ( % 1+ e k € ∂E q = δil sl−1 j ∂W ijl−1 = αs(1− s) 3/23/16 € 53 3/23/16 54 € 9 Part 4A: Neural Network Learning 3/23/16 Hidden-Layer Computation Output-Layer Computation ΔW ijL−1 = ηδiL sL−1 j s1 L–1 W i1 W ijL–1 € L–1 Σ ΔW ijl−1 = ηδil sl−1 j s1 l–1 W 1il L–1 sjL–1 hiL ∆Wij × σ siL = y iq – tiq W ijl–1 sjl–1 € ∆Wijl–1 W iNl–1 η δ1 l+1 W i1 l–1 W iNL–1 sNL–1 s1 l+1 × 1– sNl–1 δiL × δiL = 2αsiL (1− siL )( t iq − siL ) 2α hil Σ η σ 55 Eq skl+1 δkl+1 sNl+1 × δil = αsil (1− sil )∑δkl +1W kil W kil × W Nil 1– δil 3/23/16 sil Σ δNl+1 α k 3/23/16 56 € € Training Procedures Summation of Error Surfaces • Batch Learning – – – – on each epoch (pass through all the training pairs), weight changes for all patterns accumulated weight matrices updated at end of epoch accurate computation of gradient E E1 E2 • Online Learning – weight are updated after back-prop of each training pair – usually randomize order for each epoch – approximation of gradient • Doesn’t make much difference 3/23/16 57 3/23/16 58 Gradient Computation in Batch Learning Gradient Computation in Online Learning E E 1 1 E E 2 E2 E 3/23/16 59 3/23/16 60 10 Part 4A: Neural Network Learning 3/23/16 Testing Generalization Problem of Rote Learning error error on test data Available Data Training Data Domain error on training data Test Data epoch stop training here 3/23/16 61 3/23/16 62 A Few Random Tips Improving Generalization Training Data Available Data • Too few neurons and the ANN may not be able to decrease the error enough • Too many neurons can lead to rote learning • Preprocess data to: – – – – Domain Validation Data Test Data standardize eliminate irrelevant information capture invariances keep relevant information • If stuck in local min., restart with different random weights 3/23/16 63 3/23/16 64 Beyond Back-Propagation • Adaptive Learning Rate • Adaptive Architecture Run Example BP Learning – Add/delete hidden neurons – Add/delete hidden layers • Radial Basis Function Networks • Recurrent BP • Etc., etc., etc.… 3/23/16 65 3/23/16 66 11 Part 4A: Neural Network Learning 3/23/16 Deep Belief Networks Restricted Boltzmann Machine • Goal: hidden units become model of input domain • Should capture statistics of input • Evaluate by testing its ability to reproduce input statistics • Change weights to decrease difference • Inspired by hierarchical representations in mammalian sensory systems • Use “deep” (multilayer) feed-forward nets • Layers self-organize to represent input at progressively more abstract, task-relevant levels • Supervised training (e.g., BP) can be used to tune network performance. • Each layer is a Restricted Boltzmann Machine 3/23/16 67 3/23/16 Unsupervised RBM Learning • Present inputs and do RBM learning with first hidden layer to develop model • When converged, do RBM learning between first and second hidden layers to develop higher-level model • Set yi with probability • After several cycles of sampling, update weights based on statistics: / • Set xj with probability " % ΔW yi x j − yi#x#j ij = η σ $∑Wij yi ' " % σ $$∑Wij x j '' # j & i & ( 3/23/16 69 ) • Continue until all weight layers trained • May further train with BP or other supervised learning algorithms 3/23/16 70 Can ANNs Exceed the “Turing Limit”? What is the Power of Artificial Neural Networks? • There are many results, which depend sensitively on assumptions; for example: • Finite NNs with real-valued weights have super-Turing power (Siegelmann & Sontag ‘94) • Recurrent nets with Gaussian noise have sub-Turing power (Maass & Sontag ‘99) • Finite recurrent nets with real weights can recognize all languages, and thus are super-Turing (Siegelmann ‘99) • Stochastic nets with rational weights have super-Turing power (but only P/POLY, BPP/log* ) (Siegelmann ‘99) • But computing classes of functions is not a very relevant way to evaluate the capabilities of neural computation • With respect to Turing machines? • As function approximators? 3/23/16 68 Training a DBN Network • Stochastic binary units • Set yi / with probability # & • Assume bias units σ %%∑Wij x!j (( x0 = y0 = 1 $ j ' # (fig. from wikipedia) 71 3/23/16 72 12 Part 4A: Neural Network Learning 3/23/16 A Universal Approximation Theorem Suppose f is a continuous function on [0,1] Suppose σ is a nonconstant, bounded, One Hidden Layer is Sufficient n • Conclusion: One hidden layer is sufficient to approximate any continuous function arbitrarily closely monotone increasing real function on ℜ. € For any ε > 0, there is an m such that ∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if € b1 1 $ n ' F ( x1,…, x n ) = ∑ aiσ && ∑W ij x j + b j )) i=1 % j=1 ( m € x1 [i.e., F (x) = a ⋅ σ (Wx + b)] 3/23/16 € (see, e.g., Haykin, N.Nets 2/e, 208–9) Σσ W mn 73 3/23/16 a1 a2 Σ am xn €then F ( x ) − f ( x ) < ε for all x ∈ [0,1] n Σσ Σσ 74 € The Golden Rule of Neural Nets Neural Networks are the second-best way to do everything! 3/23/16 IVB 75 13