II. Neural Network Learning A. Neural Network Learning Supervised Learning

Part 4A: Neural Network Learning 3/11/15 II. Neural Network Learning 3/11/15 1 A. Neural Network Learning 3/11/15 2 Supervised Learning •  Produce desired outputs for training inputs •  Generalize reasonably & appropriately to other inputs •  Good example: pattern recognition •  Feedforward multilayer networks 3/11/15 3 1 Part 4A: Neural Network Learning 3/11/15 input layer ... ... ... ... ... ... Feedforward Network output layer hidden layers 3/11/15 4 Typical Artificial Neuron connection weights inputs output threshold 3/11/15 5 Typical Artificial Neuron linear combination activation function net input (local field) 3/11/15 6 2 Part 4A: Neural Network Learning 3/11/15 Equations # n & hi = %% ∑ w ij s j (( − θ $ j=1 ' h = Ws − θ Net input: s"i = σ ( hi ) Neuron output: s" = σ (h) € 3/11/15 7 € ... ... Single-Layer Perceptron 3/11/15 8 Variables x1 w1 xj Σ wj h Θ y wn xn 3/11/15 θ 9 3 Part 4A: Neural Network Learning 3/11/15 Single Layer Perceptron Equations Binary threshold activation function : %1, if h > 0 σ ( h ) = Θ( h ) = & '0, if h ≤ 0 $1, if ∑ w x > θ j j j Hence, y = % &0, otherwise € $1, if w ⋅ x > θ =% &0, if w ⋅ x ≤ θ 3/11/15 10 € 2D Weight Vector w2 w ⋅ x = w x cos φ € x – + v cos φ = x w w⋅ x = w v φ € € w1 v w⋅ x > θ ⇔ w v >θ θ w ⇔v >θ w 3/11/15 € 11 € N-Dimensional Weight Vector + w normal vector separating hyperplane – 3/11/15 12 4 Part 4A: Neural Network Learning 3/11/15 Goal of Perceptron Learning •  Suppose we have training patterns x1, x2, …, xP with corresponding desired outputs y1, y2, …, yP •  where xp ∈ {0, 1}n, yp ∈ {0, 1} •  We want to find w, θ such that yp = Θ(w∙xp – θ) for p = 1, …, P 3/11/15 13 Treating Threshold as Weight #n & h = %%∑ w j x j (( − θ $ j=1 ' x1 n = −θ + ∑ w j x j w1 xj j=1 h Σ wj wn xn Θ y € θ 3/11/15 14 Treating Threshold as Weight #n & h = %%∑ w j x j (( − θ $ j=1 ' x0 = –1 x1 n w1 xj wn xn j=1 Σ wj = −θ + ∑ w j x j θ = w0 h Θ y € Let x 0 = −1 and w 0 = θ n n ˜ ⋅ x˜ h = w 0 x 0 + ∑ w j x j =∑ w j x j = w j=1 3/11/15 € j= 0 15 € 5 Part 4A: Neural Network Learning 3/11/15 Augmented Vectors #θ& % ( w ˜ =% 1( w % ! ( % ( $ wn ' # −1& % p( x p x˜ = % 1 ( %!( % p( $ xn ' ˜ ⋅ x˜ p ), p = 1,…,P We want y p = Θ( w € € 3/11/15 16 € Reformulation as Positive Examples We have positive (y p = 1) and negative (y p = 0) examples ˜ ⋅ x˜ p > 0 for positive, w ˜ ⋅ x˜ p ≤ 0 for negative Want w € € € € Let z p = x˜ p for positive, z p = −x˜ p for negative ˜ ⋅ z p ≥ 0, for p = 1,…,P Want w Hyperplane through origin with all z p on one side 3/11/15 17 € Adjustment of Weight Vector z5 z1 z9 z10 11 z z6 z2 z3 z8 z4 z7 3/11/15 18 6 Part 4A: Neural Network Learning 3/11/15 Outline of Perceptron Learning Algorithm 1.  initialize weight vector randomly 2.  until all patterns classified correctly, do: a)  for p = 1, …, P do: 1)  if zp classified correctly, do nothing 2)  else adjust weight vector to be closer to correct classification 3/11/15 19 Weight Adjustment zp p ˜ "" ηz w ηz p ˜" w ˜ w € € € € € € 3/11/15 20 Improvement in Performance ! ! ⋅ z p = (w ! + ηz p ) ⋅ z p w ! ⋅ z p + ηz p ⋅ z p =w ! ⋅ zp +η zp =w 2 ! ⋅ zp >w 3/11/15 21 7 Part 4A: Neural Network Learning 3/11/15 Perceptron Learning Theorem •  If there is a set of weights that will solve the problem, •  then the PLA will eventually find it •  (for a sufficiently small learning rate) •  Note: only applies if positive & negative examples are linearly separable 3/11/15 22 NetLogo Simulation of Perceptron Learning Run Perceptron-Geometry.nlogo 3/11/15 23 Classification Power of Multilayer Perceptrons •  Perceptrons can function as logic gates •  Therefore MLP can form intersections, unions, differences of linearly-separable regions •  Classes can be arbitrary hyperpolyhedra •  Minsky & Papert criticism of perceptrons •  No one succeeded in developing a MLP learning algorithm 3/11/15 24 8 Part 4A: Neural Network Learning 3/11/15 Hyperpolyhedral Classes 3/11/15 25 Credit Assignment Problem input layer ... ... ... ... ... hidden layers ... ... How do we adjust the weights of the hidden layers? Desired output output layer 3/11/15 26 NetLogo Demonstration of Back-Propagation Learning Run Artificial Neural Net.nlogo 3/11/15 27 9 Part 4A: Neural Network Learning 3/11/15 Adaptive System System Evaluation Function (Fitness, Figure of Merit) S F P1 … Pk … Pm Control Algorithm Control Parameters C 3/11/15 28 Gradient ∂F measures how F is altered by variation of Pk ∂Pk $ ∂F ' & ∂P1 ) & ! ) & ) ∇F = & ∂F ∂P ) k & ! ) &∂F ) & ) % ∂Pm ( € ∇F points in direction of maximum local increase in F 3/11/15 € 29 € Gradient Ascent on Fitness Surface ∇F + gradient ascent 3/11/15 – 30 10 Part 4A: Neural Network Learning 3/11/15 Gradient Ascent by Discrete Steps ∇F + – 3/11/15 31 Gradient Ascent is Local But Not Shortest + – 3/11/15 32 Gradient Ascent Process P˙ = η∇F (P) Change in fitness : m ∂F d P m dF k F˙ = =∑ = ∑ (∇F ) k P˙k k=1 ∂P d t k=1 €d t k F˙ = ∇F ⋅ P˙ F˙ = ∇F ⋅ η∇F = η ∇F € € 2 ≥0 Therefore gradient ascent increases fitness (until reaches 0 gradient) 3/11/15 33 11 Part 4A: Neural Network Learning 3/11/15 General Ascent in Fitness Note that any adaptive process P( t ) will increase fitness provided : 0 < F˙ = ∇F ⋅ P˙ = ∇F P˙ cosϕ where ϕ is angle between ∇F and P˙ Hence we need cos ϕ > 0 € or ϕ < 90 ! 3/11/15 34 € General Ascent on Fitness Surface + ∇F – 3/11/15 35 Fitness as Minimum Error Suppose for Q different inputs we have target outputs t1,…,t Q Suppose for parameters P the corresponding actual outputs € are y1,…,y Q Suppose D(t,y ) ∈ [0,∞) measures difference between € target & actual outputs Let E q = D(t q ,y q ) be error on qth sample € € Q Q Let F (P) = −∑ E q (P) = −∑ D[t q ,y q (P)] q=1 3/11/15 q=1 36 € 12 Part 4A: Neural Network Learning 3/11/15 Gradient of Fitness % ( ∇F = ∇'' −∑ E q ** = −∑ ∇E q & q ) q € ∂D(t q ,y q ) ∂y qj ∂E q ∂ = D(t q ,y q ) = ∑ ∂Pk ∂Pk ∂y qj ∂Pk j d D(t q ,y q ) ∂y q ⋅ d yq ∂Pk q q q = ∇ D t ,y ⋅ ∂y = € € yq 3/11/15 ( ) ∂Pk 37 € € Jacobian Matrix #∂y1q & ∂y1q % ∂P1 ! ∂Pm ( Define Jacobian matrix J = % " # " ( %∂y q ( ∂y nq % n ∂P ! ∂Pm (' 1 $ q Note J q ∈ ℜ n×m and ∇D(t q ,y q ) ∈ ℜ n×1 € Since (∇E q ) k = € q q ∂y q ∂D(t ,y ) ∂E q =∑ j , q ∂Pk ∂y j j ∂Pk T ∴∇E q = ( J q ) ∇D(t q ,y q ) € 3/11/15 38 € Derivative of Squared Euclidean Distance Suppose D(t,y ) = t − y = ∑ ( t − y ) 2 2 i € i ∂D(t − y ) ∂ ∂ (t − y ) = ∑ (ti − y i )2 = ∑ i∂y i ∂y j ∂y j i j i = € i d( t j − y j ) dyj d D(t,y ) = 2( y − t ) dy € 3/11/15 2 2 = −2( t j − y j ) ∴ 39 € 13 Part 4A: Neural Network Learning 3/11/15 Gradient of Error on qth Input d D(t ,y ) ∂y ∂E = ⋅ q q ∂Pk dy q q q ∂Pk ∂y q = 2( y − t ) ⋅ ∂Pk q q = 2∑ ( y qj − t qj ) j ∂y qj ∂Pk T ∇E q = 2( J q ) ( y q − t q ) € 3/11/15 40 € Recap P˙ = η∑ ( J ) (t q T q q − yq ) To know how to decrease the differences between actual & desired outputs, € we need to know elements of Jacobian, ∂y qj ∂Pk , which says how jth output varies with kth parameter (given the qth input) The Jacobian depends on the specific form of the system, in this case, a feedforward neural network € 3/11/15 41 Multilayer Notation xq W1 s1 3/11/15 W2 s2 WL–2 WL–1 sL–1 yq sL 42 14 Part 4A: Neural Network Learning 3/11/15 Notation •  •  •  •  •  •  •  L layers of neurons labeled 1, …, L Nl neurons in layer l sl = vector of outputs from neurons in layer l input layer s1 = xq (the input pattern) output layer sL = yq (the actual output) Wl = weights between layers l and l+1 Problem: find out how outputs yiq vary with weights Wjkl (l = 1, …, L–1) 3/11/15 43 Typical Neuron s1l–1 sjl–1 Wi1l–1 Wijl–1 WiN Σ hil σ sil l–1 sNl–1 3/11/15 44 Error Back-Propagation We will compute ∂E q starting with last layer (l = L −1) ∂W ijl and working back to earlier layers (l = L − 2,…,1) € 3/11/15 45 15 Part 4A: Neural Network Learning 3/11/15 Delta Values Convenient to break derivatives by chain rule : ∂E q ∂E q ∂hil = ∂W ijl−1 ∂hil ∂W ijl−1 Let δil = So € ∂E q ∂hil ∂E q ∂hil = δil l−1 ∂W ij ∂W ijl−1 3/11/15 46 Output-Layer Neuron s1L–1 sjL–1 Wi1L–1 WijL–1 Σ hiL σ tiq siL = yiq WiNL–1 Eq sNL–1 3/11/15 47 Output-Layer Derivatives (1) δiL = = 2 ∂E q ∂ = ∑ (skL − tkq ) ∂hiL ∂hiL k d( siL − t iq ) dh L i 2 = 2( siL − t iq ) d siL d hiL = 2( siL − t iq )σ &( hiL ) 3/11/15 48 € 16 Part 4A: Neural Network Learning 3/11/15 Output-Layer Derivatives (2) ∂hiL ∂ = ∑W ikL−1skL−1 = sL−1 j ∂W ijL−1 ∂W ijL−1 k ∴ € ∂E q = δiL sL−1 j ∂W ijL−1 where δiL = 2( siL − t iq )σ &( hiL ) 3/11/15 49 € Hidden-Layer Neuron s1l s1l–1 W1il Wi1l–1 Wijl–1 sjl–1 WiN Σ hil σ Wkil sil skl+1 Eq l–1 WNil sNl–1 s1l+1 sNl sNl+1 3/11/15 50 Hidden-Layer Derivatives (1) Recall δil = € ∂E q ∂hil = δil l−1 ∂W ij ∂W ijl−1 ∂E q ∂E q ∂h l +1 ∂h l +1 = ∑ l +1 k l = ∑δkl +1 k l ∂hil ∂ h ∂ h ∂h i k i k k l l d σ ( hil ) ∂hkl +1 ∂ ∑m W km sm ∂W kil sil = = = W kil = W kilσ %( hil ) ∂hil ∂hil ∂hil d hil € ∴ δil = ∑δkl +1W kilσ $( hil ) = σ $( hil )∑δkl +1W kil € k 3/11/15 k 51 € 17 Part 4A: Neural Network Learning 3/11/15 Hidden-Layer Derivatives (2) dW ijl−1sl−1 ∂hil ∂ j l−1 l−1 = W s = = sl−1 ∑ ik k j ∂W ijl−1 ∂W ijl−1 k dW ijl−1 ∴ € ∂E q = δil sl−1 j ∂W ijl−1 where δil = σ &( hil )∑δkl +1W kil k 3/11/15 52 € Derivative of Sigmoid Suppose s = σ ( h ) = 1 (logistic sigmoid) 1+ exp (−α h ) −1 −2 Dh s = Dh [1+ exp(−αh )] = −[1+ exp(−αh )] Dh (1+ e−αh ) = −(1+ e−αh ) =α −2 (−αe ) = α e−αh −αh −αh 2 (1+ e ) $ 1+ e−αh 1 e−αh 1 ' = αs& − ) −αh 1+ e−αh 1+ e−αh 1+ e−αh ( % 1+ e = αs(1− s) 3/11/15 53 € Summary of Back-Propagation Algorithm Output layer : δiL = 2αsiL (1− siL )( siL − t iq ) ∂E q = δiL sL−1 j ∂W ijL−1 Hidden layers : δil = αsil (1− sil )∑δkl +1W kil k € ∂E q = δil sl−1 j ∂W ijl−1 3/11/15 54 € 18 Part 4A: Neural Network Learning 3/11/15 Output-Layer Computation ΔW ijL−1 = ηδiL sL−1 j s1L–1 sjL–1 Wi1L–1 WijL–1 € Σ hiL WiNL–1 η × sNL–1 siL = yiq – σ ∆WijL–1 1– δiL × δ = 2αs (1− s L i tiq L i L i )( t q i L i −s ) 2α 3/11/15 55 € Hidden-Layer Computation ΔW ijl−1 = ηδil sl−1 j s1l–1 s1l+1 W1il δ1l+1 l–1 Wi1 W l–1 sjl–1 € ij ∆W l–1 ij sNl–1 WiN × hil Σ sil σ Wkil skl+1 η WNil 1– δil × δil = αsil (1− sil )∑δkl +1W kil k 3/11/15 Eq δkl+1 × l–1 sNl+1 Σ δNl+1 α 56 € Training Procedures •  Batch Learning –  on each epoch (pass through all the training pairs), –  weight changes for all patterns accumulated –  weight matrices updated at end of epoch –  accurate computation of gradient •  Online Learning –  weight are updated after back-prop of each training pair –  usually randomize order for each epoch –  approximation of gradient •  Doesn’t make much difference 3/11/15 57 19 Part 4A: Neural Network Learning 3/11/15 Summation of Error Surfaces E E1 E2 3/11/15 58 Gradient Computation in Batch Learning E E1 E2 3/11/15 59 Gradient Computation in Online Learning E E1 E2 3/11/15 60 20 Part 4A: Neural Network Learning 3/11/15 Testing Generalization Available Data Training Data Domain Test Data 3/11/15 61 Problem of Rote Learning error error on test data error on training data epoch stop training here 3/11/15 62 Improving Generalization Training Data Available Data Domain Test Data Validation Data 3/11/15 63 21 Part 4A: Neural Network Learning 3/11/15 A Few Random Tips •  Too few neurons and the ANN may not be able to decrease the error enough •  Too many neurons can lead to rote learning •  Preprocess data to: –  standardize –  eliminate irrelevant information –  capture invariances –  keep relevant information •  If stuck in local min., restart with different random weights 3/11/15 64 Run Example BP Learning 3/11/15 65 Beyond Back-Propagation •  Adaptive Learning Rate •  Adaptive Architecture –  Add/delete hidden neurons –  Add/delete hidden layers •  Radial Basis Function Networks •  Recurrent BP •  Etc., etc., etc.… 3/11/15 66 22 Part 4A: Neural Network Learning 3/11/15 Deep Belief Networks •  Inspired by hierarchical representations in mammalian sensory systems •  Use “deep” (multilayer) feed-forward nets •  Layers self-organize to represent input at progressively more abstract, task-relevant levels •  Supervised training (e.g., BP) can be used to tune network performance. •  Each layer is a Restricted Boltzmann Machine 3/11/15 67 Restricted Boltzmann Machine •  Goal: hidden units become model of input domain •  Should capture statistics of input •  Evaluate by testing its ability to reproduce input statistics •  Change weights to decrease difference (fig. from wikipedia) 3/11/15 68 Unsupervised RBM Learning •  Stochastic binary units •  Assume bias units x0 = y0 = 1 •  Set yi with probability " % σ $$∑Wij x j '' # j & xj/ •  Set with probability " % σ $∑Wij yi ' # 3/11/15 i & •  Set yi/ with probability # & σ %%∑Wij x!j (( $ ' j •  After several cycles of sampling, update weights based on statistics: ( ΔWij = η yi x j − yi#x#j ) 69 23 Part 4A: Neural Network Learning 3/11/15 Training a DBN Network •  Present inputs and do RBM learning with first hidden layer to develop model •  When converged, do RBM learning between first and second hidden layers to develop higher-level model •  Continue until all weight layers trained •  May further train with BP or other supervised learning algorithms 3/11/15 70 What is the Power of Artificial Neural Networks? •  With respect to Turing machines? •  As function approximators? 3/11/15 71 Can ANNs Exceed the “Turing Limit”? •  There are many results, which depend sensitively on assumptions; for example: •  Finite NNs with real-valued weights have super-Turing power (Siegelmann & Sontag ‘94) •  Recurrent nets with Gaussian noise have sub-Turing power (Maass & Sontag ‘99) •  Finite recurrent nets with real weights can recognize all languages, and thus are super-Turing (Siegelmann ‘99) •  Stochastic nets with rational weights have super-Turing power (but only P/POLY, BPP/log*) (Siegelmann ‘99) •  But computing classes of functions is not a very relevant way to evaluate the capabilities of neural computation 3/11/15 72 24 Part 4A: Neural Network Learning 3/11/15 A Universal Approximation Theorem Suppose f is a continuous function on [0,1] Suppose σ is a nonconstant, bounded, monotone increasing real function on ℜ. € n For any ε > 0, there is an m such that ∃a ∈ ℜ m , b ∈ ℜ n , W ∈ ℜ m×n such that if € m $ n ' F ( x1,…, x n ) = ∑ aiσ && ∑W ij x j + b j )) i=1 % j=1 ( € [i.e., F (x) = a ⋅ σ (Wx + b)] €then F ( x ) − f ( x ) < ε for all x ∈ [0,1] 3/11/15 € n 73 (see, e.g., Haykin, N.Nets 2/e, 208–9) € One Hidden Layer is Sufficient •  Conclusion: One hidden layer is sufficient to approximate any continuous function arbitrarily closely 1 b1 x1 xn Σσ Σσ a1 a2 Σ am Wmn Σσ 3/11/15 74 The Golden Rule of Neural Nets Neural Networks are the second-best way to do everything! 3/11/15 IVB 75 25

II. Neural Network Learning A. Neural Network Learning Supervised Learning

Related documents

Products

Support

II. Neural Network Learning A. Neural Network Learning Supervised Learning

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib