Part 7: Neural Networks & Learning 11/24/04 Supervised Learning • Produce desired outputs for training inputs • Generalize reasonably & appropriately to other inputs • Good example: pattern recognition • Feedforward multilayer networks VII. Neural Networks and Learning 11/24/04 1 11/24/04 2 Typical Artificial Neuron Feedforward Network input layer 11/24/04 hidden layers ... ... ... ... ... ... connection weights inputs output layer output threshold 3 11/24/04 4 1 Part 7: Neural Networks & Learning 11/24/04 Typical Artificial Neuron linear combination Equations activation function n hi = w ij s j j=1 h = Ws Net input: net input (local field) si = ( hi ) Neuron output: s = (h) 11/24/04 5 11/24/04 6 Single-Layer Perceptron Variables x1 xj ... ... w1 h y wn xn 11/24/04 wj 7 11/24/04 8 2 Part 7: Neural Networks & Learning 11/24/04 Single Layer Perceptron Equations 2D Weight Vector w2 w x = w x cos Binary threshold activation function : cos = 1, if h > 0 ( h ) = ( h ) = 0, if h 0 x – v x + w w x = w v 1, if w x > j j j Hence, y = 0, otherwise w x > 1, if w x > = 0, if w x w v > v > w 11/24/04 9 N-Dimensional Weight Vector v w1 w 11/24/04 10 Goal of Perceptron Learning + w normal vector separating hyperplane • Suppose we have training patterns x1, x2, …, xP with corresponding desired outputs y1, y2, …, yP • where xp {0, 1}n, yp {0, 1} • We want to find w, such that yp = (wxp – ) for p = 1, …, P – 11/24/04 11 11/24/04 12 3 Part 7: Neural Networks & Learning 11/24/04 Treating Threshold as Weight Treating Threshold as Weight n h = w j x j j=1 x1 x1 n = + w j x j w1 xj h w1 y xj wn xn n j=1 wj n h = w j x j j=1 x0 = –1 j=1 wj = + w j x j = w0 h y wn xn Let x 0 = 1 and w 0 = n n ˜ x˜ h = w 0 x 0 + w j x j = w j x j = w j=1 11/24/04 13 14 Reformulation as Positive Examples Augmented Vectors w ˜ = 1 w M wn j= 0 11/24/04 We have positive (y p = 1) and negative (y p = 0) examples 1 p x1 x˜ p = M p xn ˜ x˜ p > 0 for positive, w ˜ x˜ p 0 for negative Want w Let z p = x˜ p for positive, z p = ˜x p for negative ˜ z p 0, for p = 1,K,P Want w ˜ x˜ ), p = 1,K,P We want y = ( w p 11/24/04 p Hyperplane through origin with all z p on one side 15 11/24/04 16 4 Part 7: Neural Networks & Learning 11/24/04 Outline of Perceptron Learning Algorithm Adjustment of Weight Vector z5 1. initialize weight vector randomly z1 z9 z10 11 z z6 z2 z3 2. until all patterns classified correctly, do: z8 a) for p = 1, …, P do: z4 1) if zp classified correctly, do nothing z7 2) else adjust weight vector to be closer to correct classification 11/24/04 17 Weight Adjustment z p 11/24/04 18 Improvement in Performance ˜ z p < 0, If w p ˜ z w z p ˜ w ˜ w ˜ + z p ) z p ˜ z p = (w w ˜ z p + z p z p =w ˜ zp + zp =w 2 ˜ zp >w 11/24/04 19 11/24/04 20 5 Part 7: Neural Networks & Learning 11/24/04 Classification Power of Multilayer Perceptrons Perceptron Learning Theorem • If there is a set of weights that will solve the problem, • then the PLA will eventually find it • (for a sufficiently small learning rate) • Note: only applies if positive & negative examples are linearly separable • Perceptrons can function as logic gates • Therefore MLP can form intersections, unions, differences of linearly-separable regions • Classes can be arbitrary hyperpolyhedra • Minsky & Papert criticism of perceptrons • No one succeeded in developing a MLP learning algorithm 11/24/04 11/24/04 21 Credit Assignment Problem 22 Adaptive System System Evaluation Function (Fitness, Figure of Merit) S F input layer ... ... ... ... ... hidden layers ... ... How do we adjust the weights of the hidden layers? Desired output output layer P1 … Pk … Pm Control Parameters 11/24/04 23 11/24/04 Control Algorithm C 24 6 Part 7: Neural Networks & Learning 11/24/04 Gradient Ascent on Fitness Surface Gradient F measures how F is altered by variation of Pk Pk F F P1 M F = F P k M F Pm + gradient ascent – F points in direction of maximum increase in F 11/24/04 25 11/24/04 Gradient Ascent by Discrete Steps 26 Gradient Ascent is Local But Not Shortest F + + – 11/24/04 – 27 11/24/04 28 7 Part 7: Neural Networks & Learning 11/24/04 Gradient Ascent Process General Ascent in Fitness P˙ = F (P) Note that any parameter adjustment process P( t ) Change in fitness : m F d P m dF k = = (F ) k P˙k F˙ = k=1 P d t k=1 dt k ˙ ˙ F = F P F˙ = F F = F 2 will increase fitness provided : 0 < F˙ = F P˙ = F P˙ cos where is angle between F and P˙ Hence we need cos > 0 0 or < 90 o Therefore gradient ascent increases fitness (until reaches 0 gradient) 11/24/04 29 11/24/04 General Ascent on Fitness Surface 30 Fitness as Minimum Error Suppose for Q different inputs we have target outputs t1,K,t Q Suppose for parameters P the corresponding actual outputs + are y1,K,y Q F Suppose D(t,y ) [0,) measures difference between – target & actual outputs Let E q = D(t q ,y q ) be error on qth sample Q Q Let F (P) = E q (P) = D[t q ,y q (P)] q=1 11/24/04 31 11/24/04 q=1 32 8 Part 7: Neural Networks & Learning 11/24/04 Jacobian Matrix Gradient of Fitness y1q y1q P1 L Pm Define Jacobian matrix J q = M O M y q y nq n L P1 Pm F = E q = E q q q D(t q ,y q ) y qj E q = D(t q ,y q ) = Pk Pk y qj Pk j d D(t q ,y q ) y q = d yq Pk q q q = D t ,y y yq ( ) Note J q nm and D(t q ,y q ) n1 Since (E q ) k = T E q = (J q ) D(t q ,y q ) Pk 11/24/04 33 Derivative of Squared Euclidean Distance Suppose D(t,y ) = t y = (t y ) 2 11/24/04 i q q E q d D(t ,y ) y q = Pk d yq Pk i D(t y) (t y ) 2 = (ti y i ) = i i y j i y j y j i 2 = 2( y q t q ) y q Pk = 2 ( y qj t qj ) 2 d( t y i ) = i = 2( t i y i ) d yi 34 Gradient of Error on qth Input 2 i q q y q D(t ,y ) E q = j , q Pk Pk y j j j y qj Pk d D(t,y ) = 2( y t ) dy 11/24/04 35 11/24/04 36 9 Part 7: Neural Networks & Learning 11/24/04 Recap T P˙ = ( J q ) (t q y q ) q To know how to decrease the differences between actual & desired outputs, we need to know elements of Jacobian, y qj Pk , which says how jth output varies with kth parameter (given the qth input) The Jacobian depends on the specific form of the system, in this case, a feedforward neural network 11/24/04 37 10