slides

Back-Propagation Algorithm Perceptron  Gradient Descent  Multi-layerd neural network  Back-Propagation  More on Back-Propagation  Examples  Inner-product net  w, x || w ||  || x || cos( ) n net   w i  x i  i1   A measure of the projection of one vector onto another Activation function n o  f (net)  f ( w i  x i ) i1   1 if x  0 f (x) : sgn( x)   1 if x  0 Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. 1 if f (x) :  (x)   0 if x 0 x 0  1 if x  0.5  f (x) :  (x)  x if 0.5  x  0.5  0 if x  0.5  sigmoid function 1 f (x) :  (x)  1 e(ax) Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Gradient Descent  To understand, consider simpler linear unit, where n o   wi  xi i 0   Let's learn wi that minimize the squared error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)} • (t for target) Error for different hypothesis, for w0 and w1 (dim 2) Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt.  We want to move the weight vector in the direction that decrease E wi=wi+wi w=w+w Differentiating E Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt.  Update rule for gradient decent wi    (t d  od )x id d D Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt.  Stochastic Approximation to gradient descent wi  (t  o)xi     The gradient decent training rule updates summing over all the training examples D Stochastic gradient approximates gradient decent by updating weights incrementally Calculate error for each example Known as delta-rule or LMS (last mean-square) weight update  Adaline rule, used for adaptive filters Widroff and Hoff (1960) Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. XOR problem and Perceptron  By Minsky and Papert in mid 1960 Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Multi-layer Networks    The limitations of simple perceptron do not apply to feed-forward networks with intermediate or „hidden“ nonlinear units A network with just one hidden unit can represent any Boolean function The great power of multi-layer networks was realized long ago  But it was only in the eighties it was shown how to make them learn   Multiple layers of cascade linear units still produce only linear functions We search for networks capable of representing nonlinear functions   Units should use nonlinear activation functions Examples of nonlinear activation functions Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. XOR-example Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt.   Back-propagation is a learning algorithm for multi-layer neural networks It was invented independently several times     Bryson an Ho [1969] Werbos [1974] Parker [1985] Rumelhart et al. [1986] Parallel Distributed Processing - Vol. 1 Foundations David E. Rumelhart, James L. McClelland and the PDP Research Group What makes people smarter than computers? These volumes by a pioneering neurocomputing..... Zur Anzeige w ird der Quic kTime™ Dekompres sor „TIFF (Unkomprimiert)“ benötigt. Zur Anzeige w ird der Quic kTime™ Dekompres sor „TIFF (Unkomprimiert)“ benötigt. Back-propagation  The algorithm gives a prescription for changing the weights wij in any feedforward network to learn a training set of input output pairs {xd,td}  We consider a simple two-layer network Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. xk x1 x2 x3 x4 x5  Given the pattern xd the hidden unit j receives a net input 5 net   w jk x d j d k k1  and produces the output 5  V  f (net )  f ( w jk x ) d j d j d k k1   Output unit i thus receives 3 3 5 j1 j1 k1 net id  W ijV jd   (W ij  f ( w jk x kd ))  And produce the final output 3 3 5 o  f (net )  f (W ijV )  f ( (W ij  f ( w x ))) d i d i d j j1 d jk k j1 k1  Out usual error function  For l outputs and m input output pairs {xd,td} m l 1 d d 2 E[w]    (t i  oi ) 2 d 1 i1   In our example E becomes m 2 1 d d 2 E[w]    (t i  oi ) 2 d 1 i1 m 2 3 5 1 d d 2 E[w]    (t i  f (W ij  f ( w jk x k ))) 2 d 1 i1 j k1   E[w] is differentiable given f is differentiable Gradient descent can be applied  For hidden-to-output connections the gradient descent rule gives: m E W ij     (t id  oid ) f ' (net id )  (V jd ) W ij d 1 m W ij   (t id  oid ) f ' (net id )  V jd d 1  id  f ' (net id )(t id  oid ) m W ij  id V jd d 1  For the input-to hidden connection wjk we must differentiate with respect to the wjk  Using the chain rule we obtain E E V w jk     d  w jk w jk d 1 V j m d j m 2 w jk    (t id  oid ) f ' (net id )W ij f ' (net dj )  x kd d 1 i1 id  f ' (net id )(t id  oid ) m  2 w jk   id W ij f ' (net dj )  x kd  d 1 i1 2  dj  f ' (net dj )W ijid i1  m w jk   dj  x kd  d 1 m W ij  id V jd d 1 m w jk   dj  x kd d 1    we have same form with a different definition of   In general, with an arbitrary number of layers, the back-propagation update rule has always the form m w ij  output  Vinput d 1 Where output and input refers to the connection concerned  V stands for the appropriate input (hidden unit or real  input, xd )   depends on the layer concerned  2  By the equation   f (net )W  d j ' d j d ij i i1 allows us to determine for a given hidden unit Vj in terms of the ‘s of the unit oi   The coefficient are usual forward, but the errors  are propagated backward   back-propagation    We have to use a nonlinear differentiable activation function  Examples: 1 f (x)   (x)  1 e( x) f ' (x)   ' (x)     (x)  (1  (x)) f (x)  tanh(   x) f ' (x)    (1 f (x) 2 ) Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Consider a network with M layers m=1,2,..,M  Vmi from the output of the ith unit of the mth layer  V0i is a synonym for xi of the ith input  Subscript m layers m’s layers, not patterns  Wmij mean connection from Vjm-1 to Vim  Stochastic Back-Propagation Algorithm (mostly used) 1. 2. 3. Initialize the weights to small random values Choose a pattern xdk and apply is to the input layer V0k= xdk for all k Propagate the signal through the network Vim  f (net im )  f ( wijmV jm1) j 4. 5. 6. 7.  Compute the deltas for the output layer iM  f ' (net iM )(t id  ViM ) Compute the deltas for the preceding layer for m=M,M-1,..2 im1  f ' (net im1) w mji mj j Update all connections wijm  imV jm1 wijnew  wijold  wij Goto 2 and repeat for the next pattern More on Back-Propagation Gradient descent over entire network weight vector  Easily generalized to arbitrary directed graphs  Will find a local, not necessarily global error minimum   In practice, often works well (can run multiple times)   Gradient descent can be very slow if  is to small, and can oscillate widely if  is to large Often include weight momentum  E w pq (t  1)      w pq (t) w pq   Momentum parameter  is chosen between 0 and 1, 0.9 is a good value  Minimizes error over training examples  Will it generalize well  Training can take thousands of iterations, it is slow!  Using network after training is very fast Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Convergence of Backpropagation  Gradient descent to some local minimum      Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence    Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses Expressive Capabilities of ANNs  Boolean functions:    Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Continuous functions:   Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]. NETtalk Sejnowski et al 1987 Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Prediction Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Zur Anzeige wird der QuickTime™ Dekompressor „TIFF (LZW)“ benötigt. Perceptron  Gradient Descent  Multi-layerd neural network  Back-Propagation  More on Back-Propagation  Examples   RBF Networks, Support Vector Machines

slides

Related documents

Products

Support

slides

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib