CS-363 Neural Networks and Machine Learning MULTILAYER FEEDFORWARD NEURAL NETWORK. ERROR BACKPROPAGATION. LEARNING ALGORITHM 1 MLF: Learning Problem • How we can train MLF? • While for the entire network its error can be calculated, because its desired outputs are known, for the hidden neurons their desired outputs are unknown… • How this problem can be solved? 2 Error Backpropagation • The heuristic idea behind the error backpropagation is to share the errors of the output neurons, which can be calculated because their desired outputs are known, with all the hidden neurons • The error backpropagation is sequential propagation of the errors of the neurons from the output layer through all the layers from the "right hand" side to the "left hand" side up to the first hidden layer, in order to calculate the errors of all other neurons 3 Multilayer Feedforward Neural Network (MLF) x1 x2 xn Signals flow Errors flow (Error Backpropagation) 4 MLF Learning • Basically, the entire learning process consists of two passes through all the layers of the network: a forward pass and a backward pass. • In the forward pass, the inputs are propagated from the input layer of the network to the first hidden layer and then, layer by layer, output signals from the hidden neurons are propagated to the corresponding inputs of the following layer neurons. Finally, a set of outputs is produced as the actual response of the network. • During the forward pass the synaptic weights of the network are all fixed. 5 MLF Learning • During the backward pass first the errors of all the neurons are calculated and then the weights of all the neurons are all adjusted in accordance with the learning rule. • One complete iteration (epoch) of the learning process consists of a forward pass and a backward pass. 6 MLF Learning • The MLF learning algorithm is based on the generalization of the error-correction learning rule for the case of MLF. • Specifically, the actual response of the network is subtracted from a desired response to produce an error signal. This error signal is then propagated backward through the network, against the direction of synaptic connections – hence the name "backpropagation". • The weights are adjusted so as to make the actual output of the network move closer to the desired output. 7 MLF Learning • The learning algorithm for the classical MLF is derived from the consideration that the global error of the network in terms of the mean square error (MSE) or root mean square error (RMSE) must be minimized. • This minimization is considered as the optimization problem – finding a global minimum of the error function 8 Paul Werbos • Suggested and developed the error backpropagation learning algorithm in his Harvard Ph.D. dissertation in 1974 • (it was not published that time and remained unknown to the research community until 1986 when the algorithm was re-invented by D. Rumelhart) 9 David Rumelhart (1942-2011) • Re-invented the error backpropagation and redeveloped it; presented the MLF learning algorithm in 1986. 10 Feedforward Neural Network • Let us use the following notations • m – the number of layers wikj Dkj - a desired output of the kth neuron from the jth layer Ykj - an actual output of the kth neuron from the jth layer - ith weight of the kth neuron from the jth layer * km Dkm Ykm km kj Nj - the network error for the kth neuron from the output layer - the error for the kth neuron from the output layer - the error for the kth neuron from the jth layer -the number of all neurons in the layer j 11 Error of the network 1 Ε N • MSE of the network is N E s 1 s • N is the total number of samples (patterns) in the learning set 2 * 2 • Es Ds Ys s , s 1,..., N is the square error of the network for the sth pattern for a single output neuron • 1 Es Nm D Nm k 1 ks Yks 2 1 Nm ; s 1,..., N is the square Nm k 1 * ks 2 error of the network for the sth pattern for Nm output neurons 12 Square Error • For simplicity, but without loss of generality, instead of MSE minimization we may consider minimization of the square error (SE) Nm * 2 E ( km ) k 1 * Dk Yk , s 1,..., N is the error of the • Where km kth output neuron for the sth learning sample s s 13 Error Function • To minimize the square error (SE) by changing the weights, we have to consider SE as a function of the weights. • We will consider minimization of the function 1 Nm * 2 E (W ) ( km ) 2 k 1 • the factor 1/2 is used to simplify subsequent derivations resulting from the minimization of E 14 Correction of the weights • To ensure movement to the global minimum on each iteration, the correction of the weights of all the neurons has to be organized in such a way that each weight wi has to be kj corrected by an amount wi , which must be E proportional to the partial derivative w kj i of the error function E(W) with respect to the weights 15 Derivative of the Square Error • Let Ykj kj zkj be the output value of the kth neuron at the jth layer represented as a function of its weighted sum • Evidently, E W E z W • Applying the chain rule for the differentiation, for the kth neuron at the output (mth) layer we obtain E (W ) E (W ) km zkm , i 0,1,..., N wikm km zkm wikm m 1 16 Derivative of the Square Error Function E (W ) 1 1 2 2 ( ) ( ) km km km km 2 k 2 k km 1 1 2 * ( km ) km ( km ) km 2 km km km N D N s 1 kms Ykms km ; km ( zkm ), km zkm zkm km km km w w Y ... w 0 1 1, m 1 N m1 YN m1 , m 1 Yi , m 1 , km km wi wi i 0, 1, ..., N m 1. 17 Derivative of the Square Error Function km Yi ,m 1 E (W ) E (W ) km zkm i 0, 1, ..., N m1; km km zkm Yi , m 1 , km km wi km zkm wi Y0,m1 1 zkm km 18 Correction of the output neuron weights • To correct the kmth output neuron weights, we obtain the following adjusting term zkm Yi ,m1 E (W ) kmkm km wi km wi zkm kmkm i 1,..., N m1 i 0, zkm km km km ; k 1,..., Nm • km is the local error of the kmth output neuron • 0 is the learning rate 19 Correction of the output neuron weights • Finally, To correct the kmth output neuron weights, we obtain the following adjusting term E (W ) kmYi ,m1 w km wi km km i i 1,..., N m1 i 0, 20 Error Backpropagation • To propagate the output neurons errors to the neurons of all hidden layers, a sequential error backpropagation through the network from the mth layer to the m-1st one, from the m-1st one to the m-2nd one, ..., from the 3rd one to the 2nd one and from the 2nd one to the 1st one has to be done. • When the error is propagated from the layer j+1 to the layer j, the local error of each neuron of the j+1st layer is multiplied by the weight of the path connecting the corresponding input of this neuron at the j+1st layer with the corresponding output of the neuron at the jth layer. 21 Error Backpropagation • The error of the kth neuron from the jth layer is obtained according to the error backpropagation rule as follows N j 1 i , j 1 kj kj zkj i , j 1wk ; k 1,..., N j k 1 22 Derivative of the Logistic Function • The derivative of the activation function is a part of the error backpropagation equation. • A sigmoid function (particularly, the logistic function) has a “beautiful” derivative (suppose for simplicity, α=1) 1 z 1 z 2 z ( z) (1 e ) (1 e ) ( e ) z 1 e e z e z ( z) ( z ) 1 ( z ) z z z (1 e )(1 e ) (1 e ) • because 1 1 e z 1 e z 1 ( z) 1 z z 1 e 1 e 1 e z 23 Error Backpropagation Rule for the Logistic Activation Function • The error of the kth neuron from the jth layer with the logistic activation function is finally obtained according to the error backpropagation rule as follows N j 1 kj kj zkj 1 kj ( zkj ) i , j 1wki , j 1; k 1,..., N j i 1 24 Correction of the Hidden Neurons Weights • Finally, To correct the jmth neuron weights, we obtain the following adjusting terms: • For the 1st hidden layer E (W ) k1 xi , i 1,..., n k1 wi k1 wi k1 i 0, • For other hidden layers E (W ) kjYi , j 1 , w kj wi kj , kj i i 1,..., N m1 i 0, j 2,..., m 1. 25 Error-Correction Rule for the MLF neurons w w w ; kj i kj i kj i i 0,..., N j 1; k 1,..., N j ; j 1,..., m 26 MLF Learning Algorithm • Step 1. Find MSE (or RMSE) for all the learning samples. If it drops below a pre-determined threshold, the learning process is completed. If not, set i=1 • Step 2. Find the network error for the learning sample Si and backpropagate it step by step, to find the local errors of all the output and hidden neurons • Step 3. Correct the weights of all neurons from the 1st hidden layer, then from the 2nd hidden layer, etc. up to the output layer, according to the error-correction learning rule • Step 4. Set i=i+1. If i>N (the number of the learning samples), then go to Step 1, otherwise go to Step 2 27 Local Minima Phenomenon • The backpropagation learning algorithm for MLF is developed as a method of solving the optimization problem. Its target is to find a global minimum of the error function. As all other such optimization methods, it suffers from a “local minima” problem 28 Local Minima Phenomenon • The error function may have many local minima points. A gradient descent method, which is used in the MLF backpropagation learning algorithm, may lead the learning process to the closest local minimum where the learning process may get stuck. This is a serious problem and it has no regular solution • The only method of how to jump over a local minimum is to “play” with the learning rate β increasing a step of learning 29