pptx

advertisement
CS-363 Neural Networks and Machine Learning
MULTILAYER FEEDFORWARD NEURAL
NETWORK. ERROR BACKPROPAGATION.
LEARNING ALGORITHM
1
MLF: Learning Problem
• How we can train MLF?
• While for the entire network its error can be
calculated, because its desired outputs are
known, for the hidden neurons their desired
outputs are unknown…
• How this problem can be solved?
2
Error Backpropagation
• The heuristic idea behind the error
backpropagation is to share the errors of the
output neurons, which can be calculated because
their desired outputs are known, with all the
hidden neurons
• The error backpropagation is sequential
propagation of the errors of the neurons from the
output layer through all the layers from the "right
hand" side to the "left hand" side up to the first
hidden layer, in order to calculate the errors of all
other neurons
3
Multilayer Feedforward Neural
Network (MLF)
x1
x2
xn
Signals flow
Errors flow
(Error
Backpropagation)
4
MLF Learning
• Basically, the entire learning process consists of two
passes through all the layers of the network: a forward
pass and a backward pass.
• In the forward pass, the inputs are propagated from
the input layer of the network to the first hidden layer
and then, layer by layer, output signals from the hidden
neurons are propagated to the corresponding inputs of
the following layer neurons. Finally, a set of outputs is
produced as the actual response of the network.
• During the forward pass the synaptic weights of the
network are all fixed.
5
MLF Learning
• During the backward pass first the errors of all
the neurons are calculated and then the
weights of all the neurons are all adjusted in
accordance with the learning rule.
• One complete iteration (epoch) of the learning
process consists of a forward pass and a
backward pass.
6
MLF Learning
• The MLF learning algorithm is based on the
generalization of the error-correction learning rule for
the case of MLF.
• Specifically, the actual response of the network is
subtracted from a desired response to produce an
error signal. This error signal is then propagated
backward through the network, against the direction of
synaptic connections – hence the name
"backpropagation".
• The weights are adjusted so as to make the actual
output of the network move closer to the desired
output.
7
MLF Learning
• The learning algorithm for the classical MLF is
derived from the consideration that the global
error of the network in terms of the mean
square error (MSE) or root mean square error
(RMSE) must be minimized.
• This minimization is considered as the
optimization problem – finding a global
minimum of the error function
8
Paul Werbos
• Suggested and developed the
error backpropagation learning
algorithm in his Harvard Ph.D.
dissertation in 1974
• (it was not published that time
and remained unknown to the
research community until
1986 when the algorithm was
re-invented by D. Rumelhart)
9
David Rumelhart (1942-2011)
• Re-invented the error
backpropagation and redeveloped it; presented
the MLF learning
algorithm in 1986.
10
Feedforward Neural Network
• Let us use the following notations
• m – the number of layers
wikj
Dkj
- a desired output of the kth neuron from the jth layer
Ykj
- an actual output of the kth neuron from the jth layer
- ith weight of the kth neuron from the jth layer
*
 km
 Dkm  Ykm
 km
 kj
Nj
- the network error for the kth neuron from the output layer
- the error for the kth neuron from the output layer
- the error for the kth neuron from the jth layer
-the number of all neurons in the layer j
11
Error of the network
1
Ε
N
• MSE of the network is
N
E
s 1
s
• N is the total number of samples (patterns) in the
learning set
2
* 2
• Es   Ds  Ys    s  , s  1,..., N is the square error of the
network for the sth pattern for a single output neuron
•
1
Es 
Nm
 D
Nm
k 1
ks
 Yks

2
1

Nm
   ; s  1,..., N is the square
Nm
k 1
*
ks
2
error of the network for the sth pattern for Nm output
neurons
12
Square Error
• For simplicity, but without loss of generality,
instead of MSE minimization we may consider
minimization of the square error (SE)
Nm
* 2
E   ( km
)
k 1
*
 Dk  Yk , s  1,..., N is the error of the
• Where  km
kth output neuron for the sth learning sample
s
s
13
Error Function
• To minimize the square error (SE) by changing
the weights, we have to consider SE as a
function of the weights.
• We will consider minimization of the function
1 Nm * 2
E (W )   ( km )
2 k 1
• the factor 1/2 is used to simplify subsequent
derivations resulting from the minimization of E
14
Correction of the weights
• To ensure movement to the global minimum
on each iteration, the correction of the
weights of all the neurons has to be organized
in such a way that each weight wi has to be
kj
corrected by an amount wi , which must be
E
proportional to the partial derivative w kj
i
of the error function E(W) with respect to the
weights
15
Derivative of the Square Error
• Let Ykj  kj  zkj  be the output value of the kth
neuron at the jth layer represented as a
function of its weighted sum
• Evidently, E W   E   z W   
• Applying the chain rule for the differentiation,
for the kth neuron at the output (mth) layer we
obtain E (W )  E (W ) km zkm , i  0,1,..., N
wikm
km zkm wikm
m 1
16
Derivative of the Square Error
Function
E (W )
 1
1

 2
 2

(

)

(

) 


km
km


km
km  2 k
 2 k km
1 

 1
 2
*



( km )   km
( km )   km
2 km
km
km N
D
N
s 1
kms


 Ykms   km
;
 km
 ( zkm ),
  km
zkm
zkm

km
km
km

w

w
Y

...

w
0
1
1, m 1
N m1 YN m1 , m 1  Yi , m 1 ,
km
km
wi
wi


i  0, 1, ..., N m 1.
17
Derivative of the Square Error
Function

 km
Yi ,m 1
E (W ) E (W ) km zkm






i  0, 1, ..., N m1;
km km  zkm  Yi , m 1 ,
km
km
wi
km zkm wi
Y0,m1  1
  zkm 
km
18
Correction of the
output neuron weights
• To correct the kmth output neuron weights, we
obtain the following adjusting term

  zkm  Yi ,m1


E (W )  kmkm
km
wi   
 
km
wi
  zkm 
  kmkm
i  1,..., N m1
i  0,

  zkm    km
 km  km
; k  1,..., Nm
•  km is the local error of the kmth output
neuron
•   0 is the learning rate
19
Correction of the
output neuron weights
• Finally, To correct the kmth output neuron
weights, we obtain the following adjusting
term
E (W )  kmYi ,m1
w   

km
wi
  km
km
i
i  1,..., N m1
i  0,
20
Error Backpropagation
• To propagate the output neurons errors to the neurons
of all hidden layers, a sequential error backpropagation
through the network from the mth layer to the m-1st
one, from the m-1st one to the m-2nd one, ..., from the
3rd one to the 2nd one and from the 2nd one to the 1st
one has to be done.
• When the error is propagated from the layer j+1 to the
layer j, the local error of each neuron of the j+1st layer
is multiplied by the weight of the path connecting the
corresponding input of this neuron at the j+1st layer
with the corresponding output of the neuron at the jth
layer.
21
Error Backpropagation
• The error of the kth neuron from the jth layer is
obtained according to the error
backpropagation rule as follows
N j 1
i , j 1

 kj  kj  zkj    i , j 1wk ; k  1,..., N j
k 1
22
Derivative of the Logistic Function
• The derivative of the activation function is a part of the error
backpropagation equation.
• A sigmoid function (particularly, the logistic function) has a
“beautiful” derivative (suppose for simplicity, α=1)

1 

 z 1 
 z 2
z

 ( z)  

(1

e
)


(1

e
)

(

e
)


z 
 1 e 
e z
e z

  ( z)
  ( z ) 1   ( z ) 
z
z
z
(1  e )(1  e )
(1  e )
• because
1
1  e z  1
e z
1   ( z)  1 


z
z
1 e
1 e
1  e z
23
Error Backpropagation Rule for the
Logistic Activation Function
• The error of the kth neuron from the jth layer
with the logistic activation function is finally
obtained according to the error
backpropagation rule as follows
N j 1
 kj  kj  zkj   1  kj ( zkj )    i , j 1wki , j 1; k  1,..., N j
i 1
24
Correction of the
Hidden Neurons Weights
• Finally, To correct the jmth neuron weights, we obtain
the following adjusting terms:
• For the 1st hidden layer
E (W )   k1 xi , i  1,..., n
k1
wi  

k1
wi
 k1 i  0,
• For other hidden layers
E (W )   kjYi , j 1 ,
w   

kj
wi
  kj ,
kj
i
i  1,..., N m1
i  0,
j  2,..., m  1.
25
Error-Correction Rule for
the MLF neurons
w  w  w ;
kj
i
kj
i
kj
i
i  0,..., N j 1; k  1,..., N j ; j  1,..., m
26
MLF Learning Algorithm
• Step 1. Find MSE (or RMSE) for all the learning
samples. If it drops below a pre-determined threshold,
the learning process is completed. If not, set i=1
• Step 2. Find the network error for the learning sample
Si and backpropagate it step by step, to find the local
errors of all the output and hidden neurons
• Step 3. Correct the weights of all neurons from the 1st
hidden layer, then from the 2nd hidden layer, etc. up to
the output layer, according to the error-correction
learning rule
• Step 4. Set i=i+1. If i>N (the number of the learning
samples), then go to Step 1, otherwise go to Step 2
27
Local Minima Phenomenon
• The backpropagation learning algorithm for MLF is developed as a method
of solving the optimization problem. Its target is to find a global minimum
of the error function. As all other such optimization methods, it suffers
from a “local minima” problem
28
Local Minima Phenomenon
• The error function may have many local minima points. A gradient descent
method, which is used in the MLF backpropagation learning algorithm,
may lead the learning process to the closest local minimum where the
learning process may get stuck. This is a serious problem and it has no
regular solution
• The only method of how to jump over a local minimum is to “play” with
the learning rate β increasing a step of learning
29
Download