BackPropagation1

advertisement
Back
Propagation
Amir Ali Hooshmandan
Mehran Najafi
Mohamad Ali Honarpisheh
Contents
•
•
•
•
•
•
•
•
•
•
•
•
What is it?
History
Architecture
Activation Function
Learnig Algorithm
EBP Heuristics
How Long to Train
Virtues AND Limitations of BP
About Initialization
Accelerating training
An Application
Different Problems Require Different Learning Rate Adaptive Methods
What is it
• A Supervised learning algorithm
• Base  Error correction learning rule
• Generalize  Adaptive Filtering
Algorithm
History
• 1986
– Rumelhart
• Paper  Why are “what” and “where” processed by
separate cortical visual systems?
• Book  Parallel Distributed Processing: Explorations
in Micro Structures of cognition
– Parker
• Optimal algorithms for adaptive networks:
second order back propagation
second order direct propagation
• 1974 & 1969
Architecture
Zk
k
Wj,k
Zj
j
Vi,j
Xi
i
Activation Function
1
f (v j (n)) 
1  exp(av j (n))
Characteristics:
f ' (v j (n))  ay j (n)[1  y j (n)]
y j (n)
output of neuron J
Learnig Algorithm
ek(n) = dk(n) - yk(n)
Energy in this error
Total energy of the
output of net:
E (n) 
1
2
e
_
y

k (n)
2 kC
1
e _ yk2 ( n )
2
Energy of the error
produced by output
Neuron J in epoch n
Learnig Algorithm(cont.’d)
Purpose
E (n) 
Minimizing E(n)
E ( n )
w j ,k ( n )
1
e _ yk2 ( n )

2 kC
E (n)
 e _ yk ( n )
e _ yk (n)
e_yk(n) = dk(n) - yk(n)
e _ yk ( n )
 1
yk ( n )
yk (n)  f ( y _ ink (n))
yk (n)
 f ' ( y _ ink (n))
y _ ink (n)
Local Field
y _ ink (n)   w j ,k (n) z j (n)
jH
y _ ink (n)
 z j (n)
w j ,k (n)
Learnig Algorithm(cont.’d)
Chain Rule in Derivation
e _ y j (n)
y j (n)
y _ in j (n)
E (n)
E (n)




w j ,k (n) e _ y j (n)
y j (n)
y _ in j (n)
wi , j (n)
E (n )
 e _ yk (n )  1  f ' ( y _ ink (n ))  z j (n )
w j ,k (n )
-
 k (n)  
E (n)
y _ ink (n)
Local gradient
Computing Δvi,j for none
output layers
Problem
We don’t have error because
it is responsible for many errors
Find another way to compute δj
E (n)
 j (n) 
z _ in j (n)
Chain
Rule
z j (n)
E (n)
 j (n) 

z j (n) z _ in j (n)
Computing Δvi,j for none
output layers
1
E (n)   e _ yk2 (n)
2 kC
E (n)
e _ yk (n)
  e _ yk ( n )
z j (n) kC
z j (n)
e _ yk (n) e _ yk (n) y _ ink (n)


z j (n)
y _ ink (n)
z j (n)
e _ yk (n)  dk (n)  yk (n)  dk (n)  f ( y _ ink )
e _ yk (n)
  f ' ( y _ ink )
y _ ink (n)
y _ ink (n)   w j ,k (n) z j (n)
jH
y _ ink (n)
 w j ,k ( n )
z j (n)
Computing Δvi,j for none
output layers
z j (n)
E (n)
 j (n) 

 f ' ( z _ in j (n))   k (n)wk , j (n)
z j (n) z _ in j (n)
kC
E (n)
e _ yk (n)
  e _ yk ( n )
z j (n) kC
z j (n)
E (n)
   e _ yk (n) f ' ( yk (n))wk , j (n)
z j (n) kC
E (n)
  k (n) wk , j (n)
z j (n) kC
z j (n)  f ( z _ in j (n))
z j (n)
z _ in j (n)
 f ' ( z _ in j (n))
Computing Weight
correction
(Weight correction) = (Learning rate parameter)
* (local gradient)
* (input signal of previouse layer Neuron)
wi, j (n)     j (n)  xi (n)
v j,k (n)    k (n)  z j (n)
Training Algorithm
Step 0 : Initialize weights
(Set to random variables with zero mean and variance one)
Step 1: While stopping condition is false do Step 2-9.
Step 2:
For each training pair do Steps 3-8.
Feed forward
Step 3: Each input unit(Xi,i=1,..,n) receives
input signal xi and broadcasts this signal
to all units in the layer above(the
hidden
units(
Training Algorithm(Cont)
Step 4: Each hidden unit (Zj j=1,…,p) sums its weighted input signals
n
zin j  voj   xi vij
i 1
applies its activation function
to compute its output signal
z j  f ( zin j )
and sends this signal to all units in the layer above
(outputunits)
Training Algorithm(Cont)
Step 5: Each output unit (Yk ,k=1,…..,m) sums its weighted input signals,
p
y_ink=wOk+  z j w jk
j 1
and applies its activation function to compute its output signal.
yk=f(y_ink).
Training Algorithm(Cont)
Backpropagation of error:
Step 6: Each output unit ( Yk ,k=1,…,m) receives a target
pattern corresponding to the input training patern
computes its error information term.
 k  (tk  yk ) f ' ( y _ ink ),
calculates its weight correction term (used to update wjk later),
wOk  k Z j ,
calculates its bias correction term ( used to update wOk later).
wOk  k
and sends
to units in the layer below,
Training Algorithm(Cont)
Step 7: Each hidden units (Zj, j=1,…,p) sums its delta inputs from
units in the layer above).
m
O_inj=

k 1
k
w jk
multiplies by the derivative of its activation function to
calculate its error information term,
 j   _ in j f ' ( z _ in j ),
calculates its weight correction term(used to update vij later),
vij   j xi
and calculates its bias correction term(used to update voj later),
voj 
 j
Training Algorithm(Cont)
Update weights and biases:
Step 8: Each output units(Yk,k=1,….,m) updates its bias
and weights(j=0,…,p):
wjk(new)=wjk(old)+
w jk
Each hidden unit(Zj j=1,….,p) updates its bias and
weights(i=0,….,n):
vij(new)=vij(old)+
vij
Step 9: Test stopping condition
EBP Heuristics
• Number Of Hidden Layers :
– Theoretical & Simulation results showed that there is no
need to have more than two hidden layers.
– One OR Two hidden layers…?
• Chester ( 1990 ) : “Why two hidden layers are better
than one”
• Gallant ( 1990 ) : “Never try a multilayer model for
fitting
data until you have first tried a single-layer model”
Both architectures are theoretically able to approximate any
continuous function to the desired degree of accuracy.
EBP Heuristics (cont’d)
• Number of hidden layers… :
– Its difficult to say which topology is
better :
• Size of NN
• Learning Time
• Implementability in hardware
• Accuracy
– Solving problem using NN with one
hidden layer first , seems appropriate.
EBP Heuristics (cont’d)
– Every adjustable network parameter of the cost function
should have its own individual learning rate parameter.
– Every Learning rate parameter should be allowed to vary
from one iteration to the next.
– Increasing weight ‘s LR that has same derivative sign for
several iterations.
– Decreasing weight’s LR that has alternating derivative
sign.
How Long to Train
• The Aim is to balance between Generalization &
Memorization
( Minimizing cost function is not necessarily good idea ).
– Hecht-Nielsen( 1990 ) : Using two disjoint sets for
training
• Training Set
• Training-Testing Set
– As long as the error for the training-testing set
decreases , training continues.
– When the error begin to increase , the net is starting to
memorize.
Virtues AND Limitations of
BP
• Connectionism
– Biological Reasons
• No excitatory or inhibitory for real neurons
• No Global connection in MLP
• No backward propagation in real neurons
– Useful in parallel hardware
implementation
– Fault Tolerance
Virtues…( Cont’d )
• Computational Efficiency
– Computational complexity of alg. is
measured in terms of
multiplications,additions…
– Learning Algorithm is said to be
computationally efficient , when its
complexity is polynomial…
– The BP algorithm is computationally
efficient.
• In MLP with a total W weights , its
complexity is linear in W
Virtues…(cont’d)
• Convergence
Saarinen (1992 ) : Local convergence rates of the BP
algorithm are linear
– Too flat OR too curved
– Wrong Direction
• Local Minima
– BP learning is basically a Hill climbing technique.
– Presence of local minima( isolated valleys )
About Initialization
About Init… (cont)
• Other issues :
– Initialization of OL weights shouldn’t
result in small weighs...?
• If the output layer weights are small, then so is the
contribution of the HL neurons to the output error, and
consequently the effect of the hidden layer weights is not
visible enough.
• If the OL weights are too small, deltas( for HLs ) also
become very small, which in turn leads to small initial
changes in the hidden layer weights.
About Init… (cont)
• Initialization by using random numbers is very important in
avoiding the effects of symmetry in the network. all the HL
neurons should start with guaranteed different weight.
– If they have similar (or, even worse, the same) weights, they
will perform similarly (the same) on all data pairs by changing
weights in similar (the same) directions.
– This makes the whole learning process unnecessarily long (or
learning will be the same for all neurons, and there will
practically be no learning).
Nguyen – Widrow
Initialization…
• Two Layer NNs have been proven capable of
approximating any
arbitrary functions…
– How this work?
– and method for speeding up training process…
Behavior of hidden nodes…
• For simplicity two layer network with one input is trained to
approximate a function of one variable d(x). “x” as input
and using BP algorithm…
– Output is in the form of :
– Sigmoid function ( tanh ) :
• Approximately linear with slope 1 for x between -1 and 1.
• Saturate to -1 or 1 as x becomes large in magnitude
• Each term in above some is linear function of x over small interval
•Size of each interval is determined by wi
•Location of interval is determined by wbi
•Network learns to implement desired function by building piece-wise linear approximations
•Pieces are summed to form the complete approximation
(Random Initialization)
Improving Learning Speed
• Main Idea :
– Divide desired region into small
intervals
– Setting weights in a manner that each
hidden node is assigned to its own
interval at start of training.
– Training is as before…
Improving … (cont)
• Desired region : (-1,1) , has length 2
• H hidden units
– So, each hidden unit is responsible for an interval of
length 2/H
– Sigmoid(wi x + wbi ) is approximately linear over :
– Which has length 2/wi , therefore :
– Its preferable to have intervals overlap :
– For wbi :
Training of a net initialized as discussed…
Net whose weights are initialized to random
values between -.5 and .5
•Improvement is best when a large number of hidden units is used
with a complicated desired response.
•Training time decreased from 2 days to 4 hours for Truck-BackerUpper
Momentum
• Weight change’s Direction : Combination of current gradient
and previous gradient.
– Advantage : Reduce the role of outliers
– Doesn’t adjust LR directly.
w jk (t  1)  w jk (t )   k z j  [ w jk (t )  w jk (t  1 ) ]
OR
w jk (t  1)   k z j  w jk (t )
Momentum Parameter , its in the range from 0 to 1
Momentum (cont’d)
– Allows large weight adjustments as long as the correction is
in the same direction…
– Forms an exponentially weighted sum :
wjk (t  1) k z j  wjk (t )
– BP Vs MOM : XOR function with bipolar representation


#Epoch
s
BP
-
.2
MOM
.9
.2
387
38
Delta-Bar-Delta
• Allows each weight to have its own learning rate
• Lets learning rates vary with time
• Two heuristics are used to determine appropriate
changes :
– If weight changes is in the same direction for several
time steps , LR for that weight should be increased.
– If direction of weight change alternates , the LR should
be decreased.
• Note : these heuristics wont always improve the
performance.
DBD (cont’d)
• DBD rule consists of :
– Weight update rule
– LR update rule
• DBD rule changes the weights as follows :
Use information
of current and
past derivative to
form “delta-bar”
DBD (cont’d)
• The 1st heuristic is implemented by increasing the LR by
a constant amount :
 jk (t  1)  
• 2nd heuristic is implemented by decreasing LR by a

proportion
of its current value :
 jk (t 1)   jk (t )
• LR increase linearly and decrease exponentially.
DBD (cont’d)
Results for XOR …
Computer Network
Intrusion Detection
Via Neural Networks
Methods
Goals
• neural network (NN) techniques can be used in detecting
intruders logging on to a computer network.
• compares the performance of the four neural network
methods in intrusion detection.
• The NN techniques used are
1. gradient descent back propagation (BP)
2. gradient descent BP with momentum
3. variable learning rate gradient descent BP
4. Conjugate Gradient BP (CGP)
Background on Intrusion Detection
Systems (IDS)
• Information assurance is a field that deals with protecting
information on computers or computer networks from being
compromised.
• Intrusion detection :
detecting unauthorized users from accessing the
information on those computers.
• Current intrusion detection techniques can not detect new
and novel attacks
• The relevance of NN in intrusion detection becomes
apparent when one views the intrusion detection problem
as a pattern classification problem.
Pattern Classification Problem
• By building profiles of authorized computer users one
can train the NN to classify the incoming computer
traffic into authorized traffic or not authorized traffic.
• The task of intrusion detection is to construct a model
that captures a set of user attributes and determine if
that user’s set of attributes belongs to the authorized
user or those of the intruder.
Problem Definition
•
•
Attribute set consists of the unique characteristics of the user logging
onto a computer network:
– Authorized user and Intruder
The problem can be stated as:
• Where:
x = input vector consisting a user’s attributes
y = {authorized user, intruder}
We want to map the input set x to an output
Solving the Intrusion Detection
Problem Using The Back
Propagation
• Multilayer Perceptron with Two Hidden Layers
The error of our model is:
d = desired output
y = actual output
e=d-y
Continue
• Activation functions
Sigmoidal :
• That users in the UNIX OS environment could be profiled
via four attributes
command, host, time, execution time
• For simplicity in testing the back propagation methods, we
decided to generate a user profile data file without profile
drift
Continue
• The generated data used here was organized into two
parts.
Training Data :
90% Authorized traffic
10% Intrusion traffic
Testing Data:
98% Authorized traffic
2% Intrusion traffic
Continue
Continue
• Our objective is to train the neural networks for detecting
intrusion traffic with the fewest number of intrusion samples
File1
File2
File3
Each CU has 4
consists of 5 CUs in each of the input
consists of 6 CUs in each of the input
consists of 7 CUs in each of the input
elements
• We investigate three kind of Error Back Propagation Neural
Network and result’s
Gradient Descent BP
(GD)
• This method updates the network weights and biases in
the direction of the performance function that decreases
most rapidly, i.e. the negative of the gradient. The new
weight vector wk+1 is adjusted according to:
• α is the learning rate and gk is the gradient of the error
with respect to the weight vector
• The negative sign indicates that the new
• Weight vector wk+1 is moving in a direction opposite to
that of the gradient.
Gradient Descent BP with
Momentum (GDM)
• Making weight changes equal to the sum of a fraction of
the last weight change and the new change suggested
by the gradient descent BP rule.
• Advantage:
1. Momentum allows a network to respond not only
to the local gradient, but also to recent trends in
the error Surface
2. Momentum allows the network to ignore small
features in the error surface.
3. Without momentum a network may get stuck in
shallow local minimum. With momentum a network
can slide through such a minimum
Gradient Descent BP with
Momentum (GDM)
• µ, which can be any number between 0 and 1.
Variable Learning Rate BP with
Momentum (GDX)
• The learning rate parameter is used to determine how fast
the BP method converges to the minimum solution.
• If the learning rate is made too large :
the algorithm will become unstable.
• if the learning rate is set to too small :
the algorithm will take a long time to converge.
• To speed up the convergence time, the variable learning
rate gradient descent BP utilizes larger learning rate α when
the neural network model is far from the solution and
smaller learning rate α when the neural net is near the
solution.
Variable Learning Rate BP with
Momentum (GDX)
• The new weight vector wk+1 is adjusted the same as in the
gradient descent with momentum above but with a varying αk
Conjugate Gradient BP
(CGP)
• In the conjugate gradient algorithms a search is performed
along
conjugate directions, which produces generally faster
convergence
Than steepest descent directions.
• A search is made along the conjugate gradient direction to
determine the step size which will minimize the
performance function along that line
Performance Comparison
• Result when Input is File1
Performance Comparison
• Result when Input is File2
Performance Comparison
• Result when Input is File3
Results From Tables
• 1.the gradient descent with momentum method was not as
good in detecting the intrusion traffic from the authorized
traffic as the gradient descent method.
• 2.gradient descent with momentum methods were able to
classify intrusion traffic from authorized traffic but with higher
MSE values.
• 3. the number of samples used as inputs affected the
performance of the classification of the data
Output Values of the Two Classes
Results From Tables
• The gradient descent, the gradient descent with momentum and
the variable rate gradient descent with momentum method
converged to a MSE = exp(-3). they were able to classify the
intrusion traffic from the authorized traffic.
• The input sample that yielded the best performance for all five
methods contained 6 CUs
• The number of neurons used at the hidden layer depended on the
number of CUs in the input samples. In these cases, the NN
topology of {24,10,1} yielded the best results.
• When the input data files were File1 and File3 (i.e. input
sample consists of 5 or 7 CUs), we did not get good results
compare to the case when the input is File2
(Memorize & Generalize Problem)
• Fourth, the conjugate gradient descent have the best
performance
Different Problems
Require Different
Learning Rate
Adaptive Methods
General parameter
settings
Network Architecture : standard feedforward
neural network.
Maximum of parameters was constant for all
tasks.
AFs for non-input units : standard hyperbolic
tangent.
Correctness : All output units produce correct
answer.
Seven networks for each configuration…
Fixed total number of itteration.
Different Problems
• Each algorithm tested on three different tasks
– Parity Bit : # of activated input units is odd or even. In
this study input layer is composed of nine units each has
two states , 512 different patterns. Only one output
layer is needed.
– N-M-N Encoder : The encoder task consisted of
reproducing the same output activation pattern as the
input one. In the activation patterns , one unit is active
and the others aren’t. The complexity of this task
resided in the fact that the number of hidden units was
less than the number of input and output units
(M<N).[M=7 , N=16].
Texture
– The task consisted in detecting the
orientation, either horizontal or vertical,
of stripes defined by texture in an
image.
Algorithm dependent
parameter settings
– The main problem with AMs is that they require many
parameters that are, like the LR, problem dependent. It is
impossible to test all
possible
parameter
combinations. Usually, to compare AMs for a
given task,
authors affirm that they tried to find “good” parameter
settings! ( easiness of finding these params …. )
• For MOM , the free parameter is momentum factor. Tested values
are : 0 ,
.5 , .7 , .9 and .99 .
• For DBD , the other parameter is increase factor.
Results for Parity-Bit
Free Parameters
LRs
Each element represents the number of times the AM solve the task
(max=7) for a given parameter combination. The lines correspond to
different LRs and the columns to different free parameter values.
MOM and DBD achieving a performance of 6/7 .
Results For Encoder
For DBD and MOM, although many parameter combinations solved
the task, none resulted in a one hundred percent efficacy.
Results for Texture
No parameter combination was found that could solve the
texture task using MOM.
DBD solved the task only if the initial LR was 4^(-2).
With the proper initial LR the other DBD parameter
(incremental constant) showed greater flexibility
Discussion
– The first and obvious conclusion that can be drawn from these
results is that no AM could attain a better performance than all
the others on all tasks
– MOM and DBD had a similar behavior when they were used on
the encoder and parity task.
– The only task of which they clearly differed was the
texture task where MOM never solved the task as
opposed to DBD.
Comparison Of NN & SVM
• Problem : Recognizing Young-Old Gait Patterns
– The gaits of 12 young and 12 elderly participants were
recorded and analyzed.
– 24 gait parameters (features) were extracted for training
and testing the NN and SVM systems.
– NNs have been employed to classify normal and
pathological gait… with good success.
– SVMs have emerged as a new and powerful technique
for learning from data and in particular for solving
classification and regression problems with reported
better performance.
Gate Parameters
• Three types of gait params :
– Basic Gate Data ( 9 Variables ) :
Walking speed , stride length , ….
– Kinetic Data ( 5 Variables ) : FootGround forces ….
– Kinematic Data( 10 Variables ) : knee
and ankle joint angles … .
Experimental Results
• A total of 24 subjects were used , which 20 subjects’ data
were used to train the NN and SVM, and the remaining 4
subjects were used to test the generalization ability of both
techniques.
• Due to small sample size, a cross-validation technique was
employed. In this way , all 24 subjects appeared in the
testing phase of the NN and SVM models.( 6 group , 4 in
each one )
• Each Alg. Is tested 20 times.
Download