Advantages of NNs. 1. High computation rates provided by MP

advertisement
Advantages of NNs.
1. High computation rates provided by MP
2. A great degree of robustness
3. Adaptation of connection weights based on current results so as
to improve performance
4. Non-parametric
The Artificial Neuron
The basic computation element of a NN is called the artifice
neuron or neuron or node
A general neuron model
Bias φ
X1 W1
Activation fn
U(i)
F( )
Out put to
Input
Xi Wi
Another
neuron
Xn Wn
Basis fn
The net input in represent ed by a net fn u (i) which also called
basis fn the neuron activation by the activation fn f() .
The bias acts exactly as a weight on a connection from a unit whose
activation is always 1
The simplest node sums the weighted inputs and passes the result
through a non-linearity
i.e.
Y=f(∑Xi Wi +φ)
the non-linear fn(f) or activation fn is typically used for the neuron in
any particular layer of a neuron net. Some common activation fns.
Are the step , ramp, sigmoid functions
The bias fn
Is represented as u(w ,x) where w stands for the weight matrix ,
and x for the input vector. The bias fn has two common forms.
1. Linear basis Fn(LBF) in a hyper plane type fn, this is a 1st
order linear bias fn. The net value in a linear combination of
the inputs
n
Ui(W,X) = ∑ Wij Xj
j=1
2. radial bias fn (RBF) in a hype sphere type fn .this involve a
second order Nonlinear basis fn ,the net value represents the
distance to a reference pattern
Ui (W,X) =
n
∑(Xi) –Wij)2
j=1
Common activation function
1- step fn .to convert the net input (a continuously valued variable)
to an output binary (0 or 1) bipolar (1 or-1) signal. the binary step
fn .is also known as the threshold fn. or hard limiter.
f(x)
1
x
-1
This type of fn is used in both supervised and unsupervised learning .
its main advantage in it simplicity.
2- the Threshold logical unit (TLU) or ramp
f(x)
-X0
X
X0
-1
-1
F(x) =
X
1
X ≤ -X0
-X0 < X < X0
X ≥ X0
3- sigmoid fns. (S. shaped curves) are useful activation fns. The
logistics fn. and the hyperbolic tangent fn. are the most common.
They are especially useful for NNS trained by back program because
of the simple relationship between the value of the fn. at a point
and the value of the derivation at that point (i.e. has a simple
derivative ).
the logistic fn .with a range frame 0 to 1 is often used for NNS
in which the desired out put values either are in the binary or are in
the rang (interval ) between 0 and 1
f(x) = 1 / 1+ e-σσx
f(x) =σ f(x) [ 1- f(x) ]
f(x)
Binary sigmoid with
σ=1
σ=3
σ=3
σ =1
x
Logistic fn .is also called the squashing σ is the slop parameter.
N.B
The slop parameter is used to modify the steepness of the
sigmoid so that the sigmoid fn .achieve a particular desired value for
a given value of x .
The bipolar sigmoid is often used when the desired range of output
values is between -1 and 1
f(x)= 1- e -σσx / 1+e -σx
f(x)=( σ/2 ) (1+f(x)) (1-f(x))
f(x)
1
x
-1
The bipolar sigmoid is similar to the logistic fn .but it is symmetric
about the origin the sigmoid fns characteristics the are continuous
differentiable and monastically non – decreasing .hence the sigmoid
fns. are used in the implementation of the back propagation
algorithm (which requires that the fn .be everywhere differentiable )
The hyper balic tangent (σ =2) :f(x)= 1-e -2x / 1+e -2x
and f(x)=(1+f(x)) (1-f(x))
f(x)
X
Note
The f part of learn scales the error to force a stronger correction
when the net input
Is near the rapid rise in the sigmoid carve i.e
The peak is in the same position as the rise in the sigmoid curve
Taxonomy of NNS
There are two phases in neural inform processing:
1-lerning phase:In the learning phase a training data set is used to determine the
weight parameter that define the never mode NNS learn by
adaptively then updating then weights that characterize the strength
of connection .the adaptation may also involve altering the pattern of
connection .the weight are up dated according to the inform
extracted from training patterns usually the optimal weights are
opened by optimizing creation fn . eg.to minimize the lest square
error between the teacher value the actual out put value
2-retiving phase:Retracing phase the trained neural modal is used in this phase to
process real patters and yield .classification rustling computed
neuron values represent the desired out put
Supervised and unclarssification networks:The NNS are commonly categorized of there training algorithms
-Fixed weight networks .
-Un supewised net work
-Supewised networks
There is no learning required for the fixed weight nets so learning
Model is either supervised or unsupervised .
Supervised net work:requires the paying of each input vector with a target vector
representing the desired out put-put together these are called a
training pair :usually a net work is trained over a no of such training
pairs .an
input vector is
Basic Delta Rule :This rule is also known as the Least Mean Squares
(LMS) or widrow-Hoff rule. The basic delta rule is given by
Δwij = α (tj - yj) xi
Where
tj
= the teacher value (or desired value) for
yj
= the actual output for unit j .
unit j .
α
= learning rate .
in other words
Δwij = α δ xi
Where
δ = tj -
yj
i.e. the deference between the desired or target output and the
actual output .The delta rule modifies the weights appropriately
for both continuous and binary inputs and outputs .
The delta rule minimizes the squares of the differences
between the actual and the desired O/P values .
the squared error for a particular training pattern
E = ∑ (tj–yj )²
j
Where E is a fn. of all the weights .The gradient of E is a vector
consisting of the partial derivatives of E with respect to each of
the weights . This vector gives the direction of most rapid
increase in E , the opposite direction gives the direction of most
rapid decrease in the error . The error can be reduced most
rapidly by adjusting the weight wij in the direction of - ∂E/∂wij
.
Generalized delta rule :In a multilayer network , the determination of the error
is a recursive process which starts with the O/P units and the
error is back propagated to the input unit. Therefore the rule is
called the error back propagation BP.
Δ wij = α δj xi
δj = (tj - yj)f´ (y –inj).
where
y-inj = ∑ wj xi + b
Back Propagation :Backpropagation is the generalization of the Widrow-Hoff
learning rule to multiple-layer networks and nonlinear
differentiable transfer functions. Input vectors and the
corresponding target vectors are used to train a network until it
can approximate a function, associate input vectors with specific
output vectors, or classify input vectors in an appropriate way as
defined by you. Networks with biases, a sigmoid layer, and a
linear output layer are capable of approximating any function
with a finite number of discontinuities. Standard
backpropagation is a gradient descent algorithm, as is the
Widrow-Hoff learning rule, in which the network weights are
moved along the negative of the gradient of the performance
function. The term backpropagation refers to the manner in
which the gradient is computed for nonlinear multilayer
networks. There are a number of variations on the basic
algorithm that are based on other standard optimization
techniques, such as conjugate gradient and Newton methods.
Properly trained backpropagation networks tend to give
reasonable answers when presented with inputs that they have
never seen. Typically, a new input leads to an output similar to
the correct output for input vectors used in training that are
similar to the new input being presented. This generalization
property makes it possible to train a network on a representative
set of input/target pairs and get good results without training the
network on all possible input/output pairs. There are two
features of Neural Network software that are designed to
improve network generalization: regularization and early
stopping.
The popularity of this algorithm lies in the network
flexibility in capturing hidden features in the application data . It
overcomes the limitations of the simple perception which can
represent linearly separable fns. Only the BP algorithm provides
a general learning rule for the more powerful feed forward
multilayer networks . The basis for the algorithm is the
optimization technique known as Gradient Descent i.e. during
learning , each of the connection wts. is adjusted by an amount
proportional to the gradient of a global measure of error of the
network at the present location . The algorithm involves three
stages
1- Feed forward of the input patterns ;
2- Back propagation of the associated error ;
3- Adjustment of the wts..
The algorithm may be used with one or several hidden
layers .
Algorithm :Step 0:- For a network with one hidden layer , initialize wts
and bias to small random values. Set learning rate α such that
0.1< α ≤ 1.
x0
b
1
1
z1
X1
X2
X1
y1
zj
yk
X2
Xn
Y1
Yk
Xn
ym
zp
p
Input
hidden
output
Ym
Step 1:- while stopping condition is false , do steps 2 to 7.
Step 2:- for each train pattern , do steps 3 to 6.
Feed forward :Step 3:- a. for each hidden unit zj .
The net input is computed as follows :
Zj in = ∑ xi wij
where x0 is the bias. The output signal is
computed.
Zj= f (Zj in)
The output signal is sent to all units in the layer above .
b. For each output unit yk , the weighted input is:
yk in = ∑ Zj v
j
and the output signal is
yk = f (yk in)
Back Propagation of Error
Step 4 : For each output unit yk (k =1,2………m)
a . Comput its error term (or delta term)
δk =(tk – yk ) f' (yk in) where f' (yk in) is the
1st derivative of activation fn .
b . Compute its wt correction term to update the relevant wt later
.
Δ vjk = α δ k zj
c. Send δk to units in the layer below
Step 5 : For each hidden unit zj (j= 1,2…….p)
a. Sum up its delta inputs (from layer above)
m
δj in =∑ δk vjk
k=1
b. Calculate its error term δj
δ j=δj in f'(z j in)
c. Calculate its wt correction term to update the relevant wt
later .
Δ wjj= α δ j xi
Update weights
Step 6 :
a. Each output unit yk updates its weights
vjk (new)=vjk (old)+Δvjk
b. Each hidden unit z updates its wts
w ij(new)=w ij(old)+Δw ij
Step 7 : Test stopping condition
The BP algorithm may be used for a NN with arbitrary no.
of hidden layers. The rule has the form
wpq =α δoutput *v input
Where input and output refer to the two ends p and q of
the connection concerned and v stands for the appropriate input
–end activation of a hidden unit or real input.
Note that the meaning of δ differs for last layer from all others
hidden layers.
Notes
1. The Bp network consists of at least one hidden layer.
2. Each layer is fully connected to the next layer.
3. The no. of neurons in the hidden layer varies and depends on the
particular problem.
4. The activation fn used is a sigmoid fn.
5. The commonly used error fn is E = 1/2 ∑ (t k – y k) 2.
µk
Where µ is the pattern no i.e. the objective is to minimize the squared
error between the teacher and the actual response.
Convergence
When a network is trained successfully ,it produces correct answers
more and more often as training progresses. The root. mean .squared error (
RMS) is usually calculated to reflect the degree to which learn has taken place.
This measure reflects how close the network is to getting the correct answers. As
the network learns ,its RMS error decreases , and generally an RMS value below
0.1 indicates that a network has learned its training set . The network gets closer
& closer to the target value incrementally with each step .Hence it is possible to
define a cutoff point when the network's output is said to match the target values.
Convergence is a process whereby the RSM value for the network gets closer to
0. However it is not always easy to achieve because
1-The process may take an exceedingly long time
2- The network may get stuck in a local minimum &stops learn g altogether .A
local min m is surrounded higher values and the network usually dose not leave a
local min m early .
Special steps may be used to get out of a local minm
E
Error surface
W1 old
W1 new
Local min
Global min
1-Change the learn rate.
W1
Error surface in 2dimension
2-change the no. of hidden unites.
3-Add small random values to the wits.
4- Start the learning process again with different initial wts.
Generalization
This is the ability of the net to produce reasonable responses to input
patterns that are similar but not identical to training patterns. A balance between
memorization and generalization is usually desired.
It is common for the generalize n performance of a neural network to become
sub optimal if training is allowed to continue indefinitely as the model overfits
the training data (or memorizes the training data ).
The most commonly used technique to avoid overtraining is to monitor
generalize n performance as training proceeds, on an independent data set (or test
set). Training is terminated when generalization performance fails to improve as
training commences. This technique (called cross validation) has some
disadvantages. It can be wasteful when training data is limited. Also, it can be
very noisy between different test data sets & hence provide only an approximate
indication of when to stop.
Another technique is to restrict the size or intercommunion
density of the NN thus restricting performance on the training
set, can also alter the network complexity by changing the no.
of hidden nodes while maintaining full connectivity between
layers or both.
Example
X1=1.1
O1
X2=2.4
-0.33
X3=3.2
0.7
-0.45
H1
O2
0.62
0.85
X4=5.1
O3
.13
-0.1
X5=3.9
.37
H2
0.21
Input Vector=(1.1 2.4 3.2 5.1 3.9)
Target=(0.52 0.25 0.75 0.97)
Let Weights between neuron H in hidden layer and output
layer(vjk)=(0.85 0.62 -0.1 0.21)
Weights V=(-0.33 0.07 -0.45 0.13 0.37)[ H ‫األوزان ألداخله إلى‬
‫الخلية‬2[‫في الطبقة المخبأة‬
n=-0.33*1.1+ 0.07*2.4 -0.45*3.2+ 0.13*5.1+ 0.37*3.9
+0,679(b)
a=1/(1+exp(-n)
a=0.7595(H ‫) إخراج الخلية‬
:‫نحتاج الى نموذج االخراج المحسوب نفرض‬
Computed Actual Output=(0.61 0.41 0.57 0.35)
Target(Desired)=(0.52 0.25 0.75 0.97)
Compute Error=T-O=(-0.09 -0.16 0.18 0.44)
δk =(tk – yk ) f' (yk in)
f' (yk in)= computed actual output (1-computed actual output)
δk(Error Vector)=(-0.02 -0.04 0.04 0.11)
let α=0.2
Δ vjk = α δ k zj
=0.2* (-0.02 -0.04 0.04 0.11)*0.7595=(-0.003 -0.006 0.006
0.017)(‫)التعديل على األوزان بين الطبقة المخبأة و الطبقة الخارجية‬
vjk (new)=vjk (old)+Δvjk
=(0.85 0.62 -0.1 0.21)+(- 0.003 -0.006 0.006 0.017)
δj in =∑ δk vjk
= ((0.85*-0.02) + (0.62*-0.04) + (-0.1*0.04) +
(0.21*0.11))
δ j=δj in f'(z j in)
=((0.85*-0.02) + (0.62*-0.04) + (-0.1*0.04) +
(0.21*0.11)* 0.7595*(1-0.759)
=-0.0041
Δ wjj= α δ j xi
let α=0.15
xi=x3=3.2
Δ wjj=0.15*3.2*-0.0041=-0.002
w ij(new)=w ij(old)+Δw ij
=-0.45+(-0.002)=-0.452
Download