neural networks

Based on sets of notes prepared by Dr. Patrick H. Corr, Brendan Coburn and John
Neural Networks (also known as Connectionist models or Parallel Distributed
Processing models) are information processing systems which model the brain’s
cognitive process by imitating some of its basic structures and operations. Interest in
these networks was originally biologically motivated. They were developed with the
expectation of gaining new insight into the workings of the brain. In the 1940’s and
1950’s there was a certain amount of success in the research and development of
neural networks but eventually the attraction of these systems declined due to a
number of factors. However, within the past decade interest in neural networks has
been revived. Although these networks are still helpful in the research of the brain’s
cognitive process, today they are actually implemented in the processing of
In 1943 two scientists, Warren McCulloch and Walter Pitts, proposed the first
artificial model of a biological neuron [McC]. This synthetic neuron is still the basis
for most of today’s neural networks.
Rosenblatt came up with his two layered perceptron which was subsequently shown
to be defective by Papert and Minsky which lead to a huge decline in funding and
interest in Neural Networks.
2.3.5 Other Developments
During this period, even though there was a lack of funding and interest in neural
networks, a small number of researchers continued to investigate the potential of
neural models. A number of papers were published, but none had any great impact.
Many of these reports concentrated on the potential of neural networks for aiding in
the explanation of biological behaviour (e.g. [Mal], [Bro], [Mar], [Bie], [Coo]).
Others focused on real world implementations. In 1972 Teuvo Kohonen and James A.
Anderson independently proposed the same model for associative memory [Koh],
[An1] and in 1976 Marr and Poggio applied a neural network to a realistic problem in
computational vision, stereopsis [Mar]. Other projects included [Lit], [Gr1], [Gr2],
[Ama], [An2], [McC].
2.4.1 The Discovery of Backpropagation
The backpropagation learning algorithm was developed independently by Rumelhart
[Ru1], [Ru2], Le Cun [Cun] and Parker [Par] in 1986. It was subsequently discovered
that the algorithm had also been described by Paul Werbos in his Harvard Ph.D thesis
in 1974 [Wer]. Error backpropagation networks are the most widely used neural
network model as they can be applied to almost any problem that requires pattern
mapping. It was the discovery of this paradigm that brought neural networks out of
the research area and into real world implementation.
What and why?
Neural Networks: a bottom-up attempt to model the
functionality of the brain.
Two main areas of activity:
o Try to model biological neural systems
o Artificial neural networks are
biologically inspired butnot necessarily
biologically plausible
o So may use other terms:
Connectionism, Parallel Distributed
Processing, Adaptive Systems Theory.
A simplified view of a neuron is shown in the diagram below.
Signals move from neuron to neuron via electrochemical reactions. The synapses
release a chemical transmitter which enters the dendrite. This raises or lowers the
electrical potential of the cell body.
The soma sums the inputs it receives and once a threshold level is reached an
electrical impulse is sent down the axon (often known as firing).
These impulses eventually reach synapses and the cycle continues.
Synapses which raise the potential within a cell body are called excitatory. Synapses
which lower the potential are called inhibitory.
It has been found that synapses exhibit plasticity. This means that long-term changes
in the strengths of the connections can be formed depending on the firing patterns
of other neurons. This is thought to be the basis for learning in our brains.
Marking Scheme
1 X 5 basic components
2 for diagram.
(c ) How is the neuron modelled in Artificial Neural Nets?
(6 marks)
Modelling a Neuron
To model the brain we need to model a neuron. Each neuron performs a simple
computation. It receives signals from its input links and it uses these values to
compute the activation level (or output) for the neuron. This value is passed to other
neurons via its output links.
The input value received of a neuron is calculated by summing the weighted input
values from its input links. That is
in i   j Wj , iaj
An activation function takes the neuron input value and produces a value which
becomes the output value of the neuron. This value is passed to other neurons in the
This is summarised in this diagram and the notes below.
Activation value of unit j
Weight on the link from unit j to unit i
Weighted sum of inputs to unit i
Activation value of unit i (also known as the output value)
Activation function
Or, in English.
A neuron is connected to other neurons via its input and output links. Each incoming
neuron has an activation value and each connection has a weight associated with it.
The neuron sums the incoming weighted values and this value is input to an activation
function. The output of the activation function is the output from the neuron.
Some common activation functions are shown below.
These functions can be defined as follows.
1 if x >= t, else 0
+1 if x >= 0, else –1
Sigmoid(x) =
On occasions an identify function is also used (i.e. where the input to the neuron
becomes the output). This function is normally used in the input layer where the
inputs to the neural network are passed into the network unchanged.
Interests in neural network differ according to profession.
Neurobiologists and psychologists
understanding our brain
Engineers and physicists
a tool to recognise patterns in noisy data (see Ts at right)
Business analysts and engineers
a tool for modelling data
Computer scientists and mathematicians
networks offer an alternative model of computing: machines that may be
taught rather than programmed
Artificial Intelligensia, cognitive scientists and philosophers
Subsymbolic processing (reasoning with patterns, not symbols)
digital/boolean (true or false)
Some Application Areas
A good overview of NN applications is provided on the pages set up by the DTI
Neural Applications programme and the later Smart software for decision makers
NCTT programme
Smart software for decision makers
I've combined their overview with Biggus's classification of NN applications in
further notes on neural network applications.
Neural computing provides an approach which is closer to human perception and
recognition than traditional computing.
Neural computing systems are adept at many pattern recognition tasks, more so than
both traditional statistical and expert systems.
Combinatorial Problems
Neural computing systems have shown some promise for solving 'NP-complete'
In solving this type of problem neural networks offer the facility to find a 'dirty'
solution quickly rather than using significantly more resources to find the optimal
solution for little extra gain.
The multi-layer perceptron has been applied to a wide variety of problems calling for
a non-linear mapping between input and output.
DECTALK is a system developed by DEC which reads English characters and
produces, with a 95% accuracy, the correct pronunciation for an input word or text
>DECTALK is an expert system which took 20 years to finalise.
It uses a list of pronunciation rules and a large dictionary of exceptions.
NETTALK, a neural network version of DECTALK, was constructed over one
summer vacation.
After 16 hours of training the system could read a 100-word sample text with
98% accuracy!
When trained with 15,000 words it achieved 86% accuracy on a test set.
NETTALK is an example of one of the advantages of the neural network approach
over the symbolic approach to Artificial Intelligence.
It is difficult to simulate the learning process in a symbolic system; rules and
exceptions must be known.
On the other hand, neural systems exhibit learning very clearly; the network
learns by example.
Neural networks are not a panacea for problems in information processing which
traditional methods find difficult or complex.
The neural approach is simply an alternative paradigm.
Neural Networks are inherently parallel and naturally amenable to expression in a
parallel notation and implementation on parallel hardware.
Capacity for Adaptation
In general, neural systems are capable of learning.
Some networks have the capacity to self-organise, ensuring their stability as dynamic
A self-organising network can take account of a change in the problem that it is
solving, or may learn to resolve the problem in a new manner.
Distributed Memory
In neural networks 'memory' corresponds to an activation map of the neurons.
Memory is thus distributed over many units giving resistance to noise.
In distributed memories, such as neural networks, it is possible to start with noisy data
and to recall the correct data.
Fault Tolerance
Distributed memory is also responsible for fault tolerance.
In most neural networks, if some PEs are destroyed, or their connections altered
slightly, then the behaviour of the network as a whole is only slightly degraded.
The characteristic of graceful degradation makes neural computing systems
extremely well suited for applications where failure of control equipment means
Capacity for Generalisation
Designers of Expert Systems have difficulty in formulation rules which encapsulate
an experts knowledge in relation to some problem.
A neural system may learn the rules simply from a set of examples.
The generalisation capacity of a neural network is its capacity to give a satisfactory
response for an input which is not part of the set of examples on which it was trained.
The capacity for generalisation is an essential feature of a classification system
Certain aspects of generalisation behaviour are interesting because they are intuitively
quite close to human generalisation.
Ease of Construction
Computer simulations of small applications can be implemented relatively quickly.
Neural systems are inherently parallel but are normally simulated on a
sequential machines.
o Processing time can rise quickly as the size of the problem grows - The
Scaling Problem
o However, a direct hardware approach would lose the flexibility offered
by a software implementation.
o In consequence, neural networks have been used to address only small
The performance of a network can be sensitive to the quality and type of
preprocessing of the input data.
Neural networks cannot explain the results they obtain; their rules of operation
are completely unknown.
Performance is measured by statistical methods giving rise to distrust on the
part of potential users.
Many of the design decisions required in developing an application are not
well understood.
Comparison of neural techniques and symbolic artificial
Early work on Neural systems was largely abandoned after serious limitations of
earlier models were highlighted in 1969.
Growth of Artificial Intelligence based on the hypothesis that thought processes could
be modelled using a set of symbols and applying a set of logical transformation rules.
The symbolic approach has a number of limitations:
It is essentially sequential and difficult to parallelise
When the quantity of data increases, the methods may suffer a combinatorial
An item of knowledge is represented by a precise object, perhaps a byte in
memory, or a production rule. This localised representation of knowledge does
not lend itself to a robust system.
The learning process seems difficult to simulate in a symbolic system.
The connectionist approach offers the following advantages over the symbolic
parallel and real-time operation of many different components
the distributed representation of knowledge
learning by modifying connection weights.
Both approaches are likely to be combined in the future. For now, here are some rules
of thumb for choosing the approach to use:
Input -> Output
Facts -> Decision
Logic representation
(facts and rules)
of human
Expert system
learned from
Numbers (measurements ->
Mathematical calculation
support system network
Multiple Layer Feed Forward Networks
Solving non linearly separable problems
As pointed out before, XOR is an example of a non linearly separable problem which
two layer neural nets cannot solve. By adding another layer, the hidden layer, such
problems can be solved.
Solving XOR with NETS
With 3 hidden nodes in one hidden layer
Example weight file - xor3.pwt
With 1 hidden node
Example weight file - xor31.pwt
In you can roughly trace how the trained net works. Assume nodes have
step function thresholds at 0.0 and output 0 or 1. (The actually use the sigmoid
If you apply x1= 0 and x2 = 0, no signal arrives at the middle node. Since the bias is
2.2 the node outputs 1 through the -6.51 weight. The outer connections also receive
no inputs, so the output node receives nothing from those lines. The -6.51 signal
outweighs the 3.93 bias so there is no output at y, a correct result.
If x1 =1 and x2 = 1, then the weighted inputs into the centre node overwhelm the 2.12
bias on the node and it produces no output along the line weighted -6.51. Similarly,
along the outer lines the weighted input to the output node is 1 * - 2.64 + 1 * -2.66 = 5.3. This overwhelms the 3.93 bias so that the net input to the node is -1.37 which is
below the 0 threshold, so there is no output at y. This is also correct.
If x1 = 1 and x2 = 0, the net input to the centre node, including the bias is only -3.24,
well below 0.0 so it does not fire. Therefore there is no input to the output node along
the center connection.
On the left outer line we have an input of 1 * -2.64. On the right outer line the input is
0 to the output node. The bias is 3.93 so the net input to the output node is 3.03- 2.64
> 0.0 do the node fires and an output of 1 appears at y.
If x1 = 0 and x2 = 1, the result is similar to x1 = 1 and x2 = 0. In all cases the net
correctly implements the XOR function.
It turns out that 3 layer or 4 layer nets can mimic any computable function. Thus they
have the computing power of a Turning Machine. In other words, they are as powerful
as any computational device can be.
One might ask, just what is the net learning, and how is it trained?
People have tried, with limited success, to better understand these networks with
hidden layers by associating the hidden nodes with features of the input data. This
diagram illustrates the idea of feature detection.
The idea is that each of the two hidden nodes becomes associated with a feature of the
input data set. The following sketches illustrate what might happen.
Don't take the lines too seriously - they are just for visual effect. The first input
pattern is 0101010 which produces output 100. Similarly 0100010 ==> 010, and
0001000 ==> 001.
Diagram (b) shows the input and output nodes. But what about the 'hidden' nodes?
They might look like this
Fig. (c) represents the features detected by the network's hidden nodes. The points
represent the weights along the connections coming from the 7 input nodes to each of
the hidden nodes.
The connections from the hidden nodes to the output nodes combine these features to
produce the appropriate output. The joining of the points is meant to provide an aid in
visualizing how the features 'add up' to produce the output.
Many people take this feature analysis 'with a grain of salt', and prefer to visualize the
network as a black box which, when trained, does what it supposed to.
Training multilayer networks with back propagation.
The discovery of the backpropagation algorithm lead to an explosion of interest in
Neural Networks. Feed forward multilayer networks trained with the backpropagation
algorithm are still the most common kind today. How does the algorithm work?
The big picture
Calculating errors for a 3 layer network
First of all, the appropriate error signals must be calculated.
or, more simply,
Here f(I) is the threshold function. The total weighted signal (sometimes called
'activation') into a node is represented by I. The y's are outputs from various nodes.
The index, j, numbers output nodes. Index, i, numbers the hidden nodes (middle
layer). w(i, j) is the weight on the line going from node i in the hidden layer to node j
in the output layer. It is assumed in the second set of equations that f(I) is the sigmoid
The errors at the output nodes is smoothed by multiplying them the derivative of the
sigmoid function,
This has the effect of reducing the effects of larger errors.
Once the errors at the output nodes are calculated, they are treated as inputs starting
from the output node. The net is run 'backwards' as the error signals are propagated
from the output node layer towards the input node layer. This backward error
propagation (backward in the sense that the error signals flow in the opposite
direction the direction of the activations caused by the normal inputs) enables the
calculation of errors at the hidden layer corresponding to the output errors.
The second or fourth of the above equations determines the errors at the hidden nodes.
The calculation is similar to the delta rule.
The error at each hidden node, j, coming from the output node, i, is proportional to the
weight of the line from i to j and to the error at output node, j (calculated using the
first or third equation). The total error at hidden node i is the sum of the errors coming
from all the output nodes receiving activation from node i. Once again, the derivative
of the sigmoid function smooths the results by diminishing the effects of large errors.
Using the errors to adjust the weights
Once the errors are known, the weights can be adjusted, layer by layer. The weights of
the hidden nodes are also adjusted correctly. The formula is,
where beta is the learning rate.
The correction works layer by layer, backwards, from output layer to input layer. In
the 3 layer (of nodes) net there are only two layers of weights.
The errors at the output nodes are used to calculate the new weights between the
hidden (middle) layer and the output layer. Then the delta formula is used again to
calculate the new weights between the input and the hidden layer using the errors at
the middle layer derived using the 2nd o r 4th error equations above.
Here is an attempt to put it all together. The notation is slightly different from the
above notation.
2.3.4 Linear Separability and Minsky and Papert’s Perceptrons
During the years following the introduction of the perceptron and the ADALINE,
research in neural networks prospered. However, both Rosenblatt’s network and
Widrow and Hoff’s adaptive neurons suffered from one major problem. To illustrate
this problem, consider the way the neurons in both models compute their net inputs.
The equation used is
I =  wi xi
i 1
This can rewritten (for a two dimensional input pattern) as
I  x1w1  x2w2
This equation is also used to calculate the dot product of two vectors in a Cartesian
co-ordinate system. Therefore what is really happening is the dot product between the
weight vector and the input vector is being calculated. Another form of this equation
x  w | x|| w| cos 
Since the length of a vector must always be positive, the only term in the above
equation that can affect the sign of the result is the cosine of the angle between the
two vectors. The dot product will be positive if the angle between the vectors is less
than 90 and negative if the angle is greater than 90. It can be seen from figure 2.6
that the ADALINE (the outputs used in the figure are for the ADALINE. The only
difference with the perceptron is that the outputs would be +1 or 0) produces an
output of +1 for input patterns that are within 90 of its weight vector and a -1 for all
other patterns. Therefore in order for the ADALINE or perceptron to successfully
solve a particular problem, that problem must be linearly separable, i.e. the
categories of the inputs can be separated by a straight line. This is a simple
representation of this problem, but even so, it is easy to see that the capabilities of
these neural models were extremely limited since the majority of problems do not
meet this criterion (the XOR problem being a classic example [Day]).
Class A (+1 Output)
Minsky and Papert, Perceptrons
In the years leading up to the publication
Weight vector
of this book, the appeal of neural
networks had begun to dwindle. There Class B (-1 Output)
were a number of reasons for this decline
in interest. Firstly, neural networks failed
Figure 2.6
to meet the expectations of the scientific
Separation of two different classes
community. Although originally greeted
with great excitement, no major progress was made in the follow up to their
discovery, and excessive hype put pressure the developers of these networks to
produce results quickly. Another obstacle that neural networks had to overcome was
the apprehension which inevitably meets any attempt to develop artificially intelligent
machines. There was constant resistance to the idea of building a ‘chunk of brain’.
Rosenblatt once quoted a headline from an Oklahoma newspaper [Ros2] which said
“Frankenstien Monster Designed by Navy Robot That Thinks.” The new field of
Artificial Intelligence also lessened the attraction of neural computing as it seemed
potentially more successful in the design of machine intelligence.
All these factors contributed to the decline of neural networks. However, the final
blow came in 1969 with the arrival of the book Perceptrons by Marvin Minsky and
Seymour Papert [Min]. This book highlighted many difficulties with the existing
neural models, particularly the problem of linear separability. Although this problem
was already recognised, its full extent was never really realised. By using
mathematical analysis, Minsky and Papert showed that one and two layered neural
networks would only ever be able to solve linearly separable problems and therefore
their usefulness was extremely limited. This book all but wiped out funding and
interest in neural computation. To most scientists it seemed that these networks were
a lost cause and the majority of the scientific community turned their attention
elsewhere, in particular to the field of Artificial Intelligence.
2.4.2 An Overview of Backpropagation Networks
In their book Perceptrons, Minsky and Papert concluded that any network with only
one layer of adjustable weights would only ever be able to solve linear separable
problems. They also concluded that no algorithm could ever be constructed that
would allow the proper modification of a two or more layers and therefore the
potential of neural networks was extremely limited. However, with the discovery of
backpropagation, this obstacle was overcome.
Backpropagation Network Architecture
A backpropagation network typically consists of three or more layers of nodes. The
first layer is the known as the input layer and the last layer is known as the output
layer. Any layers of nodes in between the input and output layers are known as
hidden layers. Each unit in a layer is connected to every unit in the next layer. There
are no interlayer connections. The operation of the network consists of a forward pass
Output Layer
Hidden Layer
Input Layer
Figure 2.7
A Backpropagation Network
of the input through the network and then a backward pass of an error value which is
used in the weight modification (figure 2.7).
Forward Propagation
A forward propagation step is initiated when an input pattern is presented to the
network. No processing is performed at the input layer. The pattern is propagated
forward to the next layer, and each node in this layer performs a weighted sum on all
its inputs (as in the perceptron and ADALINE). After this sum has been calculated, a
function is used to compute the unit’s output. The function used to perform this
operation is the sigmoid function,
f ( x) 
1  e x
The main reason why this particular function is chosen is that its derivative, which is
used in the learning law, is easily computed. The result obtained after applying this
function to the net input is taken to be the node’s output value. This process is
continued until the pattern has been propagated through the entire network and
reaches the output layer. The activation pattern at the output layer is taken as the
network’s result.
Backward Propagation
The first step in the backpropagation stage is the calculation of the error between the
network’s result and the desired response. This occurs when the forward propagation
phase is completed. Each processing unit in the output layer is compared to its
corresponding entry in the desired pattern and an error is calculated for each node in
the output layer. The weights are then modified for all of the connections going into
the output layer. Next, the error is backpropagated to the hidden layers and by using
the generalised delta rule, the weights are adjusted for all connections going into the
hidden layer. The procedure is continued until the last layer of weights have been
modified. The forward and backward propagation phases are repeated until the
network’s output is equal to the desired result.
The Backpropagation Learning Law
The learning law used in the backpropagation network is a modification of the
original delta rule used in the ADALINE. The new form, known as the Generalised
Delta Rule, allows for the adjustment of the weights in the hidden layer, a feat
deemed impossible by Minsky and Papert. It uses the derivative of the activation
function of nodes (which in most cases is the sigmoid function) to determine the
extent of the adjustment to the weights connecting to the hidden layers. A full
mathematical description of the generalised delta rule can be found in [Bra].
2.4.2 Components of a Backpropagation Network
The Input Layer
The input layer of a backpropagation network acts solely as a buffer to hold the
patterns being presented to the network. Each node in the input layer corresponds to
one entry in the pattern. No processing is done at the input layer. The pattern is fed
forward from the input layer to the next layer.
The Hidden Layers
It is the hidden layers which give the
backpropagation network its exceptional
computation abilities. The units in the
hidden layers act as “feature detectors”.
They extract information from the input
patterns which can be used to
distinguish between particular classes.
The network creates its own internal
representation of the data. An example
of this feature detection is given in
[Day]. A backpropagation network with
Figure 2.8
Example Network Training Set
one hidden layer comprising of two units was trained on the patterns shown in figure
2.8. The patterns were chosen so that the distinguishing features of each pattern were
obvious. From figure 2.8, it can be seen
that only two feature detectors are
needed, the second and third patterns are
combined in the first. Thus, if feature
detectors are organised to respond to the
second and third patterns, then the first
pattern can be identified when both
feature detectors are activated. After
training, the features encoded by the
network were found by reading weight
Figure 2.9
values from the trained network. The
weights were read from the first layer of weights, for connections that originated at
the input layer and terminated at the two hidden “feature detector” units. Graphs of
these weights were used to visually represent the features to which each of the hidden
units responded. It is apparent from figure 2.9 that these features matched the
contours of the second and third training patterns, providing distinguishing
characteristics for all three patterns.
The Output Layer
The output layer of a network uses the response of the feature detectors in the hidden
layer. Each unit in the output layer emphasises each feature according to the values of
the connecting weights. The pattern of activation at this layer is taken as the
network’s response.
The backpropagation neural network is by the far the most powerful and adaptive
neural model available. It is an excellent choice for any form of pattern mapping
problems. For this study, backpropagation networks were chosen because of their
effectiveness and versatility.
(10 marks)
d) Calculate the weight adjustments in the following network for expected outputs
of {1,1} and the learning rate is 1:
(9 marks)
The Target Values are 1, 1 and the learning rate is 1
Use F(X) as the Activation Function 1 / (1 + e -x )
iW1 = 1 * 1 + 0 * -1 = 1,
1 * -1 + 0 * 1 = -1 { 1 - 1}
h = F(i W1) = 1 / (1 + e ) {F(1) , F (- 1) } = { 0. 73, 0.27}
hW2 =
0.73 * -1 + 0.27 * 0 = -0.73,
0.73 * 0 + 0.27 * -1 = -0.27 { -0.73 -
o = F(i W2) = 1 / (1 + e -x ) {F(-.73) , F (-.27) }
= { 0. 675, 0.567}
d1 = 0.7( 1 -0.7)(0.7 - 1) = 0.7 (0.3)(-0.3) = -0.063
d2 = 0.6(1 - 0.6)(0.6 - 1) = 0.6(0.4)(-0.4) = -0.096
e = h(1 - h)W2d
e1 = h1(1-h1)+ h2(1-h2) W2 d =
(h1, h2)(1 - h1)( W11 W12)(D1) =
(1 - h2)(W21 W22)(D2)
e1 = (h1(1-h1)+ h2(1-h2)) W11 D1 +W12D2
e2 =( h1(1-h1)+ h2(1-h2)) W21 D1 +W22D2
e1 = (0.73(1-0.73)+ 0.27(1-0.27))( -1* -0.063 +0*-0.096)
e2 =( 0.73(1-0.73)+ 0.27(1-0.27)) (0 *-0.063 +-1*-0.096)
e1 = (0.73(0.27)+ 0.27(0.73))( 0.063)
e2 =( 0.73(0.27)+ 0.27(0.73)) (0.096)
E1 = 0.3942 * 0.063 = 0.247
E2 = 0.3942 * 0.096 = 0.038
 W2t =  hd +   W2t-1
hd = (h1) (d1 d2) = (h1d1 h1d2)
(h2d1 h2d2)
where  = 1
1 *hd = (0.73) (-0.063 -0.096) = (0.73*-0.063 0.73*-0.096) = (-0.046 -0.107)
(0.27*-0.063 0.27*-0.096) (-0.017 -0.026 )