NEURAL NETWORKS Based on sets of notes prepared by Dr. Patrick H. Corr, Brendan Coburn and John Gilligan Neural Networks (also known as Connectionist models or Parallel Distributed Processing models) are information processing systems which model the brain’s cognitive process by imitating some of its basic structures and operations. Interest in these networks was originally biologically motivated. They were developed with the expectation of gaining new insight into the workings of the brain. In the 1940’s and 1950’s there was a certain amount of success in the research and development of neural networks but eventually the attraction of these systems declined due to a number of factors. However, within the past decade interest in neural networks has been revived. Although these networks are still helpful in the research of the brain’s cognitive process, today they are actually implemented in the processing of information. In 1943 two scientists, Warren McCulloch and Walter Pitts, proposed the first artificial model of a biological neuron [McC]. This synthetic neuron is still the basis for most of today’s neural networks. Rosenblatt came up with his two layered perceptron which was subsequently shown to be defective by Papert and Minsky which lead to a huge decline in funding and interest in Neural Networks. 2.3.5 Other Developments During this period, even though there was a lack of funding and interest in neural networks, a small number of researchers continued to investigate the potential of neural models. A number of papers were published, but none had any great impact. Many of these reports concentrated on the potential of neural networks for aiding in the explanation of biological behaviour (e.g. [Mal], [Bro], [Mar], [Bie], [Coo]). Others focused on real world implementations. In 1972 Teuvo Kohonen and James A. Anderson independently proposed the same model for associative memory [Koh], [An1] and in 1976 Marr and Poggio applied a neural network to a realistic problem in computational vision, stereopsis [Mar]. Other projects included [Lit], [Gr1], [Gr2], [Ama], [An2], [McC]. 2.4.1 The Discovery of Backpropagation The backpropagation learning algorithm was developed independently by Rumelhart [Ru1], [Ru2], Le Cun [Cun] and Parker [Par] in 1986. It was subsequently discovered that the algorithm had also been described by Paul Werbos in his Harvard Ph.D thesis in 1974 [Wer]. Error backpropagation networks are the most widely used neural network model as they can be applied to almost any problem that requires pattern mapping. It was the discovery of this paradigm that brought neural networks out of the research area and into real world implementation. What and why? Neural Networks: a bottom-up attempt to model the functionality of the brain. Two main areas of activity: Biological o Try to model biological neural systems Computational o Artificial neural networks are biologically inspired butnot necessarily biologically plausible o So may use other terms: Connectionism, Parallel Distributed Processing, Adaptive Systems Theory. A simplified view of a neuron is shown in the diagram below. Signals move from neuron to neuron via electrochemical reactions. The synapses release a chemical transmitter which enters the dendrite. This raises or lowers the electrical potential of the cell body. The soma sums the inputs it receives and once a threshold level is reached an electrical impulse is sent down the axon (often known as firing). These impulses eventually reach synapses and the cycle continues. Synapses which raise the potential within a cell body are called excitatory. Synapses which lower the potential are called inhibitory. It has been found that synapses exhibit plasticity. This means that long-term changes in the strengths of the connections can be formed depending on the firing patterns of other neurons. This is thought to be the basis for learning in our brains. Marking Scheme 1 X 5 basic components 2 for diagram. (c ) How is the neuron modelled in Artificial Neural Nets? (6 marks) Modelling a Neuron To model the brain we need to model a neuron. Each neuron performs a simple computation. It receives signals from its input links and it uses these values to compute the activation level (or output) for the neuron. This value is passed to other neurons via its output links. The input value received of a neuron is calculated by summing the weighted input values from its input links. That is in i j Wj , iaj An activation function takes the neuron input value and produces a value which becomes the output value of the neuron. This value is passed to other neurons in the network. This is summarised in this diagram and the notes below. aj : Activation value of unit j wj,i : Weight on the link from unit j to unit i ini : Weighted sum of inputs to unit i ai : Activation value of unit i (also known as the output value) g : Activation function Or, in English. A neuron is connected to other neurons via its input and output links. Each incoming neuron has an activation value and each connection has a weight associated with it. The neuron sums the incoming weighted values and this value is input to an activation function. The output of the activation function is the output from the neuron. Some common activation functions are shown below. These functions can be defined as follows. Stept(x) = 1 if x >= t, else 0 Sign(x) = +1 if x >= 0, else –1 Sigmoid(x) = 1/(1+e-x) On occasions an identify function is also used (i.e. where the input to the neuron becomes the output). This function is normally used in the input layer where the inputs to the neural network are passed into the network unchanged. Interests in neural network differ according to profession. Neurobiologists and psychologists understanding our brain Engineers and physicists a tool to recognise patterns in noisy data (see Ts at right) Business analysts and engineers a tool for modelling data Computer scientists and mathematicians networks offer an alternative model of computing: machines that may be taught rather than programmed Artificial Intelligensia, cognitive scientists and philosophers Subsymbolic processing (reasoning with patterns, not symbols) Symbol Subsymbol top-down bottom-up Explicit implicit Rules examples Serial parallel digital/boolean (true or false) analog/fuzzy Brittle robust Some Application Areas A good overview of NN applications is provided on the pages set up by the DTI Neural Applications programme and the later Smart software for decision makers programme NCTT programme Smart software for decision makers I've combined their overview with Biggus's classification of NN applications in further notes on neural network applications. Neural computing provides an approach which is closer to human perception and recognition than traditional computing. Neural computing systems are adept at many pattern recognition tasks, more so than both traditional statistical and expert systems. Combinatorial Problems Neural computing systems have shown some promise for solving 'NP-complete' problems. In solving this type of problem neural networks offer the facility to find a 'dirty' solution quickly rather than using significantly more resources to find the optimal solution for little extra gain. The multi-layer perceptron has been applied to a wide variety of problems calling for a non-linear mapping between input and output. NETTALK AND DECTALK DECTALK is a system developed by DEC which reads English characters and produces, with a 95% accuracy, the correct pronunciation for an input word or text pattern. >DECTALK is an expert system which took 20 years to finalise. It uses a list of pronunciation rules and a large dictionary of exceptions. NETTALK, a neural network version of DECTALK, was constructed over one summer vacation. After 16 hours of training the system could read a 100-word sample text with 98% accuracy! When trained with 15,000 words it achieved 86% accuracy on a test set. NETTALK is an example of one of the advantages of the neural network approach over the symbolic approach to Artificial Intelligence. It is difficult to simulate the learning process in a symbolic system; rules and exceptions must be known. On the other hand, neural systems exhibit learning very clearly; the network learns by example. COMPARISONS Neural networks are not a panacea for problems in information processing which traditional methods find difficult or complex. The neural approach is simply an alternative paradigm. ATTRACTIVE PROPERTIES OF NEURAL NETWORKS Parallelism Neural Networks are inherently parallel and naturally amenable to expression in a parallel notation and implementation on parallel hardware. Capacity for Adaptation In general, neural systems are capable of learning. Some networks have the capacity to self-organise, ensuring their stability as dynamic systems. A self-organising network can take account of a change in the problem that it is solving, or may learn to resolve the problem in a new manner. Distributed Memory In neural networks 'memory' corresponds to an activation map of the neurons. Memory is thus distributed over many units giving resistance to noise. In distributed memories, such as neural networks, it is possible to start with noisy data and to recall the correct data. Fault Tolerance Distributed memory is also responsible for fault tolerance. In most neural networks, if some PEs are destroyed, or their connections altered slightly, then the behaviour of the network as a whole is only slightly degraded. The characteristic of graceful degradation makes neural computing systems extremely well suited for applications where failure of control equipment means disaster. Capacity for Generalisation Designers of Expert Systems have difficulty in formulation rules which encapsulate an experts knowledge in relation to some problem. A neural system may learn the rules simply from a set of examples. The generalisation capacity of a neural network is its capacity to give a satisfactory response for an input which is not part of the set of examples on which it was trained. The capacity for generalisation is an essential feature of a classification system Certain aspects of generalisation behaviour are interesting because they are intuitively quite close to human generalisation. Ease of Construction Computer simulations of small applications can be implemented relatively quickly. LIMITATIONS IN THE USE OF NEURAL NETWORKS Neural systems are inherently parallel but are normally simulated on a sequential machines. o Processing time can rise quickly as the size of the problem grows - The Scaling Problem o However, a direct hardware approach would lose the flexibility offered by a software implementation. o In consequence, neural networks have been used to address only small problems. The performance of a network can be sensitive to the quality and type of preprocessing of the input data. Neural networks cannot explain the results they obtain; their rules of operation are completely unknown. Performance is measured by statistical methods giving rise to distrust on the part of potential users. Many of the design decisions required in developing an application are not well understood. Comparison of neural techniques and symbolic artificial intelligence Early work on Neural systems was largely abandoned after serious limitations of earlier models were highlighted in 1969. Growth of Artificial Intelligence based on the hypothesis that thought processes could be modelled using a set of symbols and applying a set of logical transformation rules. The symbolic approach has a number of limitations: It is essentially sequential and difficult to parallelise When the quantity of data increases, the methods may suffer a combinatorial explosion. An item of knowledge is represented by a precise object, perhaps a byte in memory, or a production rule. This localised representation of knowledge does not lend itself to a robust system. The learning process seems difficult to simulate in a symbolic system. The connectionist approach offers the following advantages over the symbolic approach: parallel and real-time operation of many different components the distributed representation of knowledge learning by modifying connection weights. Both approaches are likely to be combined in the future. For now, here are some rules of thumb for choosing the approach to use: Input -> Output Facts -> Decision Model: Logic representation (facts and rules) of human expertise Expert system learned from data Machine induction Numbers (measurements -> Decision Neural Mathematical calculation predictions) support system network Multiple Layer Feed Forward Networks Solving non linearly separable problems As pointed out before, XOR is an example of a non linearly separable problem which two layer neural nets cannot solve. By adding another layer, the hidden layer, such problems can be solved. Solving XOR with NETS With 3 hidden nodes in one hidden layer - xor3.net Example weight file - xor3.pwt With 1 hidden node xor31.net Example weight file - xor31.pwt In xor31.net you can roughly trace how the trained net works. Assume nodes have step function thresholds at 0.0 and output 0 or 1. (The actually use the sigmoid function.) If you apply x1= 0 and x2 = 0, no signal arrives at the middle node. Since the bias is 2.2 the node outputs 1 through the -6.51 weight. The outer connections also receive no inputs, so the output node receives nothing from those lines. The -6.51 signal outweighs the 3.93 bias so there is no output at y, a correct result. If x1 =1 and x2 = 1, then the weighted inputs into the centre node overwhelm the 2.12 bias on the node and it produces no output along the line weighted -6.51. Similarly, along the outer lines the weighted input to the output node is 1 * - 2.64 + 1 * -2.66 = 5.3. This overwhelms the 3.93 bias so that the net input to the node is -1.37 which is below the 0 threshold, so there is no output at y. This is also correct. If x1 = 1 and x2 = 0, the net input to the centre node, including the bias is only -3.24, well below 0.0 so it does not fire. Therefore there is no input to the output node along the center connection. On the left outer line we have an input of 1 * -2.64. On the right outer line the input is 0 to the output node. The bias is 3.93 so the net input to the output node is 3.03- 2.64 > 0.0 do the node fires and an output of 1 appears at y. If x1 = 0 and x2 = 1, the result is similar to x1 = 1 and x2 = 0. In all cases the net correctly implements the XOR function. It turns out that 3 layer or 4 layer nets can mimic any computable function. Thus they have the computing power of a Turning Machine. In other words, they are as powerful as any computational device can be. One might ask, just what is the net learning, and how is it trained? Features People have tried, with limited success, to better understand these networks with hidden layers by associating the hidden nodes with features of the input data. This diagram illustrates the idea of feature detection. The idea is that each of the two hidden nodes becomes associated with a feature of the input data set. The following sketches illustrate what might happen. Don't take the lines too seriously - they are just for visual effect. The first input pattern is 0101010 which produces output 100. Similarly 0100010 ==> 010, and 0001000 ==> 001. Diagram (b) shows the input and output nodes. But what about the 'hidden' nodes? They might look like this Fig. (c) represents the features detected by the network's hidden nodes. The points represent the weights along the connections coming from the 7 input nodes to each of the hidden nodes. The connections from the hidden nodes to the output nodes combine these features to produce the appropriate output. The joining of the points is meant to provide an aid in visualizing how the features 'add up' to produce the output. Many people take this feature analysis 'with a grain of salt', and prefer to visualize the network as a black box which, when trained, does what it supposed to. Training multilayer networks with back propagation. The discovery of the backpropagation algorithm lead to an explosion of interest in Neural Networks. Feed forward multilayer networks trained with the backpropagation algorithm are still the most common kind today. How does the algorithm work? The big picture Calculating errors for a 3 layer network First of all, the appropriate error signals must be calculated. or, more simply, Here f(I) is the threshold function. The total weighted signal (sometimes called 'activation') into a node is represented by I. The y's are outputs from various nodes. The index, j, numbers output nodes. Index, i, numbers the hidden nodes (middle layer). w(i, j) is the weight on the line going from node i in the hidden layer to node j in the output layer. It is assumed in the second set of equations that f(I) is the sigmoid function. The errors at the output nodes is smoothed by multiplying them the derivative of the sigmoid function, . This has the effect of reducing the effects of larger errors. Once the errors at the output nodes are calculated, they are treated as inputs starting from the output node. The net is run 'backwards' as the error signals are propagated from the output node layer towards the input node layer. This backward error propagation (backward in the sense that the error signals flow in the opposite direction the direction of the activations caused by the normal inputs) enables the calculation of errors at the hidden layer corresponding to the output errors. The second or fourth of the above equations determines the errors at the hidden nodes. The calculation is similar to the delta rule. The error at each hidden node, j, coming from the output node, i, is proportional to the weight of the line from i to j and to the error at output node, j (calculated using the first or third equation). The total error at hidden node i is the sum of the errors coming from all the output nodes receiving activation from node i. Once again, the derivative of the sigmoid function smooths the results by diminishing the effects of large errors. Using the errors to adjust the weights Once the errors are known, the weights can be adjusted, layer by layer. The weights of the hidden nodes are also adjusted correctly. The formula is, where beta is the learning rate. The correction works layer by layer, backwards, from output layer to input layer. In the 3 layer (of nodes) net there are only two layers of weights. The errors at the output nodes are used to calculate the new weights between the hidden (middle) layer and the output layer. Then the delta formula is used again to calculate the new weights between the input and the hidden layer using the errors at the middle layer derived using the 2nd o r 4th error equations above. Here is an attempt to put it all together. The notation is slightly different from the above notation. 2.3.4 Linear Separability and Minsky and Papert’s Perceptrons During the years following the introduction of the perceptron and the ADALINE, research in neural networks prospered. However, both Rosenblatt’s network and Widrow and Hoff’s adaptive neurons suffered from one major problem. To illustrate this problem, consider the way the neurons in both models compute their net inputs. The equation used is n I = wi xi i 1 This can rewritten (for a two dimensional input pattern) as I x1w1 x2w2 This equation is also used to calculate the dot product of two vectors in a Cartesian co-ordinate system. Therefore what is really happening is the dot product between the weight vector and the input vector is being calculated. Another form of this equation is x w | x|| w| cos Since the length of a vector must always be positive, the only term in the above equation that can affect the sign of the result is the cosine of the angle between the two vectors. The dot product will be positive if the angle between the vectors is less than 90 and negative if the angle is greater than 90. It can be seen from figure 2.6 that the ADALINE (the outputs used in the figure are for the ADALINE. The only difference with the perceptron is that the outputs would be +1 or 0) produces an output of +1 for input patterns that are within 90 of its weight vector and a -1 for all other patterns. Therefore in order for the ADALINE or perceptron to successfully solve a particular problem, that problem must be linearly separable, i.e. the categories of the inputs can be separated by a straight line. This is a simple representation of this problem, but even so, it is easy to see that the capabilities of these neural models were extremely limited since the majority of problems do not meet this criterion (the XOR problem being a classic example [Day]). Class A (+1 Output) Minsky and Papert, Perceptrons In the years leading up to the publication Weight vector of this book, the appeal of neural networks had begun to dwindle. There Class B (-1 Output) were a number of reasons for this decline in interest. Firstly, neural networks failed Figure 2.6 to meet the expectations of the scientific Separation of two different classes community. Although originally greeted with great excitement, no major progress was made in the follow up to their discovery, and excessive hype put pressure the developers of these networks to produce results quickly. Another obstacle that neural networks had to overcome was the apprehension which inevitably meets any attempt to develop artificially intelligent machines. There was constant resistance to the idea of building a ‘chunk of brain’. Rosenblatt once quoted a headline from an Oklahoma newspaper [Ros2] which said “Frankenstien Monster Designed by Navy Robot That Thinks.” The new field of Artificial Intelligence also lessened the attraction of neural computing as it seemed potentially more successful in the design of machine intelligence. All these factors contributed to the decline of neural networks. However, the final blow came in 1969 with the arrival of the book Perceptrons by Marvin Minsky and Seymour Papert [Min]. This book highlighted many difficulties with the existing neural models, particularly the problem of linear separability. Although this problem was already recognised, its full extent was never really realised. By using mathematical analysis, Minsky and Papert showed that one and two layered neural networks would only ever be able to solve linearly separable problems and therefore their usefulness was extremely limited. This book all but wiped out funding and interest in neural computation. To most scientists it seemed that these networks were a lost cause and the majority of the scientific community turned their attention elsewhere, in particular to the field of Artificial Intelligence. 2.4.2 An Overview of Backpropagation Networks In their book Perceptrons, Minsky and Papert concluded that any network with only one layer of adjustable weights would only ever be able to solve linear separable problems. They also concluded that no algorithm could ever be constructed that would allow the proper modification of a two or more layers and therefore the potential of neural networks was extremely limited. However, with the discovery of backpropagation, this obstacle was overcome. Backpropagation Network Architecture A backpropagation network typically consists of three or more layers of nodes. The first layer is the known as the input layer and the last layer is known as the output layer. Any layers of nodes in between the input and output layers are known as hidden layers. Each unit in a layer is connected to every unit in the next layer. There are no interlayer connections. The operation of the network consists of a forward pass I E N R P R U O T R Output Layer Hidden Layer Input Layer Figure 2.7 A Backpropagation Network of the input through the network and then a backward pass of an error value which is used in the weight modification (figure 2.7). Forward Propagation A forward propagation step is initiated when an input pattern is presented to the network. No processing is performed at the input layer. The pattern is propagated forward to the next layer, and each node in this layer performs a weighted sum on all its inputs (as in the perceptron and ADALINE). After this sum has been calculated, a function is used to compute the unit’s output. The function used to perform this operation is the sigmoid function, f ( x) 1 1 e x The main reason why this particular function is chosen is that its derivative, which is used in the learning law, is easily computed. The result obtained after applying this function to the net input is taken to be the node’s output value. This process is continued until the pattern has been propagated through the entire network and reaches the output layer. The activation pattern at the output layer is taken as the network’s result. Backward Propagation The first step in the backpropagation stage is the calculation of the error between the network’s result and the desired response. This occurs when the forward propagation phase is completed. Each processing unit in the output layer is compared to its corresponding entry in the desired pattern and an error is calculated for each node in the output layer. The weights are then modified for all of the connections going into the output layer. Next, the error is backpropagated to the hidden layers and by using the generalised delta rule, the weights are adjusted for all connections going into the hidden layer. The procedure is continued until the last layer of weights have been modified. The forward and backward propagation phases are repeated until the network’s output is equal to the desired result. The Backpropagation Learning Law The learning law used in the backpropagation network is a modification of the original delta rule used in the ADALINE. The new form, known as the Generalised Delta Rule, allows for the adjustment of the weights in the hidden layer, a feat deemed impossible by Minsky and Papert. It uses the derivative of the activation function of nodes (which in most cases is the sigmoid function) to determine the extent of the adjustment to the weights connecting to the hidden layers. A full mathematical description of the generalised delta rule can be found in [Bra]. 2.4.2 Components of a Backpropagation Network The Input Layer The input layer of a backpropagation network acts solely as a buffer to hold the patterns being presented to the network. Each node in the input layer corresponds to one entry in the pattern. No processing is done at the input layer. The pattern is fed forward from the input layer to the next layer. The Hidden Layers It is the hidden layers which give the backpropagation network its exceptional computation abilities. The units in the hidden layers act as “feature detectors”. They extract information from the input patterns which can be used to distinguish between particular classes. The network creates its own internal representation of the data. An example of this feature detection is given in [Day]. A backpropagation network with Figure 2.8 Example Network Training Set one hidden layer comprising of two units was trained on the patterns shown in figure 2.8. The patterns were chosen so that the distinguishing features of each pattern were obvious. From figure 2.8, it can be seen that only two feature detectors are needed, the second and third patterns are combined in the first. Thus, if feature detectors are organised to respond to the second and third patterns, then the first pattern can be identified when both feature detectors are activated. After training, the features encoded by the network were found by reading weight Figure 2.9 Feature Detectors values from the trained network. The weights were read from the first layer of weights, for connections that originated at the input layer and terminated at the two hidden “feature detector” units. Graphs of these weights were used to visually represent the features to which each of the hidden units responded. It is apparent from figure 2.9 that these features matched the contours of the second and third training patterns, providing distinguishing characteristics for all three patterns. The Output Layer The output layer of a network uses the response of the feature detectors in the hidden layer. Each unit in the output layer emphasises each feature according to the values of the connecting weights. The pattern of activation at this layer is taken as the network’s response. Summary The backpropagation neural network is by the far the most powerful and adaptive neural model available. It is an excellent choice for any form of pattern mapping problems. For this study, backpropagation networks were chosen because of their effectiveness and versatility. (10 marks) d) Calculate the weight adjustments in the following network for expected outputs of {1,1} and the learning rate is 1: (9 marks) The Target Values are 1, 1 and the learning rate is 1 Use F(X) as the Activation Function 1 / (1 + e -x ) iW1 = 1 * 1 + 0 * -1 = 1, 1 * -1 + 0 * 1 = -1 { 1 - 1} -x h = F(i W1) = 1 / (1 + e ) {F(1) , F (- 1) } = { 0. 73, 0.27} hW2 = 0.27} 0.73 * -1 + 0.27 * 0 = -0.73, 0.73 * 0 + 0.27 * -1 = -0.27 { -0.73 - o = F(i W2) = 1 / (1 + e -x ) {F(-.73) , F (-.27) } = { 0. 675, 0.567} d1 = 0.7( 1 -0.7)(0.7 - 1) = 0.7 (0.3)(-0.3) = -0.063 d2 = 0.6(1 - 0.6)(0.6 - 1) = 0.6(0.4)(-0.4) = -0.096 e = h(1 - h)W2d e1 = h1(1-h1)+ h2(1-h2) W2 d = (h1, h2)(1 - h1)( W11 W12)(D1) = (1 - h2)(W21 W22)(D2) e1 = (h1(1-h1)+ h2(1-h2)) W11 D1 +W12D2 e2 =( h1(1-h1)+ h2(1-h2)) W21 D1 +W22D2 e1 = (0.73(1-0.73)+ 0.27(1-0.27))( -1* -0.063 +0*-0.096) e2 =( 0.73(1-0.73)+ 0.27(1-0.27)) (0 *-0.063 +-1*-0.096) e1 = (0.73(0.27)+ 0.27(0.73))( 0.063) e2 =( 0.73(0.27)+ 0.27(0.73)) (0.096) E1 = 0.3942 * 0.063 = 0.247 E2 = 0.3942 * 0.096 = 0.038 W2t = hd + W2t-1 hd = (h1) (d1 d2) = (h1d1 h1d2) (h2) (h2d1 h2d2) where = 1 1 *hd = (0.73) (-0.063 -0.096) = (0.73*-0.063 0.73*-0.096) = (-0.046 -0.107) (0.27) (0.27*-0.063 0.27*-0.096) (-0.017 -0.026 )