Introduction to the multilayer perceptron To be able to solve nonlinearly separable problems, a number of neurons are connected in layers to build a multilayer perceptron. Each of the perceptrons is used to identify small linearly separable sections of the inputs. Outputs of the perceptrons are combined into another perceptron to produce the final output. The hard-limiting (step) function used for producing the output prevents information on the real inputs flowing on to inner neurons. To solve this problem, the step function is replaced with a continuous function- usually the sigmoidal function. The Architecture of the Multilayer Perceptron In a multilayer perceptron, the neurons are arranged into an input layer, an output layer and one or more hidden layers. 533563331 1 The Generalised Delta Rule The learning rule for the multilayer perceptron is known as "the generalised delta rule" or the "backpropagation rule". The generalised delta rule repetitively calculates an error function for each input and backpropagates the error from one layer to the previous one. The weights for a particular node are adjusted in direct proportion to the error in the units to which it is connected. Let Ep tpj opj wij = = = = error function for pattern p target output for pattern p on node j actual output for pattern p on node j weight from node i to node j The error function Ep is defined to be proportional to the square of the difference tpj opj (1) Ep = 1/2(tpj - opj)2 j The activation of each unit j, for pattern p, can be written as netpj = wijopi (2) i The output from each unit j is determined by the non-linear transfer function fj opj = fj(netpj) We assume fj to be the sigmoid function, f(net) = 1/(1 + e-k.net), where k is a positive constant that controls the "spread" of the function. The delta rule implements weight changes that follow the path of steepest descent on a surface in weight space. The height of any point on this surface is equal to the error measure Ep. This can be shown by showing that the derivative of the error measure with resepect to each weight is proportional to the weight change dictated by the delta rule, with a negative constant of proportionality, i.e., pwi -Ep/wij 533563331 2 The multilayer perceptron learning algorithm using the generalised delta rule 1. 2. 3. Initialise weights (to small random values) and transfer function Present input Adjust weights by starting from output layer and working backwards wij(t + 1) = wij(t) + pjopi wij(t) represents the weights from node i to node j at time t, is a gain term, and pj is an error term for pattern p on node j. For output layer units pj = kopj(1 - opj)(tpj - opj) For hidden layer units pj = kopj(1 - opj) pkwjk k where the sum is over the k nodes in the following layer. Problem solving by multilayer perceptrons - the XOR example Fig. A solution to the XOR problem. The hidden unit acts as the feature detector, detecting when both inputs are on. It is possible to produce different network topologies to solve the same problem and direct connections between the input and output layers is not essential (see Fig. below). 533563331 3 0.5 1 -1 0.5 1.5 1 1 1 1 input During training using the generalised delta rule, the network is provided information on the correct output for every input pattern and accordingly such a training scheme is termed supervised training. The learning rule in a multilayer perceptron is not guaranteed to produce convergence, and it is possible for the network to fall into a situation (the so called local minima) in which it is unable to learn the correct output. Multilayer Perceptrons as Classifiers The single layer perceptron is limited to calculating a single line of separation between classes. Let us consider a two layer perceptron with two units in the input layer. If one unit is set to respond with a 1 if the input is above its decision line, and the other responds with a 1 if the input is below its decision line, the second layer produces a solution in the form of a 1 if its input is above line 1 and below line 2. line 1 line 2 Fig. A 2-layer perceptron and the resluting decesion region. A three layer perceptron can therefore produce arbitrarily shaped decision regions, and are capable of separating any classes. This statement is referred to as the Kolmogorov theorem. Considering pattern recognition as a mapping function from unknown inputs to known classes, any function, no matter how complex, can be represented by a multilayer perceptron of no more than three layers. 533563331 4 The energy landscape The behaviour of a neural network as it attempts to arrive at a solution can be visualised in terms of the error or energy function Ep. The energy is a function of the input and the weights. For a given pattern, Ep can be plotted against the weights to give the so called energy surface. The energy surface is a landscape of hills and valleys, with points of minimum energy corresponding to wells and maximum energy found on peaks. The generalised delta rule aims to minimise Ep by adjusting weights so that they correspond to points of lowest energy. It does this by the method of gradient descent where the changes are made in the steepest downward direction. All possible solutions are depressions in the energy surface, known as basins of attraction. Learning Difficulties in Multilayer Perceptrons Occasionally, the multilayer perceptron fails to settle into the global minimum of the energy surface and instead find itself in one of the local minima. This is due to the gradient descent strategy followed. A number of alternative approaches can be taken to reduce this possibility: Lowering the gain term progressively Addition of more nodes for better representation of patterns Introduction of a momentum term which determines the effect of past weight changes on the current direction of the movement in weight space: wij(t + 1) = wij(t) + pjopi + (wij(t) - wij(t - 1)) where momentum term 0 < < 1. Addition of random noise to perturb a system out of a local minima. 533563331 5 Advantages of Multilayer Perceptrons The following two features characterise multilayer perceptrons and artificial neural networks in general. They are mainly responsible for the "edge" these networks have over conventional computing systems. Generalisation Neural networks are capable of generalisation, that is, they classify an unknown pattern with other known patterns that share the same distinguishing features. This means noisy or incomplete inputs will be classified because of their similarity with pure and complete inputs. Fault Tolerance Neural networks are highly fault tolerant. This characteristic is also known as "graceful degradation". Because of its distributed nature, a neural network keeps on working even when a significant fraction of its neurons and interconnections fail. Also, relearning after damage can be relatively quick. 533563331 6 Applications of Multilayer Perceptrons The multilayer perceptron with backpropagation has been applied in numerous applications ranging from OCR (Optical Character Recognition) to medicine. Brief accounts of a few are given below. Speech synthesis A very well known use of the multilayer perceptron is NETtalk [1], a text-to-speech conversion system, developed by Sejnowski and Rosenberg in 1987. It consists of 203 input units, 120 hidden units, and 26 output units with over 27000 synapses. Each output unit represents one basic unit of sound, known as a phoneme. Context is utilised in training by presenting seven successive letters to the input and the net learns to pronounce the middle letter. 90% correct pronunciation achieved with the training set (80-87% with unseen set). Resistant to damage and displays graceful degradation. Multilayer perceptrons are also being used for speech recognition to be used in voice activated control systems. Financial applications Examples include bond rating, loan application evaluation and stock market prediction. Bond rating involves categorising the bond issuer's capability. There is no hard and fast rules for determining these ratings. Statistical regression is inappropriate because the factors to be used are not well defined. Neural networks trained with backpropagation has consistently outperformed standard statistical techniques [2]. Pattern Recognition For many of the applications of neural networks, the underlying principle is that of pattern recognition. Target identification from sonar echoes has been developed. Given only a day of training, the net produced 100% correct identification of the target, compared to 93% scored by a Bayesian classifier. There are many commercial applications of networks in character recognition. One such system performs signature verification on bank cheques. Networks have been applied to the problems of aircraft identification, and to terrain matching for automatic navigation. 533563331 7 Limitations of Multilayer Perceptrons 1. Computationally expensive learning process Large number of iterations required for learning, not suitable for real-time learning 2. No guaranteed solution Remedies such as the "momentum term" add to computational cost Other remedies: using estimates of transfer functions using transfer functions with easy to compute derivatives using estimates of error values, eg., a single global error value for the hidden layer 3. Scaling problem Do not scale up well from small research systems to larger real systems. Both too many and too few units slow down learning. Biological arguments against Backpropagation Backpropagation not used or used through different pathways in biological systems. Biological systems use only local information for self-adjustments. The question one might ask at this point is - does an effective system need to mimic nature exactly? 533563331 8 REFERENCES [1] Beale, R., & Jackson, T., "Neural Computing: An Introduction", Bristol : Hilger, c1990. [2] Sejnowski, T., & Rosenberg, C.R., Parallel Networks that learn to pronounce English text, Complex Systems, 1987, pp. 145-168. [3] Dutta, S., & Sekhar, S., Bond rating: A non-Conservative Application of Neural Networks, IEEE Int. Conf. on Neural Networks, San Diego, CA, July 24-27, II: 443-450, 1988. [4] Gupta, L., Sayeh, M., Tamanna, R., A Neural Network Approach to Robust Shape Classification, Pattern Recognition, Vol. 23, No. 6, pp. 563-568, 1990. [5] Rumelhart, D., et at, "Parallel Distributed Processing" Vol. 1, Ch 8. For a derivation of the generalised delta rule. [6] http://www.cc.utah.edu/~jb1554/nnet/nnet1.html For a useful discussion of some of the issues in training a multilayer perceptron see 533563331 9