Unsupervised Learning and Self-Organising Networks References: Textbook chapter on unsupervised learning and self-organising maps. Summary from last week: We explained what local minima are, and described ways of escaping them. We investigated how the backpropagation algorithm can be improved by changing various parameters and re-training. Aim: To introduce an alternative form of learning, unsupervised learning Objectives: You should be able to: Describe unsupervised learning and the principle of clustering patterns according to their similarity. Describe the learning process used by Kohonen’s Self Organising Map Lecture overview This lecture gives a description of neural networks which can be trained by unsupervised learning and can exhibit self-organisation. It outlines how inputs and weights can be represented by vectors, what is meant by competitive learning and how self-organisation is achieved in Kohonen networks (SelfOrganising Maps). The neural network models discussed so far which perform learning have been examples of supervised learning, where an external 'teacher' presents input patterns one after the other to the neural net and compares the output pattern produced with the desired target result. In unsupervised learning weight adjustments are not made by comparison to some target output, there is no 'teaching signal' to control the weight adjustments. This property is also known as self organisation, this means that the networks are trained by showing examples of patterns that are to be classified, and the network is allowed to produce its own output representation for the classification. Obviously, it is up to the user to interpret the output. During training, input patterns are shown, and when the corresponding output patterns are produced, the user knows that that code corresponds to the class which contains the input pattern. In self-organizing networks the following three properties are required: The weights in the neurons should be representative of a class of patterns (so each neuron represents a different class). Input patterns are presented to all of the neurons, and each neuron produces an output. The value of the output of each neuron is used as a measure of the match between the input pattern and the pattern stored in the neuron. A competitive learning strategy, which selects the neuron with the largest response. A method of reinforcing the largest response. Self-organising maps The inspiration for many of these networks came from biology. They have been developed either to model some biological function (particularly in cognitive modelling) or in response to the demand for biological plausibility in neural networks. One important organising principle of sensory pathways in the brain is that the placement of neurons is orderly and often reflects some physical characteristic of the external stimulus being sensed. For example, at each level of the auditory pathway, nerve cells and fibres are arranged anatomically in relation to the frequency which elicits the greatest response from each neuron. This tonotopic organisation in the auditory pathway also extends up to the auditory cortex. Although much of this low-level organisation is genetically pre-determined, it is likely that some of the organisation at higher levels is created during learning by algorithms which promote self-organisation. Kohonen took inspiration from this physical structure of the brain to produce self-organising feature maps (topology preserving maps). In a self-organising map, units located physically next to one another will respond to input vectors that are in some way 'next to one another'. Although it is easy to visualise units being next to one another in a two-dimensional array, it is not so easy to determine which classes of vectors are next to each other in a high-dimensional space. Large dimensional input vectors are in a sense 'projected down' onto the two-dimensional map in a way that maintains the natural order of the input vectors. This dimensional reduction can allow us to easily visualise important relationships among the data that otherwise may not have been noticed. 1 Kohonen networks Teuvo Kohonen was the originator of this type of self-organising network. The aim of a Kohonen network is to produce a pattern classifier, which is self-organizing and uses a form of unsupervised learning to adjust the weights. Typically, a Kohonen network consists of a two-dimensional array of neurons with all of the inputs arriving at all of the neurons. Each neuron has its own set of weights which can be regarded as an exemplar pattern. When an input pattern arrives at the network, the neuron with the exemplar pattern that is most similar to the input pattern will give the largest response. One difference from other self-organizing systems, however, is that the exemplar patterns are stored in such a way that similar exemplars are to be found in neurons that are physically close to one another and exemplars that are very different are situated far apart. Self-Organizing Maps (SOMs) aim to produce a network where the weights represent the coordinates of some kind of topological system or map and the individual elements in the network are arranged in an ordered way. For example, a two-dimensional coordinate system would ideally be produced in a two-dimensional array of elements where the weights in each element correspond to the coordinates, as shown in Figure 1. x1 x2 1 -1 1 0 1 1 0 0 0 -1 0 1 -1 -1 -1 0 -1 1 Figure 1. A two-dimensional map represented as a two-dimensional array of neurons. Lateral inhibition and excitation During the learning process, there is positive excitatory feedback between a unit and its nearest neighbours, this causes all the units in the neighbourhood of a winning unit to learn. As the lateral distance from a unit increases the degree of excitation falls until it becomes an inhibition, which continues for a significant distance. Finally a weak positive excitation extends a considerable distance away from the unit. The network exhibits a bubble of activity (neighbourhood) around the unit with the largest value of net input, this is called the Mexican-hat function (Figure 2). 2 Figure 2. Mexican hat function. Initially, each weight is set to some random number. Then, pairs of randomly selected coordinates are presented to the system with no indication that these coordinates are taken from a square grid. The system then has to order itself so that the weights correspond to the coordinates, and so that the position of the element also corresponds to the position in the coordinate system. The method for achieving this is to use a matching procedure, various matching criteria can be used, but one that is often used is the Euclidean distance. This is found by taking the square root of the sum of the squared differences. n Dj = (x i wij ) 2 i 1 For a 2-dimensional problem, the distance calculated in each neuron is: 2 Dj = (x i wij ) 2 ( x1 w1 j ) 2 ( x 2 w2 j ) 2 i 1 An input vector is simultaneously compared to all of the elements in the network, and the one with the lowest value for D is selected, i.e. a winning processing element is determined for each input vector based on the similarity between the input vector and the weight vector. Instead of updating the weights of the winning unit only, we define a neighbourhood around the winning unit, and all the units within this neighbourhood participate in the weight update process. As learning proceeds, the size of this neighbourhood is diminished until it encompasses only a single unit.If the winning element is denoted with a c, then a neighbourhood around c is also defined as being those elements which lie within a distance of Nc from c. The exact nature of the neighbourhood can vary, but one that is frequently used is shown in Figure 3. 3 c neighbourhood Nc Figure 3 The neighbourhood of element c Having identified the element c, the centre of the neighbourhood, and the elements that are included in the neighbourhood, the weights of those elements are adjusted using: wij = k(xi – wij )Yj Where Yj is a value obtained from the Mexican hat function (itself controlled by the size of the neighbourhood Nc) and k is a value which changes over time. This means that if the unit lies within the neighbourhood of the winning unit its weight is changed by the difference between its weight vector and the vector x multiplied by the time factor k and the function Yj. Each weight vector participating in the update process rotates slightly toward the input vector x. As training continues with different input points, the size of the neighbourhood is decreased gradually until it encompasses only a single unit. Once training has progressed sufficiently, the weight vector on each unit will converge to a value that is representative of the coordinates of the points near the physical location of the unit. The decisions about the size of Nc and the value of k are important. The sideways 'spread' of the Mexican hat function must change over time, hence changing the size of the neighbourhood of the units. Both Nc and the value of k must decrease with time, and there are several ways of doing this. The value of k and the size of Nc could decrease linearly with time, however, it has been pointed out that there are two distinct phases - an initial ordering phase, in which the elements find their correct topological order, and a final convergence phase in which the accuracy of the weights improves. For example, the initial ordering phase might take 1000 iterations where k decreases linearly from 0.9 to 0.01 say, and Nc decreases linearly from half the diameter of the network to one spacing. During the final convergence phase k may decrease from 0.01 to 0 while Nc stays at one spacing. This final stage could take form between 10 to 100 times longer than the initial stage depending on the desired accuracy. An example is shown below where a two-dimensional array of elements is arranged in a square to map a rectangular two-dimensional coordinate space onto this array (which is the simplest case to imagine). Figure 4 shows the network for the example where units are trained to recognise their relative positions in two-dimensional space. This figure illustrates the dynamics of the learning process. Instead 4 of plotting the position of the processing elements according to their physical location, they are plotted according to their location in weight space. As training proceeds, the map evolves. In the initial map, weight vectors have random values near the centre of the map coordinates (i.e. If these values are plotted on a two-dimensional image, they would be shown as a set of randomly distributed points). We want to indicate that some elements are next to other elements. This is done by drawing a line between adjacent elements so that the image ends up as a set of lines, the elements being situated at the points where the lines intersect. These lines are not physical, in the sense that the elements are not joined together, but show units that are neighbours in physical space. The system is presented with a set of randomly chosen coordinates. As the map begins to evolve, weights spread out from the centre. Eventually the final structure of the map begins to appear. Finally the relationship between the weight vectors mimics the relationship between the physical coordinates of the processing elements (i.e., as time elapses, the weights order themselves so that they correspond to the positions in the coordinate system). Another way of thinking about this is that the weights distribute themselves in an even manner across the coordinate space so that, in effect, they learn to ‘fill the space’. (b) (a) (d) (c) (e) Figure 4. Weight vectors during the ordering phase. Although the above example uses input points that are uniformly distributed within the region, they can in fact be distributed according to any distribution function. Once the SOM has been trained, the weight vectors will be organised into an approximation of the distribution function of the input vectors. Kohonen has shown that (more formally) “The point density function of the weight vectors tends to approximate the probability density function p(x) of the input vectors x, and the weight vectors tend to be ordered according to their mutual similarity”. The network output need not be two-dimensional, even though the layout of the physical devices might be two-dimensional. If there are n weights, then each weight corresponds to a coordinate. So although a two-dimensional image of which elements are active is produced when patterns are presented to the input, the interpretation of that map might have many more dimensions. For example, a system where each element has three weights would organize itself so that the different pattern classes occupy different parts of a three-dimensional space. If the network is observed, only individual elements firing would be seen, so it is misleading to think in terms of the physical layout. This (and other examples) shows how two-dimensional arrays which map on to a coordinate system can arrange the weights so that the ‘nodes’ in that system are distributed evenly. One thing that has not been mentioned yet is the output. What is the output of a Kohonen network? Training involves grouping similar patterns in close proximity in this pattern space, so that clusters of similar patterns cause neurons to fire that are physically located close together in the network. Clearly, the outputs need to be interpreted, but it should be possible to identify which regions belong to which class by 5 showing the network known patterns and seeing which areas are active. The feature map classifier An advantage of the SOM is that large numbers of unlabelled data can be organised quickly into a configuration that may illuminate underlying structure within the data. Following the self-organisation process, it may be desirable to associate certain inputs with certain output values (such as is done with backpropagation networks). The Feature Map Classifier has an additional layer of units that form an output layer, which can be trained by several methods (including the delta rule) to produce a particular output given a particular pattern of activation on the SOM (Figure 5). Figure 5. The feature map classifier. In this network, the SOM classifies the input vectors, and the output layer (connected by weighted connections to the SOM which can be trained) associates desired output values with certain input vectors. The neural phonetic typewriter It has long been a goal of computer scientists to endow a machine with the ability to recognise and understand human speech. Despite many years of research, currently available commercial products are limited by their small vocabularies, by dependence on extensive training by a particular speaker, or by both. The neural phonetic typewriter demonstrates the potential for neural networks to aid in the endeavour to build speaker-independent speech recognition into a computer. It also shows how neural network technology can be merged with traditional signal processing and standard techniques in AI to solve a particular problem. It can transcribe speech into written text from an unlimited vocabulary in real time, with an accuracy of 92 to 97 per cent. Training for an individual speaker requires the dictation of only about 100 words, and requires about 10 minutes of time on a personal computer. A 2-dimensional array of units is trained using inputs that are 15-component spectral analyses of spoken words sampled every 10 milliseconds. These input vectors are produced from a series of preprocessing steps performed on the spoken words. This preprocessing involves the use of a noise-cancelling microphone, a 12-bit analogue-to-digital conversion, a 256 point fast Fourier transform performed every 10 milliseconds, grouping of the spectral channels into 15 groups, and additional filtering and normalisation of the resulting 15-component input vector. Using Kohonen's clustering algorithm, nodes in the 2-dimensional array are allowed to organise themselves in response to the input vectors. After training the resulting map is calibrated by using the spectra of phonemes as input vectors. Even though phonemes were not used explicitly to train the network, most units responded to a single phoneme. As a word is spoken, it is sampled, analysed, and submitted to the network as a sequence of input vectors. As the nodes in the network respond, a path is traced on the map that corresponds to the sequence of input 6 patterns. This path results in a phonetic transcription of the word, which can then be used as input to a rule-based system to be compared with known words. As words are spoken into the microphone, their transcription appears on the computer screen. The SOM is used in diverse applications, including speech recognition and processing, image analysis, controlling industrial processes and novelty detection. References (not essential): Teuvo Kohonen. The 'neural' phonetic typewriter. Computer, 21(3), (pp. 1122), March 1988. Lecture summary This lecture has discussed a self-organizing neural network called a Self-organising Map. This network uses unsupervised learning to physically arrange its neurons so that the patterns that it stores are arranged such that similar patterns are close to each other and dissimilar patterns are far apart. In unsupervised learning there is no ‘teaching input’. Learning is based on the principle of clustering patterns according to their similarity. A common form of unsupervised learning is that used by Kohonen’s Self Organising Map. Self Assessment Questions (answers on following page) 1. For the network shown in Figure (i), what will the response be to the input values shown? Figure (i), Network for question 1 2. The input pattern [X] = [1 0 1] and a set of weights [W] = [0.2 1.5 2.0] in a neuron can be interpreted as two vectors. (a) What are the lengths of the two vectors? (b) What is the Euclidean distance between them? 7 Answers 1 With these input values, the outputs of the three neurons are, from the top: 0.5 – 2.0 = –1.5, 0.2 + 2.0 = 2.2, –1.5 + 0.1 = –1.4 Therefore the middle neuron wins the competition, so that the values for y are: y1 = 0, y2 = 1, y3 = 0 (if the Mexican hat function is localised, otherwise y2 will have a value of 1 and the others will have a value somewhere between 1 and 0. 2 (a) The length of a vector is found by squaring all of the elements of the vector, summing the squares then finding the square root. Length of [X] = |X| = sqroot of (12 + 02 + 12 ) = 2 = 1.414 |W| = 0.22 + 1.52 + 2.02 = 6.29 = 2.508 (b) The Euclidean distance, Dj, for neuron j is given as: n Dj = (x i wij ) 2 i 1 The input pattern [X] = [1 0 1] and the weights are [W] = [0.2 1.5 2.0]. Dj = (1 0.2) 2 (0 1.5) 2 (1 2.0) 2 = 1.97 8