SOMUnsuper_hot

advertisement
Unsupervised Learning and Self-Organising Networks
References: Textbook chapter on unsupervised learning and self-organising maps.
Summary from last week:
 We explained what local minima are, and described ways of escaping them.
 We investigated how the backpropagation algorithm can be improved by changing various
parameters and re-training.
Aim: To introduce an alternative form of learning, unsupervised learning
Objectives: You should be able to:
 Describe unsupervised learning and the principle of clustering patterns according to their similarity.
 Describe the learning process used by Kohonen’s Self Organising Map
Lecture overview
This lecture gives a description of neural networks which can be trained by unsupervised learning and
can exhibit self-organisation. It outlines how inputs and weights can be represented by vectors, what is
meant by competitive learning and how self-organisation is achieved in Kohonen networks (SelfOrganising Maps).
The neural network models discussed so far which perform learning have been examples of
supervised learning, where an external 'teacher' presents input patterns one after the other to the
neural net and compares the output pattern produced with the desired target result. In unsupervised
learning weight adjustments are not made by comparison to some target output, there is no 'teaching
signal' to control the weight adjustments. This property is also known as self organisation, this
means that the networks are trained by showing examples of patterns that are to be classified, and the
network is allowed to produce its own output representation for the classification. Obviously, it is up to
the user to interpret the output. During training, input patterns are shown, and when the corresponding
output patterns are produced, the user knows that that code corresponds to the class which contains
the input pattern. In self-organizing networks the following three properties are required:
 The weights in the neurons should be representative of a class of patterns (so each neuron
represents a different class).
 Input patterns are presented to all of the neurons, and each neuron produces an output. The value
of the output of each neuron is used as a measure of the match between the input pattern and the
pattern stored in the neuron.
 A competitive learning strategy, which selects the neuron with the largest response.
 A method of reinforcing the largest response.
Self-organising maps
The inspiration for many of these networks came from biology. They have been developed either to
model some biological function (particularly in cognitive modelling) or in response to the demand for
biological plausibility in neural networks. One important organising principle of sensory pathways in the
brain is that the placement of neurons is orderly and often reflects some physical characteristic of the
external stimulus being sensed. For example, at each level of the auditory pathway, nerve cells and fibres
are arranged anatomically in relation to the frequency which elicits the greatest response from each
neuron. This tonotopic organisation in the auditory pathway also extends up to the auditory cortex.
Although much of this low-level organisation is genetically pre-determined, it is likely that some of the
organisation at higher levels is created during learning by algorithms which promote self-organisation.
Kohonen took inspiration from this physical structure of the brain to produce self-organising feature maps
(topology preserving maps). In a self-organising map, units located physically next to one another will
respond to input vectors that are in some way 'next to one another'. Although it is easy to visualise units
being next to one another in a two-dimensional array, it is not so easy to determine which classes of
vectors are next to each other in a high-dimensional space. Large dimensional input vectors are in a
sense 'projected down' onto the two-dimensional map in a way that maintains the natural order of the
input vectors. This dimensional reduction can allow us to easily visualise important relationships among
the data that otherwise may not have been noticed.
1
Kohonen networks
Teuvo Kohonen was the originator of this type of self-organising network. The aim of a Kohonen
network is to produce a pattern classifier, which is self-organizing and uses a form of unsupervised
learning to adjust the weights. Typically, a Kohonen network consists of a two-dimensional array of
neurons with all of the inputs arriving at all of the neurons. Each neuron has its own set of weights
which can be regarded as an exemplar pattern. When an input pattern arrives at the network, the
neuron with the exemplar pattern that is most similar to the input pattern will give the largest response.
One difference from other self-organizing systems, however, is that the exemplar patterns are stored
in such a way that similar exemplars are to be found in neurons that are physically close to one
another and exemplars that are very different are situated far apart. Self-Organizing Maps (SOMs) aim
to produce a network where the weights represent the coordinates of some kind of topological system
or map and the individual elements in the network are arranged in an ordered way. For example, a
two-dimensional coordinate system would ideally be produced in a two-dimensional array of elements
where the weights in each element correspond to the coordinates, as shown in Figure 1.
x1
x2
1
-1
1
0
1
1
0
0
0
-1
0
1
-1
-1
-1
0
-1
1
Figure 1. A two-dimensional map represented as a two-dimensional array of neurons.
Lateral inhibition and excitation
During the learning process, there is positive excitatory feedback between a unit and its nearest
neighbours, this causes all the units in the neighbourhood of a winning unit to learn. As the lateral
distance from a unit increases the degree of excitation falls until it becomes an inhibition, which continues
for a significant distance. Finally a weak positive excitation extends a considerable distance away from
the unit. The network exhibits a bubble of activity (neighbourhood) around the unit with the largest value
of net input, this is called the Mexican-hat function (Figure 2).
2
Figure 2. Mexican hat function.
Initially, each weight is set to some random number. Then, pairs of randomly selected coordinates are
presented to the system with no indication that these coordinates are taken from a square grid. The
system then has to order itself so that the weights correspond to the coordinates, and so that the
position of the element also corresponds to the position in the coordinate system.
The method for achieving this is to use a matching procedure, various matching criteria can be used,
but one that is often used is the Euclidean distance. This is found by taking the square root of the sum
of the squared differences.
n
Dj =
 (x
i
 wij ) 2
i 1
For a 2-dimensional problem, the distance calculated in each neuron is:
2
Dj =
 (x
i
 wij ) 2  ( x1  w1 j ) 2  ( x 2  w2 j ) 2
i 1
An input vector is simultaneously compared to all of the elements in the network, and the one with the
lowest value for D is selected, i.e. a winning processing element is determined for each input vector
based on the similarity between the input vector and the weight vector. Instead of updating the weights of
the winning unit only, we define a neighbourhood around the winning unit, and all the units within this
neighbourhood participate in the weight update process. As learning proceeds, the size of this
neighbourhood is diminished until it encompasses only a single unit.If the winning element is denoted
with a c, then a neighbourhood around c is also defined as being those elements which lie within a
distance of Nc from c. The exact nature of the neighbourhood can vary, but one that is frequently used
is shown in Figure 3.
3
c
neighbourhood Nc
Figure 3 The neighbourhood of element c
Having identified the element c, the centre of the neighbourhood, and the elements that are included in
the neighbourhood, the weights of those elements are adjusted using:
wij = k(xi – wij )Yj
Where Yj is a value obtained from the Mexican hat function (itself controlled by the size of the
neighbourhood Nc) and k is a value which changes over time. This means that if the unit lies within the
neighbourhood of the winning unit its weight is changed by the difference between its weight vector
and the vector x multiplied by the time factor k and the function Yj. Each weight vector participating in
the update process rotates slightly toward the input vector x. As training continues with different input
points, the size of the neighbourhood is decreased gradually until it encompasses only a single unit.
Once training has progressed sufficiently, the weight vector on each unit will converge to a value that
is representative of the coordinates of the points near the physical location of the unit.
The decisions about the size of Nc and the value of k are important. The sideways 'spread' of the
Mexican hat function must change over time, hence changing the size of the neighbourhood of the units.
Both Nc and the value of k must decrease with time, and there are several ways of doing this. The
value of k and the size of Nc could decrease linearly with time, however, it has been pointed out that
there are two distinct phases - an initial ordering phase, in which the elements find their correct
topological order, and a final convergence phase in which the accuracy of the weights improves. For
example, the initial ordering phase might take 1000 iterations where k decreases linearly from 0.9 to
0.01 say, and Nc decreases linearly from half the diameter of the network to one spacing. During the
final convergence phase k may decrease from 0.01 to 0 while Nc stays at one spacing. This final
stage could take form between 10 to 100 times longer than the initial stage depending on the desired
accuracy.
An example is shown below where a two-dimensional array of elements is arranged in a square to
map a rectangular two-dimensional coordinate space onto this array (which is the simplest case to
imagine). Figure 4 shows the network for the example where units are trained to recognise their relative
positions in two-dimensional space. This figure illustrates the dynamics of the learning process. Instead
4
of plotting the position of the processing elements according to their physical location, they are plotted
according to their location in weight space. As training proceeds, the map evolves. In the initial map,
weight vectors have random values near the centre of the map coordinates (i.e. If these values are
plotted on a two-dimensional image, they would be shown as a set of randomly distributed points). We
want to indicate that some elements are next to other elements. This is done by drawing a line
between adjacent elements so that the image ends up as a set of lines, the elements being situated at
the points where the lines intersect. These lines are not physical, in the sense that the elements are
not joined together, but show units that are neighbours in physical space.
The system is presented with a set of randomly chosen coordinates. As the map begins to evolve,
weights spread out from the centre. Eventually the final structure of the map begins to appear. Finally the
relationship between the weight vectors mimics the relationship between the physical coordinates of the
processing elements (i.e., as time elapses, the weights order themselves so that they correspond to the
positions in the coordinate system). Another way of thinking about this is that the weights distribute
themselves in an even manner across the coordinate space so that, in effect, they learn to ‘fill the
space’.
(b)
(a)
(d)
(c)
(e)
Figure 4. Weight vectors during the ordering phase.
Although the above example uses input points that are uniformly distributed within the region, they can in
fact be distributed according to any distribution function. Once the SOM has been trained, the weight
vectors will be organised into an approximation of the distribution function of the input vectors. Kohonen
has shown that (more formally) “The point density function of the weight vectors tends to approximate
the probability density function p(x) of the input vectors x, and the weight vectors tend to be ordered
according to their mutual similarity”.
The network output need not be two-dimensional, even though the layout of the physical devices might
be two-dimensional. If there are n weights, then each weight corresponds to a coordinate. So although
a two-dimensional image of which elements are active is produced when patterns are presented to the
input, the interpretation of that map might have many more dimensions. For example, a system where
each element has three weights would organize itself so that the different pattern classes occupy
different parts of a three-dimensional space. If the network is observed, only individual elements firing
would be seen, so it is misleading to think in terms of the physical layout.
This (and other examples) shows how two-dimensional arrays which map on to a coordinate system
can arrange the weights so that the ‘nodes’ in that system are distributed evenly. One thing that has
not been mentioned yet is the output. What is the output of a Kohonen network? Training involves
grouping similar patterns in close proximity in this pattern space, so that clusters of similar patterns
cause neurons to fire that are physically located close together in the network. Clearly, the outputs
need to be interpreted, but it should be possible to identify which regions belong to which class by
5
showing the network known patterns and seeing which areas are active.
The feature map classifier
An advantage of the SOM is that large numbers of unlabelled data can be organised quickly into a
configuration that may illuminate underlying structure within the data. Following the self-organisation
process, it may be desirable to associate certain inputs with certain output values (such as is done with
backpropagation networks). The Feature Map Classifier has an additional layer of units that form an
output layer, which can be trained by several methods (including the delta rule) to produce a particular
output given a particular pattern of activation on the SOM (Figure 5).
Figure 5. The feature map classifier.
In this network, the SOM classifies the input vectors, and the output layer (connected by weighted
connections to the SOM which can be trained) associates desired output values with certain input
vectors.
The neural phonetic typewriter
It has long been a goal of computer scientists to endow a machine with the ability to recognise and
understand human speech. Despite many years of research, currently available commercial products are
limited by their small vocabularies, by dependence on extensive training by a particular speaker, or by
both. The neural phonetic typewriter demonstrates the potential for neural networks to aid in the
endeavour to build speaker-independent speech recognition into a computer. It also shows how neural
network technology can be merged with traditional signal processing and standard techniques in AI to
solve a particular problem. It can transcribe speech into written text from an unlimited vocabulary in real
time, with an accuracy of 92 to 97 per cent. Training for an individual speaker requires the dictation of
only about 100 words, and requires about 10 minutes of time on a personal computer.
A 2-dimensional array of units is trained using inputs that are 15-component spectral analyses of spoken
words sampled every 10 milliseconds. These input vectors are produced from a series of preprocessing
steps performed on the spoken words. This preprocessing involves the use of a noise-cancelling
microphone, a 12-bit analogue-to-digital conversion, a 256 point fast Fourier transform performed every
10 milliseconds, grouping of the spectral channels into 15 groups, and additional filtering and
normalisation of the resulting 15-component input vector. Using Kohonen's clustering algorithm, nodes in
the 2-dimensional array are allowed to organise themselves in response to the input vectors. After
training the resulting map is calibrated by using the spectra of phonemes as input vectors. Even though
phonemes were not used explicitly to train the network, most units responded to a single phoneme. As a
word is spoken, it is sampled, analysed, and submitted to the network as a sequence of input vectors. As
the nodes in the network respond, a path is traced on the map that corresponds to the sequence of input
6
patterns. This path results in a phonetic transcription of the word, which can then be used as input to a
rule-based system to be compared with known words. As words are spoken into the microphone, their
transcription appears on the computer screen.
The SOM is used in diverse applications, including speech recognition and processing, image
analysis, controlling industrial processes and novelty detection.
References (not essential): Teuvo Kohonen. The 'neural' phonetic typewriter. Computer, 21(3), (pp. 1122), March 1988.
Lecture summary
This lecture has discussed a self-organizing neural network called a Self-organising Map. This
network uses unsupervised learning to physically arrange its neurons so that the patterns that it stores
are arranged such that similar patterns are close to each other and dissimilar patterns are far apart.
 In unsupervised learning there is no ‘teaching input’.
 Learning is based on the principle of clustering patterns according to their similarity.
 A common form of unsupervised learning is that used by Kohonen’s Self Organising Map.
Self Assessment Questions (answers on following page)
1. For the network shown in Figure (i), what will the response be to the input values shown?
Figure (i), Network for question 1
2. The input pattern [X] = [1 0 1] and a set of weights [W] = [0.2 1.5 2.0] in a neuron can be
interpreted as two vectors.
(a)
What are the lengths of the two vectors?
(b)
What is the Euclidean distance between them?
7
Answers
1
With these input values, the outputs of the three neurons are, from the top:
0.5 – 2.0 = –1.5, 0.2 + 2.0 = 2.2, –1.5 + 0.1 = –1.4
Therefore the middle neuron wins the competition, so that the values for y are: y1 = 0, y2 = 1, y3 = 0 (if
the Mexican hat function is localised, otherwise y2 will have a value of 1 and the others will have a
value somewhere between 1 and 0.
2
(a)
The length of a vector is found by squaring all of the elements of the vector, summing the
squares then finding the square root.
Length of [X] = |X| = sqroot of (12 + 02 + 12 ) =
2 = 1.414
|W| =
0.22 + 1.52 + 2.02 =
6.29 = 2.508
(b)
The Euclidean distance, Dj, for neuron j is given as:
n
Dj =
 (x
i
 wij ) 2
i 1
The input pattern [X] = [1 0 1] and the weights are [W] = [0.2 1.5 2.0].
Dj =
(1  0.2) 2  (0  1.5) 2  (1  2.0) 2 = 1.97
8
Download