Computational Models of Cognitive Processes Practical 1

advertisement
Connectionist Modelling of Language and Cognitive Processes
Practical with OXlearn
Gert Westermann August 2011
In order to understand and appreciate connectionist modelling it is very useful to learn
how to use neural networks in practice. This is what this practical is about. It is a
mixture of enabling you to set up simple neural network models, train and test them,
and at the same time to deepen your understanding of neural modelling.
The practical handout contains tasks and questions which form part of the assessment
for this course. Some of these are revision questions from the seminars and/or the
texts on the reading list, but most will require you to use the OXlearn simulator and
build your own neural network models.
Please enter your answers to the questions in this sheet in bold. Where you are asked
to provide graphs of your results, please paste the pictures directly into the sheet
(use .jpg or .gif format to limit the size of your document). You can either take
screenshots or extract the figures from OXlearn (using the ‘extract’ button and File ->
Save As).
For each question, the number in parenthesis indicates the maximum number of points
you can get for answering correctly. Answering the more advanced FoodForThought
questions is not strictly part of the coursework. However, I would be delighted if you
had a go at (some of) these and I might be tempted to allocate extra points for answers
that are extraordinarily concise or informed.
Note: In the practical sessions, you can work in small groups if you wish. However, if
you choose to do so you need to indicate in your portfolio who you have been
working with in a particular session. Also, you still need to formulate your answers
in your own words, even if you have discussed them as a group.
1. Understanding how activation is sent through a neural network.
Question 1:
What are the two tasks performed by each individual node (or
unit) in a neural network? (1)
Question 2:
Why are input units functionally different from all the other
units in a network? (1)
x
Question 3:
x1 = 3; x2 = - 2; x3 = 1.5; x4 = 0.7. What is  i ? (1)
Imagine a minimal neural network as drawn here (A), whose output unit uses the
sigmoidal (or logistic) activation function shown on the right hand side (B).
A) network architecture
B) activation function
unit
activation 1
Output unit
-0.25
0.6
connection
weights
Input units
1
1
input values
0.5
- 0.5 0
0.5
1
netinput
Question 4:
Given the input values and connection weights are as shown
above, what is the net input received by the output unit? (1)
Question 5:
Given the sigmoidal activation function (B), what is the rough
activation value of the output unit (look it up in the graph)? (1)
Question 6:
If both connection weights were 0, what would the activation
value of the output unit be? (1)
Question 7:
And if the weights had the values 1.2 and 0.7 and both inputs
were 0, what would be the resulting output activation? (1)
Question 8:
Name three possible activation functions for neural network
units. For each of these functions, (a) indicate if the function is linear or not
and (b) describe verbally how the function works (in which way the net input
is transformed into an activation value). (2)
Bias units: A bias node is an extra node in many neural networks that always
(automatically) has an activation value of 1 (it is always on). The bias is usually
connected to all units in the network, and its weights are adjusted in the same manner
as all other weights in the network (they can end up being negative or positive).
Question 9:
Explain why a bias node might be necessary for a network to
perform correctly (maybe with an example?). (2)
FoodForThought 1:
Is there a simple way of describing the effect of the bias
node on a unit’s activation function?
FoodForThought 2:
biological neurons?
Is there an equivalent to the bias node’s functionality in
2. Setting up your first neural network with OXlearn
This practical introduces you to the OXlearn simulation software that you will be
using in this practical. OXlearn is a high-level neural network simulation program that
enables you to easily define connectionist models and run simulations with them.
OXlearn is actually a MATLAB program, which means that it only works when you
have a recent version (7.3 or better) of MATLAB installed on your computer (we
have it installed here). However, you do not need to do any programming yourself; all
the functionality of OXlearn is accessible via its graphical user interface - basically by
using buttons, dropdown menus, etc.
OXlearn is the successor of the t-learn software which is used in the recommended
books for this module (McLeod et al, 1998; and Plunkett & Elman, 1997; see reading
list). All the exercises contained in these books can also be done with OXlearn, and it
has become a lot easier, too.
Today’s instructions are exceedingly detailed (you can skip what you know already)
in order to give you a chance to familiarize yourself with OXlearn. However, after
today you will be expected to know your way around and you will not get step-bystep instructions any more. If you have problems using OXlearn, please consult the
user’s manual, and if this doesn’t help speak to me. There will be a second supervised
OXlearn session in week 2 of the course.
Simulating an AND network in OXlearn
In this task you will use OXlearn to set up and train a neural network to learn the
AND function. Similar to all simulation projects, this requires setting up the
simulation (defining the train patterns, the network architecture and the training
options) and then training the network and analysing its performance (during training
or subsequent testing).
The very first thing you’ll have to do is download the OXlearn program onto
your personal USB drive. You can find the OXlearn folder at
http://psych.brookes.ac.uk/oxlearn/ (you might need to unzip the folder). This page
also has the OXlearn manual.
Next you have to start MATLAB, just click on the MATLAB icon on your
desktop. Once it has opened, you’ll need to load the OXlearn program. To do so, click
on File->Open and then browse to the OXlearn folder, and double click on
OXlearn.m. The OXlearn interface will open.
The first thing you will notice is the simulation overview display which has opened in
the OXlearn interface. Most of the panels will be empty and the red attention signs at
the left indicate that the simulation is not yet set up properly to be trained or tested.
This is not surprising, as we are currently dealing with an empty simulation.
The first thing to do is to assign a name to your new simulation project. Go to File ->
Save Simulation As and browse to a suitable location on your USB drive, for
example you could create a new folder called “practical1” in the default
“Simulations” folder. Also, you have to provide a name, e.g. “myAND”, to which the
extension ‘.mat’ will be attached automatically. To add subsequent changes to this
simulation file you just choose File -> Save Simulation or press CTRL + S, as in
most programs.
Note: Please make sure to save all simulation files that you are producing on your
USB drive! When you log out, all data will be deleted from the computer..
Now you are ready to add content to your simulation. There are three essential parts
of a simulation that need to be set up properly before you can train the network.
These three parts concern the train patterns, the network architecture and the
training options, all accessible from the Set-up menu in the menu bar at the top of the
OXlearn window - or directly from the simulation overview (Inspect -> Simulation)
by clicking on the corresponding edit button (the blue pen).
Table 1: the AND function
Patterns
Pattern 1
Pattern 2
Pattern 3
Pattern 4
Input
unit 1
unit 2
0
0
0
1
1
0
1
1
Output target
unit 1
0
0
0
1
Note: In binary problems like this you can also read a value of 1 as “on”, ”true” or
“given”, while a value of 0 represents “off”, “false” or “absent”.
Go to Set-up -> Train Patterns. In this window you can define the Input and Target
Patterns you want the network to learn from, essentially feeding the program with the
content of Table 2. There are, in fact, several ways of doing so (see the OXlearn User
Manual), only one will be described here. First, you have to choose an appropriate
number of input and output (target) units, as well as the number of different patterns
that you are going to define. This can be done by clicking on the “change size of
patterns” button on the left hand side of the display (use the tooltip information that is
shown when the mouse pointer hovers over a button). Once you have entered these
numbers (4 patterns, 2 input units, 1 target unit) you can see that the two graphs have
adapted their size. However, you still need to enter the values of the desired input and
target activation. This is simple: just right-click on the appropriate area of a graph and
enter the value (0 or 1). You might also want to change the pattern labels (right click
on them to edit) to something more descriptive. For now you are done, you can close
the Set-up -> Train Patterns window by clicking on OK.
Note: In fact, changes in all set-up windows are applied immediately, not only when
the window is closed.
Go to Set-up -> Network. In this window you can define the exact network
architecture by choosing a network type from a dropdown menu and defining the
appropriate number of units for each layer, as well as whether you want a bias node or
not. For now we want a “2-layer feed-forward” network with a bias unit and a
sigmoidal activation function in the output layer. The input and output layer size
should correspond with the train patterns.
Go to Set-up -> Training Options. In this window you define all the parameters
pertaining to the exact way in which the network is trained, e.g. which learning
algorithm is used, for how long the network should be trained, in which order the
individual patterns are presented during training, as well as some options to control
which information is logged during the training process. For the current simulation,
you can just leave all these parameters at their default values and close the window.
Note: You will realize that the status of the train options in the simulation overview
display has changed, it now shows a green tick instead of the red attention sign. The
reason is this: whenever you open a set-up window, all the corresponding parameters
that are not yet defined will be created with pre-defined default values. Thus, even if
you have not changed any of the values, the act of opening the window has lead to the
creation of all the relevant parameters.
Note: The grey ticks in front of the bottom most four elements (weights, train
performance, verify performance and test performance) indicates that all is well, but
the corresponding parameters are still empty. The reason is simple: the corresponding
parameters are obtained by training, verifying or testing the network. Once you have
done this, the ticks will turn green.
At a later stage, you might also encounter yellow attention signs here. They indicate
that the set-up parameters (the topmost five) and the results parameters (the last four)
might be inconsistent. This can happen when, in a simulation that has already been
trained, you change the set-up (e.g. if you choose a different learning rate). Because
the weights, train performance, etc., are the outcome of training with the previous setup, your simulation is temporarily inconsistent. However, you don’t need to worry:
training the network with the novel set-up will replace the results parameters and thus
make the simulation consistent again.
The first three elements in the status panel (in the simulation overview display) should
now all be ticked, which means that you are ready to train the network. Before doing
so, however, you should take the time to familiarize yourself with some of the other
displays provided by OXlearn, all of which let you inspect different aspects of a
simulation in more detail. You can change displays by selecting the options under the
Inspect menu – the first entry (Inspect -> Simulation) corresponds to the currently
shown simulation overview display. Because much of the information you might want
to display concerns the network’s performance during and after training, the displays
will of course be more informative once you have trained the network.
And this is what you’ll be doing now. Switch to the performance display (Inspect ->
Performance), and then go to Run -> Train Network. Provided you have set up
everything correctly, this will lead to the network being trained for the given number
of sweeps (1000, as per default setting).
Note: A sweep is the processing of one individual training pattern, where the identity
of the pattern is determined by the chosen presentation order (you might want to bring
up the Training Options window again to check). If you have left this parameter at
“sequential”, this means that the first pattern is presented in the first training sweep,
then the second pattern, then the third, and so on, starting over with the first pattern
again after the last pattern (here the forth) has been presented. The other options either
randomly shuffle the order of presenting each pattern once (“random without
replacement”) or draw a pattern at random for each individual sweep (random with
replacement”). A related term is an epoch, which usually means one pass through all
the training patterns. Thus, in the current simulation, four sweeps make one epoch.
While the network is being trained, the performance display is updated immediately.
Thus, you will see the error curve creeping from left to right until the maximum
number of sweeps is reached. Each point of this line represents the mean squared
deviation of the network’s actual output activation from the intended target activation
in a specific sweep. You will (hopefully) notice that the line goes down gradually,
thus indicating that the network is getting increasingly better at performing the task.
But has it actually mastered the AND function? You might get a better impression by
bringing up the display for the other performance measure, just tick the box next to
“%Correct” in the options panel at the bottom of the figure. The fact that the network
has not reached 100% by the end of training already indicates that everything is not
entirely well.
Note: The criterion for judging performance in an individual sweep is always
“deviation < 0.1” for training performance. This means that the deviation of the actual
output activation from target is evaluated for each of the output units (well, currently
there is only one), and that a specific sweep will be deemed correct if none of the
output units deviates by more than 0.1 from the corresponding target value. This
criterion is rather conservative, i.e., the network’s performance needs to be quite close
to target in order to be judged as correct. Nevertheless, it is important to note that any
such criterion, while often necessary, is essentially arbitrary. Two things follow from
this insight: A) when evaluating existing models, pay attention to how loose or strict a
criterion is used; and B) you will have to make (and report) a similar choice with your
own simulations. You can change the correctness criterion in OXlearn when
inspecting verify or test performance (but not for training performance).
Have a closer look at the %Correct display for train performance, there is something
that should make you suspicious: if every sweep is classified as either correct or
incorrect, and if the line has one point for each sweep, why are there intermediate
values such as 35% correct? You have just discovered the functionality of the
smoothing filter which is governed by the number in the little box below the buttons
on the left hand side of the display. Right now you should find a 20 there (number of
logged sweeps/50, the default setting), which means that each point in a line (Error
and %Correct) actually represents the average of 20 consecutive sweeps – that is why
the graph has these distinct steps in it. If you set this number to 1 no smoothing occurs
and you will get the expected binary correctness display – however, you will also see
that it is not very informative. Similarly, setting this smoothing factor to 1000 (= the
maximum number of sweeps) results in a horizontal line representing the average
value of all 1000 sweeps.
But let us come back to the question whether the network has mastered the task or not.
Actually, the training performance can provide a general impression, but not the direct
answer. Recall that learning in a neural network means adapting the weights, and each
change of the weights configuration may impact on all patterns processed thereafter.
Because the weights are updated constantly (after every sweep) during training you
should not, strictly speaking, compare the network’s behaviour over adjacent sweeps.
It could be that the weight update in response to the second pattern has compromised
the network’s ability to deal with the first one (unlikely, but possible). Luckily, there
is a clean solution: one could simply record the network’s responses to any number of
patterns and not adjusting the weights while doing so. This is often called ‘testing the
network with frozen weights’. In the current simulation this means exposing the
network (with the weights configuration after 1000 sweeps of training) to all training
patterns once – in sequential order, to keep it simple. OXlearn will carry out exactly
this process when you select Run -> Verify the network has learned. To be clear on
terminology, the verify option is actually performing a test, but with the exact set of
patterns with which the network has been trained. Run -> Test Network, conversely,
presents the network with Test Patterns that must be set up separately.
Actually, you don’t even have to Verify that the network has leaned now, because
OXlearn has performed this operation automatically, at the end of training (see the
“auto verify” tickbox in the training options window + ‘>> more’). To inspect the
results, select “verify performance” from the dropdown menu in the options panel.
The display is similar to “training performance”, but you get only one point per
pattern and, because we are now interested in the individual patterns, smoothing is set
to 1. Again you can inspect the mean square error and you will find that it is rather
low. Keep in mind, however, that squaring a small number will make it even smaller
and, if we had several output units, taking the average might further reduce the error
value. The bottom line is, you should not give too much about absolute error values.
Comparing error values between patterns, on the other hand, is somewhat more
informative. In the current simulation, for example, the network seems to have more
problems with the last pattern (where both input units are on) as compared to the
others. This is also reflected by the fact that this last pattern sometimes is classified as
incorrect with conservative correctness criterions (you can now select different
criteria from the dropdown menu at the bottom). Usually you should find that a less
conservative criterion (e.g. deviation < 0.3 or binary) classifies all patterns as correct
(details might vary from simulation to simulation, we will investigate this in the next
practical). And here we have the answer to our question: given a specific criterion, the
network has mastered the AND function.
The performance display provides rather high-level information about the network’s
actual output. Look at the Inspect -> Patterns display (note the different scales, add
colorbars if in doubt) and/or the Inspect -> Activations display and try to find out
how exactly they relate to the information in the performance display.
Question 10:
What is the difference between the Patterns and the Activations
display in OXlearn? (1)
Question 11:
What is the network’s exact numerical output activation for
each of the 4 patterns? (1)
Inspect the weights configuration that resulted from training the network (Inspect ->
Weights). The weight display will look something like this:
This is called a Hinton Diagram. Positive weights are represented by red boxes and
negative weights by blue boxes. The size of the boxes indicates the connection
strength (use the colorbar for a rough indication, the datatip for exact information
concerning the weight values). The weights are from the units marked on the x-axis to
the units on the y-axis. Thus, the three weights displayed here are from i1 to o1 (input
unit 1 to output unit 1), from i2 to o1, and from the bias node to o1.
Question 12:
What are the exact weight values of the three connections in
your simulation after training? (1)
Note: It is quite easy to save any of the OXlearn displays: clicking on the topmost
button at the left hand side (“extract”) will send the graphs contained in the current
display to a novel MATLAB figure. If ever you wish to change anything about the
visual appearance of a graph (e.g. changing colors or labels, adding annotations, etc.),
you can use the graphics tools at the top of this window (MATLAB comes with an
excellent help - press F1 - where you can find out about using the graphics tools). You
can save the figure in several formats by selecting File -> Save As. Make sure to
choose a suitable format from the dropdown menu at the bottom; to keep files small
you should usually go for .jpg or .gif.
Question 13:
Extract the weights display of your simulation and include it in
your portfolio. (1)
Train the network a few more times with different settings in the training options
window (you could change the presentation order, the number of sweeps, the learning
rate or even add a momentum). You should be able to find a setting for which the
network, after training, gets all four patterns right when evaluated against the criterion
of “deviation < 0.1”.
Question 14:
Which settings that have led to such a good performance?
Please report all settings with other than default values. (1)
Question 15:
Explain why performance has increased as compared to the
initial settings. (2)
Question 16:
Report the exact mean square error for all four patterns, as well
as the average mean square error. (2)
Question 17:
Report the exact difference between the output and the target
activation (i.e., the raw error; not squared) for all four patterns. (1)
If you have come this far today, you have done well indeed. In case there is time left,
feel free to explore variations of the current simulation and/or play around some more
with OXlearn to make yourself familiar with some of the other functionalities
provided by the program. Just to mention one, you could explore what happens when
you click on “pause” (or press the ‘p’ key) during training.
Note: In general, it is your responsibility to save your simulations frequently. If you
quit OXlearn without saving, recent changes (made since the simulation was last
saved) will be lost. In order to decrease the risk of this happening, however, OXlearn
has an autosave function. Essentially, whenever a network is trained or tested
OXlearn creates a simulation file which is named like your current simulation plus the
affix ‘_sw<number of sweeps>’. For the present simulation this means that you
should find a file called ‘OR_sw1000.mat’ alongside your original simulation file.
This snapshot of your simulation right after training or testing can be used as a
recovery file in case you have forgotten to save after training was completed and
closed OXlearn. As we will see later, these automatically created files also come in
handy when training several instances of the same network or when you want to dump
intermediate states of the network during training.
3. Learning XOR (Exclusive-Or)
The exclusive-Or (XOR) function is very similar to both the AND and the OR
function. Again we have two inputs and one output. The output is 1 if either of the
inputs are 1, but NOT if BOTH inputs are 1.
Question 18:
Please provide a table (similar to table 2 from the last practical)
illustrating the XOR function. (1)
As you will have noticed, there really is not much of a difference to the AND function
from the previous practical. It therefore should be straightforward to adapt one of your
simulation files to accommodate the XOR function. Please do so (don’t forget to save
that file under a different name) and train the network again – what a difference!
Usually when something like this happens, i.e. the network does not learn a task as
expected, you’d have to go back and check the set-up, see if you have allowed enough
sweeps, too small or large a learning rate, etc. In this case, however, there is a more
principled reason for the network’s failure to learn. Recall the story from the
introductory lecture, there was an argument that almost led to the whole neural
network approach being abandoned before it even properly got started. You have just
discovered this argument: Two-layer networks are in principle incapable of acquiring
tasks that are not linearly separable. The XOR function, of course, is only the simplest
example, but linear separability plays a role in many naturalistic tasks. Therefore it
would be really worrying if a model that is meant to simulate how the brain works
was limited to linearly separable problems! As it turns out, however, this limitation
does not apply to networks that have a hidden layer and a non-linear activation
function.
Figure 1: A 3-layer feed-forward network
Let’s try and understand this properly. For very simple tasks there is a nice way to
illustrate linear separability: If, in a graph like the one below, you can draw a straight
line that separates all the ones from all the zeros, then the problem is linearly
separable.
input 2
output
1
1
0
0
0
1
0
1
input 1
As you can see, a single line split is impossible in the case of XOR. Adding a hidden
layer helps because, roughly speaking, the network can use the first processing step
(mapping input activations onto hidden layer activations) to create a re-representation
of the original problem (a bit like moving the zeros and ones in the above graph
around) such that, in the second step (mapping hidden layer activation onto output
layer activation), it can be solved by finding the dividing line. From this you can also
understand why at least the hidden layer units need a non-linear (e.g., sigmoid)
activation function: applying several linear transformations in a row will still result,
overall, in a linear transformation and thus would not help with the separation
problem.
Of course all of this becomes more complicated for more complex tasks and such neat
visualisation in two dimensions is not possible anymore, but the general principles
still hold: a hidden layer with a non-linear activation function is needed to enable a
neural network to deal with many tasks. The astonishing fact now is this: if there is
such a hidden layer, you can (mathematically) prove that a neural network is, in
principle, capable of implementing any possible task. Yes, any possible task, similar
to a computer which also is, in principle, capable of implementing any possible
function. In the case of a computer, of course, the problem is to find the right program
(or, indeed, a programmer smart enough to write it), whereas for neural networks, the
problem is to find the right weights configuration (through the learning algorithm). In
the remainder of this practical you will explore some of the factors that have an
impact on whether such a weights configuration is found or not.
First, you need to change the network architecture to a 3-layer feed-forward network.
Give the novel hidden layer 2 units and a bias. If you just run a training now, you will
probably find that the network still does not learn. There are several reasons for this:
1. Because XOR is a considerably more difficult task, you might want to give the
network a bit more time to learn it, try training for 20000 sweeps.
Note: Depending on how powerful your computer is, you might find that training has
become slow. A simple way to improve training speed is to go to the training options
window and untick the box next to “update display during training”. You will only
see the results once training has finished, but you will get there much quicker. It
might also help to increase the logging interval. If you set “log performance every n
sweeps” to 10 you should still retain sufficient information about training
performance while having decreased the amount of data in your simulation file by
almost a magnitude.
2. The training options chosen might not be suitable to the given architecture
and/or the task at hand. As a first guess you might try with a learning rate of
0.1, no momentum, and a random presentation order (with replacement).
Are we getting there?
3. When judging the network’s ability to learn a task, you should not rely on a
single run. Even if you do not change any settings there are two things that can
differ between runs: the initial weights configuration (if it is set to “random
seed”) and the presentation order (if it is not “sequential”).
Question 19:
Train the network 3 times with the same settings. Report all
relevant settings with other than default values and, for each run, the average
correctness performance (specify the criterion used) and the highest mean
square error. (3)
Now, play around with different values for the learning rate (typically between 0.001
and 1.0) and the momentum (typically 0 to 0.9). At this stage it might be useful to
memorize some of the shortcuts, e.g., pressing CTRL+T to train the network or
CTRL+O to bring up the Training Options window.
Note: Two other features of OXlearn facilitate tasks like these. First, you can train
several instances of the same network in one go. If you select Run -> Train Several
Networks, OXlearn will train several instances of the network with the current set up
and include the run number in the autosave file (e.g. “myXOR(run3)_sw20000.mat”).
The very last run will remain open under the original name. Second, the network
comparison tool (Tools -> Compare Networks) allows you to select several
simulation files on your hard drive and to compare performances. Note that this tool
uses the currently loaded file (“current simulation”) as a reference in terms of the
networks’ set-up. If you select files that deviate in important aspects from the current
simulation (e.g., different number of input units, different log interval) you might get
invalid comparisons or outright errors.
Question 20:
What effects do you observe when experimenting with learning
rate and momentum? (think in terms of speed of learning, stability of learning,
quality of the performance after learning) (3)
Question 21:
Report the settings for a particularly successful version and
comment on how you judged “success”? Please include relevant graphs of
this network’s performance in your portfolio. (3)
Question 22:
Does the error decrease with each training sweep? Why or why
not? (Don’t forget about the smoothing option!) (2)
Question 23:
Define a grouping vector (under Set-up ->
Train Patterns) and comment on what the performance
display(s) can tell you when the training sweeps are organised
into groups. (2)
Question 24:
Certainly, you will have observed some networks that failed to
learn. Choose one specific simulation and speculate on the possible reasons for
its failure. You should back up your speculations with concrete settings and
your experience with other, successful runs. (2)
4. Generalisation in Neural Networks
In this practical you will explore the generalisation ability of connectionist networks
with an extension of the XOR-net (in fact, this already is a minimal categorisation
task).
The first step is to determine the input and target patterns we want to present to the
network. The task looks like this: you have four input nodes, each of which can take
the values 1 (on, true, present) or 0 (off, false, not present). This gives a maximum of
16 combinations, displayed below (maybe you see the logic in determining all
combinations?).
Table 2: 4 bit combinatorics
The network’s task is to distinguish the patterns with exactly 2
active input units from all the rest. If the input pattern belongs
to the Duo category the output node should be on (=1), in all
other cases the output should be off (=0). However, as we want
to investigate generalisation today, we will exclude two of the
patterns (one that has exactly two ones and one that hasn’t; for
example the 5th and the 6th pattern) from the training set. We
will test later if the network is able to deduce the correct
classification ‘rule’ (if two ones -> on; else -> off) from the
training set and transfer it to the patterns it has not yet seen.
You can load the patterns described above from
OXlearn\Simulations\practical4\DuoPatterns.mat. Next,
rename the Simulation (File -> Save As) and create a 4x3x1
net, that is, a network with 4 inputs, 3 hidden nodes and 1
output node. Also, make sure to have a bias node connected to
all internal nodes.
P1
P2
P3
P4
N5
N6
P7
P8
P9
P10
P11
P12
P13
P14
P15
P16
i1
i2
i3
i4
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
With the experience from the last practicals you should be able to find settings that do
the job (this net is very similar to the XOR). If you need a hint: try with a momentum
of around 0.3 and a learning rate of ~0.1 and you should find nets that converge
within 10000 sweeps. Train a few networks to get an impression on the stability of
your setting.
Let’s assume you have found a net that solves the task, i.e., it classifies the 14 training
exemplars correctly. In order to judge this, as always, you need to inspect the verify
performance. Today, however, we want to go further: we want to see how well the
networks generalize. A network (or any kind of information processing system) shows
good generalisation when it can transfer the lessons learned on the basis of a restricted
set of examples to other exemplars that it has not dealt with before. Looking at the
face of a perfect stranger and being able to tell its gender, for example, is an instance
of generalisation; even if you have never seen this face before you will usually be able
to transfer you knowledge about (or experience with) a large amount of male and
female faces to correctly classify this novel exemplar.
But how do we capture generalisation in a neural network model? Actually, this is
quite simple. A neural network acquires all its ‘knowledge’ from the experience with
the exemplars (patterns) that are processed during training, resulting in a weights
configuration which implements all this ‘knowledge’. All we have to do to test
generalisation performance is to expose the trained network to novel patterns and
assess how it fares. In contrast to a computer we don’t have to fear a “syntax error”
response with neural networks, the worst that can happen is an incorrect classification.
In OXlearn, this process is straight forward: we simply need to set up some Test
Patterns which will be presented to the network (with ‘frozen weights’) after training
has finished. For the purpose of our analyses it would be more efficient if the Test
Patterns would not only consist of the two novel patterns, but also include the 14 old
ones – distinguished by informative labels. Actually, such a set of Test Patterns was
included in the DuoPatterns.mat file, so you should be able to find and inspect it in
your current simulation. To run the test, select Run -> Test Network.
Question 25:
Include the output activations for all 16 test patterns here (label
appropriately) and comment on it. Has it worked out? Did the net generalize to
the two new patterns? How well? (2)
Repeat this a few times until you have found one net that generalises well and another
one that performs worse on the novel patterns – they both should have low error
scores and perform correctly on the original training set, though. Consider checking
the autotest option (Set up -> Training options – more>>). Make sure to save the two
simulations you have found.
Note: you need to rename your current simulation when you have found a promising
candidate; OXlearn will overwrite older autosave files if the simulation name remains
unchanged and the number of sweeps is equal.
Question 26:
Include the output activation in response to the two new
patterns for the two nets you have found. (2)
Novel patterns
N0100
N0101
Desired output
Good gen net (G)
Bad gen net (B)
Let’s see if we can find out why one net generalizes better than the other. In order to
do this we need to have a look at the distributed representations that have developed
in our hidden layer of three units. OXlearn provides two tools to analyse hidden layer
activations: Cluster Plot and PCA, in the Tools menu. Both are visualisation tools,
that is, they operate on whichever data you care to feed them (two dimensional
matrices where rows correspond to observations and columns represent variables).
When analysing neural networks, each column usually corresponds to the activation
of a specific node. It is often useful to think of each node/column as representing one
dimension in a high dimensional space, often termed ‘activation space’ with as many
dimensions as there are nodes. A PCA finds (and shows) the most informative planes
within that space, whereas the cluster plot collapses across all these dimensions by
looking at (Euclidian) distances only. Since we are interested in the generalisation
performance of the network, you should choose the hidden layer activations during
testing (aka ‘OXtestHidden’) from the dropdown menu at the bottom of the displays.
To appreciate how useful such tools are, you should maybe have a direct look at the
hidden layer activations first. To do so, go to the MATLAB main window and locate
the workspace browser (usually on the upper left hand side, else type “workspace” in
the command window). The workspace browser provides a direct look at the diverse
variables that MATLAB is manipulating, everything starting with ‘OX’ has to do with
OXlearn. Now you can do two things: either you type OXtestHidden in the command
window, or you double click on OXtestHidden in the workspace browser. Both
methods will display the 16x3 matrix that we want to analyse, and you will see that it
is quite difficult to deduce anything from it (and imagine a simulation with many
more hidden layer nodes).
And now let’s see what a cluster plot of these values looks like (Tools -> Cluster
Plot) and select OXtestHidden in the options panel. This is where informative labels
come in handy.
Question 27:
For both your (B) and (G) network, provide an appropriately
labelled cluster plot along with all settings that deviate from the default values.
(2)
Question 28:
Choose one of the plots and interpret it. Where do the new
patterns end up? Does the plot show anything useful with respect to the
networks ability (or failure) to generalize? (2)
FoodForThought 3:
What exactly do you think has gone wrong with the (B)
net? Try to explain why good generalisation, in the current setting, is not
guaranteed.
Because the ability to generalize is one of the most interesting aspects of
connectionist networks, I have included a few more optional questions on this topic. If
you are having a go at them, solutions can be found either by trying (basically
repeating the things we have done today) or by thinking about how connectionist
models solve their tasks (or reading about it, “overfitting” is a useful keyword). Or
you can think about it first and then try to verify your thoughts with OXlearn.
Here we go: Imagine (or implement) the same task, but with a net that has many more
(say 14) hidden units instead of three. What would change? More specifically:
FoodForThought 4:
Would it be easier or harder for such a net to learn the
14 training exemplars? Why?
FoodForThought 5:
Given the network performs well on the original
patterns, do you think a net with many hidden units is more likely or less
likely to generalize well? Why?
FoodForThought 6:
What would happen if we were to train the net on only 8
patterns and then test for generalisation? And with 5 exemplars for training?
Or 2? And if we allow the input patterns to have 5 bits/units instead of 4? The
more general question is this: How much information is necessary for
successful generalisation?
5. Letter prediction in a recurrent network
In this practical you will implement an SRN that learns to predict the next letter
in a very simple artificial language. This model constitutes a simplified version of
the one that learned to predict word boundaries (see lecture).
In the folder Simulations\practical6 you will find a file called ‘PredSRNinput.txt’ .
The content of this file is meant to represent a continuous speech signal (one
phoneme/letter per time step/line) from an artificial language that only has three
words. These three words, however, were collocated randomly to make up one long
sequence.
Question 29:
Give the three words that make up our artificial language. (1)
All right, now we need to import those letters and convert them into a set of numbers
which can be presented to the network.
We will use the following coding scheme:
b
d
g
a
i
u
1
1
1
0
0
0
1
0
0
1
0
0
0
1
0
0
1
0
0
0
1
0
0
1
Have a look at the coding. You should realize that a) this is a distributed coding (i.e.,
a letter may be represented by more than one active unit) and b) there is some
structure to it: the first bit (input) codes whether the letter is a consonant (1) or a
vowel (0), the next 3 bits encode the identity of the respective consonant or vowel.
This coding thus constitutes a mixture of distributed and localist coding (overall
amounting to distributed, as mentioned above).
Question 30:
How many bits (input units) would be required for a fully
localist coding? (1)
Question 31:
only. (1)
Give an example for a distributed coding that uses 3 bits (inputs)
But let’s stick to the coding scheme given above for the moment. First, you need to
import the content of the file into your current simulation (File -> Import Selected).
You should assign the name ‘OXinputLabels’, either during import or afterwards, in
the Matlab workspace browser. Next, you have to use OXlearn’s translation facility
(Tools -> Translate) to translate the sequence of letters into distributed numerical
representations, using the coding scheme above (also in trnslTableBaDiiGuuu.txt).
The translation source, evidently, is the name you have assigned to the sequence of
letters. As translation target you should specify ‘OXinput’, so OXlearn will recognize
the result of the translation (or, again, you can rename post hoc). As translation
function, you choose OX_trslTlearnStyle and specify the text file
trnslTableBaDiiGuuu.txt as coding scheme when prompted.
The network is a so-called ‘predictive SRN’, which is to say that its task consists of
learning to predict the next element in a sequence. Given this information, you now
should be able to create the target patterns (and the target labels).
If you are having too many problems with the above, you can find the resulting
patterns in practical6\BaDiiGuuPatterns.mat.
Once your train patterns are all set up, you need to define the network architecture.
Choose an SRN with ~10 hidden units. You should be able to determine appropriate
values for the other network settings on your own. In terms of training options, you
might try ~ 0.1 as a learning rate and a momentum of 0.3. Also, you should make sure
to present the patterns sequentially.
Question 32:
Why is sequential presentation important in this simulation? (1)
Train the network for 5000 sweeps.
Question 33:
What (approximately) is the average squared error at the end of
training? Comment on this value: is it high or low? What does it tell us? (2)
You can also inspect the verify performance, but this goes through the whole corpus
and thus looks rather messy. Because there are only three words in our artificial
language, you could set up a test that presents each word only once. Create the
necessary test patterns (+ labels), run the test and inspect the results.
Question 34:
When looking at the overall errors for your test patterns, what
do you find (general picture in terms of the performance for the prediction of
each letter)? (1)
Question 35:
Have a more detailed look at the individual test pattern
activations, specifically at the error of the first unit as compared to the error of
the other three units. What do you find (recall the coding)? (2)
Question 36:
Even more specifically, compare the outputs for the prediction
of the three consecutive u’s. You will notice that the last one produces the
highest error. Why is that? (1)
Question 37:
Why is there no need to ‘start small’ in this specific model? (1)
FoodForThought 7:
Strictly speaking, having a test set that includes only the
three words was suboptimal. Can you see what the problem is? Can you come
up with a cleaner solution to testing an SRN?
FoodForThought 8:
Create a Cluster Plot/PCA for this test and see if you
find clues about the internal representations the SRN has developed
(specifically about the interaction between type and position of an input
pattern). Can you deduce information on how the network solves its task?
6. Mini project with OXlearn
E
Recognizing LED digits (10)
In this mini project you have to implement the
task of recognising the LED digits 1 – 9 in
OXlearn.
Your task is straight forward: find a connectionist
model that is able to recognise all the possible 9
digits from an LED display. You can use
whichever set-up you like, as long as it works.
The minimal requirements should be 7 input
nodes (A-G, see picture on the right) and 9 output
nodes (numbers 1-9, localist coding), but you are
welcome to find a better solution. An input node
is on (value of 1) when the corresponding diode is
lit, off (value of 0) otherwise. For the digit ‘2’, for
example, all inputs except A and G should be on.
G
B
D
A
F
C
The aim is to find a working implementation. If you are ambitious, try to find the
most efficient net, i.e. one that reaches the lowest overall error with the smallest
amount of training. Again, you may work in groups of up to three, but report who you
have worked with.
When reporting your project in the portfolio, please include
 All relevant set-up parameter values
 The display showing the final weights configuration
 The train performance display including Error and %Correct performance
 The numerical activations in the output layer for each of the possible input
patterns when tested after training.
 A short paragraph summarizing the results
Download