Connectionist Modelling of Language and Cognitive Processes Practical with OXlearn Gert Westermann August 2011 In order to understand and appreciate connectionist modelling it is very useful to learn how to use neural networks in practice. This is what this practical is about. It is a mixture of enabling you to set up simple neural network models, train and test them, and at the same time to deepen your understanding of neural modelling. The practical handout contains tasks and questions which form part of the assessment for this course. Some of these are revision questions from the seminars and/or the texts on the reading list, but most will require you to use the OXlearn simulator and build your own neural network models. Please enter your answers to the questions in this sheet in bold. Where you are asked to provide graphs of your results, please paste the pictures directly into the sheet (use .jpg or .gif format to limit the size of your document). You can either take screenshots or extract the figures from OXlearn (using the ‘extract’ button and File -> Save As). For each question, the number in parenthesis indicates the maximum number of points you can get for answering correctly. Answering the more advanced FoodForThought questions is not strictly part of the coursework. However, I would be delighted if you had a go at (some of) these and I might be tempted to allocate extra points for answers that are extraordinarily concise or informed. Note: In the practical sessions, you can work in small groups if you wish. However, if you choose to do so you need to indicate in your portfolio who you have been working with in a particular session. Also, you still need to formulate your answers in your own words, even if you have discussed them as a group. 1. Understanding how activation is sent through a neural network. Question 1: What are the two tasks performed by each individual node (or unit) in a neural network? (1) Question 2: Why are input units functionally different from all the other units in a network? (1) x Question 3: x1 = 3; x2 = - 2; x3 = 1.5; x4 = 0.7. What is i ? (1) Imagine a minimal neural network as drawn here (A), whose output unit uses the sigmoidal (or logistic) activation function shown on the right hand side (B). A) network architecture B) activation function unit activation 1 Output unit -0.25 0.6 connection weights Input units 1 1 input values 0.5 - 0.5 0 0.5 1 netinput Question 4: Given the input values and connection weights are as shown above, what is the net input received by the output unit? (1) Question 5: Given the sigmoidal activation function (B), what is the rough activation value of the output unit (look it up in the graph)? (1) Question 6: If both connection weights were 0, what would the activation value of the output unit be? (1) Question 7: And if the weights had the values 1.2 and 0.7 and both inputs were 0, what would be the resulting output activation? (1) Question 8: Name three possible activation functions for neural network units. For each of these functions, (a) indicate if the function is linear or not and (b) describe verbally how the function works (in which way the net input is transformed into an activation value). (2) Bias units: A bias node is an extra node in many neural networks that always (automatically) has an activation value of 1 (it is always on). The bias is usually connected to all units in the network, and its weights are adjusted in the same manner as all other weights in the network (they can end up being negative or positive). Question 9: Explain why a bias node might be necessary for a network to perform correctly (maybe with an example?). (2) FoodForThought 1: Is there a simple way of describing the effect of the bias node on a unit’s activation function? FoodForThought 2: biological neurons? Is there an equivalent to the bias node’s functionality in 2. Setting up your first neural network with OXlearn This practical introduces you to the OXlearn simulation software that you will be using in this practical. OXlearn is a high-level neural network simulation program that enables you to easily define connectionist models and run simulations with them. OXlearn is actually a MATLAB program, which means that it only works when you have a recent version (7.3 or better) of MATLAB installed on your computer (we have it installed here). However, you do not need to do any programming yourself; all the functionality of OXlearn is accessible via its graphical user interface - basically by using buttons, dropdown menus, etc. OXlearn is the successor of the t-learn software which is used in the recommended books for this module (McLeod et al, 1998; and Plunkett & Elman, 1997; see reading list). All the exercises contained in these books can also be done with OXlearn, and it has become a lot easier, too. Today’s instructions are exceedingly detailed (you can skip what you know already) in order to give you a chance to familiarize yourself with OXlearn. However, after today you will be expected to know your way around and you will not get step-bystep instructions any more. If you have problems using OXlearn, please consult the user’s manual, and if this doesn’t help speak to me. There will be a second supervised OXlearn session in week 2 of the course. Simulating an AND network in OXlearn In this task you will use OXlearn to set up and train a neural network to learn the AND function. Similar to all simulation projects, this requires setting up the simulation (defining the train patterns, the network architecture and the training options) and then training the network and analysing its performance (during training or subsequent testing). The very first thing you’ll have to do is download the OXlearn program onto your personal USB drive. You can find the OXlearn folder at http://psych.brookes.ac.uk/oxlearn/ (you might need to unzip the folder). This page also has the OXlearn manual. Next you have to start MATLAB, just click on the MATLAB icon on your desktop. Once it has opened, you’ll need to load the OXlearn program. To do so, click on File->Open and then browse to the OXlearn folder, and double click on OXlearn.m. The OXlearn interface will open. The first thing you will notice is the simulation overview display which has opened in the OXlearn interface. Most of the panels will be empty and the red attention signs at the left indicate that the simulation is not yet set up properly to be trained or tested. This is not surprising, as we are currently dealing with an empty simulation. The first thing to do is to assign a name to your new simulation project. Go to File -> Save Simulation As and browse to a suitable location on your USB drive, for example you could create a new folder called “practical1” in the default “Simulations” folder. Also, you have to provide a name, e.g. “myAND”, to which the extension ‘.mat’ will be attached automatically. To add subsequent changes to this simulation file you just choose File -> Save Simulation or press CTRL + S, as in most programs. Note: Please make sure to save all simulation files that you are producing on your USB drive! When you log out, all data will be deleted from the computer.. Now you are ready to add content to your simulation. There are three essential parts of a simulation that need to be set up properly before you can train the network. These three parts concern the train patterns, the network architecture and the training options, all accessible from the Set-up menu in the menu bar at the top of the OXlearn window - or directly from the simulation overview (Inspect -> Simulation) by clicking on the corresponding edit button (the blue pen). Table 1: the AND function Patterns Pattern 1 Pattern 2 Pattern 3 Pattern 4 Input unit 1 unit 2 0 0 0 1 1 0 1 1 Output target unit 1 0 0 0 1 Note: In binary problems like this you can also read a value of 1 as “on”, ”true” or “given”, while a value of 0 represents “off”, “false” or “absent”. Go to Set-up -> Train Patterns. In this window you can define the Input and Target Patterns you want the network to learn from, essentially feeding the program with the content of Table 2. There are, in fact, several ways of doing so (see the OXlearn User Manual), only one will be described here. First, you have to choose an appropriate number of input and output (target) units, as well as the number of different patterns that you are going to define. This can be done by clicking on the “change size of patterns” button on the left hand side of the display (use the tooltip information that is shown when the mouse pointer hovers over a button). Once you have entered these numbers (4 patterns, 2 input units, 1 target unit) you can see that the two graphs have adapted their size. However, you still need to enter the values of the desired input and target activation. This is simple: just right-click on the appropriate area of a graph and enter the value (0 or 1). You might also want to change the pattern labels (right click on them to edit) to something more descriptive. For now you are done, you can close the Set-up -> Train Patterns window by clicking on OK. Note: In fact, changes in all set-up windows are applied immediately, not only when the window is closed. Go to Set-up -> Network. In this window you can define the exact network architecture by choosing a network type from a dropdown menu and defining the appropriate number of units for each layer, as well as whether you want a bias node or not. For now we want a “2-layer feed-forward” network with a bias unit and a sigmoidal activation function in the output layer. The input and output layer size should correspond with the train patterns. Go to Set-up -> Training Options. In this window you define all the parameters pertaining to the exact way in which the network is trained, e.g. which learning algorithm is used, for how long the network should be trained, in which order the individual patterns are presented during training, as well as some options to control which information is logged during the training process. For the current simulation, you can just leave all these parameters at their default values and close the window. Note: You will realize that the status of the train options in the simulation overview display has changed, it now shows a green tick instead of the red attention sign. The reason is this: whenever you open a set-up window, all the corresponding parameters that are not yet defined will be created with pre-defined default values. Thus, even if you have not changed any of the values, the act of opening the window has lead to the creation of all the relevant parameters. Note: The grey ticks in front of the bottom most four elements (weights, train performance, verify performance and test performance) indicates that all is well, but the corresponding parameters are still empty. The reason is simple: the corresponding parameters are obtained by training, verifying or testing the network. Once you have done this, the ticks will turn green. At a later stage, you might also encounter yellow attention signs here. They indicate that the set-up parameters (the topmost five) and the results parameters (the last four) might be inconsistent. This can happen when, in a simulation that has already been trained, you change the set-up (e.g. if you choose a different learning rate). Because the weights, train performance, etc., are the outcome of training with the previous setup, your simulation is temporarily inconsistent. However, you don’t need to worry: training the network with the novel set-up will replace the results parameters and thus make the simulation consistent again. The first three elements in the status panel (in the simulation overview display) should now all be ticked, which means that you are ready to train the network. Before doing so, however, you should take the time to familiarize yourself with some of the other displays provided by OXlearn, all of which let you inspect different aspects of a simulation in more detail. You can change displays by selecting the options under the Inspect menu – the first entry (Inspect -> Simulation) corresponds to the currently shown simulation overview display. Because much of the information you might want to display concerns the network’s performance during and after training, the displays will of course be more informative once you have trained the network. And this is what you’ll be doing now. Switch to the performance display (Inspect -> Performance), and then go to Run -> Train Network. Provided you have set up everything correctly, this will lead to the network being trained for the given number of sweeps (1000, as per default setting). Note: A sweep is the processing of one individual training pattern, where the identity of the pattern is determined by the chosen presentation order (you might want to bring up the Training Options window again to check). If you have left this parameter at “sequential”, this means that the first pattern is presented in the first training sweep, then the second pattern, then the third, and so on, starting over with the first pattern again after the last pattern (here the forth) has been presented. The other options either randomly shuffle the order of presenting each pattern once (“random without replacement”) or draw a pattern at random for each individual sweep (random with replacement”). A related term is an epoch, which usually means one pass through all the training patterns. Thus, in the current simulation, four sweeps make one epoch. While the network is being trained, the performance display is updated immediately. Thus, you will see the error curve creeping from left to right until the maximum number of sweeps is reached. Each point of this line represents the mean squared deviation of the network’s actual output activation from the intended target activation in a specific sweep. You will (hopefully) notice that the line goes down gradually, thus indicating that the network is getting increasingly better at performing the task. But has it actually mastered the AND function? You might get a better impression by bringing up the display for the other performance measure, just tick the box next to “%Correct” in the options panel at the bottom of the figure. The fact that the network has not reached 100% by the end of training already indicates that everything is not entirely well. Note: The criterion for judging performance in an individual sweep is always “deviation < 0.1” for training performance. This means that the deviation of the actual output activation from target is evaluated for each of the output units (well, currently there is only one), and that a specific sweep will be deemed correct if none of the output units deviates by more than 0.1 from the corresponding target value. This criterion is rather conservative, i.e., the network’s performance needs to be quite close to target in order to be judged as correct. Nevertheless, it is important to note that any such criterion, while often necessary, is essentially arbitrary. Two things follow from this insight: A) when evaluating existing models, pay attention to how loose or strict a criterion is used; and B) you will have to make (and report) a similar choice with your own simulations. You can change the correctness criterion in OXlearn when inspecting verify or test performance (but not for training performance). Have a closer look at the %Correct display for train performance, there is something that should make you suspicious: if every sweep is classified as either correct or incorrect, and if the line has one point for each sweep, why are there intermediate values such as 35% correct? You have just discovered the functionality of the smoothing filter which is governed by the number in the little box below the buttons on the left hand side of the display. Right now you should find a 20 there (number of logged sweeps/50, the default setting), which means that each point in a line (Error and %Correct) actually represents the average of 20 consecutive sweeps – that is why the graph has these distinct steps in it. If you set this number to 1 no smoothing occurs and you will get the expected binary correctness display – however, you will also see that it is not very informative. Similarly, setting this smoothing factor to 1000 (= the maximum number of sweeps) results in a horizontal line representing the average value of all 1000 sweeps. But let us come back to the question whether the network has mastered the task or not. Actually, the training performance can provide a general impression, but not the direct answer. Recall that learning in a neural network means adapting the weights, and each change of the weights configuration may impact on all patterns processed thereafter. Because the weights are updated constantly (after every sweep) during training you should not, strictly speaking, compare the network’s behaviour over adjacent sweeps. It could be that the weight update in response to the second pattern has compromised the network’s ability to deal with the first one (unlikely, but possible). Luckily, there is a clean solution: one could simply record the network’s responses to any number of patterns and not adjusting the weights while doing so. This is often called ‘testing the network with frozen weights’. In the current simulation this means exposing the network (with the weights configuration after 1000 sweeps of training) to all training patterns once – in sequential order, to keep it simple. OXlearn will carry out exactly this process when you select Run -> Verify the network has learned. To be clear on terminology, the verify option is actually performing a test, but with the exact set of patterns with which the network has been trained. Run -> Test Network, conversely, presents the network with Test Patterns that must be set up separately. Actually, you don’t even have to Verify that the network has leaned now, because OXlearn has performed this operation automatically, at the end of training (see the “auto verify” tickbox in the training options window + ‘>> more’). To inspect the results, select “verify performance” from the dropdown menu in the options panel. The display is similar to “training performance”, but you get only one point per pattern and, because we are now interested in the individual patterns, smoothing is set to 1. Again you can inspect the mean square error and you will find that it is rather low. Keep in mind, however, that squaring a small number will make it even smaller and, if we had several output units, taking the average might further reduce the error value. The bottom line is, you should not give too much about absolute error values. Comparing error values between patterns, on the other hand, is somewhat more informative. In the current simulation, for example, the network seems to have more problems with the last pattern (where both input units are on) as compared to the others. This is also reflected by the fact that this last pattern sometimes is classified as incorrect with conservative correctness criterions (you can now select different criteria from the dropdown menu at the bottom). Usually you should find that a less conservative criterion (e.g. deviation < 0.3 or binary) classifies all patterns as correct (details might vary from simulation to simulation, we will investigate this in the next practical). And here we have the answer to our question: given a specific criterion, the network has mastered the AND function. The performance display provides rather high-level information about the network’s actual output. Look at the Inspect -> Patterns display (note the different scales, add colorbars if in doubt) and/or the Inspect -> Activations display and try to find out how exactly they relate to the information in the performance display. Question 10: What is the difference between the Patterns and the Activations display in OXlearn? (1) Question 11: What is the network’s exact numerical output activation for each of the 4 patterns? (1) Inspect the weights configuration that resulted from training the network (Inspect -> Weights). The weight display will look something like this: This is called a Hinton Diagram. Positive weights are represented by red boxes and negative weights by blue boxes. The size of the boxes indicates the connection strength (use the colorbar for a rough indication, the datatip for exact information concerning the weight values). The weights are from the units marked on the x-axis to the units on the y-axis. Thus, the three weights displayed here are from i1 to o1 (input unit 1 to output unit 1), from i2 to o1, and from the bias node to o1. Question 12: What are the exact weight values of the three connections in your simulation after training? (1) Note: It is quite easy to save any of the OXlearn displays: clicking on the topmost button at the left hand side (“extract”) will send the graphs contained in the current display to a novel MATLAB figure. If ever you wish to change anything about the visual appearance of a graph (e.g. changing colors or labels, adding annotations, etc.), you can use the graphics tools at the top of this window (MATLAB comes with an excellent help - press F1 - where you can find out about using the graphics tools). You can save the figure in several formats by selecting File -> Save As. Make sure to choose a suitable format from the dropdown menu at the bottom; to keep files small you should usually go for .jpg or .gif. Question 13: Extract the weights display of your simulation and include it in your portfolio. (1) Train the network a few more times with different settings in the training options window (you could change the presentation order, the number of sweeps, the learning rate or even add a momentum). You should be able to find a setting for which the network, after training, gets all four patterns right when evaluated against the criterion of “deviation < 0.1”. Question 14: Which settings that have led to such a good performance? Please report all settings with other than default values. (1) Question 15: Explain why performance has increased as compared to the initial settings. (2) Question 16: Report the exact mean square error for all four patterns, as well as the average mean square error. (2) Question 17: Report the exact difference between the output and the target activation (i.e., the raw error; not squared) for all four patterns. (1) If you have come this far today, you have done well indeed. In case there is time left, feel free to explore variations of the current simulation and/or play around some more with OXlearn to make yourself familiar with some of the other functionalities provided by the program. Just to mention one, you could explore what happens when you click on “pause” (or press the ‘p’ key) during training. Note: In general, it is your responsibility to save your simulations frequently. If you quit OXlearn without saving, recent changes (made since the simulation was last saved) will be lost. In order to decrease the risk of this happening, however, OXlearn has an autosave function. Essentially, whenever a network is trained or tested OXlearn creates a simulation file which is named like your current simulation plus the affix ‘_sw<number of sweeps>’. For the present simulation this means that you should find a file called ‘OR_sw1000.mat’ alongside your original simulation file. This snapshot of your simulation right after training or testing can be used as a recovery file in case you have forgotten to save after training was completed and closed OXlearn. As we will see later, these automatically created files also come in handy when training several instances of the same network or when you want to dump intermediate states of the network during training. 3. Learning XOR (Exclusive-Or) The exclusive-Or (XOR) function is very similar to both the AND and the OR function. Again we have two inputs and one output. The output is 1 if either of the inputs are 1, but NOT if BOTH inputs are 1. Question 18: Please provide a table (similar to table 2 from the last practical) illustrating the XOR function. (1) As you will have noticed, there really is not much of a difference to the AND function from the previous practical. It therefore should be straightforward to adapt one of your simulation files to accommodate the XOR function. Please do so (don’t forget to save that file under a different name) and train the network again – what a difference! Usually when something like this happens, i.e. the network does not learn a task as expected, you’d have to go back and check the set-up, see if you have allowed enough sweeps, too small or large a learning rate, etc. In this case, however, there is a more principled reason for the network’s failure to learn. Recall the story from the introductory lecture, there was an argument that almost led to the whole neural network approach being abandoned before it even properly got started. You have just discovered this argument: Two-layer networks are in principle incapable of acquiring tasks that are not linearly separable. The XOR function, of course, is only the simplest example, but linear separability plays a role in many naturalistic tasks. Therefore it would be really worrying if a model that is meant to simulate how the brain works was limited to linearly separable problems! As it turns out, however, this limitation does not apply to networks that have a hidden layer and a non-linear activation function. Figure 1: A 3-layer feed-forward network Let’s try and understand this properly. For very simple tasks there is a nice way to illustrate linear separability: If, in a graph like the one below, you can draw a straight line that separates all the ones from all the zeros, then the problem is linearly separable. input 2 output 1 1 0 0 0 1 0 1 input 1 As you can see, a single line split is impossible in the case of XOR. Adding a hidden layer helps because, roughly speaking, the network can use the first processing step (mapping input activations onto hidden layer activations) to create a re-representation of the original problem (a bit like moving the zeros and ones in the above graph around) such that, in the second step (mapping hidden layer activation onto output layer activation), it can be solved by finding the dividing line. From this you can also understand why at least the hidden layer units need a non-linear (e.g., sigmoid) activation function: applying several linear transformations in a row will still result, overall, in a linear transformation and thus would not help with the separation problem. Of course all of this becomes more complicated for more complex tasks and such neat visualisation in two dimensions is not possible anymore, but the general principles still hold: a hidden layer with a non-linear activation function is needed to enable a neural network to deal with many tasks. The astonishing fact now is this: if there is such a hidden layer, you can (mathematically) prove that a neural network is, in principle, capable of implementing any possible task. Yes, any possible task, similar to a computer which also is, in principle, capable of implementing any possible function. In the case of a computer, of course, the problem is to find the right program (or, indeed, a programmer smart enough to write it), whereas for neural networks, the problem is to find the right weights configuration (through the learning algorithm). In the remainder of this practical you will explore some of the factors that have an impact on whether such a weights configuration is found or not. First, you need to change the network architecture to a 3-layer feed-forward network. Give the novel hidden layer 2 units and a bias. If you just run a training now, you will probably find that the network still does not learn. There are several reasons for this: 1. Because XOR is a considerably more difficult task, you might want to give the network a bit more time to learn it, try training for 20000 sweeps. Note: Depending on how powerful your computer is, you might find that training has become slow. A simple way to improve training speed is to go to the training options window and untick the box next to “update display during training”. You will only see the results once training has finished, but you will get there much quicker. It might also help to increase the logging interval. If you set “log performance every n sweeps” to 10 you should still retain sufficient information about training performance while having decreased the amount of data in your simulation file by almost a magnitude. 2. The training options chosen might not be suitable to the given architecture and/or the task at hand. As a first guess you might try with a learning rate of 0.1, no momentum, and a random presentation order (with replacement). Are we getting there? 3. When judging the network’s ability to learn a task, you should not rely on a single run. Even if you do not change any settings there are two things that can differ between runs: the initial weights configuration (if it is set to “random seed”) and the presentation order (if it is not “sequential”). Question 19: Train the network 3 times with the same settings. Report all relevant settings with other than default values and, for each run, the average correctness performance (specify the criterion used) and the highest mean square error. (3) Now, play around with different values for the learning rate (typically between 0.001 and 1.0) and the momentum (typically 0 to 0.9). At this stage it might be useful to memorize some of the shortcuts, e.g., pressing CTRL+T to train the network or CTRL+O to bring up the Training Options window. Note: Two other features of OXlearn facilitate tasks like these. First, you can train several instances of the same network in one go. If you select Run -> Train Several Networks, OXlearn will train several instances of the network with the current set up and include the run number in the autosave file (e.g. “myXOR(run3)_sw20000.mat”). The very last run will remain open under the original name. Second, the network comparison tool (Tools -> Compare Networks) allows you to select several simulation files on your hard drive and to compare performances. Note that this tool uses the currently loaded file (“current simulation”) as a reference in terms of the networks’ set-up. If you select files that deviate in important aspects from the current simulation (e.g., different number of input units, different log interval) you might get invalid comparisons or outright errors. Question 20: What effects do you observe when experimenting with learning rate and momentum? (think in terms of speed of learning, stability of learning, quality of the performance after learning) (3) Question 21: Report the settings for a particularly successful version and comment on how you judged “success”? Please include relevant graphs of this network’s performance in your portfolio. (3) Question 22: Does the error decrease with each training sweep? Why or why not? (Don’t forget about the smoothing option!) (2) Question 23: Define a grouping vector (under Set-up -> Train Patterns) and comment on what the performance display(s) can tell you when the training sweeps are organised into groups. (2) Question 24: Certainly, you will have observed some networks that failed to learn. Choose one specific simulation and speculate on the possible reasons for its failure. You should back up your speculations with concrete settings and your experience with other, successful runs. (2) 4. Generalisation in Neural Networks In this practical you will explore the generalisation ability of connectionist networks with an extension of the XOR-net (in fact, this already is a minimal categorisation task). The first step is to determine the input and target patterns we want to present to the network. The task looks like this: you have four input nodes, each of which can take the values 1 (on, true, present) or 0 (off, false, not present). This gives a maximum of 16 combinations, displayed below (maybe you see the logic in determining all combinations?). Table 2: 4 bit combinatorics The network’s task is to distinguish the patterns with exactly 2 active input units from all the rest. If the input pattern belongs to the Duo category the output node should be on (=1), in all other cases the output should be off (=0). However, as we want to investigate generalisation today, we will exclude two of the patterns (one that has exactly two ones and one that hasn’t; for example the 5th and the 6th pattern) from the training set. We will test later if the network is able to deduce the correct classification ‘rule’ (if two ones -> on; else -> off) from the training set and transfer it to the patterns it has not yet seen. You can load the patterns described above from OXlearn\Simulations\practical4\DuoPatterns.mat. Next, rename the Simulation (File -> Save As) and create a 4x3x1 net, that is, a network with 4 inputs, 3 hidden nodes and 1 output node. Also, make sure to have a bias node connected to all internal nodes. P1 P2 P3 P4 N5 N6 P7 P8 P9 P10 P11 P12 P13 P14 P15 P16 i1 i2 i3 i4 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 With the experience from the last practicals you should be able to find settings that do the job (this net is very similar to the XOR). If you need a hint: try with a momentum of around 0.3 and a learning rate of ~0.1 and you should find nets that converge within 10000 sweeps. Train a few networks to get an impression on the stability of your setting. Let’s assume you have found a net that solves the task, i.e., it classifies the 14 training exemplars correctly. In order to judge this, as always, you need to inspect the verify performance. Today, however, we want to go further: we want to see how well the networks generalize. A network (or any kind of information processing system) shows good generalisation when it can transfer the lessons learned on the basis of a restricted set of examples to other exemplars that it has not dealt with before. Looking at the face of a perfect stranger and being able to tell its gender, for example, is an instance of generalisation; even if you have never seen this face before you will usually be able to transfer you knowledge about (or experience with) a large amount of male and female faces to correctly classify this novel exemplar. But how do we capture generalisation in a neural network model? Actually, this is quite simple. A neural network acquires all its ‘knowledge’ from the experience with the exemplars (patterns) that are processed during training, resulting in a weights configuration which implements all this ‘knowledge’. All we have to do to test generalisation performance is to expose the trained network to novel patterns and assess how it fares. In contrast to a computer we don’t have to fear a “syntax error” response with neural networks, the worst that can happen is an incorrect classification. In OXlearn, this process is straight forward: we simply need to set up some Test Patterns which will be presented to the network (with ‘frozen weights’) after training has finished. For the purpose of our analyses it would be more efficient if the Test Patterns would not only consist of the two novel patterns, but also include the 14 old ones – distinguished by informative labels. Actually, such a set of Test Patterns was included in the DuoPatterns.mat file, so you should be able to find and inspect it in your current simulation. To run the test, select Run -> Test Network. Question 25: Include the output activations for all 16 test patterns here (label appropriately) and comment on it. Has it worked out? Did the net generalize to the two new patterns? How well? (2) Repeat this a few times until you have found one net that generalises well and another one that performs worse on the novel patterns – they both should have low error scores and perform correctly on the original training set, though. Consider checking the autotest option (Set up -> Training options – more>>). Make sure to save the two simulations you have found. Note: you need to rename your current simulation when you have found a promising candidate; OXlearn will overwrite older autosave files if the simulation name remains unchanged and the number of sweeps is equal. Question 26: Include the output activation in response to the two new patterns for the two nets you have found. (2) Novel patterns N0100 N0101 Desired output Good gen net (G) Bad gen net (B) Let’s see if we can find out why one net generalizes better than the other. In order to do this we need to have a look at the distributed representations that have developed in our hidden layer of three units. OXlearn provides two tools to analyse hidden layer activations: Cluster Plot and PCA, in the Tools menu. Both are visualisation tools, that is, they operate on whichever data you care to feed them (two dimensional matrices where rows correspond to observations and columns represent variables). When analysing neural networks, each column usually corresponds to the activation of a specific node. It is often useful to think of each node/column as representing one dimension in a high dimensional space, often termed ‘activation space’ with as many dimensions as there are nodes. A PCA finds (and shows) the most informative planes within that space, whereas the cluster plot collapses across all these dimensions by looking at (Euclidian) distances only. Since we are interested in the generalisation performance of the network, you should choose the hidden layer activations during testing (aka ‘OXtestHidden’) from the dropdown menu at the bottom of the displays. To appreciate how useful such tools are, you should maybe have a direct look at the hidden layer activations first. To do so, go to the MATLAB main window and locate the workspace browser (usually on the upper left hand side, else type “workspace” in the command window). The workspace browser provides a direct look at the diverse variables that MATLAB is manipulating, everything starting with ‘OX’ has to do with OXlearn. Now you can do two things: either you type OXtestHidden in the command window, or you double click on OXtestHidden in the workspace browser. Both methods will display the 16x3 matrix that we want to analyse, and you will see that it is quite difficult to deduce anything from it (and imagine a simulation with many more hidden layer nodes). And now let’s see what a cluster plot of these values looks like (Tools -> Cluster Plot) and select OXtestHidden in the options panel. This is where informative labels come in handy. Question 27: For both your (B) and (G) network, provide an appropriately labelled cluster plot along with all settings that deviate from the default values. (2) Question 28: Choose one of the plots and interpret it. Where do the new patterns end up? Does the plot show anything useful with respect to the networks ability (or failure) to generalize? (2) FoodForThought 3: What exactly do you think has gone wrong with the (B) net? Try to explain why good generalisation, in the current setting, is not guaranteed. Because the ability to generalize is one of the most interesting aspects of connectionist networks, I have included a few more optional questions on this topic. If you are having a go at them, solutions can be found either by trying (basically repeating the things we have done today) or by thinking about how connectionist models solve their tasks (or reading about it, “overfitting” is a useful keyword). Or you can think about it first and then try to verify your thoughts with OXlearn. Here we go: Imagine (or implement) the same task, but with a net that has many more (say 14) hidden units instead of three. What would change? More specifically: FoodForThought 4: Would it be easier or harder for such a net to learn the 14 training exemplars? Why? FoodForThought 5: Given the network performs well on the original patterns, do you think a net with many hidden units is more likely or less likely to generalize well? Why? FoodForThought 6: What would happen if we were to train the net on only 8 patterns and then test for generalisation? And with 5 exemplars for training? Or 2? And if we allow the input patterns to have 5 bits/units instead of 4? The more general question is this: How much information is necessary for successful generalisation? 5. Letter prediction in a recurrent network In this practical you will implement an SRN that learns to predict the next letter in a very simple artificial language. This model constitutes a simplified version of the one that learned to predict word boundaries (see lecture). In the folder Simulations\practical6 you will find a file called ‘PredSRNinput.txt’ . The content of this file is meant to represent a continuous speech signal (one phoneme/letter per time step/line) from an artificial language that only has three words. These three words, however, were collocated randomly to make up one long sequence. Question 29: Give the three words that make up our artificial language. (1) All right, now we need to import those letters and convert them into a set of numbers which can be presented to the network. We will use the following coding scheme: b d g a i u 1 1 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 Have a look at the coding. You should realize that a) this is a distributed coding (i.e., a letter may be represented by more than one active unit) and b) there is some structure to it: the first bit (input) codes whether the letter is a consonant (1) or a vowel (0), the next 3 bits encode the identity of the respective consonant or vowel. This coding thus constitutes a mixture of distributed and localist coding (overall amounting to distributed, as mentioned above). Question 30: How many bits (input units) would be required for a fully localist coding? (1) Question 31: only. (1) Give an example for a distributed coding that uses 3 bits (inputs) But let’s stick to the coding scheme given above for the moment. First, you need to import the content of the file into your current simulation (File -> Import Selected). You should assign the name ‘OXinputLabels’, either during import or afterwards, in the Matlab workspace browser. Next, you have to use OXlearn’s translation facility (Tools -> Translate) to translate the sequence of letters into distributed numerical representations, using the coding scheme above (also in trnslTableBaDiiGuuu.txt). The translation source, evidently, is the name you have assigned to the sequence of letters. As translation target you should specify ‘OXinput’, so OXlearn will recognize the result of the translation (or, again, you can rename post hoc). As translation function, you choose OX_trslTlearnStyle and specify the text file trnslTableBaDiiGuuu.txt as coding scheme when prompted. The network is a so-called ‘predictive SRN’, which is to say that its task consists of learning to predict the next element in a sequence. Given this information, you now should be able to create the target patterns (and the target labels). If you are having too many problems with the above, you can find the resulting patterns in practical6\BaDiiGuuPatterns.mat. Once your train patterns are all set up, you need to define the network architecture. Choose an SRN with ~10 hidden units. You should be able to determine appropriate values for the other network settings on your own. In terms of training options, you might try ~ 0.1 as a learning rate and a momentum of 0.3. Also, you should make sure to present the patterns sequentially. Question 32: Why is sequential presentation important in this simulation? (1) Train the network for 5000 sweeps. Question 33: What (approximately) is the average squared error at the end of training? Comment on this value: is it high or low? What does it tell us? (2) You can also inspect the verify performance, but this goes through the whole corpus and thus looks rather messy. Because there are only three words in our artificial language, you could set up a test that presents each word only once. Create the necessary test patterns (+ labels), run the test and inspect the results. Question 34: When looking at the overall errors for your test patterns, what do you find (general picture in terms of the performance for the prediction of each letter)? (1) Question 35: Have a more detailed look at the individual test pattern activations, specifically at the error of the first unit as compared to the error of the other three units. What do you find (recall the coding)? (2) Question 36: Even more specifically, compare the outputs for the prediction of the three consecutive u’s. You will notice that the last one produces the highest error. Why is that? (1) Question 37: Why is there no need to ‘start small’ in this specific model? (1) FoodForThought 7: Strictly speaking, having a test set that includes only the three words was suboptimal. Can you see what the problem is? Can you come up with a cleaner solution to testing an SRN? FoodForThought 8: Create a Cluster Plot/PCA for this test and see if you find clues about the internal representations the SRN has developed (specifically about the interaction between type and position of an input pattern). Can you deduce information on how the network solves its task? 6. Mini project with OXlearn E Recognizing LED digits (10) In this mini project you have to implement the task of recognising the LED digits 1 – 9 in OXlearn. Your task is straight forward: find a connectionist model that is able to recognise all the possible 9 digits from an LED display. You can use whichever set-up you like, as long as it works. The minimal requirements should be 7 input nodes (A-G, see picture on the right) and 9 output nodes (numbers 1-9, localist coding), but you are welcome to find a better solution. An input node is on (value of 1) when the corresponding diode is lit, off (value of 0) otherwise. For the digit ‘2’, for example, all inputs except A and G should be on. G B D A F C The aim is to find a working implementation. If you are ambitious, try to find the most efficient net, i.e. one that reaches the lowest overall error with the smallest amount of training. Again, you may work in groups of up to three, but report who you have worked with. When reporting your project in the portfolio, please include All relevant set-up parameter values The display showing the final weights configuration The train performance display including Error and %Correct performance The numerical activations in the output layer for each of the possible input patterns when tested after training. A short paragraph summarizing the results