A computational model of learned behavioural sequences in prefrontal cortex Edwin van der Ham July 20, 2007 Contents 1 Introduction 2 Literature review 2.1 The function of prefrontal cortex . . . . . . . . . . . . . . . 2.1.1 The evolution of the human brain . . . . . . . . . . 2.1.2 Behavioural evidence for the importance of the PFC 2.1.3 Why stimulus-response behaviour is insufficient . . 2.1.4 Towards a theory of PFC behaviour . . . . . . . . . 2.2 Prediction and reward . . . . . . . . . . . . . . . . . . . . . 2.3 The role of dopamine in reward prediction . . . . . . . . . . 2.4 Dopamine mediated learning . . . . . . . . . . . . . . . . . 2.5 The temporal difference algorithm . . . . . . . . . . . . . . 2.6 A theory of cognitive control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 5 5 6 8 9 11 12 15 15 17 Design of a simple stimulus-response network 3.1 Architecture . . . . . . . . . . . . . . . . 3.2 Learning stimulus-response behaviour . . 3.3 Performance of the network . . . . . . . . 3.4 A fully connected version of the network . 3.5 Limitations of the current network . . . . 3 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 20 24 26 27 28 A model of behavioural sequences 4.1 Why do I want to learn behavioural sequences? . 4.2 Existing models of sequences . . . . . . . . . . . 4.3 A neural model for learning sequences . . . . . . 4.3.1 A first impression of the final network . . 4.3.2 Extending the stimulus-response network 4.3.3 Sequence learning without PFC . . . . . 4.4 A final model including a PFC layer . . . . . . . 4.4.1 The design of a PFC layer . . . . . . . . 4.4.2 Learning inside PFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 31 34 34 36 37 40 40 41 5 Implementation and evaluation of a neural network including PFC layer 5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 The performance of the final model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 46 49 6 Conclusions and further work 6.1 Discussion . . . . . . . . . 6.2 Conclusions . . . . . . . . 6.3 Recommendations . . . . . 6.4 Future work . . . . . . . . 53 53 54 54 56 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Package network A.1 Classes . . . . . . . . . . . A.1.1 C LASS Connection A.1.2 C LASS Layer . . . . A.1.3 C LASS Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 61 61 64 70 Chapter 1 Introduction There is an ongoing scientific effort to understand the working of the human brain. Starting with the psychological method of introspection, the subject has fascinated us for many years now. Unfortunately it proves hard to study the one thing we use for thinking in the first place. In more recent years, scientists have begun to gain a better understanding of the processes that take place in the brain. MRI scanning devices allow us to see the complex network of neurons and other brain structures which ultimately provides us with the ability to think and act in a sensible way. This network is so intricate that to date it has proven to be impossible to comprehend at the level of a single cell how the physical properties of (various parts of) the brain determine the way we think and act. A lot of today’s brain knowledge is gathered from experiments with groups of patients having a certain disability or damage to a specific part of the brain. These experiments have led to a basic understanding of the functional responsibilities of different parts of the brain. Recently, in computer science the field of artificial intelligence (AI) started to arise. The ultimate goal in AI is to create a machine that is capable of mimicking human behaviour so flawlessly that the average person will not be able to tell the difference from a behavioural perspective. Initially, techniques devised in other areas of computer science were used to try to accomplish this goal. However, a major problem with traditional computer models is that they can only do exactly what they were designed to do. Unlike a human, a traditional computer system does not perform very well in a novel situation. With the progression made in neurophysical brain research, an improved understanding of the physical properties of brain processes arose. As a result, an effort started to simulate those processes in an artificial computer model. Such a model is expected to be much more flexible than the more traditional approach. The human brain is an extremely complex system containing a large number of small building blocks called neurons. A neuron is a small cell in the nervous system that can generate a tiny electrical current. Apart from generating an electrical pulse it is also sensitive to pulses generated by adjacent neurons. The excitation of a neuron can lead to a rise in its action potential causing it to fire another electrical pulse. A group of neurons communicates by means of these electrical pulses, herewith creating communication pathways trough the brain. Based on neurons found in the brain of humans and other animals, computer scientists have created an artificial counterpart to the biological neuron. An artificial neural network provides a new and biologically founded method for modeling all kinds of processes including those that take place in the brain. A large portion of the brain is concerned with processing information received through the body’s sensory system. Information from various sources is processed and combined to finally select an action to take. But performing a single action is not enough. In order to reach a desired goal, often a whole series of actions needs to be performed. A part of the brain believed to have a coordinating influence on the stimulusresponse system is the prefrontal cortex (PFC). Deficits to the PFC strongly limit our ability to consciously control behaviour. In order to understand how it is that the brain can guide our behaviour such that novel tasks can be completed successfully, we need to understand how the PFC influences the stimulus-response pathways responsible for the selection of actions to take. Artificial neural networks have been devised that are capable of learning a mapping between input and output without programming it specifically for the task at hand. Such a network can be seen as a very high-level abstraction of the stimulus-response 3 pathways running through the brain. Likewise, networks like this are usually only capable of learning single input-output combinations. It proves much harder to learn sequences of actions. My hypothesis is that a computational model of the PFC can provide a mechanism for guiding a stimulus-response network towards desired behaviour, the same way that the PFC in the brain coordinates and orchestrates behaviour. With the guiding influence of the PFC model, an existing artificial neural network, currently incapable of learning behavioural sequences, could be enabled to flexibly select a sequence of actions. Since the objective is to gain a better understanding of the human brain, biological plausibility is a major concern. The main question I will answer in this report is therefore: How can a biologically plausible model of PFC be developed to guide a neural network towards learning behavioural sequences? The first goal of my research is to develop a computational model of both the PFC and the stimulusresponse pathways in the brain. By itself, the latter model should be able to learn simple input-output combinations. More complicated behaviour such as performing a sequence of actions is much harder to learn and requires the guiding influence of the PFC model. Without the PFC, the stimulus-response model is expected to perform poorly and prove incapable of learning a sequence of actions. The objective is for both models to be consistent with results from brain studies, i.e. the characteristics of the computational model must correspond to the general characteristics of the brain structures involved. Without a basic understanding of brain functioning it becomes difficult to prove the biological plausibility of an artificial network. Therefore, the first step in my research comprises a literature study on the neurophysiological properties of the human brain. In this study, I will give special attention to the PFC and its effects on the ability to learn behaviour. I will also look at existing computational models of human behaviour, especially those that exhibit the functional properties assigned to PFC. The next step in my research will be to create and implement a very simple, highly abstracted model of the stimulus-response pathways in the brain that allow us to react to sensory input. This model is be capable of learning only basic stimulus-response behaviour. The final and most challenging task will be to implement a biologically plausible, computational model of the PFC. Connecting the PFC model with the stimulus-response network should enable it to learn behavioural sequences as well. This report starts with a review of a number of relevant neuroscientific topics in chapter 2. In chapter 3, I will build up the neural model used for learning stimulus-response behaviour. The results of a performance test on this model are also presented here. In chapter 4, I will extend the existing model to include a PFC component. The PFC can be switched off to simulate what happens when the PFC does not function any more. Chapter 5 reports on the results of a number of different tests I created to show the difference between the two situations. Finally in chapter 6, I will give the final conclusions and further work that needs to be carried out. 4 Chapter 2 Literature review To understand how a model of prefrontal cortex can contribute to an understanding of human behaviour, I will discuss some relevant literature on the subject of neuroscience. The most influential writer for my work is Jonathan D. Cohen. Together with Todd Braver he introduced a model of cognitive control [6] incorporating the PFC and dopaminergic brain systems. In section 2.1 I start by looking at the crucial role of the PFC in human behaviour. Section 2.2 explains why reward prediction is particularly important. In the human brain, dopaminergic (DA) neurons seem to encode a prediction of reward as I will explain in section 2.3. Sections 2.4 and 2.5 talk about how the DA signal can be used to mediate learning. In section 2.6 the neurophysiological findings are put together to form an integrated theory of cognitive control. 2.1 The function of prefrontal cortex Miller and Cohen pose the question of how coordinated, purposeful behaviour can arise from the distributed activity of billions of neurons in the brain [20]. This question has been around since neuroscience first began and will probably not be fully answered in the near future. But there is certainly progress in our understanding of the functions that different parts of the brain exhibit. A fairly simple and low-level way of describing animal behaviour is by stimulus-response mapping. In such a model any particular set of stimuli leads to a predetermined output. For simple animals with a relatively low number of neurons, such a model seems quite capable of describing and predicting behaviour. But larger animals, including primates such as ourselves, show a much more complex behaviour that cannot be accounted for by simple stimulus-response mappings. The property that distinguishes humans most from animals is the ability to control their behaviour in such a way as to accomplish a higher-level goal. It is widely recognised that the PFC plays a very important role in this. To understand why, in section 2.1.1 I will first have a look at how the human brain has evolved as a result of evolutionary processes. Section 2.1.2 provides behavioural evidence for the importance of the PFC. In section 2.1.3 I present a simple model of stimulus-response behaviour which explains why the PFC is necessary to control behaviour. Finally, in section 2.1.4 I assess the plausibility of an extended model of cognitive control including the PFC. 2.1.1 The evolution of the human brain In 1970, neurologist Paul MacLean proposed a model of the human brain that he called the ‘triune brain’ [19]. According to this model, the human brain can be divided into three main components, as figure 2.1 shows. The oldest part of the brain from an evolutionary perspective is the brain stem. Located deep inside the skull together with the cerebellum, it is also called the reptilian brain. This part of a human brain is similar to the brain of a reptile and is responsible for vital body functions such as heart rate, breathing, body temperature and balance. The first mammals developed a new structure on top of the brain stem, called the limbic brain. This part includes areas such as the hippocampus, the amygdala, and the hypothalamus. It is mainly concerned with emotion and instincts. The newest part of the brain, also found in mammals is the neocortex. Neocortex is traditionally thought of as responsible for higher-cognitive functions and is only 5 Figure 2.1: The triune brain (from http://www.ascd.org/) found in higher mammals. In most animals it comprises only a small part of the total brain area. Primates, however, have a much bigger neocortical area. In the human brain, it even takes up to two third of the total brain mass. Human skills such as language and reasoning rely heavily on the neocortex. The cortex can be functionally subdivided into a number of regions as shown in figure 2.2. Especially the prefrontal cortex, an area in the anterior part of frontal cortex, is much more complex in humans than it is in any other mammal. Recent studies have found that although humans have a larger frontal area than any other mammal, its relative size is not larger than that of our closest ancestors, the great apes. [37]. Yet, in terms of complexity and interconnectivity with other areas, the human frontal cortex is superior. 2.1.2 Behavioural evidence for the importance of the PFC There are various sources of behavioural evidence for the crucial role of the PFC in higher-level human behaviour. A well-known task in which top-down control over behaviour is necessary in order to achieve a good result is the Stroop task [38]. In this task, subjects are presented with words that are the names of colours (red, blue, green etc.). All the words are written in a colour that does not necessarily correspond to the correct interpretation of the word. Figure 2.3 shows a few typical examples of such words. The task of the subject is to name the colour in which the word is written and ignore the word itself. Our automatic response is to read and interpret any word we see. Attending to the colour in which it is written requires our brain to suppress the automated stimulus-response behaviour that will lead to simply reading the word. Although it requires extra attentional effort, most people will be able to do this task without many mistakes. Patients with damage to the PFC area are known to have much more difficulty with (variations of) this task [28] [40]. A similar task where frontally impaired people show very low scores is the Wisconsin Card Sorting Task (WCST) [5]. In this task, subjects are shown cards with symbols on them that vary either in number, shape or colour. These cards must be sorted according to one of those three dimensions. This by itself is not so difficult, but the rule by which to sort the cards changes periodically, meaning that the currently pursued goal has to be abandoned and a new one adopted. Humans with PFC damage are quite capable of applying the initial rule, regardless of the selected dimension. However, when they have to change to a new rule they are unable to do so and usually continue sorting according to the initially learned mapping [23]. 6 Figure 2.2: Functional subdivision of the cortex (from http://universe-review.ca/) RED PURPLE BLACK BLUE RED GREEN GREEN YELLOW BLUE Figure 2.3: Example of the Stroop task 7 2.1.3 Why stimulus-response behaviour is insufficient Experiments such as the Stroop task and the WCST show that the PFC is particularly important when there is more than one possible response to a given set of stimuli. For example, think of the everyday activity of crossing a street. If you are born in a western European country such as Holland, your automatic response will be to first look to the left and then to the right to see if there are any cars coming towards you. This is something that does not require any attention, it is a completely automated behaviour. Figure 2.4 gives a schematic impression of the neural brain pathways in the situation where standing on the kerb of a road (stimulus S1) leads to looking left (response R1). In this figure, stimulus S2 represents a S1 R1 S2 R2 S3 Figure 2.4: The stimulus-response pathway for crossing the street in Holland. (from [20]) set of sensory stimuli associated with being in Holland. These might be street name plates written in Dutch or people on the streets speaking Dutch. The thick line between S1 and R1 represents the strong tendency to look left before crossing a street. The red lines indicate the currently active pathway from stimulus to response. But first looking to the left is not appropriate when you find yourself in a country such as New Zealand where the cars drive on the left-hand side of the road instead. In this country we need to change our behaviour to first look to the right. Stimulus S3 represents a set of stimuli associated with being in New Zealand (note that S2 and S3 can not occur at the same time). Figure 2.5 shows what happens when you go on a vacation to New Zealand. Initially, there is still a S1 R1 S2 R2 S3 Figure 2.5: Crossing the street during your vacation to New Zealand. strong tendency to look left before crossing the street because this is a highly automated behaviour. New Zealand associated stimuli are not influential enough to enforce you to look right (response R2) instead. However, after a few days you may find that you do not start by looking left any more. One possible way of explaining this is that the strong connection between S1 and R1 is unlearnt and a connection from S1 to R2 is learned instead. There are two reasons why this is not very likely. First, an automated behaviour 8 is something that takes a reasonable amount of time to develop. It would be a whole lot easier to learn to drive if this was not true. Second, when you return from your vacation you can very quickly reestablish the habit of looking left. It appears that both pathways stay intact and that a separate control mechanism is able to switch between the two pathways. A different explanation of how the learning has taken place is through the use of the PFC. Figure 2.6 PFC S1 R1 S2 R2 S3 Figure 2.6: PFC mediating the flow of activity in order to get the correct response. shows the situation in which you have learned to change your behaviour during your visit to New Zealand. Connections within PFC have formed that bias the stimulus-response pathways to select the appropriate behaviour. When you return to Holland, you will have no problems crossing the street, because the automatic stimulus-response behaviour of looking right still exists. On top of this, the PFC helps to select the right pathways too, as shown in figure 2.7. Interestingly, when you go back to New Zealand a few months later, you may find it much easier to revert to the correct initial response of looking right. You only have to turn on the correct representation in PFC to obtain the right behaviour. 2.1.4 Towards a theory of PFC behaviour Such an abstract picture of stimulus-response pathways might explain the basic idea, but how exactly does this work in our brain? A fundamental principle of neuroprocessing is that processing in the brain is competitive. A pattern of activation over numerous input dimensions activates different pathways that all compete for expression in behaviour. The pathways with the strongest support will win the competition and exert control over areas that contribute to external behaviour. Miller and Cohen [20] have developed a theory that extends the notion of biased competition to a control mechanism for PFC. To assess the plausibility of their model, they define a minimal set of functional properties that a system must exhibit if it can serve as a mechanism of cognitive control. Neural findings suggest that the PFC conforms to all of those properties. I will give a short summary of the most important properties and the supporting evidence. Firstly, a system capable of controlling behaviour must be able to access and influence a wide range of information in other brain regions. As I mentioned in section 2.1.1, the PFC is the newest area in the brain from an anatomical perspective. During its evolution it formed direct connections to almost every 9 PFC S1 R1 S2 R2 S3 Figure 2.7: PFC still mediating the flow of activity back home other cortical area. This way it can receive input from and exert control over virtually all sensory and motor systems. Using tracing techniques in the brain of rhesus monkeys, researchers have found numerous connections from a wide range of sensory systems into the frontal cortex as well as connections from the PFC to premotor systems. For example, the parietal lobe, processing somatosensory information, has projections to the frontal cortex [29]. The superior temporal sulcus, involved in integrating somatosensory, auditory and visual information [36] connects with the frontal lobe [35]. Most of the connections do not come from the primary sensory cortices but only from the secondary parts, so the information flowing into the cortex is not raw sensory information. Information is delivered to different prefrontal areas which are in turn linked to motor control structures related to their specific function. [4]. Again, most of the connections link to higher-level control structures [12]. On top of this wide range of connections to other brain areas, there is an extensive amount of connections that connect one PFC area to another [3], suggesting that different areas within PFC are capable of sharing and intermixing information. This would be required from a system capable of producing complex coordinated behaviour. Secondly, if the PFC is responsible for selecting plans and goals, the neural activity pattern observed should reflect the current plan and stay roughly the same as long as this plan is used. Coming back to my previous example, this means that a similar pattern of PFC activity should occur every time you want to cross a street. This same pattern is visible when crossing a street during your vacation to New Zealand. After a few days a different pattern might have developed, but when you go back to Holland the initial pattern immediately reoccurs. Asaad et al [1] performed experiments in which a monkey learns to associate a visually presented cue with a saccade to the left or right. About 60% of the recorded neurons showed activity that depended on both the cue and the saccade direction. But the activity of only 16% of those neurons could be explained by a straightforward linear addition of both the input and cue activation. This gives evidence for the fact that it is not merely input-output associations that are represented, but more complex patterns representing a particular plan. In another experiment [41], a monkey viewed a video screen on which four light spots were visible: right, left, up and down from the centre. On the slight dimming of one of the four spots, the monkey had to foveate to that spot. Before the dimming, a visual cue appeared according to either a spatial or a conditional rule. The conditional cue involved a letter associated 10 with one of the four spots. In the spatial condition, the location of the letter of the screen indicated the light spot and the identity of the letter was unimportant. During the task, the activity of a sample of 311 prefrontal neurons was measured of which 221 neurons showed task-related activity. Between 33 and 50 percent of the task-related neurons showed statistically significant differences that could be attributed to the rule the monkey was using. This result gives a good indication that there is a significant number of neurons in PFC that encode the rules that are necessary for the task at hand. Thirdly, the activity of PFC neurons must be resistant to wholesale updating by immediate sensory stimuli. Being able to represent plans and goals is a good start, but it is not very useful if they are updated at every opportunity. For example, say there is an ice cream shop on your way to work. If you would change your plan from going to work to having an ice cream every time you passed the shop, you would never arrive at work. The current plan in memory needs to be protected against interference from other plans until the goal is actually reached lest chaotic unordered behaviour will prevail. On the other hand, assume you did actually arrive at work and adopted the plan to carry out some work. Then you suddenly notice that the office is on fire. This time, you definitely want to abandon your current plan and make a run for it instead of first trying to reach your current goal of getting your work done. Although the plans and goals in PFC must be protected against distractions, there must be a flexibility to update them when necessary. Fuster [11] was one of the first researchers to show that neurons in prefrontal cortex show sustained activity during the delay period in a delayed response test. In a delayed response test, a cue-response pairing is learned and a delay is introduced between the cue onset and the time of the response that could subsequently lead to reward. Monkeys showed a performance of nearly 100% on a delayed response test when the delay was somewhere between 15 to 30 seconds. This shows that they are able to keep a plan in memory for some time after developing it. Miller and Desimone carried out a delayed matching to sample task in which a stimulus is presented to a monkey that must be matched to forthcoming stimuli [21]. In order to get a reward, a response was required on the first matching stimulus, meaning that distracting intervening stimuli must be ignored. They found that half of the recorded cells in PFC showed selectivity for whether the sample matched the test stimulus. Furthermore, the activity of these neurons was sustained throughout the trial. It seems that there is enough evidence suggesting that the PFC is capable of playing an important role in controlling behaviour. But if PFC controls behaviour, who or what controls the PFC? For any control theory to be successful, the controller must be able to learn by itself without having to rely on a hidden ‘homunculus’ to explain its behaviour. The remaining question therefore is how the PFC ‘knows’ when to update its representations in order to change the current plan or goal. Miller and Cohen suggest that a mechanism of prediction and reward can be used to model its behaviour. To understand how this works, I first take a few steps back and explain some of the fundamental ideas about prediction and reward. 2.2 Prediction and reward Being able to predict future events has been a critical factor in the development and survival of animals throughout history. If a creature is unable to find food or escape predators it has a very small chance of survival. It is clear that random behaviour is not the best way of getting around. In animals, behaviour is generally guided by something that can be referred to as reward. Reward is a concept for the intrinsic positive value of an object, a behavioural act or an internal physical state. It represents something that is generally good or satisfactory. Being able to predict future reward is therefore a very valuable skill that all animals must possess in order to survive in the world. There is a clear connection between prediction and reward. This is shown in a wide variety of conditioning experiments in which arbitrary stimuli with no intrinsic reward value are being associated with rewarding objects. This effect was first described by Ivan Pavlov [27] and is known as Pavlovian or classical conditioning. Pavlov draws on the fact that dogs start producing saliva whenever they see food. The food is called an unconditioned stimulus (US) because no conditioning has taken place to associate this stimulus with a particular response, in this case salivation. Pavlov predicted that if a particular stimulus would be present whenever the dog was presented with the food, this stimulus would become associated with the food and therefore trigger the dog to produce saliva. I will refer to the stimulus that will be used to condition a response as the neutral stimulus (NS). After conditioning, this (arbitrary) stimulus is called 11 a conditioned stimulus (CS) because it has no natural association with reward but it has proven to reliably predict reward under certain conditions. In this situation a CS-US pair has developed, meaning that whenever the CS is present, any natural response associated with the US will also be triggered by the CS which comes on earlier in time. Some theories suggest that this learning process is triggered by the initial unpredictability of the reward by the NS. A very influential model is the Rescorla-Wagner model of classical conditioning [30]. This theory says that learning takes place whenever there exists a discrepancy between the expectation about what would happen and what actually happens. Before a CS-US pair has developed, no prediction of reward is associated with the NS. When after a delay an US comes on and reward is delivered, a discrepancy between the prediction made by the stimulus (no reward) and the actual reward exists. What this means is that the stimulus might actually predict a future reward. Therefore, a small amount of predictive power is associated with the stimulus. After repeated presentations of the NS with subsequent reward, it becomes conditioned and a CS-US pair develops. The CS now fully predicts the forthcoming reward and it also triggers the natural response to the associated US. An interesting property of this model is its ability to explain the behavioural phenomenon of ‘blocking’. This is demonstrated by an experiment in which a rat learns that food will be delivered whenever a light comes on [16]. When an extra cue, in this case an additional sound, is presented together with the light, this second NS will not become a CS. Apparently, the additional sound does not add any predictive information about the forthcoming reward. The Rescorla-Wagner model successfully predicts this behaviour because there is no discrepancy between the predicted reward at the time the sound is presented (because of the light, there will be a reward anyway) and the actually delivered reward. Despite its enormous success in successfully predicting previously unexplained behaviour, there are still some phenomena that are predicted incorrectly by the Rescorla-Wagner model [22]. Second-order conditioning, for example, is something this model does not account for. Assume the situation we had before where a reward is delivered whenever a light comes on. Next, we use the light as an US to pair with the sound. However, this time the light is not followed by a subsequent reward. The model predicts that a negative association between the sound and the light should develop, because of the absence of the expected reward associated with the light. However, in an experimental situation, a positive association between the sound and the reward usually develops. This is just one example of a number of phenomena the Rescorla-Wagner model fails to explain. Nevertheless, it is still widely used because of its computational simplicity. Classical conditioning experiments prove the existence of some kind of reward prediction system in the human brain. However, it does not tell us about the nature of this system. It is assumed that a neural transmitter called dopamine is closely associated with the human reward prediction mechanism. In section 2.3, I look at the supposed function of this neurotransmitter in the brain and why it is thought to be involved in reward prediction. 2.3 The role of dopamine in reward prediction Dopamine (DA) is a chemical that is naturally produced in the human brain. It functions as a neurotransmitter, meaning that it can activate particular pathways of neurons, also referred to as the dopaminergic system. Neurons sensitive to DA that make up the dopaminergic system are sometimes called DA neurons. The system influences parts of the brain involved in motivation and goal-directed behaviour. Evidence for the involvement of DA in the PFC was found in a self-stimulation experiment on rats [25]. Dopamine levels significantly increased when the rats pushed a lever to obtain an electrical pulse delivered to the medial prefrontal cortex. Later it was discovered that DA neurons specifically respond to rewarding events such as the delivery of food as well as to conditioned stimuli associated with a future reward [32]. DA neurons respond to a wide range of somatosensory, visual and auditory stimuli. They do not seem to discriminate according to the nature of the sensory cue, but merely distinguish between rewarding and non-rewarding cues [33]. Another indication for the involvement of DA in reward systems comes from research on drugs like amphetamine and cocaine. Those and other stimulating drugs (ab)used by humans were found to have a positive effect on the dopamine concentrations in the mesolimbic system of rats [9]. Since DA neurons 12 naturally respond to events that predict reward, the intake of the drug signals an enormous upcoming reward to the human body. Unfortunately for the addict, this reward fails to occur and it makes the subject feel miserable after the influence of the drug has ceased. Normally, because of this discrepancy between predicted reward and delivered reward, the human body would make sure that next time the predictive cue comes on, no DA will be delivered. However, because of the direct influence of the drugs on the dopamine system, the body is enforced to release more dopamine, again incorrectly signalling reward. This explains the addictive effect those drugs have on the human body. Various experiments have been conducted to find out how DA neurons respond in different situations. As was expected, they respond to both unexpected rewards as well as predictive cues. According to the theory of classical conditioning, learning takes place in the face of prediction and reward. Before any cue-response pairings have developed, DA neurons mostly respond to the (unexpected) reward. After successful training with a particular cue-response pair, the neurons come to respond to the cue more than to the now predicted reward [18]. These results were replicated in an experiment on monkeys, performed to find out how the activity of dopamine neurons changed during a delayed response learning task [33]. During learning, 25% of the recorded neurons responded to the delivery of a liquid reward. After learning, only 9% of the neurons were activated by reward. In a later experiment, it was shown that this dopamine response is related to the temporal unpredictability of an upcoming reward [24]. Figure 2.8 depicts the typical activity of DA neurons in the presence of an unexpected reward. Just after DA level baseline activity time reward US Figure 2.8: Dopamine activity upon delivery of an unexpected reward reward delivery, DA neurons respond to this unexpected event by increased activity for a short period of time. When the US shown in this picture is repeatedly presented at the same time before reward delivery, it will become a predictor of reward according to the theory of classical conditioning. Figure 2.9 shows what happens. While dopamine neurons initially responded to the reward, they have now come to respond to the CS instead. This will only happen when the reward is consistently delivered at the same time after cue onset. After training, if a reward is delivered earlier than expected, dopamine neurons do respond to this unexpected reward. When an expected reward is not delivered at all, a decrease in DA activity occurs at the time the reward was expected to occur. This is shown in figure 2.10 where a CS is not followed by reward. There is an increase in activity just after CS onset, but at the expected time of reward delivery activity drops below baseline level. According to the Rescorla-Wagner model of classical conditioning, learning takes place whenever there exists a discrepancy between the expectation about what would happen and what actually happens. Dopamine neurons seem to provide information about exactly this discrepancy. Given the involvement of DA in the PFC, it is highly likely that it serves to enable learning in the PFC. In section 2.4 I will explain the neural mechanisms by which connections between cortical cells are strengthened or weakened in the presence of DA. Another question left unanswered is how DA neurons are able to learn the correct timing. There is however a well-established algorithm by which artificial systems can learn to predict reward, 13 DA level baseline activity time reward CS Figure 2.9: Dopamine responding to a CS in the case of an expected reward DA level baseline activity no reward CS Figure 2.10: Failure of occurrence of an expected reward 14 time called the temporal difference algorithm. In section 2.5 I will explain how this algorithm works. There is a striking similarity between the temporal difference given by the algorithm at any time and the timing of midbrain dopamine neuron firing [34]. It is therefore suggested that this algorithm could be successfully used for a biologically plausible model of DA activity. In section 2.6 I present the model in which Braver and Cohen use the TD algorithm as a learning signal for a simple delayed response learning system. 2.4 Dopamine mediated learning In the previous sections I explained how the presence of dopamine in the brain aids in learning neural connections in the cortical brain areas. In order to create a model of learning behaviour, a more detailed analysis of the learning process is required. Is the presence of dopamine required and if so, how does it bring about long-lasting changes to the neural network responsible for behaviour? The ability of a connection between two neurons to change in strength is called synaptic plasticity. In general, two kinds of changes can happen; the connection can either be strengthened or weakened. The former is called long-term potentiation (LTP), the latter is long-term depression (LTD). This mechanism of plasticity is believed to underlie behavioural learning as well as the formation of memories. The idea of synaptic plasticity has been described long before its existence in the brain was proven. In 1949, Donald Hebb developed a theory describing synaptic plasticity [15]. The theory is based on the general idea that two cells that repeatedly fire at the same time are related in some way. Whenever one of those cells becomes activated, the other cell will tend to become activated as well. To accommodate for this, there must be a strong connection between the two. During learning, this means that cells that happen to fire together are likely to be part of the same global activation pattern. If this pattern belongs to a desired or good situation, it is fruitful to make this pattern more likely to occur in a future situation. This can be achieved by increasing the strength of the connections between all cells that are part of this desired pattern. Effectively, the theory gives a computational implementation of the neural process of LTP. A similar situation exists for LTD; if two cells repeatedly show decreased activity, the strength of the connection between them is decreased. A number of researchers have suggested DA to be involved in establishing synaptic plasticity, i.e. the ability of the connection, or synapse, between two neurons to change in strength. For example, Calabresi et al. [7] showed that the induction of striatal LTD is blocked by the application of a DA antagonist. In other words, the application of a chemical suppressing DA levels in the striatum, located in the basal ganglia which are part of the human central nervous system, partly disables the process of weakening of neural connections over time. By the application of DA, the process of LTD could be restored. The results were found in an experiment on very small slices of rat brain striatum, submerged in a solution containing the DA antagonist. But, because of the general application of dopamine to a whole slice of brain tissue, other effects of dopamine can not be excluded. In 1996 Wickens et al. [42] investigated the effects of a directed pulsatile application of dopamine. In addition, the timing of the dopamine application was set to coincide with experimentally induced presynaptic and postsynaptic activity of the neurons involved. Both LTP and LTD could be induced by using the correct timing of dopamine application. Similar to the timing requirements for reinforcement learning, it seems that the same timing aspects are required for its neural correlate. As I suggested before, the DA timing aspect can be modelled by the temporal difference algorithm that I will discuss in section 2.5. Assuming the correct temporal behaviour, the level of DA activation can be used as a learning parameter for performing LTP and LTD. Consequently, learning and thus changes in synaptic plasticity only takes place in the presence of dopamine. 2.5 The temporal difference algorithm In section 2.1 I posed the question how it is that the PFC can learn to update represented plans appropriately without the need for a control mechanism to explain its behaviour. Sections 2.2 and 2.3 suggested the midbrain dopaminergic system to be involved. The initial question has been answered insofar that the firing rate of DA neurons can be used to learn the correct behaviour. But how exactly do DA neurons come to respond to the earliest predictor of reward? An influential idea based on animal learning theories 15 was introduced by Richard Sutton [39]. He created a computational procedure for the prediction of reward using a temporal difference algorithm. This algorithm can be successfully used to model the DA response required for PFC updates. In this section, I explain how the temporal difference algorithm can give a reliable prediction about future events. The basic idea behind the temporal difference algorithm is fairly easy to understand. At any point in time, it tries to make an estimate of all expected future reward. This would be easy if one could look ahead in time to observe all future events and associated rewards. Unfortunately, the future is highly dependent on your own future actions and even if those were all predetermined, a completely deterministic environment would be required to reliably predict future reward. It is clear that looking into the future in order to get a reliable reward estimate is not feasible. The only thing we can observe about our environment is the current sensory input including information about the current reward. And we can remember this information, meaning we can also keep a short history of past sensory input, undertaken actions and consequences. By evaluating past experiences and storing this information in our brain, it is possible to construct a reward expectation for the future, based on our current context. To see how this works mathematically, I will start by looking at the simplest form of temporal difference, one step ahead prediction. Suppose Pt is the output of a simple linear connectionist unit: Pt = m X wti xit (2.1) i=1 where wti is the connection weight for unit i at time t and xit is the input activation. If at any time only one input is active, the output of this unit represents the predicted reward for the given input at this time. If this prediction is higher than 0, a reward is expected in the next time step. You can compare this to a CS-US pair in classical conditioning (see figure 2.9. If the reward fails to occur, apparently the prediction was too high. Similarly, if an unexpected reward occurs, the prediction was too low. To make a better future prediction, the expectation needs to be changed. The amount by which we want to change it depends on the difference between the predicted reward at time t and the perceived reward at time t + 1, rt+1 . This is called the temporal difference: T Dt = rt+1 − Pt (2.2) This temporal difference can be used to update the connection weights. We only want to change the weight associated with the current input and we do not want to change it too radically. A well-established method for doing this is by using the delta rule [43]. Basically what is says is that the amount by which to change a connection weight is given by the difference between the expected and actual output times a learning constant between 0 and 1. The delta learning rule for one-step-ahead prediction is: i wt+1 = wti + ηT Dt xit (2.3) where η > 0 is the learning rate. An important observation we can make here is that in a classical conditioning experiment only one sensory stimulus is paired with another. Since only one of the inputs in equation 2.3 is active at a time, only one of the weights will be updated. Another way of putting this is to say that only one connection weight is eligible for modification at time t. Applied at time t instead of t + 1 the delta-rule is given by the following equation: i wti = wt−1 + η(rt − Pt−1 )xit−1 (2.4) The interpretation of this equation is that the eligible connection weights are updated by subtracting the previously made reward prediction for the specific input we are dealing with from the currently observed reward. If the prediction was correct, no changes are made. Else the weights are updated to better predict the current situation in the future. This all works fine if the predicted reward immediately follows the cue that predicts it. This is sufficient for learning stimulus-response combinations, but fails for more complex situations in which reward only comes after completing a sequence of actions. We then want the sensory cue to predict a reward that only comes after performing a sequence of actions taking more than one time step. Ultimately, we want to know 16 about all the future rewards that the current sensory input might lead us to. What this effectively means is that we will have to remember everything we did in the past and where it has lead us to. This would make the algorithm unnecessarily complex. But there is a solution. Let’s assume that we can in fact make an infinite-step-ahead prediction of future reward. This means that at any time, we know about the immediate reward following our stimulus-response pair as well as all rewards that will follow later on because of our next sequence of actions. The expected reward is simply the sum of all expected future rewards. Presumably, we will get a reward some time and therefore our prediction will always be very high, even if the expected reward is still many time steps ahead. To account for this, a discount factor needs to be introduced to give less value to predictions of reward still far ahead. The prediction we can make now looks something like this: Pt = rt+1 + γrt+2 + γ 2 rt+3 + ... (2.5) where 0 ≤ γ < 1 is the discount factor. The farther we look ahead in time, the more influential the discount factor becomes. Applied at time t − 1 it gives us: Pt−1 = rt + γrt+1 + γ 2 rt+2 + ... (2.6) Now notice that Pt−1 can be rewritten as follows: Pt−1 = rt + γ(rt+1 + γrt+2 + γ 2 rt+3 + ...) = rt + γPt (2.7) (2.8) Apparently, the prediction at any time can be derived from the reward at that time and the prediction at the previous time step. Even better, if this prediction were perfect, it would satisfy equation 2.7. The amount by which two adjacent predictions fail to satisfy this equation can be used as an error measure for changing the weights of equation 2.1. This temporal difference error is: T Dt = rt + γPt − Pt−1 (2.9) Note that this temporal difference equation is very similar to equation 2.3. When we take γ = 0, meaning that future reward is simply discarded, it is exactly the same. The full equation for updating the weights applied at time step t is now as follows: i wti = wt−1 + η(rt + γPt − Pt−1 )xit−1 (2.10) Put into words, this equation says that the weights are updated by looking at the difference between the currently observed reward minus the previous prediction, plus the current next prediction. In other words, if a reward was predicted but has not come yet, the next prediction must take this into account. The weights for the previous input are adapted to ensure a better prediction for similar future situations. 2.6 A theory of cognitive control Having gained a basic understanding of the processes in the brain that enable humans to express such complex behaviour, I will now look at a model of control that Braver and Cohen introduce in [6]. This model focuses on the idea that any system capable of controlling behaviour needs to be able to attend only to contextually relevant information while ignoring other contextually irrelevant sensory input. DA neurons are believed to be involved in updating PFC representations. Braver and Cohen hypothesise that the effect of DA neurons is to modulate the responsivity of PFC units to their input, meaning that DA serves as a gate between sensory input and PFC representations. This explains how the PFC can update its representations when necessary while protecting them against interference from other, distracting stimuli. DA comes to respond to the earliest predictor of reward so if DA is to open the gate between sensory input and the PFC, updates are made only when an unexpected stimulus comes in that reliably predicts a future reward. Any stimulus not associated with reward will not develop a predictive DA response and therefore allow representations in PFC to be maintained. A second effect of DA is to strengthen the associations 17 between sensory stimuli that predict reward and the DA neurons themselves. This corresponds to the reward prediction in temporal difference learning. However, there is one problem with this situation. If the gating system is learned by observing external reward, but the reward acquisition in turn depends on a correctly working gating mechanism, then how can this process get started? This is a classic example of a bootstrapping problem. To show that their theory of gated cognitive control is capable of bootstrapping, Braver and Cohen constructed a neural model to carry out a simple cognitive control task. The task they used is a variant of a delayed-response task in which a cue is presented at the beginning of each trial. This cue can be either the letter A or B written in black or white, meaning four different cues can be distinguished. After a delay of variable length, the letter X is given as a probe to which the network must reply with one of two possible responses. One of those, called the target response, must be made when the probe follows one particular cue (e.g. a black A). In all other cases the nontarget response is required. During the delay period, the system can be presented with both target and nontarget stimuli. In order to give the correct response, the network needs to ‘remember’ seeing the target cue while ignoring nontarget cues. The only feedback given to the system is a value of reward for that particular trial. OUTPUT CONTEXT Gating Connection Black White Color Pool A B X Identity Pool STIMULUS INPUT REWARD PREDICTION Figure 2.11: Network used by Braver and Cohen (from [6]) Figure 2.11 shows the network they used. There is a stimulus layer with five inputs separated into two different pools to represent identity and colour of the stimulus. All five units have an excitatory connection to a corresponding unit in the context layer. Two network responses are possible, represented by two units in the output layer. Both the context and the input layer are fully connected to the output so every input and context unit has a connection to both output units. In every layer there are lateral inhibitory connections to enable competition between units. Units in the context layer have strong self-excitatory connections as well as inhibitory connections from a tonically active bias unit. This is used to simulate active maintenance of context information in PFC. The most interesting layer is the reward-prediction unit. This unit receives input from both the stimulus and the context layer. In addition, this unit observes the external reward value at the current time step. Its behaviour is supposed to mimic the DA activity observed in the brain. Its 18 activity is therefore used to adjust all the weights in the system. For a more detailed description, see [6]. After training, the network was able to correctly respond to the given inputs while ignoring distractor cues. Ten runs were performed and on every run the network was initialised with random weights. The network converged to perfect performance on every run meaning that it was perfectly capable of bootstrapping. The results suggest that a gating mechanism can indeed be used to exert control over behaviour in order to successfully carry out a delayed response task. 19 Chapter 3 Design of a simple stimulus-response network The theory so far is mostly concerned with the influence of the PFC on human behaviour. It is clear that the PFC is very important, but to understand how it is that the PFC can exert its influence we first need a model of the evolutionarily older stimulus-response pathways which the PFC exerts a modulatory influence on. In this chapter I will introduce a highly simplified model of the cortical stimulus-response pathways in the absence of PFC. The goal of this model is to simulate basic stimulus-response behaviour that allows animals to appropriately respond to sensory input by performing those actions that lead to a desired situation. I will start in section 3.1 by building up an architecture for the network. Then I will discuss the learning regime including my own implementation of the temporal difference algorithm in section 3.2. I created a few test cases to assess the overall performance of the network, section 3.3 discusses the results. Next, in section 3.4 I introduce a slightly modified, more biologically plausible architecture and discuss its performance. Finally, in section 3.5 the limitations of this network design are discussed. 3.1 Architecture Animal behaviour is dependent on sensory stimuli. Without any input from the environment, no living creature has any chance of survival. The key to survival is interaction with the world. Nature has given us the ability to see, hear, taste, smell and feel. All those things are handled by the sensory system comprising sensory receptors, neural pathways and brain structures involved in processing and combining the raw sensory information. Generalising the basic function of this system, it generates high-level representations of sensory stimuli which analyse the scene in terms of its affordances for actions. Milner and Goodale hypothesise that there are two functionally and anatomically separate pathways for processing visual sensory input [14]. The ‘what’ pathway, going through inferior temporal cortex, specialises in the identification of objects. This is a very important aspect when it comes to understanding the world, but not so much for reflexive motor behaviour. The ‘how’ pathway on the other hand is located in posterior parietal cortex and specialises in tasks involving spatial perception. In experiments on patients with lesions to the parietal region, subjects showed great difficulty in reaching for and picking up common objects, even though they had no problems recognising and naming them. Stimulus-response behaviour relies on the possibilities for interaction with an object much more than on its identification. Based primarily on the ‘how’ pathway, a plan is generated to react to the environment in an appropriate manner. This plan is then sequentially fed to the basic motor pathway, a brain system responsible for the control of all body motor systems. It basically transforms a high-level motor command like moving your arm into a correct set of muscle control signals to actually move the arm in the right direction. For my brain model, I will take this complex system to an abstraction level where only three basic components remain. The first component models the sensory system, the second component integrates the sensory pathways which select an action. The third component mimics the motor pathway and is thus responsible for executing the resulting action. 20 The most natural way of implementing this model is by the use of an artificial neural network. Based on biological neural networks found in the brain, a number of basic computational units called artificial neurons are connected to form a complex network capable of orchestrating complex behaviour. Every neuron has at least one input and zero or more outputs. The output models the neuron’s firing rate. The inputs model the firing rates of neurons which connect to it. Upon activation of a neuron, the weighted sum of its inputs is computed and passed through a non-linear function called an activation function. My stimulus-response (SR) network, is inspired by the network designed by Braver and Cohen [6]. It is composed of an input layer, an output layer and a hidden layer. The input layer is used to represent a highlevel abstraction of sensory input values. For example, activation of the first input neuron, which I will call S1 from now on, could represent the sight of a coffee cup. The input layer is unidirectionally connected to the hidden layer. This layer is hidden in the sense that the pattern of activation does not simply reflect one particular event or action but rather a particular action in a particular sensory context. This pattern can not be mapped directly onto an observable external event or behaviour, hence the name ‘hidden layer’. The hidden layer is unidirectionally connected to the output layer. The output or response layer represents a high-level abstraction of an action, for example walking to the coffee machine. Figure 3.1 shows this three-layer network. It is connected in such a way that there is exactly one way to get from every input to H1 S1 H2 H3 S2 R1 R2 H4 Figure 3.1: A simple stimulus-response network with three layers every output, in other words anything is possible. To make sure that only one stimulus-response pathway is selected at any time, there are inhibitory (negative) connections in between the hidden neurons as well as self-excitatory (positive) connections. This brings about winner-take-all behaviour meaning that only one neuron in this layer can be active at a time. In neurophysiological terms this phenomenon is known as lateral inhibition. The idea of using inhibitory competition in a neural network is not new. Randall O’Reilly [26] states that ‘roughly 15 to 20% of the neurons in the cortex are ... inhibitory interneurons’ and ‘any realistic model of the cortex should include a role for inhibitory competition’. A sparse representation of the world allows for categorisation which is very useful in an ever-changing environment. Inhibitory competition allows the network to attend only to the strongest sensory input without even considering processing much less important stimuli. In an emergency situation where your life depends on appropriate and quick response to the situation, the brain needs to be able to focus on one thing only. Lateral inhibition allows a strong sensory stimulus like pain to suppress other less important stimuli completely. The network is activated when a sensory stimulus arrives. One of the input neurons will activate and in turn activate the hidden layer which will then activate the output layer. The pattern of activation over the output layer determines the motor action the system has chosen to take. Correctly choosing the parameters for the activation function is crucial for the performance of the network. We want the output layer to make a clear decision for taking an action in any situation. The most commonly used and also biologically plausible activation function is the sigmoid function shown in equation 3.1. The graph of this function is plotted in figure 3.2. 1 (3.1) P (t) = 1 + e−x Since the output of this function is always between zero and one, the output of any neuron is constrained 21 1 0.8 0.6 0.4 0.2 0 -4 -3 -2 0 -1 1 2 3 4 Figure 3.2: Standard sigmoid function to fall between those two numbers. For both the input and output neurons, we need to think of a sensible way to map the output value of the neuron to a useful interpretation. In other words, we need to figure out what the real-life correlate of a maximally activated input neuron is. First note that one input neuron in my network does not necessarily represent exactly one biological neuron. The activation of a single artificial neuron can encode the existence of a distributed representation of a complex sensory event. An excited artificial input neuron tells us that the sensory stimulus encoded by this neuron is present. This is a binary event, either the stimulus is present or it is not, so we can encode the absence of the stimulus by an activation value of zero. A similar situation exists for the output neurons, either an action is performed or it is not. I interpret a value between zero and one as the likelihood that the action is a good one in the current circumstances. A value of zero means it is definitely not a good idea, while the value one means the exact opposite. Compare this to the firing rate of a biological neuron. The average firing rate of neurons on an actively used neural pathway is relatively high. This is the activity that is measured during an MRI scan. But this does not mean that inactive connections show no activity at all. The average firing rate of those neurons is lower than that of actively used neural pathways, but they still show electrical activity. The average firing rate of unused neurons is also called the baseline activity. Neurons can also show an average activity below their baseline. In my network, I encode the baseline activity by the value 0.5. This value signals a high level of uncertainty or rather indifference about the current state of affairs. The activation value of the output neurons is determined by the pattern of activation over the hidden neurons. Using the same sigmoid function for those neurons, the output again varies between zero and one and so does the input to the output neurons. However, looking at the graph in figure 3.2 we can see that an input of zero gives an output of 0.5 meaning indifference. This is not a desirable situation, input activation close to zero should generate an output close to zero. To tackle this problem, I make a small adjustment to the sigmoid function, as shown in 3.2. 1 (3.2) P (t) = 1 + e1−2x As you can see in figure 3.3 an input of 0.5 now generates an output of 0.5. Until now, I regarded the interneural connections as if they were static. This is incorrect; in order to learn behaviour their strength needs to be dynamically changed to suit a behavioural pattern rewarded by the environment. The synaptic strength of biological neurons can be translated into the strength of a connection between two model neurons. A strong connection will cause the neuron on the receiving side to fire more 22 1 0.8 0.6 0.4 0.2 0 -4 -3 -2 0 -1 1 2 3 4 Figure 3.3: Adapted sigmoid function quickly and stronger whereas a weak connection will do the opposite. As I explained in section 2.4, under the right conditions the connection can be strengthened or weakened. A stronger connection between two neurons makes it more likely that the neuron on the receiving side will fire. Because of the lateral inhibition in the hidden layer, for a neuron in this layer to fire it needs a relatively strong connection with its input neuron in order to receive enough input to win the competition. This is demonstrated in figure 3.4 where the red connection from S1 to H2 (the second hidden neuron from the top) depicts a strong connection. the blue connection from S1 to H1 depicts a weaker connection. In this situation the first input is triggered, H1 S1 H2 H3 S2 R1 R2 H4 Figure 3.4: Example behaviour S1-R2 causing S1 to fire. This is depicted by colouring this neuron red. Because of the strong connection between S1 and H2, this hidden neuron easily wins the competition between the four neurons. H1 receives much less input from S1 because of the weaker connection and H3 and H4 receive no input at all because S2 did not fire. Since H2 is connected to R2, this output neuron is the only one to receive any input at all and therefore fires. In this situation the network chose to select R2 when S1 came on, under different circumstances other behaviour might have come about. 23 3.2 Learning stimulus-response behaviour Now it is time to introduce the learning aspect. Learning is triggered by the environment so the first thing I need to introduce is environmental feedback. The immediate environment of the SR system consists of other subcortical brain structures including the dopaminergic subsystem. As I explained in sections 2.2 and 2.4 this system aids in learning behaviour by releasing small amounts of the neurotransmitter DA. I also explained the striking similarity between the DA signal and the temporal difference algorithm. I will now explain how the two are combined to achieve learning. The DA signal is not targeted on single neurons, instead it is used as a measure for changing the connection strength throughout the network. To understand how this works, I will revisit Hebbian theory [15]. In a situation where learning is driven by reinforcement, the only patterns we want to strengthen are those which lead to reward. We therefore need to add an extra constraint: regular Hebbian learning only takes place if there is a DA burst. (Recall Wickens’ empirical support for this extra constraint, as described in section 2.4). DA neurons fire whenever an unexpected reward is delivered or when a reward signalling stimulus comes on. Vice versa, when something undesirable happens such as the absence of an expected reward, DA levels drop below baseline level signalling something that could be regarded as a negative reward. Using this signal as a learning parameter for either strengthening (in case DA levels are high) or weakening (in case DA level are low) ‘active’ connections, i.e. connections between neurons that are firing, our network is capable of learning stimulus-response combinations. In the following example I assume there is a correctly working DA system that responds to the rewarding schedule S1 → R1. This means that the network is rewarded for selecting R1 whenever S1 comes on. The initial network shown in figure 3.5a has no preference for choosing any particular hidden neuron. If S1 S1 R1 R1 a) b) R2 R2 S2 S2 DA S1 S1 R1 R1 c) d) R2 R2 S2 S2 Figure 3.5: Example of a situation where the behaviour S1-R1 is learned all neural connections were initialised with equal strength, the network could never decide which hidden neuron to use. In every reinforcement learning there is a tradeoff between exploration and exploitation. A solution based solely on exploitation has bootstrapping problems when faced with a new unexplored environment. Exploration is necessary early in learning, but as the expectation of reward gets higher, exploitation gets more rewarding and is preferable over exploration. So early in learning there is a need for more exploration while in later stages the focus needs to be on exploitation. This is implemented in 24 the network by applying a certain amount of randomness to the activation of every neuron. As learning progresses, this exploration factor decreases favouring exploitatory behaviour. Figure 3.5b shows what happens when stimulus S1 comes on. One of the two hidden neurons connected to the first input neuron activates at random (subject to the exploration factor) and response R1 is selected as an output. This triggers an external reward causing the DA system to send a positive reward signal which is delivered to every single connection in the network (3.5c). Hebbian theory says that only the active connections have their weights altered, in this case the connection from S1 to H1 and the connection from H1 to R1. The strength of those connections is increased (figure 3.5d) making it more likely that in a future situation they are preferred over other, weaker connections. For use with my neural network, I implemented the temporal difference algorithm in a dopamine unit. To understand how this dopamine unit works, first have a look at the temporal difference function in equation 3.3. T Dt = rt + γPt − Pt−1 (3.3) The first thing we need is information about external reward. The neural correlate of external reward is stimulation of the dopaminergic system. This stimulation usually comes from outside the brain and can be considered external to the brain structures I am modeling. The reward value is dependant on the specific stimulus-response combination being learned and can be determined by observing the current input-output state of the network. The second and final variable needed for calculating the temporal difference is a prediction of reward. In my implementation of the dopamine unit this prediction is the output of a linear connectionist unit: m X Pt = wti xit (3.4) i=1 wti The connection weights are internal variables of the DA unit, the input values xit are the inputs to the SR network. The timing of the DA system is crucial to its correct functioning. For the SR network one time step consists of the selection of an input value, activation of the network and the selection of the appropriate action. In the same time step I want to receive feedback from the DA system in order to update the connection weights. Looking at the temporal difference function in equation 3.3, the temporal difference available at the end of this time step is based on both the prediction made for this time step and the one for the previous time step. Actually this provides us with an evaluation of our previous action. Unfortunately, we want to have an evaluation of our current action and not of the one we did before. The solution to this problem can be found in the observation that we are actually doing one-stepahead prediction. The SR network is trying to learn stimulus response combinations which means that the predictions we make are predictions about immediate reward. Taking only immediate reward into consideration, the equation for updating the weights of the dopamine unit (see section 2.5) is: i wti = wt−1 + η(rt − Pt−1 )xit−1 (3.5) Applying this formula at time t + 1 it can be rewritten as follows: i wt+1 = wti + η(rt+1 − Pt )xit (3.6) Now what does the factor rt+1 actually mean? It is the reward observed by the dopamine unit in the next time step. The interesting thing is that this reward is actually the environmental reward based on the current output of the system. It does not depend on the next input anymore because we only do one-stepahead prediction. In other words, the dopamine unit can provide the reinforcement signal even before the new input to the system is known. This is the immediate evaluative feedback necessary for updating the connection weights throughout the network. Provided that the DA unit works correctly, this network is capable of learning every possible stimulus response combination by strengthening connections on the pathway from input to output and weakening other connections. In theory it all looks very promising. Section 3.3 describes a number of tests I created to assess the performance of the network design. 25 3.3 Performance of the network To show that the SR network described in this chapter is capable of learning stimulus response behaviour by observing only evaluative environmental feedback, I made an implementation in Java. The code includes a computational network component and a Graphical User Interface (GUI). Figure 3.6 shows what the GUI looks like. For a description of the functionality of the different buttons I refer to chapter 5. The Figure 3.6: Graphical User Interface for the SR network neural network was constructed using a set of modular Java classes like Neuron and Connection. Multiple Neuron classes are grouped together into a Layer component which can be connected to another Layer, thus creating new instances of the Connection class. A functional overview of those three Java classes can be found in appendix A. To test the performance of the SR network I created three test cases. For each case, a different set of reward schedules was used, i.e. the external reward used as input for the dopamine unit was the only variable factor between the three cases. Those three test cases are: 1. S1 → R1 2. S1 → R2 ; S2 → R1 3. S1 → R1 ; S2 → R1 In each case the network architecture depicted in figure 3.1 was used, with two neurons in the input layer connected to the four-neuron hidden layer in turn connected to an output layer with two neurons. The weights of the connections between the input and hidden neurons are dynamic, meaning that Hebbian learning is applied after each stimulus-response epoch. The connections between the hidden and output neurons are statically set to 0.5. In theory these connections could be learnable too but since there is only one connection from every hidden neuron to an output neuron, the strength of the connection is not decisive in choosing a particular output. Too avoid unnecessary complexity, these connections are made static. The 26 dynamic weights between the input and output layer are initialised with a random value between 0.4 and 0.6, adding to a random initial behaviour of the network at the start of the exploration phase. To test the performance of the network I created a batch file which can be interpreted by the network. It sets the reward schedules according to the test case and then presents the network with an input 100 times. On each input presentation the network is activated and selects an output. Next, the temporal difference value is calculated and every learnable connections is updated using the Hebbian update rule: ∆weight = learningrate ∗ output ∗ (input − weight) (3.7) For the first test case, I presented the network only with S1 to show that the network is actually capable of learning this simple schedule. Every session consisting of 100 epochs was run 100 times. At the start of every session the network was reinitialised with new random values for the connection weights. In the worst of those 100 sessions the wrong output was chosen in 7 of the 100 epochs, a fraction of 7%. Note that this includes ‘mistakes’ made by the network during the training phase. The average number of incorrect trials (i.e. an epoch in which a non-rewarding response was chosen) was 2.8. After 10 epochs the average number of incorrect trials was only 0.3 out of 90 remaining epochs, a fraction of 0.4%. After 17 epochs the network did not select the wrong output anymore. We can safely say that this is near perfect performance. For the second test, the input to the network was randomly generated, in every epoch one of the two possible input neurons was selected. The network was then activated and the correct response needed to be selected, dependant on the selected input. In the worst case session, the network now made 14 errors, a fraction of 14%. On average only 6.5 incorrect responses were selected. Again, the network was initialised with random weights so there was no bias towards any response before the first trial. This time an average of 3 mistakes were made after the first 10 trials. After 20 trials, the average number of mistakes had gone down to 1.2. After another 10 epochs, less than one mistake was made on average. Although the overall performance was not as good as in the first test case, this makes perfect sense considering that this time two correct responses needed to be learned instead of one. Given that in the third case two responses need to be learned as well, one would expect similar figures as in the second test. The batch file was changed to set and test the correct rewards and run once more. In the worst case session 12 errors were made this time, the average number of incorrect responses was 6.9. After 10 trials, only 3.2 mistakes were made on average, after 20 trials only 1.1. Table 3.1 summarises the results. maximum average avg after 10 trials avg after 20 trials avg after 30 trials test case 1 7 2.8 0.3 0.0 0.0 test case 2 14 6.5 3 1.2 0.3 test case 3 12 6.9 3.2 1.1 0.4 Table 3.1: Number of incorrect trials for each test case We can conclude that the SR network performs very well when faced with various input-output reward schedules. For every test case it took only a few trials to figure out which connections to strengthen and weaken in order to receive reward. However, there are some comments to be made about the current setup. Section 3.4 describes an attempt to create a more realistic situation in which the input and hidden layers are fully connected. Finally some other limitations are discussed in section 3.5. 3.4 A fully connected version of the network The current network design might give a very good performance, but the following issue needs to be considered. The connections between the input and output are set up to connect every input neuron with exactly one output neuron. There is no neurological basis for this sparse and very specific connectivity. If one of the connections would somehow fail, it means that a stimulus-response pathway is disabled and can 27 not be restored anymore. This is obviously an undesirable situation that needs to be resolved. A much more robust situation is achieved when multiple pathways from input to output are formed. A solution would be to simply add more hidden neurons. Now if a connection fails, another neuron in the hidden layer can be used to select the correct output instead. But how many neurons are required to have a robust enough system? Probably quite a lot. A more efficient solution would be to fully connect the input and hidden layer so that every input neuron is connected to every hidden neuron. This way, in case one of the connections fails the network can still rely on a different hidden neuron to select the correct output. The hidden neurons are now shared between the input neurons meaning that the same number of neurons can be used. Figure 3.7 shows this new design with a fully connected input and hidden layer. H1 S1 H2 H3 S2 R1 R2 H4 Figure 3.7: A network design with full connectivity between input and hidden layer This new design probably has its impact on the ability to learn stimulus-response combinations. To test the new design, I reran the three test cases described in section 3.3: 1. S1 → R1 2. S1 → R2 and S2 → R1 3. S1 → R1 and S2 → R1 Figure 3.8 gives a comparison between the performance of the sparsely connected and the fully connected network. It clearly shows a higher average number of errors at the beginning. Since there are more connections running between the input and hidden layers, there are more options for the network to explore. This explains why it takes more time for the network to settle into a situation where it hardly makes a mistake anymore. On average, it takes the fully connected network 12 trials to make only one more mistake in the remaining 88 epochs. Now have a look at figure 3.9. Again, the overall error rate is higher for the fully connected version of the network, but the learning curve is steep and after 20 epochs the average remaining number of mistakes has gone down to 6. The one mistake threshold is reached in epoch 41. Figure 3.10 displays a similar situation, the learning phase is short and the learning curve steep. In general I can conclude that although the performance has gone down a little bit because of the full connectivity, the network is still capable of achieving a very good performance. Since full connectivity is more biologically plausible, I will use it for my final model as well. 3.5 Limitations of the current network The network described in this section seems perfectly capable of learning any combination of stimulusresponse behaviour. But the network design also imposes some major limitations. The first and most obvious limitation is the incapability of the network to handle distractor input. My network is based on a design by Braver and Cohen [6] who implemented a network capable of learning task-relevant behaviour while ignoring other irrelevant sensory input. Their network can handle distractors because they included 28 7 s s 6 s 5 s Average number 4 of errors in the 3 remaining b epochs 2 s s b s s s b s b b 1 0 sparsely connected fully connected s 0 b b 5 b b b b s b s b s s s b b b 10 Number of lapsed epochs s s b b 15 s b s b s b s b 20 Figure 3.8: Performance of the network for the first test case: S1 → R1 14 12 s s s s s s s ss s Average 10 s ss number ss of b sparsely connected 8 s errors ss s fully connected b in the b ss b ss remaining 6 b ss bb epochs ss bb ss 4 bb ss ss bb sss bb sss bbb 2 ssss bbbb sss bbbbbbbbb b b b b b b b b b b b b b b b sb bs bs bs bs sb 0 0 10 20 30 40 50 Number of lapsed epochs Figure 3.9: Error rate of the network for the second test case: S1 → R2; S2 → R1 29 14 12 s s s ss s s s ss Average 10 ss s number s s of b sparsely connected 8 ss errors s fully connected s in the b ss ss remaining 6 b b ss b epochs bb ss bb ss 4 ss bb ss bb sss bbb sss bbb 2 ssss bbbb ssss bbbbbb b b b b b b b b b b b b b b b b b b bs bs bs bs bs 0 0 10 20 30 40 50 Number of lapsed epochs Figure 3.10: Error rate of the network for the third test case: S1 → R1; S2 → R1 a gating mechanism which supposedly is one of the functions of PFC. My network reacts on any input regardless of its relevance in the current context. However, my focus is not on the capability of the brain to ignore irrelevant sensory input. The focus of my research is on the output side of the network. Although the Braver and Cohen network is able to act only on task-relevant sensory input, it is unable to learn a sequence of actions triggered by a single sensory event. In this report I present a solution for this specific problem, also based on processes thought to take place in PFC. Adding an input gating mechanism to include the ability to ignore irrelevant stimuli would be a possible extension to the network, left for future work. Being able to learn stimulus-response combinations is a good start, but it is not enough to explain human behaviour. Not only can we learn to react to a stimulus, sometimes a whole sequence of actions needs to be learned before getting rewarded. A critical question must be answered: Is the current network design sufficient for learning sequences of actions? Until now I only looked at single stimulus-response combinations. Both the sparsely connected and the fully connected network were very good at learning those combinations. But with the current network design we can never learn behavioural sequences. The output layer can only select one output at a time and even if it could select more than one output, no information about the serial order of the actions is provided. My hypothesis is that the DA signal which is currently only used for learning the correct input-output combinations can also be used for learning the serial order of actions. This requires some changes to the network architecture. In chapter 4 I will describe a model that can learn behavioural sequences. This model is an extension of the SR-network presented in this chapter. 30 Chapter 4 A model of behavioural sequences In this chapter I will introduce a neural network capable of learning behavioural sequences. Before I start talking about the necessary changes to the network architecture I will briefly introduce in section 4.1 why learning behavioural sequences is crucial for explaining human intelligence. In section 4.2 I will discuss a well established theory for learning and executing action sequences. Sections 4.3 and 4.4 elaborate on the design of my network. 4.1 Why do I want to learn behavioural sequences? The short answer to the question, why do I want to learn behavioural sequences, is that stimulus-response behaviour by itself is just too limited. Only in very rare situations do we immediately receive feedback on actions in real life. More often, it takes time and the right sequence of actions to accomplish a goal. For example, if I feel like having a cup of coffee I first need to grab a cup. Then I want to take the cup to the coffee vending machine in order to fill it with coffee. Filling the cup requires a particular sequence of actions to perform on the coffee machine. It is not until I finally drink the coffee that I get a positive feedback from feeling a bit less thirsty. There are numerous situations like this where a specific sequence of actions needs to be performed in order to get positive environmental feedback. If I were only able to learn stimulus-response combinations I would get to drink my cup of coffee only after randomly performing the correct sequence of actions. The only thing I will learn is that drinking out of a coffee cup might help me lessen my thirst, because this is the last action I did before feeling satisfied. Unfortunately, drinking out of an empty coffee cup is not very rewarding. Learning a sequence of actions may not sound too complicated for a human being. All you have to do is remember exactly what you did before receiving the reward. But how can we know if every action we did was strictly necessary for getting the current result? And even if we did figure it out, another dilemma awaits. We can never consciously be considering every action we take in life, let alone the exact set of possible future consequences. After profound rehearsal of a successful behavioural sequence, we are able to practice it without having to memorise it every single time. It is very useful indeed to be able to execute more complicated behaviour automatically. 4.2 Existing models of sequences So we need to be able to unconsciously execute a sequence of actions, how could this work in our brain? A common solution for learning sequential tasks involves an (associative) chaining mechanism. The basic idea behind associative chaining is that a subject first learns to associate a pleasant experience with the preceding action. The action is then associatively chained with another earlier action. The process continues until the first action is associated with the initial stimulus that triggered this chain of events and the chain is complete. Another interesting range of models are the so-called competitive queueing (CQ) models. These models differ from associative chaining models in that a learned sequence is activated in parallel instead of 31 sequentially. Behavioural evidence suggests that a CQ approach to learning shows much more resemblance to learning processes actually taking place in the brain [13]. In this section I will explain how a CQ model works. I focus on the execution of a plan representation of an already learned plan. Learning the plan in the first place is an issue I will address later. Let’s go back to my favourite example and assume that we know how to get ourselves a nice cup of coffee. The sight of an empty coffee cup combined with a lingering urge for caffeine triggers the generation of an action plan in the brain. Based on previous experiences with getting coffee, we come up with the following action plan: 1. pick up the empty coffee cup 2. walk to the vending machine 3. place the cup in the slot 4. push the button for coffee 5. pick up the cup 6. drink from the cup We know that we have to perform this entire sequence of actions from the moment the plan is triggered. The only thing we need to do is to execute it in the right order. A CQ model can explain how the sequential ordering of those actions can be determined. CQ can be implemented in a neural network comprising two layers, a selection layer and an action layer. The selection layer represents the plan to be executed, the action layer is used for determining the immediate action to take. The basic architecture is depicted in figure 4.1. The CQ model comprises Selection layer Action layer Figure 4.1: Generic competitive queueing architecture two layers, a selection layer and an action layer. There is an excitatory connection from every selection neuron to the corresponding action neuron. Conversely, there is an inhibitory connection going back to the selection layer. We only want one action to be taken at a time, so the action layer implements lateral inhibition to ensure that only one of the neurons in the layer wins. The action layer receives input from the selection layer, so this winner will be the neuron with the highest input from the selection layer. For the plan representation this means that if the neuron representing the first action to be taken is activated most strongly, this will certainly be the first neuron to exert its influence on the action layer. Interestingly, there is an inhibitory connection from neurons in the action layer back to their corresponding neurons in the selection layer. Self-inhibition of a dominant representation is a widespread cognitive phenomenon, often called inhibition-of-return [17]. Immediately after the activation of the neuron in the action layer, the 32 neuron in the selection layer is inhibited. This neuron will no longer be the most active any more, making way for the next step in the sequence. Figure 4.2 shows the first steps in the execution of the coffee making example. In this example I represent the activity of a neuron in the selection layer by a number between 0 and 10. I assume the baseline activity of all neurons to be 5. Figure 4.2a shows the initial situation. The action plan is represented by S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 9 6 8 10 7 9 6 8 1 7 a) b) E1 E2 E3 E4 E5 E1 E2 E3 E4 E5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 9 6 8 2 7 1 6 8 3 7 E1 E2 E3 E4 E5 c) d) E1 E2 E3 E4 E5 Figure 4.2: Example execution of a sequence in a competitive queueing model a gradient of activation over the selection neurons. In the example neuron S4 is given the highest level of activation. The neurons in the action layer are competing with each other. The winner will be the one with the most input from the selection layer. Because of the one-on-one connectivity between the two layers, this winner is determined solely by the pattern of activation over the selection layer. This is represented in the figure by the red colouring of E4. In the process of making coffee, action E4 would represent picking up the cup. The next step is shown in figure 4.2b. The inhibitory connection from E4 back to S4 depresses the activity of the neuron. This is the inhibition of return ensuring that once the first part of our planned sequence has been executed, processing of the second part of the plan can start. The depression of neuron S4 makes way for S1 which now has the highest activation level. In figure 4.2c we can see that E1 is selected in the same way as S4 selected action E4 to be taken. E1 would encode walking to the coffee vending machine, the second step in getting coffee. Again, immediately after the selection of E1 the inhibition of return depresses S1 making way for S3, placing the cup in the slot. In the mean time, neuron S4 has recovered a bit from the strong inhibitory influence from the action layer. But unless a new plan is put into the selection layer the activity will not rise above 5 anymore. This process continues until all the actions in the sequence have been done in the correct order. Competitive queueing is a very interesting technique for explaining how sequences can be learned in the brain. Indeed there is good evidence that a CQ-type paradigm is used in PFC to plan certain types of 33 motor sequences [2]. Behavioural evidence also comes from the domain of error correction in language technology. Transposition of individual letters is a common type of error made by people typing sentences on a keyboard. However, this error hardly occurs in handwriting. It seems that the CQ paradigm only works in a situation where a fast succession of (low level) actions is made, like making key presses on a keyboard or drawing basic shapes. The type of actions generated by my model are higher level actions requiring more attention and possibly taking much longer to execute. Transposition errors like we sometimes see in typing are rarely found in those higher level sequences. For example, when making coffee people hardly ever make the mistake of picking up the (empty) cup before pushing the button for coffee. It is therefore unlikely that a CQ type of learning is used for learning those kind of sequences in PFC. Another important issue with CQ is that a plan representation is destructively updated, i.e. as soon as an action is performed, inhibition of return makes sure that the neuron(s) representing the action will become inactive. After the sequence has been executed, no traces of it remain. This is problematic, because there are suggestions that sensorimotor sequences can be retained and replayed after they have been executed. By actively thinking about a recent experience the learning procedure that just took place can be activated once more to strengthen its result and speed up the learning process. This can only be done when the PFC activity is not destroyed while executing the plan, like a CQ model does. Plan destruction also does not support intention recognition in action observation. The intention of peoples actions is often only worked out retrospectively, when the action itself has been completed. The brain needs access to the original plan representation in order to recognise the intention behind someone else’s actions. A type of learning that is believed to be unique to the human species is imitation learning. Humans possess the unique ability to learn by observing other people getting rewarded or making mistakes. It was recently discovered that humans seem to have so-called mirror neurons [31]. Mirror neurons respond when someone performs an object-oriented action such as reaching for a cup. This is exactly the kind of action that can trigger the learning process in PFC. It is plausible to assume that the PFC neurons used for learning are mostly mirror neurons. Interestingly, they respond in exactly the same way when the subject sees someone else perform the same action. This allows us to learn by observation or imitation. Given that the brain structures involved in learning are made up of mirror neurons, it makes no difference to them what triggered the learning process. For mirror neurons, it is also useful to remain active during plan execution. For example, assume you have practised making coffee and you know how to go to the coffee machine, place a cup and drink from it. Now if you see someone walk away from the coffee machine with a filled coffee cup, you can infer that he must have placed his cup and used the machine even though you did not actually see him do this. This is the intention recognition process I mentioned earlier. 4.3 A neural model for learning sequences We are now faced with the problem that we want to have a highly adaptive model of learning sequences that does not destructively update its plans. Instead we have a model that can only learn stimulus-response behaviour. It seems there is no suitable biologically plausible sequence learning model readily available for implementation. The solution I present lies in the addition of a PFC component. Before getting into the implementation details, it is useful to understand where I’m headed. I will therefore first present a general description of my final model in section 4.3.1. In section 4.3.2 I continue by making changes to the design of the stimulus-response network in order to allow sequence learning. 4.3.1 A first impression of the final network The goal of my research is to have a biologically plausible model of the brain that can perform a sequence of actions when presented with a single sensory stimulus. Somewhere in the model I want a representation of the plan to be visible throughout the execution of the sequence. Research has shown that the PFC is is likely to be the locus of this plan representation. The PFC does not exert its influence directly onto the brains neural motor pathways. Instead it relies on basic stimulus-response pathways controlling the automated execution of actions. It is on those pathways that the PFC exerts its influence by exciting some of them and depressing others. 34 On the basis of this idea, I can make the black-box representation of the final network shown in figure 4.3. This model consists of three main components. The SR model presented earlier is included as one PFC input DA hidden output Figure 4.3: A black-box representation of the final network component. It still comprises an input layer, a hidden layer and an output layer. Above the SR network lies the PFC. Because of the black box representation, the internal working of the PFC is not revealed yet. The important thing to note here is the connectivity between the PFC and the hidden layer. There are multidirectional connections between the two. Connections running from PFC towards the hidden layer are being used to bias the SR pathways in such a way that the desired behaviour results. This is accomplished by having a PFC neuron bias a particular hidden neuron responsible for selecting the desired action. The DA unit works independently of the other two components. Information from the DA unit, responsible for signalling both current and future reward, is available to every unit at any time. Still it seems this design only allows basic stimulus-response behaviour to take place. Even with the biasing influence of the PFC component, the SR network does not activate unless a sensory stimulus is received. The execution of a sequence of actions requires the network to select a time-separated series of outputs on the presentation of a single input that may or may not be present throughout the sequence. To solve this serial order problem, I will introduce the concept of reafferent inputs, i.e. sensory stimuli which are generated by the agent himself when he performs an action (see section 4.3.2 for details). Before the PFC layer can exert its bias on the hidden layer, it needs to learn how and when to activate. This can be learned by observation. The PFC layer actively observes the sensory input and the choices made by the SR network. This is what the connections from the input layer and the hidden layer into PFC are for. Unlike the SR component, the PFC can store the information it receives for an extended period of time. Integrating the current state of the network with actions taken in the past, the PFC layer can record a sequence of events and the associated reward. Based on this information the internal state of the PFC can be updated to maximise future reward by putting a bias on the hidden layer at the right time. Before explaining the implementation details of the PFC layer, the serial order problem needs to be solved. It was suggested that PFC activity roughly stays the same throughout the execution of a learned action plan. In order to execute the plan, at some stage a transformation from a (parallel) activation pattern to a sequential selection of actions is made. Consider the coffee making example where a sequence of actions leads to a reward in the form of a nice cup of coffee. If this activity has been practised very well, the sight of an empty coffee cup will immediately trigger an action plan. A pattern of activation is set in PFC to ensure the correct execution of the plan. Possibly, the SR network has already learned that it needs to take an action when the cup is spotted. A number of different actions may be taken at this point, but the PFC can exert some influence on the hidden layer to help it choose the first action in the sequence, picking 35 up the empty cup. Now it is time to take the next action. But the SR network only activates when sensory input is present. Of course there is the sight of a picked up coffee cup now. But even if we closed our eyes we would still know how to continue the process we just started. In other words, the presence of sensory input is not strictly necessary for taking actions. We need to be able to act even when there is no sensory input. This requires some changes to the SR network. 4.3.2 Extending the stimulus-response network If we want the network to perform a sequence of actions it needs to be able to act independently of sensory input. In terms of network design this means that the network needs to be activated every single time step, even when no input is available. This is somewhat impractical because activating the network on no input leads to a highly unreliable activation pattern. For example, say we want the network to first select R1 and then R2 after S1 comes on. The first step is shown in figure 4.4a. So far so good, the network can S1 S1 R1 a) S2 R1 b) R2 S2 R2 Figure 4.4: Trying to learn a sequence in the stimulus-response network learn to select R1 when presented with S1 by strengthening the neural pathway between the two. We get into trouble trying to take the next step shown in figure 4.4b. Even though there is no input, inhibitory interaction in the hidden layer selects a winner anyway. Hidden neuron H4 is selected as the winner of the inhibitory competition, but there is no good reason for H4 being activated. The subsequent release of DA will enable learning to take place. In this particular situation the connection between H4 and R2 is strengthened. But there is no guarantee that the same hidden neuron will be used next time because there is no support from the input layer. So how can this problem be solved? It is true that there is no sensory input coming from the environment, at least no unexpected input is present. On the other hand, in a situation like this, the execution of R1 in the first step provides us with important tactile feedback. Think about picking up a coffee cup. Even with your eyes closed you can feel the cup in your hand. This tactile input is processed by the brain just like any other external sensory event. If the feedback corresponds with your expectations it leads to a state of awareness or confirmation of just having done an action. This awareness is an important aspect in performing an action plan in a nondeterministic environment. In case the confirmation feedback fails to occur it is no use to continue processing the rest of your plan, since its success depends on the correct sequential execution of every step. I represent the tactile feedback in the neural model by an additional set of inputs. The new network is shown in figure 4.5. Two neurons are added to the input layer, labelled did-R1 and did-R2. They are activated only when the corresponding actions have successfully been carried out in the previous time step. My model assumes that the selection of an action will always lead to a successful execution of the action, so every time R1 is selected by the network it will be presented with did-R1 in the next time step. Now the second step in the sequence can be learned by strengthening the pathway between did-R1 and R2. A second issue that needs some attention is the temporal difference algorithm. For the SR network it was sufficient to do one-step-ahead prediction (see section 3.2) because we only looked at stimulusresponse behaviour. Now that we want to learn sequences of actions a full prediction of future reward is needed to appropriately perform the necessary weight updates. Equation 4.1 shows what the calculation 36 H1 S1 S2 did-R1 did-R2 H2 R1 H3 R2 H4 Figure 4.5: Network model with reafferent inputs for updating the weights of the DA unit looks like (for details see section 2.5). i wti = wt−1 + η(rt + γPt − Pt−1 )xit−1 (4.1) In order to compute the new weights for the DA unit we need the observed reward (rt ), the values of the current and previous prediction of reward (Pt and Pt−1 ) and the previous input to the system (xit−1 ). Note that the factor rt is actually the reward given to the result of the previous action. This means that we can already observe it after the network activation in the previous time step. Instead of applying the weight updates at the start of the activation sequence, we can try to set the new weights for the next time step after activation of the network. Applying equation 4.1 at time t + 1 it can be rewritten as follows: i wt+1 = wti + η(rt+1 + γPt+1 − Pt )xit (4.2) Now the factor rt+1 is the environmental reward based on the current output of the system. This reward can be observed immediately after the network has been activated and the output selected. The factor Pt+1 needs some more attention. It stands for the reward prediction in the next time step. For making a prediction of reward based on the current weights of the DA unit, the only variable needed is the input to the system. In a nondeterministic environment there is no way of fully predicting future input values. But is our environment completely nondeterministic? I just introduced the idea of internal tactile feedback represented in the network by an additional external input. We can be pretty certain that this tactile feedback will follow every action we take. Consider the situation in which the input S1 has a high prediction of reward based on the successfully learned behavioural sequence S1 → {R1,R2}. In the first step the output R1 will be chosen but no DA is released. Before observing the world in the next step, we already know that action R1 has been selected. So even before the actual tactile feedback from our body is received, we can assume with a high degree of certainty that did-R1 will be the next input. Using this knowledge about the next input to the system, a new prediction of reward can be made and the weights of the prediction unit updated. 4.3.3 Sequence learning without PFC The two extensions to the SR model, tactile feedback and an updated reward prediction mechanism, already allow the network to learn a sequence of actions. Even better, a computational implementation of the prefrontal cortex has not even been included yet. Is this model then capable of learning sequences without the need for PFC mediating behaviour? Have a look at the following example. In this example I use the reward schedule S1 → {R1,R2}. The first step in learning this sequence, depicted in figure 4.6a, looks familiar. Just like in the SR network without extensions, upon the presentation of stimulus S1 the network activates and selects a winner in the hidden layer. Hidden neuron H1 activates R1 and the associated action is taken. Unlike before, there is no DA activity because there is no reward or expectation of future reward. This means that, at this time, no learning takes place. After the selection and execution of R1, the latter of which is not explicitly modelled, the reafferent stimulus did-R1 comes on. Figure 4.6b shows what 37 H1 S1 DA R1 H2 S2 S1 S2 a) H1 DA H2 R1 H3 R2 b) did-R1 did-R1 R2 H3 did-R2 did-R2 H4 H1 S1 H2 S2 H4 DA S1 R1 S2 c) H1 DA H2 R1 H3 R2 d) did-R1 H3 did-R1 R2 did-R2 did-R2 H4 H4 Figure 4.6: Example of learning a sequence without the aid of the PFC happens next. The network activates once more and randomly selects R2. The sequence S1 → {R1,R2} has now been completed, therefore the DA unit fires. This allows the connections on the pathway from did-R1 to R2 to be strengthened. Also, the DA unit increases the weight of its connection with stimulus did-R1. Effectively, this means that did-R1 has now become an above average predictor of reward. But the learning process is far from complete; only the last step in the sequence is more likely to be taken correctly. The connections from S1 to R1 have not been strengthened at all. This is where the temporal difference algorithm implemented in the DA unit becomes important. Observe what happens when the network is presented with S1 once more. In figure 4.6c response R1 is randomly selected again. The amount of reward predicted by did-R1 is slightly above average, meaning there now exists a temporal difference. And a temporal difference means a release of DA into the system which allows learning to take place. The amount of DA released may be small now, but as the network continues to correctly perform the sequence the external reward will become expected and the temporal difference after completing the sequence (as in figure 4.6d) decreases. As a result, the reward prediction of did-R1 increases and so does the temporal difference between R1 and did-R1. This in turn enables more learning of the connections between S1 and R1, until the reward can be fully predicted. At this point no more learning takes place, at least for as long as the network does not make a mistake. Then why do we need a PFC after all? Have a look at the following example. In this example I concurrently use the two reward schedules: • S1 → {R1,R2} • S2 → {R2,R1} I assume that the input to the network depicted in figure 4.5 is generated randomly. Consider a situation where the first input has been selected a few times and by chance the network has selected the correct sequence of outputs such that a reward has been delivered and both the network and the DA unit have updated their weights. The second input has been selected once or twice, but the correct sequence has been performed only once. In this situation the prediction of reward for the first input will be slightly above average but not as much as the prediction for input did-R1. Figure 4.7 gives an example distribution of the 38 prediction of reward for every input. Remember that an expectation of 0.5 means complete uncertainty. Anything higher than 0.5 means a chance of getting a positive reward at some time in the future. The pathways from did-R1 to R2 are coloured red to indicate a significantly higher weight due to learning. 0.61 0.54 DA 0.94 0.57 Figure 4.7: Example situation showing the prediction of reward for every input to the system Now observe what happens when the network is presented with S2 and it chooses to select R2. This is a good start, but since the reward prediction for did-R2 is very low, the temporal difference is very low as well. In the example situation the temporal difference equals 0.57 - 0.54 = 0.03. The temporal difference is used for learning the connection between S2 and R2. Not much learning will take place before the predictions of reward of S2 and did-R2 will have increased. Now observe what happens when the network selects R1 instead of R2. This is obviously not a desirable situation, because the correct output sequence starts with R2, not R1. Looking at the values in figure 4.7, it can be concluded that the temporal difference is now 0.94 - 0.54 = 0.40! This time learning does take place, but the behaviour that is reinforced here is the incorrect behaviour. Unfortunately there is no way of knowing at this moment in time. A second example of incorrect behaviour, this time the following two reward schedules are used: • S1 → {R1,R1} • S2 → {R1,R2} Consider the situation in which the first sequence has been performed correctly a few times, but the second one has not. When the network is presented with S2 and R1 is selected (see figure 4.8a), the temporal difference is quite high. By chance this is a good situation and the correct pathways are strengthened but things turn bad on the next step. Figure 4.8b shows this situation. The network has developed a preference a) b) Figure 4.8: Example of unlearning correct behaviour for the behaviour did-R1 → R1 and this is likely to happen again this time. Unfortunately this is not the correct behaviour in this situation. The pathway between did-R1 and R1 is weakened again, even though we need it for the sequence S1 → {R1,R1}. 39 4.4 A final model including a PFC layer The solution for the problems sketched in section 4.3 is one I have been working towards from the start of this report. I will add a PFC layer to the network. The function of this layer is to bias the pathways in the SR network towards the desired behaviour. In section 4.4.1 I will first look at the architecture of this new layer. Section 4.4.2 discusses the updated learning regime. 4.4.1 The design of a PFC layer I want the PFC layer to have the same properties observed in neurological experiments on (primate) PFC to make it as biologically plausible as possible. I try to achieve this by designing a model that has the same characteristics as the biological PFC. An important observation made in section 2.1.4 is that the representation of a plan in PFC remains visible throughout the sequence. This was one of the reasons why a competitive queueing model was insufficient to explain PFC behaviour. In my model, I want the PFC to show sustained activity of a plan representation. The task ascribed to PFC is mediative, it indirectly biases particular pathways involved in behavioural decisions, as in the model of Braver and Cohen. PFC has no direct influence on input or output selection. The neurons in the hidden layer are a good candidate for topdown PFC influence. If PFC exerts a high influence on one particular hidden neuron, specific input-output behaviour can be enforced. Without this bias the network reverts to learned habitual behaviour. Figure 4.9 shows the new network design including a PFC layer connected to the hidden layer. The PFC1 PFC2 PFC3 PFC4 PFC5 PFC6 PFC7 PFC8 S1 S2 R1 did-R1 R2 did-R2 Figure 4.9: A neural network with PFC layer hidden layer now consists of eight neurons. This way every neuron can theoretically be used to represent a unique input-output combination. For display purposes the input and hidden layer are only sparsely connected. Similar to the first design for the SR network, there is exactly one pathway from every input neuron to every output. In chapter 5 I will show that full connectivity between the input and hidden layer yields the same results. There are also eight neurons in the PFC layer. Every one of those neurons is connected to a specific hidden neuron. Before any learning has taken place there will be no activity in PFC meaning no bias is placed on the hidden layer. Without top-down influence from the PFC the SR component behaves just like the SR network from chapter 3. The added value of the PFC layer becomes clear when demanding tasks like the ones described in section 4.3.3 are given. Recall the task with the reward schedules: • S1 → {R1,R1} 40 • S2 → {R1,R2} Without PFC this task was impossible to learn because the response to the input did-R1 is ambiguous. A decision can only be based on the input value initially presented to the network. A correctly learned pattern of activation over the PFC neurons can bias the network in such a way that this decision can be made without having to remember the initial sensory input. Observe what happens when the network is faced with the same situation, only this time neurons PFC3 and PFC5 are actively biasing the network throughout the sequence. Figure 4.10 shows what happens in the second step of the sequence. Even though PFC1 PFC2 PFC3 PFC4 PFC5 PFC6 PFC7 PFC8 S1 S2 R1 did-R1 R2 did-R2 Figure 4.10: PFC biasing correct behaviour the network has a strong tendency to select R2 whenever did-R1 comes on (represented by the red line), the top-down bias from the PFC layer forces the hidden layer to select a different winner. This allows the connection between did-R1 and R1 to be strengthened too. In the fully trained network, both connections from did-R1 to the hidden layer will have become quite strong and PFC activation will be decisive in the selection of an output. 4.4.2 Learning inside PFC Using a specific pattern of activation, the PFC can bias the hidden layer towards any behaviour. A question left unanswered is how the PFC can learn this pattern in the first place. To accomplish this, I subdivided the PFC into a bias layer and a history layer. Figure 4.11 gives an overview of the network architecture. The basic structure of the SR network is still intact. The input layer is fully connected to the hidden layer, visualised in the figure by the solid arrow. Instead of one PFC layer, there now is a history layer and a bias layer. The hidden layer connects to the history layer, which is in turn connected to the bias layer. Finally the bias layer reconnects to the hidden layer. In the figure there are two copies of the same bias layer to avoid crossing lines. The bias layer performs the function of biasing the hidden layer in order to enforce a desired output sequence. The PFC layer shown in figure 4.10 is actually a visualisation of only the bias layer. Activation in the bias layer means that some sensory event has triggered a plan. By biasing the hidden neurons the PFC can ensure a correct execution of the action plan, even when the SR network below has a different output selection preference. The history layer, also embedded in PFC, simply keeps track of things going on in the brain. Every hidden neuron is connected to a specific history neuron. Two extra neurons are present to record the sensory input observed by the input layer. I will call those neurons history-input1 and 41 bias gate history hidden output input bias Figure 4.11: A neural network with a two-layer PFC component 42 history-input2 from now on. The other history neurons are named history1 through to history8. The task of the history layer is to keep track of sensory events as well as the response of the brain to those events. The neurons are interconnected to allow integration of multiple sources of information. Figure 4.12 shows the connections within this layer. The two history-input neurons, used to record the sensory input, are input hidden Figure 4.12: Connections internal to the history layer both fully connected to the eight history neurons connected to the hidden layer. Those connections are all learnable. A strong connection from one of the two history-input neurons to a history neuron in the layer represents a tendency to select one or more specific hidden neurons when the stimulus comes on. This way, a sensory stimulus can trigger the selection of an action plan in PFC. To learn the internal PFC connections no other source of information is available than the signal provided by the DA unit. Recall the shift in DA activation that takes place over time when a sequence of actions consistently leads to subsequent reward. Initially, DA fires when the reward is delivered but as learning takes place the DA activation shifts towards the earliest reliable predictor of reward. This will always be an external sensory event like S1 or S2, not a reafferent input like did-R1. One of the reasons why the SR network was incapable of learning sequences was because no information about the initial stimulus is available at the time of reward delivery. Unlike the hidden layer, the history layer in PFC has sustained activity, meaning that information can be remembered over time. The two leftmost neurons in the history layer are used to remember the external sensory input that triggered the sequence. I will use the reward schedule S1 → {R1,R2} to explain the role of the PFC in learning this sequence. Assume that the sequence is performed correctly. Step 1 is shown in figure 4.13a. Stimulus S1 comes on and the network selects R1 going through hidden neuron H1. This activity is recorded by the PFC history layer. After taking the next step, shown in figure 4.13b, the activity from the previous step is still visible in PFC. The selection of did-R1 leads to response R2 and this too is recorded by the PFC history layer. The correct sequence has now been completed and the DA unit activates because of the unexpected reward. Like in previous examples, the connection from did-R1 to R2 is strengthened. But this time there are connections in PFC that are strengthened as well. In this case the connection between history-input1 and both history1 and history6 is strengthened. The delivery of a reward tells the brain that the sequence has been successfully completed. The expected reward was delivered so there is no need to remember the preceding sequence anymore. Activity in PFC will be reset, the only thing remaining are the somewhat stronger internal connections. Resetting activity clears the way for trying to receive a new reward. Figure 4.14a shows what happens when S1 comes on again. This time neuron history1 and history6 are activated simultaneously. There already is a possible plan for getting reward. Observe what happens 43 bias bias DA DA history a) history b) bias bias Figure 4.13: Network activation while learning a sequence bias bias DA DA history a) history b) bias Figure 4.14: Network activation with plan available 44 bias after the selection of R1 as the first output. As shown in figure 4.14b, the input did-R1 is activated next. In the previous epoch a little bit of learning has taken place when R2 was selected. The prediction of reward for did-R1 has consequently gone up a bit and therefore a small temporal difference exists. This means a small release of DA will take place. The gate between the history and bias layer (see figure 4.11) has been closed so far. The small amount of DA released opens this gate a little bit, allowing some of the activity from the history layer into the bias layer. The bias layer now exerts a tiny influence on the hidden layer. The bias may not be very large at the moment, but as the behaviour is learned the temporal differences will grow and so will the bias from the PFC. Eventually the temporal difference will have shifted to the time when the initial stimulus appears. Before taking the first action, a plan is generated in the history layer and, because of the high concentration of DA that opens the gate, copied to the bias layer. The bias on the hidden neurons remains until the sequence is completed. This all looks quite nice, at least in case everything goes exactly according to plan. But what happens when a part of the required action sequence is incorrect? For example, assume that the sequence S1 → {R1,R2} leads to reward but S1 → {R1,R1} is executed instead. After completing the sequence the activation in the network is as shown in figure 4.15. The connections from history-input1 to history1 and bias DA history bias Figure 4.15: Incorrect activation of the second part of the sequence history5 are weakened. The behaviour S1 → R1 is discouraged even though we need it for getting a reward. Next time, the network is likely to try something different. Expectedly, this does not yield any result and S1 → R2 will be discouraged too, thus increasing the possibility of S1 → R1 happening again. When the entire sequence is performed correctly for the first time, the temporal difference will be relatively high. By now, the network has learned two things: the selection of R2 is a good option when the initial stimulus was S1 and selecting R1 is not good in this situation. This knowledge is encoded in the strength of the connections in PFC and stored separately from the underlying SR network. This is exactly the kind of information needed to overcome the problems sketched in section 4.3.3. 45 Chapter 5 Implementation and evaluation of a neural network including PFC layer In chapter 4 I presented a neural network model of the human brain including the PFC. In this chapter I will prove that this model is actually capable of learning sequences of actions. In section 5.1 I will discuss some implementation details focusing mainly on the graphical user interface. In section 5.2 I will report on a number of performance tests I carried out on the network. 5.1 Implementation To show that the network I designed in chapter 4 is indeed capable of learning to perform sequences of actions triggered by a single input, I made an implementation in Java. I used the same basic Java framework I designed for the SR network (see section 3.3). A functional overview of the Java classes used for building the network can be found in appendix A. The code consists of a separate network component and Graphical User Interface (GUI). The GUI only contains code for displaying the network and retrieving user input. The network component provides the neural network functions and carries out the calculations. Two different network architectures can be selected by the user, one for the SR network from chapter 3 and one for the network extended with PFC layer. Both have a different user interface showing the network architecture. Figure 5.1 shows what the GUI looks like for the SR network. The three different layers are visible from left to right. The input layer has two neurons, the hidden layer four and the output layer two neurons. In between the layers the connections are depicted by a red line. The strength of the connection is printed on top of the line and the colour of the line also gives an indication of the strength of the connection. The stronger the connection, the deeper the colour red will be. Inside every neuron, two numbers are printed. The number printed in black is the current level of activation, the green number is the output value observed at the adjacent neuron. On the very left side of the screen, the user can provide input to the network. The user can herewith simulate external sensory input by clicking on the number left of the neuron. Selection of an input does not yet trigger the network to activate, it is only used to select the sensory input that will be presented to the network in the next activation cycle. Only one input can be selected at the same time. The selection of a new input automatically leads to the deselection of all others. The two numbers on the very right represent the response chosen by the network. There is another GUI for the network including the PFC. Figure 5.2 shows the extended GUI. The lower part of the screen shows the SR component. From left to right there is an input layer, hidden layer and output layer, all connected with each other. The input layer has two additional neurons, the upper two are the usual input neurons S1 and S2. The lower two represent the reafferent did-R1 and did-R2 inputs. The topmost part of the GUI shows the neurons residing in PFC. The two neurons on the very left represent the two history-input neurons (see section 4.4.2). The rest of the history neurons are depicted in the center of the screen. Internal to the history layer there are only connections from the history-input neurons to the other neurons. Initially, those connections have a strength of 0 because there is no learned plan yet. The 46 Figure 5.1: Graphical User Interface for the SR network history layer is in turn connected to the bias layer shown on the right. The gate between the layers is not visualised in the GUI. Also left out are the connections between the hidden layer and the PFC. Those are not very interesting because they are static and can not be learned. When the user selects an input, the current prediction of all future reward is immediately visible in the upper part of the screen. This is the reward prediction the dopamine unit would use for calculating the temporal difference if the current input were actually chosen. Right next to the dopamine prediction the most recent temporal difference value is visible. This is the temporal difference value used in the last learning step. The buttons on the left and right side of the text fields can be used to initialise a new network or quit the program. The buttons in the lower part of the screen can be used to perform a number of different actions. When the user clicks the run button, the network is activated on the selected input. This simulates the presentation of the selected stimulus to the brain and has it react to the stimulus. Simultaneously the dopaminergic network is activated on the same stimulus and the output of the DA unit is used to update the strength of every learnable connection in the network. In order to understand the activation process that takes place in the model, it has been broken up into a number of sequential steps. Upon the manual selection of an input the DA unit only provides the reward prediction. Selection of a different input provides the prediction for that input. After input selection the user can first simulate an activation cycle by clicking the activate button. Every neuron in the network is now activated and a candidate output is selected. No learning or temporal difference calculation has taken place yet. The network only shows a possible activation pattern, this pattern is not final. Clicking the activate button once more shows a different possible activation pattern with a different candidate output. During testing this is a useful feature for having the network ‘randomly’ select a desired output. After activation the user can click the go button to activate the learning and temporal difference processes on the current pattern. The button sequence activate-go has exactly the same consequences as clicking run once. The only difference is that the activation is visualised before learning has taken place. The learn button is added to enhance the learning effect of the current network activation. Selection of the button allows the user to ‘play god’ by providing the network with the same DA activation once more without calculating a 47 Figure 5.2: Graphical User Interface for the SR network including PFC 48 new temporal difference. It should only be used for testing purposes as it is not biologically plausible for learning to take place more than once in a single activation cycle. Finally there is the clear button. The reason for introducing this button has to do with the temporal difference algorithm. Remember that in order to compute the temporal difference after network activation it was sufficient to know the next input value (see section 4.3.2). In case a sequence of two actions needs to be learned (e.g. S1 → {R1,R2}), the input following the first response will be the reafferent did-R1. After the second response, the sequence should have been completed. Any reward prediction associated with did-R1 is really a prediction for immediate reward. Equation 5.1 shows how the temporal difference was computed. T D = rt + γPt − Pt−1 (5.1) Assume that the required behaviour has been learned and the network successfully completed the required sequence, the factor rt will be 1 because of the externally provided reward. The behaviour is well practised, so the factor Pt−1 will be close to 1 as well. In a normal situation the weight updates should be small because no more learning needs to take place. If it was not for the factor γPt (in this case the prediction of reward for did-R2) the previous prediction (Pt−1 ) and the external reward (rt ) would cancel each other out. So upon delivery of an external reward the next input needs to be ignored. This is exactly what happens; whenever an external reward comes in, the next input provided to the DA unit is 0. But what if a sequence of two actions does not lead to subsequent reward? When the system is trained on a set of reward schedules comprised of two actions it is no use trying to predict reward beyond the second action. The clear button can be used to manually provide zero input to the DA unit. This clears all network activation and breaks the multi-step ahead learning chain. 5.2 The performance of the final model In order to prove that my network can learn to select sequences by the use of the PFC component, I created a few test scenarios where the network needs to learn to generate a sequence of two consecutive actions after presentation of a single input. There are two inputs available to the system and two possible actions to be taken. In section 4.3.3 we have seen that there are some cases in which the PFC-less network is incapable of learning the correct mappings. The addition of a PFC layer would theoretically solve these issues. I will compare the performance of the full network for different sets of reward schedules with that of a network with a disabled PFC. My hypothesis is that the PFC-less network shows a much worse performance than the network including PFC. Biological plausibility is one of my main concerns for the current network design. I already fully connected the input and hidden layers for this reason. I have not looked at the connections between the hidden layer and the output layer yet. Currently, the connections between the layers are perfectly evenly distributed. In reality, there probably exists not such a nicely balanced distribution. To simulate this more realistic situation I adapted the network such that every hidden neuron is randomly connected to exactly one output neuron during the initialisation of a new network. This means that theoretically there can be only one or in the worst case even no connection to one of the outputs. The network must be able to adapt to this situation. To test the performance of the network at several steps in the learning process the test comprises two functionally different types of epoch. In a learn epoch, the network is presented with a random input and then activated followed by an update of the connection weight. Next, the network is activated again using the reafferent input and another weight update takes place. This is exactly what happens in a normal situation, the network receives input and learns from every action it takes. This does not allow an extensive test at a specific time in the learning process. A test epoch therefore consists of the presentation of the first input followed by network activation. But no learning takes place during a test epoch. After selecting an output, the reafferent input is presented and the network is activated again. The same sequence is repeated for the second input. Effectively, the network has been tested once on both inputs now. The tests I carried out differ only in the reward schedules used. A test starts with 100 test epochs, this will give an indication of the initial performance before any learning has taken place at all. Next, a series of 10 learn epochs is run with random values for the inputs. This is followed by a series of alternations of 100 test epochs and 10 learn epochs until a total of 250 learn epochs has taken place. By then, the system 49 is expected to have achieved its optimal performance. A measure of the average network performance is achieved by running 100 sessions of 250 learn epochs intertwined with test epochs. At the start of every session, a new network is initialised with new random initial values for every connection. For the first test, I used the following reward schedules: • S1 → {R1,R2} • S2 → {R2,R1} The problem the network has to face here is that the correct execution of the first of the two schedules affects the second because the prediction of reward for the reafferent action is independent of the initial stimulus (recall section 4.4.1). The two schedules can therefore not easily be learned independently of each other. Because of the ability of the PFC to remember stimulus activation, it can bias the SR component towards selection of the correct response. I first ran the aforementioned test in the PFC-less network. As expected did the network not perform too well. After 250 learn epochs, it selected the correct sequence in less than half of the test epochs. Closer inspection of the test results reveals that in most sessions at least one of the reward schedules has in fact been learned. In rare cases the network has managed to learn the other schedule as well because of coincidental random successful choices. In some other cases the network did not manage to find any reliably rewarding schedule at all. Enabling the PFC layer is expected to yield a much better performance. I ran the test again in the network with PFC. The results of this test and the previous one are shown in figure 5.3. The initial per1 0.8 Average number of correct test epochs 0.6 0.4 s s s s s s s s s s s s s s s s s s s s without PFC with PFC b s b b b b b b b b b b s b b b b b s b b b b b b b s b s b bs b s 0.2 0 0 50 100 150 Number of learn epochs 200 250 Figure 5.3: Performance of the network for the first test case formance is equal to the performance of the PFC-less network. This is also the expected performance of a system choosing actions entirely at random. Since there are only two out of eight possible sequences correct, the expected fraction of correctly executed sequences is 25%. In the first 40 learn epochs, performance increases for both networks. But even at this early stage in learning, the advantages of the PFC are already clear. The performance of the PFC network keeps increasing while the PFC-less network’s performance flats out and stays well below 50%. It is not until the PFC network reaches a level of 80% that its performance starts to level. The second test I ran comprised the following two reward schedules: • S1 → {R1,R1} • S2 → {R1,R2} 50 For comparison, I first ran the test on the PFC-less network. A very poor performance is expected because without the PFC there is no way for the network to learn which action to associate with the reafferent input did-R1. The correct output is solely dependant on the initial stimulus which is not remembered in the network. Figure 5.4 confirms this hypothesis. Performance levels never reach anything above the initial performance of the network. After 150 learn epochs, it has resided to choosing R2 as the first action in the sequence every time because this reliably leads to not getting reward. With a reward expectation of 0 for every input, this is an undesirable but stable situation in which there are no temporal differences and no learning takes place anymore. Again, the PFC makes a big difference. After 100 learn epochs, the 1 0.8 Average number of correct test epochs s 0.6 s s s s s s s s s s s s s s s s s s s without PFC with PFC s b s s 0.4 0.2 0 s s sb b b 0 s b b b b b b b b b b b b b b b b b b b b b b b 50 100 150 200 250 Number of learn epochs Figure 5.4: Performance of the network for the second test case performance exceeds 75% and at 150 learn epochs a 90% performance level is reached. The asymptotic 95% performance level is reached after 200 learn epochs. An important characteristic of human learning is our ability to learn by observation. When a learned behaviour suddenly fails to get rewarded, it is necessary for the brain to react to the new situation, possibly unlearning current plans and replacing them with updated ones. I created a test where this situation is simulated by first having the reward schedules from the second test case: • S1 → {R1,R1} • S2 → {R1,R2} After having run the network for 250 learn epochs, the network will show near perfect performance for every input. At this point I replace the reward schedules with the ones from the first test case: • S1 → {R1,R2} • S2 → {R2,R1} From this point on, I tested the network’s performance by running a series of alternations of 100 test epochs followed by 10 learn epochs until 250 learn epochs have been run. Both sequences that got rewarded before are not being rewarded anymore and a different sequence needs to be executed in order to receive the reward. For the network it means that it first has to unlearn the current associations and then learn new connections. As figure 5.5 shows, the performance level immediately after changing the reward schedules is close to 0%. This is expected because the network has had no chance to adapt to the new situation. With the temporal differences being extremely high, the learning curve is steep at the beginning. After 20 learn 51 1 0.8 Average number of correct test epochs 0.6 0.4 s 0.2 s s s s s s s s s s s s s s s s s s s s s s s with PFC s s 0 s 0 50 100 150 200 Number of learn epochs in second reward set 250 Figure 5.5: Performance of the network facing changing conditions epochs the reward expectation has been adjusted and more random exploratory actions are taken. This slows down the learning rate but allows the new set of reward schedules to be discovered. After around 200 learn epochs an asymptote of 75% is reached. 52 Chapter 6 Conclusions and further work The goal of this research project was to create a biologically plausible model of PFC in order to guide a neural network towards learning behavioural sequences. In this chapter I will assess this goal and look at what still needs to be done. Section 6.1 presents a summary of the general line of reasoning I used in this report. In section 6.2 I present the general conclusions of this project. A number of recommendations for improvement are given in section 6.3. Finally in section 6.4 I present some suggestions for future work. 6.1 Discussion The network my PFC model is connected to is an abstract simplification of neural processes taking place in the (human) brain that allow basic stimulus-response behaviour. There is good evidence that the brain implements a form of Hebbian learning on neural pathways used for guiding behaviour. This learning regime can also be applied to connections in an artificial neural network. This form of unsupervised learning only requires an external feedback signal giving an indication of the effectiveness of the last action. The activation level of dopamine neurons found throughout the brain seems to encode exactly this. DA activation patterns show a high resemblance with the computational method of temporal difference learning. I created a three-layer neural network to simulate stimulus-response behaviour. The output of a temporal difference function was used as a parameter for applying Hebbian learning throughout the network. This network was perfectly capable of learning any possible stimulus-response combination. However, human behaviour almost invariably depends on the correct execution of a sequence of actions before a goal or desired situation is achieved. The basic stimulus-response network is incapable of learning or even performing a sequence of actions. To allow the network to perform behavioural sequences, the notion of reafferent stimuli was introduced. To the network, a reafferent stimulus is regarded as any other stimulus. The characterising difference with a normal stimulus is that it is not generated by an external event, but internally by the brain itself. It is a confirmation of having done a certain action instead of a reaction to an external event. With this extension, the network architecture is suitable for performing sequences of actions. Although the network is quite capable of learning to execute a series of two consecutive actions, the performance drastically goes down when more than one reward schedule involving the same actions needs to be learned concurrently. In a similar fashion as the PFC orchestrates complex behaviour in the human brain, the PFC component I created aids the stimulus-response network in learning behavioural sequences also under more difficult circumstances. I already stressed the importance of biological plausibility for the design of the PFC component. Although it is not (yet) known exactly how it works, from neuroimaging experiments on human and primate PFC, researchers have gained an understanding of the important properties of human PFC. Unlike most other brain areas, the PFC displays sustained activity during the learning and execution of a behavioural sequence. This property is also present in my PFC component. The initial stimulus is remembered and if an execution plan is available, a pattern of activation becomes visible that is kept active throughout the execution of the sequence. Another property of human PFC is that is does not directly 53 control the brain structures responsible for selecting motor outputs. Instead, it biases the same pathways used by the stimulus-response network responsible for highly automated and instinctive behaviour. My artificial PFC model exerts its influence by biasing the hidden neurons in the stimulus-response network as opposed to controlling the output layer. Similar to the biological PFC, the bias is not directly on the output but on the neurons in between the input and output layers. This allows very strong automated behaviour to overrule the influence of the PFC. In a life-threatening situation, this can be critical for survival. With the addition of the PFC, the network is now capable of learning more complex behavioural sequences of two consecutive actions. The test results show performance levels close to 95% for a fully learned network. Disabling the PFC drastically impairs its capability to learn the more complex reward schedules. 6.2 Conclusions Before I start talking about my suggestions for research that could be done in succession to my work, I will summarise the conclusions I can draw from my research. Firstly, I explained how a computational model for learning stimulus-response behaviour can be constructed using biologically plausible methods. An artificial neural network provides a biologically founded implementation of a system capable of processing information. Hebbian theory describes a biological mechanism for strengthening and weakening neural connections under the right circumstances. In the brain, the neurotransmitter dopamine plays an important role as well. Researchers such as Wickens [42] have shown that (Hebbian) learning only takes place when DA levels are high. There is a striking similarity between the computational method of temporal difference and the timing of DA firing. I used the concept of dopamine mediated Hebbian learning in a neural network modeling the stimulus-response pathways in the brain. The second conclusion I can draw is that the aforementioned neural network is indeed capable of learning arbitrary stimulus-response behaviour if it is consistently rewarded at the right time. The results in section 3.3 support this claim. After an initial learning phase, the network showed near perfect performance every time it was run. My third conclusion is that, although this SR network is capable of learning one sequence, if more than one set of sequences gets rewarded, the network performance drops. This was demonstrated by the performance tests in section 5.2. Learning to perform a sequence of actions is crucial for intelligent creatures such as ourselves, because it allows us to suppress instinctive behaviour and consciously reach for more rewarding future goals instead. Behavioural as well as neurological evidence suggests that the PFC is the brain region responsible for orchestrating complex behaviour. A neural model of PFC should enable my network to learn more complex sequential behaviour. The last and most important conclusion I can draw is that the addition of a PFC to my SR network solves the sequence learning problems in a biologically plausible way. Just like its biological counterpart, the PFC layer I designed records a history of events and integrates this information into an observable pattern of activation. Like in the brain, motor actions are still selected by the SR network, the PFC only biases the stimulus-response pathways to bring about the desired behaviour. This allows the brain to very quickly react to an emergency situation. 6.3 Recommendations The results of the performance tests provide preliminary support for the hypothesis that a neural model of PFC can be successfully developed to guide a neural network towards learning behavioural sequences. The model I developed seems to work nicely under the conditions I created. But before it can be used in a more challenging environment, a number of issues need to be resolved. It proved to be difficult to implement lateral inhibition in the hidden and output layers. Recall that lateral inhibition is a mechanism for putting the focus on the strongest of a set of input signals while at the same time depressing the other inputs. It brings about winner-take-all behaviour which is very useful in a noisy and unreliable environment. In my model, I implemented lateral inhibition by creating inhibitory 54 connections from every neuron to every other neuron in the same layer. To compensate for the very low baseline activity of the neurons, I added an excitatory connection from every neuron to itself. The strength of these connections is not learnable and needs to be set to a fixed initial value. Setting both the inhibitory and excitatory strength relatively high produces random unpredictable behaviour. The relative strength of the inputs to the layer is not sufficient to have much influence of the final activation pattern. Setting the inhibitory and excitatory connection strength too low results in an undifferentiated activation pattern in which no clear winner can be selected. Too much self-excitation can produce more than one winner, with too little self-excitation no winner could come up at all. The amount of inhibition and excitation to apply depends on the number of neurons in the layer. After a change in network configuration such as additional inputs or output neurons, the lateral inhibition parameters must be tuned again. The new values can probably be derived from a linear function of the number of neurons. However, the parameters also depend on the number and average strength of the inputs external to the layer. Those can not be derived linearly since they in turn depend on the lateral inhibition processes taking place inside the other layers. Currently, the lateral inhibition parameters are statically set upon initialisation of a new network. An option would be to set them at the start of a network activation cycle as soon as the average input activation to the layer can be determined. This would work if the network was activated in parallel, i.e. layer by layer. But this is not the case, activation of the network happens by activating the neurons in the network in a random order. The notion of a layer is disregarded during the activation process. By activating every neuron 100 times, the network settles into a stable activation pattern. My suggestion would be to adjust the parameters after every activation cycle. Initially, the adjustments made will be quite radical but as the network settles, the lateral inhibition parameters will settle into a stable state as well. The current network design only allows sequences of two consecutive actions to be learned. In a reallife situation, behavioural sequences hardly ever consist of only two actions. In order to have the network learn a sequence of arbitrary length we probably need more output neurons. Having more output neurons implies a growth of the number of reafferent input neurons as well as the number of hidden and PFC neurons. Although the complexity of the network increases, this should be relatively easy to implement. A related issue has to do with the temporal differences. In the performance tests run on the network, I manually cleared the reward prediction values after the network has selected two consecutive actions. It makes sense to clear this prediction when a reward is delivered, but what if an unrewarded sequence of two actions has been executed? How can the network know whether the expected reward is still to come or if it failed to perform the correct sequence? This requires a notion of time-until-reward to be associated with the reward prediction value. There are indications that the DA neurons in the brain do activate in a time-dependent manner. In classical conditioning experiments animals have been trained to expect a reward when presented with a conditioned stimulus. If after training the expected reward fails to occur, a drop in dopamine activation level can be observed. This drop occurs at the very time that the animal was expecting the reward to come. The exact time of reward delivery thus seems to be encoded in the DA signal. Although little is yet known about the neural mechanism underlying this time-of-reward encoding, I can use the concept in my model to allow for sequences of arbitrary length to be learned correctly. Assume that the lateral inhibition and time-until-reward issues are resolved and say we want the network to learn the reward schedule S1 → {R1,R1,R2}. A strong connection between S1 and R1 will develop, as well as a strong connection between did-R1 and R1. For the final step in the sequence, a strong connection between did-R1 and R2 is required. Recall why the reafferent input did-R1 was introduced in the first place. We needed a way to activate the network even when there is no external input present to provide a way for generating one action from the completion of a previous action. The stimulus did-R1 is regarded by the stimulus-response system like any other input, but unlike the other stimuli it is generated by an internal representation of just having done a certain action. We are now faced with the problem that did-R1 requires a different response at a different moment in sequence execution. But are the two did-R1 stimuli really the same? Not entirely, the context in which the stimuli present themselves is somewhat different. Instead of having the reafferent input represent the fact that a particular action has just been carried out, a more complex context representation could be used as input to the system. More research is needed to assess the nature and the biological plausibility of such a context representation. 55 6.4 Future work My network design is inspired by Braver and Cohen’s model of cognitive control [6]. Where my focus was aimed at the execution of sequences, they focused on the brain’s ability to selectively update or keep existing context information. This allows us to attend only to contextually important sensory events while ignoring others. Braver and Cohen theorise that phasic changes in DA activity serve the function of gating information into active memory in PFC. They implement this in their network by means of a gated connection between the stimulus input layer and the context layer (see figure 2.11). The reward prediction layer, which models DA activity, opens the gate when a contextually relevant stimulus is presented. We can compare this with the predictive DA release in the brain whenever an unexpected opportunity for getting reward presents itself. In my model, I assume that only one stimulus is presented to the network at a time and that this stimulus really is contextually relevant. Similar to the Braver and Cohen design, the DA signal could be used to gate information from the input layer into PFC. When a potentially rewarding stimulus is then presented together with a less relevant stimulus, PFC will activate and bias the hidden layer towards the selection of the rewarding sequence. Upon the presentation of one or more non-rewarding stimuli, connections from the hidden layer to the PFC still allow it to learn by observation. The PFC will not put an unnecessary bias onto the hidden layer and the stimulus-response network is free to select whichever habitual or random exploratory response, unaffected by expectation. In a broader context, my model of PFC can be used to explain the behavioural phenomenon of intention recognition. Imagine that an observer has a plan to do action A1 followed by action A2 in context C. Now imagine that this observer is watching another agent in this same context, and the agent performs action A1. The mirror system hypothesis [31] tells us that the observer’s own representation of action A1 will be activated in this situation. If we configure my network to allow activation not only to flow forward from the input neurons but also backward from the output layer to the hidden layer, this will activate the hidden neuron associated with the action the agent would perform if the observer had the same plan. Figure 6.1a shows the network in this situation. This in turn will activate the observer’s representation of S1 S1 A1 A1 a) b) A2 A2 did-A1 did-A1 Figure 6.1: Performance of the network for the second test case one component of the PFC plan which he would use to generate that observed behaviour. The observer now has a representation of a part of the agent’s plan. A remaining question is how to activate the entire plan in order to actually understand the agent’s intentions. This requires a form of pattern auto-completion inside PFC. In [8] Gregory Caza reports on a neural network model of this plan competition behaviour. The general idea is that units which are frequently active together end up activating each other. After training, this partial PFC plan will activate the entire plan using auto-completion of commonly found activation patterns. Figure 6.1b shows the next step in doing intention recognition. Once the complete plan is active and the observer detects the consequences of action A1, the network will activate action A2 of its own accord. Effectively, the observer will use his plan to anticipate the observed agent’s next action. This type of anticipation of a likely successor action in premotor mirror areas has indeed been found [10]. 56 Bibliography [1] W.F. Asaad, G. Rainer, and E.K. Miller. Neural activity in the primate prefrontal cortex during associative learning. Neuron, 21:1399–1407, 1998. [2] Bruno B. Averbeck, Matthew V. Chafee, David A. Crowe, and Apostolos P. Georgopoulos. Parallel processing of serial movements in prefrontal cortex. Proceedings of the National Academy of Sciences, 99(20):13172–13177, 2002. [3] H. Barbas and D.N. Pandya. Architecture and intrinsic connections of the prefrontal cortex in the rhesus monkey. Journal of Comparative Neurology, 286:353–375, 1989. [4] Helen Barbas. Connections underlying the synthesis of cognition, memory, and emotion in primate prefrontal cortices. Brain Research Bulletin, 52:319–330, 2000. [5] E. A. Berg. A simple objective technique for measuring flexibility in thinking. Journal of General Psychology, page 15, 1948. [6] Todd S. Braver and Jonathan D. Cohen. On the control of control: The role of dopamine in regulating prefrontal function and working memory. In Stephen Monsell and Jon Driver, editors, Attention and Performance XVIII; Control of cognitive processes, pages 713–737. The MIT Press, London, England, 2000. [7] P. Calabresi, R. Maj, N.B. Mercuri, and G. Bernardi. Coactivation of d1 and d2 dopamine receptors is required for long-term synaptic depression in the striatum. Neuroscience letters, 142:95–99, August 1992. [8] Gregory A. Caza. Computational model of plan competition in the prefrontal cortex. In Proceedings of NZCSRSC ’07, the Fifth New Zealand Computer Science Research Student Conference, April 2007. [9] G. Di Chiara and A. Imperato. Drugs abused by humans preferentially increase synaptic dopamine concentrations in the mesolimbic system of freely moving rats. Proceedings of the National Academy of Sciences of the United States of America, 85:5274–5278, 1988. [10] L Fogassi, P F Ferrari, B Gesierich, S Rozzi, F Chersi, and G Rizzolatti. Parietal lobe: from action organisation to intention understanding. Science, 308:662–667, 2005. [11] J.M. Fuster. Neuron activity related to short-term memory. Science, 173:652–654, August 1971. [12] S. Geyer, M. Matelli, G. Luppino, and K. Zilles. Functional neuroanatomy of the primate isocortical motor system. Anatomy and embryology, 202:443–474, 2000. [13] D.W. Glasspool and Houghton D. Dynamic representation of structural constraints in models of serial behaviour. In J. Bullinaria, D. Glasspool, and G. Houghton, editors, Connectionist Representations, pages 269–282. Springer-Verlag, London, 1997. [14] Melvyn A. Goodale and A. David Milner. Separate visual pathways for perception and action. Trends in Neuroscience, 15:20–25, 1992. [15] D. Hebb. Organization of Behavior. J. Wiley & Sons, New York, 1949. 57 [16] L.J. Kamin. Predictability, surprise, attention and conditioning. In R. Church and B. Campbell, editors, Punishment and Aversive Behavior. Appleton-Century-Crofts, New York, 1969. [17] Raymond M. Klein. Inhibition of return. Trends in Cognitive Sciences, 4:138–147, 2000. [18] T. Ljungberg, P. Apicella, and W. Schultz. Responses of monkey dopamine neurons during learning of behavioral reactions. Journal of neurophysiology, 67:145–163, 1992. [19] P. Maclean. The triune brain, emotion and scientific basis. In F.O. Schmitt, editor, The neurosciences: second study program. Rockefeller University Press, New York, 1970. [20] Earl K. Miller and Jonathan Cohen. An integrative theory of prefrontal cortex function. Annual Revivew Neuroscience, 24:167–202, 2001. [21] E.K. Miller. Neural mechanisms of visual working memory in prefrontal cortex of the macaque. Journal of neuroscience, 16:5154–5167, 1996. [22] Ralph R. Miller, Robert C. Barnet, and Nicholas J. Grahame. Assessment of the rescorla-wagner model. Psychological Bulletin, 117:363–386, May 1995. [23] B. Milner. Effects of different brain lesions on card sorting. the role of the frontal lobes. Archives of neurology, 9:90–100, 1963. [24] J. Mirenowicz and W. Schultz. Importance of unpredictability for reward responses in primate dopamine neurons. Journal of Neurophysiology, 72:1024–1027, 1994. [25] F. Mora and R.D. Myers. Brain self-stimulation: direct evidence for the involvement of dopamine in the prefrontal cortex. Science, 197:1387–1389, September 1977. [26] Randall C. O’Reilly. Generalization in interactive networks: The benefits of inhibitory competition and hebbian learning. Neural Computation, 13:1199–1241, 2001. [27] Ivan P. Pavlov. Conditioned reflexes. Routledge & Kegan Paul, London, 1927. [28] E. Perret. The left frontal lobe of man and the suppression of habitual responses in verbal categorical behaviour. Neuropsychologia, 12:323–330, 1974. [29] M. Petrides and D.N. Pandya. Dorsolateral prefrontal cortex: comparative cytoarchitectonic analysis in the human and the macaque brain and corticocortical connection patterns. European Journal of Neuroscience, 11(3):1011–1036, 1999. [30] R.A. Rescorla and A.R. Wagner. A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In A.H. Black and W.F. Prokasy, editors, Classical conditioning II: Current research and theory. Appleton-Century-Crofts, New York, 1972. [31] G. Rizzolatti and L. Craighero. The mirror-neuron system. Annual Review of Neuroscience, 27:169– 192, 2004. [32] W. Schultz. Responses of midbrain dopamine neurons to behavioral trigger stimuli in the monkey. Journal of neurophysiology, 56:1439–1461, 1986. [33] W Schultz, P Apicella, and T Ljungberg. Responses of monkey dopamine neurons to reward and conditioned stimuli during successive steps of learning a delayed response task. The Journal of Neuroscience, 13:900–913, March 1993. [34] Wolfram Schultz, Peter Dayan, and P. Read Montague. A neural substrate of prediction and reward. Science, 275(5306):1593–1599, March 1997. [35] B Seltzer and D.N. Pandya. Frontal lobe connections of the superior temporal sulcus in the rhesus monkey. Journal of Comparative Neurology, 281:97–113, 1989. 58 – 59 [36] B. Seltzer and D.N. Pandya. Parietal, temporal, and occipita projections to cortex of the superior temporal sulcus in the rhesus monkey: A retrograde tracer study. Journal of Comparative Neurology, 343:445–463, 1994. [37] A.L. Semendeferi, N. Schenker, and H. Damasio. Humans and great apes share a large frontal cortex. Nature neuroscience, 5:272–276, March 2002. [38] J. R. Stroop. Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18:643–662, 1935. [39] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. [40] Pere Vendrell, Carme Junque, Jesus Pujol, M. Angeles Jurado, Joan Molet, and Jordan Grafman. The role of prefrontal regions in the stroop task. Neuropsychologia, 33:341–352, March 1995. [41] Ilsun M. White and Steven P. Wise. Rule-dependent neuronal activity in the prefrontal cortex. Experimental brain research, 126:315–335, May 1999. [42] J. R. Wickens, A. J. Begg, and G. W. Arbuthnott. Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neuroscience, 70:1–5, 1996. [43] B. Widrow and M.E. Hoff. Adaptive switching circuits. Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, part 4, pages 96–104, 1960. July 20, 2007 Appendix A Package network Package Contents Page Classes Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 This class represents a connection between two artifical neurons in an artificial neural network. Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 This class represents a layer of neurons in an artificial neural network. Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 This class represents an artificial neuron in an artificial neural network. 60 network– Connection A.1 Classes A.1.1 C LASS Connection This class represents a connection between two artifical neurons in an artificial neural network. D ECLARATION public class Connection extends java.lang.Object S ERIALIZABLE F IELDS • private double threshold – Threshold for activating this connection. • private double random – Random value to add to this connections weight. • private boolean modifiable – Is the weight of this connection modifiable. • private boolean gated – Is there a (dopamine) gate on this connection . • private int activationtype – Type of activation. Can either be ConnectionType.EXCITATORY or ConnectionType.INHIBITORY. • private int location – Location of the connection. Can be either Layer.INTERNAL or Layer.EXTERNAL • private String name – Name of this connection. • private int learningtype – Type of learning to apply for this connection. Can be either ConnectionType.LTP or ConnectionType.LTD. • public Hashtable<Integer,Neuron>neurons – A table containing the neurons on either side of this connection. The table contains a mapping from one neuron to the other as well as mapping from type (INPUT or OUTPUT) to neuron. F IELDS • public Hashtable<Integer,Neuron>neurons – A table containing the neurons on either side of this connection. The table contains a mapping from one neuron to the other as well as mapping from type (INPUT or OUTPUT) to neuron. 61 network– Connection 62 C ONSTRUCTORS • Connection public Connection( network.Neuron direction, int location ) source, network.Neuron target, byte – Usage ∗ Creates a new connection between two artificial neurons. The source neuron is unidirectionally connected to the target neuron. If the connection is bidirectional, the output neuron is unidirectionally connected to the input neuron as well. By default, a modifiable excitatory connection is created. Activation of the source neuron leads to excitation of the target neuron. If the learning function is called on the connection the weight is updated according to a Hebbian learning rule. The default learning type is LTP. The connection has an initial weight of 0.5. – Parameters ∗ inputNeuron M ETHODS • getLocation public int getLocation( ) – Usage ∗ Returns the location of this connection. This can be Layer.INTERNAL or Layer.EXTERNAL. – Returns - the location of this connection. • getName public String getName( ) – Usage ∗ Returns the name of this connection. – Returns - the current name of this connection. • getOutput public double getOutput( int hash ) – Usage ∗ Returns the activation of the source neuron multiplied by the weight of the connection. The provided hash identifies the source neuron. – Parameters ∗ hash - the hash value of the source neuron. – Returns - the activation level observed by the target neuron. • getWeight public double getWeight( ) – Usage ∗ Returns the weight of this connection. – Returns - the current weight of this connection. • initWeight public void initWeight( double – Usage weightCentre, double weightRange ) network– Connection 63 ∗ Initializes the weight of this connection to a random value that lies in the weight range around the weight centre. – Parameters ∗ weightCentre - the value used as the weight centre. ∗ weightRange - the maximum value by which to stray from the centre. • isModifiable public boolean isModifiable( ) – Usage ∗ Returns the modifiability of this connection. – Returns - true if the weight of this connection is modifiable. • learn public void learn( double lr ) – Usage ∗ Updates the weight of this connection by applying a Hebbian learning rule on the output values of the source and target neurons. – Parameters ∗ lr - the learning rate for the Hebbian learning rule. • setActivationType public void setActivationType( int activationtype ) – Usage ∗ Sets the connection type of this connection. The type can be either ConnectionType.EXCITATORY or ConnectionType.INHIBITORY. – Parameters ∗ activationtype - the connection type for this connection. • setLearningType public void setLearningType( int learningtype ) – Usage ∗ Sets the learning type for this connection. The type can be either ConnectionType.LTP or ConnectionType.LTD. – Parameters ∗ learningtype - the learning type for this connection. • setModifiable public void setModifiable( boolean modifiable ) – Usage ∗ Sets the modifiability of this connection. – Parameters ∗ modifiable - true if this connection should be learned and consequently have its weight updated. • setName public void setName( String name ) – Usage ∗ Sets the name of this connection. – Parameters ∗ name - the name to be given to this connection. network– Layer 64 • setRandom public void setRandom( double random ) – Usage ∗ Sets the random factor for this connection. If the getOutput() function is called on the connection it will add the random factor to the output. – Parameters ∗ random - a random amount to temporarily add to the weight of this connection. • setThreshold public void setThreshold( double threshold ) – Usage ∗ Sets the threshold value for this connection. If the getOutput() function is called on the connection it will only return a positive value if the output is above the threshold value. – Parameters ∗ threshold - the threshold value to use for this connection. • setWeight public void setWeight( double weight ) – Usage ∗ Sets the weight for this connection – Parameters ∗ weight - the weight for this connection. A.1.2 C LASS Layer This class represents a layer of neurons in an artificial neural network. A layer can be connected to another layer in various ways. D ECLARATION public class Layer extends java.lang.Object S ERIALIZABLE F IELDS • private int activation function – The activation function currently used for activating this layer. • private String name – This layer’s name. network– Layer 65 F IELDS • public static final int FULLY – • public static final int ONE – • public static final int ONEONE – • public static final int ONETWO – • public static final int ONESOME – • public static final int EXTERNAL – • public static final int INTERNAL – • public static final int SELF – • public static final int OTHER – C ONSTRUCTORS • Layer public Layer( String name, int neuronCount ) – Usage ∗ The constructor creates a new layer with the specified number of neurons. – Parameters ∗ name - the name given to this layer. ∗ neuronCount - the number of neurons in this layer. M ETHODS • activate public void activate( ) – Usage network– Layer 66 ∗ Activates the neurons in this layer by activating every single neuron in the layer. Consequently, the outputs of the neurons are set using a ’winner takes all / output one’ strategy. • average public void average( ) • clearActivation public void clearActivation( ) – Usage ∗ Clears the activation of the neurons in this layer. The activation of every neuron is set to 0. • connect public Vector<Connection>connect( network.Layer network.ConnectionType ct ) layer, int type, – Usage ∗ Connects this layer to another layer with full connectivity. – Parameters ∗ layer - the layer to connect this layer to. ∗ type - the connection type for the new layer connection (FULLY, ONEONE or ONETWO) ∗ ct - the ConnectionType Object describing important connection features • getActivation public double getActivation( ) – Usage ∗ Returns the activation of all the neurons in this layer. – Returns - the activation of the neurons in this layer • getConnections public HashSet<Connection>getConnections( ) – Usage ∗ Returns the external connections to and from this layer – Returns - an unordered set of Connection objects containing all the external connections of this layer. network– Layer 67 • getName public String getName( ) – Usage ∗ Returns the name of this layer. – Returns - the name of this layer. • getNeuronCount public int getNeuronCount( ) – Usage ∗ Returns the number of neurons in this layer. – Returns - the number of neurons in this layer. • getNeuronSet public HashSet<Neuron>getNeuronSet( ) – Usage ∗ Returns the neurons in this layer. – Returns - an unordered set of Neuron objects containing all the neurons in this layer. • getOutputs public double getOutputs( ) – Usage ∗ Returns the outputs of this layer. – Returns - the current outputs of this layer. • getWeights public Hashtable<String,Double>getWeights( ) – Usage ∗ Returns the weights of all external connections to and from this layer as a hash table. The name of the connection is used as a key for looking up its weight. – Returns - a hash table with the weights of the connections to and from this layer. • initWeights public void initWeights( double – Usage weightCentre, double weightRange ) network– Layer 68 ∗ Initialises the weights of all external connections of this layer to a value that lies in the weight range around the weight centre. – Parameters ∗ the - value used as the weight centre. ∗ the - maximum value by which to stray from the centre. • printActivation public void printActivation( ) • printConnections public void printConnections( ) • selfBias public void selfBias( double weight ) – Usage ∗ Creates a bias unit with excitatory connections to every neuron in this layer. – Parameters ∗ weight - the weight for the connections. • selfConnect public void selfConnect( int type, double weight ) – Usage ∗ Connects this layer to itself with full connectivity using the specified connection type. – Parameters ∗ type - the connection type (inhibitory or excititory). ∗ weight - the weight for the connections. • selfExcite public void selfExcite( double weight ) – Usage ∗ Creates a self-excitatory connection to every neuron in this layer. – Parameters ∗ weight - the weight for the connections. • selfInhibit public void selfInhibit( double weight ) network– Layer 69 – Usage ∗ Creates inhibitory connections from every neuron to every other neuron in this layer. – Parameters ∗ weight - the weight for the connections. • setActivation public void setActivation( double [] activation ) – Usage ∗ Manually sets the activation of the neurons in this layer to the given activation pattern. – Parameters ∗ the - activation pattern set for the neurons in this layer. • setActivationFunction public void setActivationFunction( int activation function ) • setInternalConnections public void setInternalConnections( double weight ) – Usage ∗ Sets all the weights of the internal (inhibitory) connections to the same value. – Parameters ∗ weight - the weight for the connections. • setOutputs public void setOutputs( ) – Usage ∗ Sets the outputs of the neurons in this layer by applying a ’winner takes all / output one’ function on their current activation. • setRandom public void setRandom( double randominput ) • setSelfConnections public void setSelfConnections( double – Usage weight ) network– Neuron ∗ Sets all the weights of the self-excitatory (internal) connections to the same value. – Parameters ∗ weight - the weight for the connections. A.1.3 C LASS Neuron This class represents an artificial neuron in an artificial neural network. Neuron objects can be connected with another Neuron using a Connection. D ECLARATION public class Neuron extends java.lang.Object S ERIALIZABLE F IELDS • private int hash – Hash value of this neuron for use in a hash table. • private int activation function – The currently used activation function. • private Vector<Connection>ext connections – External connections to this neuron (from other layers). • private Vector<Connection>int connections – Internal connections to this neuron (from within the same layer). • private String name – Name of this neuron. • private String status info – Status information about the input and activation values. • private StringBuffer buf – Buffer used for status info. F IELDS • public static final int SIGMOID – Use a sigmoidal activation function. • public static final int LINEAR – Use a linear activation function. • public static final int PFC – Use a PFC specific sigmoidal activation function 70 network– Neuron 71 C ONSTRUCTORS • Neuron public Neuron( ) – Usage ∗ Constructs a new Neuron with a sigmoidal activation function. M ETHODS • activate public void activate( ) – Usage ∗ Activates this neuron by computing its activation value. This is done by taking the weighted sum of both the internal and the external connections to this neuron and finally feeding this value to the set activation function. • connect public void connect( network.Connection connection ) – Usage ∗ Associates this neuron with a connection. – Parameters ∗ connection - the connection to associate this neuron with. • getActivation public double getActivation( ) – Usage ∗ Returns the activation of this neuron. – Returns - the most recent activation value of this neuron. • getName public String getName( ) – Usage ∗ Returns the name of this neuron. – Returns - the name that has been given to this neuron. • getOutput public double getOutput( ) – Usage ∗ Returns the latest set output of this neuron. – Returns - the most recent output of this neuron. • getStatus public String getStatus( ) – Usage ∗ Gives some information about the input values received from other neurons and used for activating this neuron. – Returns - a string containing status information about the neuron. • setActivation public void setActivation( double activation ) network– Neuron 72 – Usage ∗ Activates this neuron by manually setting its activation value. – Parameters ∗ activation - the activation to manually set for this neuron. • setActivationFunction public void setActivationFunction( int activation function ) – Usage ∗ Sets the activation function to use for this neuron. Possible activation functions are SIGMOID, LINEAR or PFC. – Parameters ∗ activation function - the activation function to use for this neuron. • setName public void setName( String name ) – Usage ∗ Sets the name of this neuron. – Parameters ∗ name - the preferred name for this neuron. • setOutput public void setOutput( double output ) – Usage ∗ Sets the output of this neuron to a given value. – Parameters ∗ output - the output value to set for this neuron.