A computational model of learned behavioural sequences in

advertisement
A computational model of learned behavioural sequences in
prefrontal cortex
Edwin van der Ham
July 20, 2007
Contents
1
Introduction
2
Literature review
2.1 The function of prefrontal cortex . . . . . . . . . . . . . . .
2.1.1 The evolution of the human brain . . . . . . . . . .
2.1.2 Behavioural evidence for the importance of the PFC
2.1.3 Why stimulus-response behaviour is insufficient . .
2.1.4 Towards a theory of PFC behaviour . . . . . . . . .
2.2 Prediction and reward . . . . . . . . . . . . . . . . . . . . .
2.3 The role of dopamine in reward prediction . . . . . . . . . .
2.4 Dopamine mediated learning . . . . . . . . . . . . . . . . .
2.5 The temporal difference algorithm . . . . . . . . . . . . . .
2.6 A theory of cognitive control . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
5
5
6
8
9
11
12
15
15
17
Design of a simple stimulus-response network
3.1 Architecture . . . . . . . . . . . . . . . .
3.2 Learning stimulus-response behaviour . .
3.3 Performance of the network . . . . . . . .
3.4 A fully connected version of the network .
3.5 Limitations of the current network . . . .
3
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
20
20
24
26
27
28
A model of behavioural sequences
4.1 Why do I want to learn behavioural sequences? .
4.2 Existing models of sequences . . . . . . . . . . .
4.3 A neural model for learning sequences . . . . . .
4.3.1 A first impression of the final network . .
4.3.2 Extending the stimulus-response network
4.3.3 Sequence learning without PFC . . . . .
4.4 A final model including a PFC layer . . . . . . .
4.4.1 The design of a PFC layer . . . . . . . .
4.4.2 Learning inside PFC . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
31
31
31
34
34
36
37
40
40
41
5
Implementation and evaluation of a neural network including PFC layer
5.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 The performance of the final model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
46
49
6
Conclusions and further work
6.1 Discussion . . . . . . . . .
6.2 Conclusions . . . . . . . .
6.3 Recommendations . . . . .
6.4 Future work . . . . . . . .
53
53
54
54
56
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
A Package network
A.1 Classes . . . . . . . . . . .
A.1.1 C LASS Connection
A.1.2 C LASS Layer . . . .
A.1.3 C LASS Neuron . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
60
61
61
64
70
Chapter 1
Introduction
There is an ongoing scientific effort to understand the working of the human brain. Starting with the
psychological method of introspection, the subject has fascinated us for many years now. Unfortunately it
proves hard to study the one thing we use for thinking in the first place. In more recent years, scientists have
begun to gain a better understanding of the processes that take place in the brain. MRI scanning devices
allow us to see the complex network of neurons and other brain structures which ultimately provides us
with the ability to think and act in a sensible way. This network is so intricate that to date it has proven
to be impossible to comprehend at the level of a single cell how the physical properties of (various parts
of) the brain determine the way we think and act. A lot of today’s brain knowledge is gathered from
experiments with groups of patients having a certain disability or damage to a specific part of the brain.
These experiments have led to a basic understanding of the functional responsibilities of different parts of
the brain.
Recently, in computer science the field of artificial intelligence (AI) started to arise. The ultimate
goal in AI is to create a machine that is capable of mimicking human behaviour so flawlessly that the
average person will not be able to tell the difference from a behavioural perspective. Initially, techniques
devised in other areas of computer science were used to try to accomplish this goal. However, a major
problem with traditional computer models is that they can only do exactly what they were designed to do.
Unlike a human, a traditional computer system does not perform very well in a novel situation. With the
progression made in neurophysical brain research, an improved understanding of the physical properties
of brain processes arose. As a result, an effort started to simulate those processes in an artificial computer
model. Such a model is expected to be much more flexible than the more traditional approach.
The human brain is an extremely complex system containing a large number of small building blocks
called neurons. A neuron is a small cell in the nervous system that can generate a tiny electrical current.
Apart from generating an electrical pulse it is also sensitive to pulses generated by adjacent neurons. The
excitation of a neuron can lead to a rise in its action potential causing it to fire another electrical pulse.
A group of neurons communicates by means of these electrical pulses, herewith creating communication
pathways trough the brain. Based on neurons found in the brain of humans and other animals, computer
scientists have created an artificial counterpart to the biological neuron. An artificial neural network provides a new and biologically founded method for modeling all kinds of processes including those that take
place in the brain.
A large portion of the brain is concerned with processing information received through the body’s
sensory system. Information from various sources is processed and combined to finally select an action to
take. But performing a single action is not enough. In order to reach a desired goal, often a whole series of
actions needs to be performed. A part of the brain believed to have a coordinating influence on the stimulusresponse system is the prefrontal cortex (PFC). Deficits to the PFC strongly limit our ability to consciously
control behaviour. In order to understand how it is that the brain can guide our behaviour such that novel
tasks can be completed successfully, we need to understand how the PFC influences the stimulus-response
pathways responsible for the selection of actions to take. Artificial neural networks have been devised
that are capable of learning a mapping between input and output without programming it specifically for
the task at hand. Such a network can be seen as a very high-level abstraction of the stimulus-response
3
pathways running through the brain. Likewise, networks like this are usually only capable of learning
single input-output combinations. It proves much harder to learn sequences of actions.
My hypothesis is that a computational model of the PFC can provide a mechanism for guiding a
stimulus-response network towards desired behaviour, the same way that the PFC in the brain coordinates
and orchestrates behaviour. With the guiding influence of the PFC model, an existing artificial neural network, currently incapable of learning behavioural sequences, could be enabled to flexibly select a sequence
of actions. Since the objective is to gain a better understanding of the human brain, biological plausibility
is a major concern. The main question I will answer in this report is therefore:
How can a biologically plausible model of PFC be developed to guide a neural network towards
learning behavioural sequences?
The first goal of my research is to develop a computational model of both the PFC and the stimulusresponse pathways in the brain. By itself, the latter model should be able to learn simple input-output
combinations. More complicated behaviour such as performing a sequence of actions is much harder to
learn and requires the guiding influence of the PFC model. Without the PFC, the stimulus-response model
is expected to perform poorly and prove incapable of learning a sequence of actions. The objective is for
both models to be consistent with results from brain studies, i.e. the characteristics of the computational
model must correspond to the general characteristics of the brain structures involved.
Without a basic understanding of brain functioning it becomes difficult to prove the biological plausibility of an artificial network. Therefore, the first step in my research comprises a literature study on the
neurophysiological properties of the human brain. In this study, I will give special attention to the PFC
and its effects on the ability to learn behaviour. I will also look at existing computational models of human
behaviour, especially those that exhibit the functional properties assigned to PFC. The next step in my
research will be to create and implement a very simple, highly abstracted model of the stimulus-response
pathways in the brain that allow us to react to sensory input. This model is be capable of learning only
basic stimulus-response behaviour. The final and most challenging task will be to implement a biologically
plausible, computational model of the PFC. Connecting the PFC model with the stimulus-response network
should enable it to learn behavioural sequences as well.
This report starts with a review of a number of relevant neuroscientific topics in chapter 2. In chapter 3, I
will build up the neural model used for learning stimulus-response behaviour. The results of a performance
test on this model are also presented here. In chapter 4, I will extend the existing model to include a
PFC component. The PFC can be switched off to simulate what happens when the PFC does not function
any more. Chapter 5 reports on the results of a number of different tests I created to show the difference
between the two situations. Finally in chapter 6, I will give the final conclusions and further work that
needs to be carried out.
4
Chapter 2
Literature review
To understand how a model of prefrontal cortex can contribute to an understanding of human behaviour,
I will discuss some relevant literature on the subject of neuroscience. The most influential writer for my
work is Jonathan D. Cohen. Together with Todd Braver he introduced a model of cognitive control [6] incorporating the PFC and dopaminergic brain systems. In section 2.1 I start by looking at the crucial role of
the PFC in human behaviour. Section 2.2 explains why reward prediction is particularly important. In the
human brain, dopaminergic (DA) neurons seem to encode a prediction of reward as I will explain in section 2.3. Sections 2.4 and 2.5 talk about how the DA signal can be used to mediate learning. In section 2.6
the neurophysiological findings are put together to form an integrated theory of cognitive control.
2.1 The function of prefrontal cortex
Miller and Cohen pose the question of how coordinated, purposeful behaviour can arise from the distributed
activity of billions of neurons in the brain [20]. This question has been around since neuroscience first
began and will probably not be fully answered in the near future. But there is certainly progress in our
understanding of the functions that different parts of the brain exhibit. A fairly simple and low-level way
of describing animal behaviour is by stimulus-response mapping. In such a model any particular set of
stimuli leads to a predetermined output. For simple animals with a relatively low number of neurons,
such a model seems quite capable of describing and predicting behaviour. But larger animals, including
primates such as ourselves, show a much more complex behaviour that cannot be accounted for by simple
stimulus-response mappings. The property that distinguishes humans most from animals is the ability to
control their behaviour in such a way as to accomplish a higher-level goal. It is widely recognised that
the PFC plays a very important role in this. To understand why, in section 2.1.1 I will first have a look at
how the human brain has evolved as a result of evolutionary processes. Section 2.1.2 provides behavioural
evidence for the importance of the PFC. In section 2.1.3 I present a simple model of stimulus-response
behaviour which explains why the PFC is necessary to control behaviour. Finally, in section 2.1.4 I assess
the plausibility of an extended model of cognitive control including the PFC.
2.1.1 The evolution of the human brain
In 1970, neurologist Paul MacLean proposed a model of the human brain that he called the ‘triune brain’ [19].
According to this model, the human brain can be divided into three main components, as figure 2.1 shows.
The oldest part of the brain from an evolutionary perspective is the brain stem. Located deep inside the
skull together with the cerebellum, it is also called the reptilian brain. This part of a human brain is similar to the brain of a reptile and is responsible for vital body functions such as heart rate, breathing, body
temperature and balance. The first mammals developed a new structure on top of the brain stem, called the
limbic brain. This part includes areas such as the hippocampus, the amygdala, and the hypothalamus. It is
mainly concerned with emotion and instincts. The newest part of the brain, also found in mammals is the
neocortex. Neocortex is traditionally thought of as responsible for higher-cognitive functions and is only
5
Figure 2.1: The triune brain (from http://www.ascd.org/)
found in higher mammals. In most animals it comprises only a small part of the total brain area. Primates,
however, have a much bigger neocortical area. In the human brain, it even takes up to two third of the total
brain mass. Human skills such as language and reasoning rely heavily on the neocortex. The cortex can be
functionally subdivided into a number of regions as shown in figure 2.2.
Especially the prefrontal cortex, an area in the anterior part of frontal cortex, is much more complex
in humans than it is in any other mammal. Recent studies have found that although humans have a larger
frontal area than any other mammal, its relative size is not larger than that of our closest ancestors, the great
apes. [37]. Yet, in terms of complexity and interconnectivity with other areas, the human frontal cortex is
superior.
2.1.2
Behavioural evidence for the importance of the PFC
There are various sources of behavioural evidence for the crucial role of the PFC in higher-level human
behaviour. A well-known task in which top-down control over behaviour is necessary in order to achieve
a good result is the Stroop task [38]. In this task, subjects are presented with words that are the names of
colours (red, blue, green etc.). All the words are written in a colour that does not necessarily correspond
to the correct interpretation of the word. Figure 2.3 shows a few typical examples of such words. The
task of the subject is to name the colour in which the word is written and ignore the word itself. Our
automatic response is to read and interpret any word we see. Attending to the colour in which it is written
requires our brain to suppress the automated stimulus-response behaviour that will lead to simply reading
the word. Although it requires extra attentional effort, most people will be able to do this task without many
mistakes. Patients with damage to the PFC area are known to have much more difficulty with (variations
of) this task [28] [40].
A similar task where frontally impaired people show very low scores is the Wisconsin Card Sorting
Task (WCST) [5]. In this task, subjects are shown cards with symbols on them that vary either in number,
shape or colour. These cards must be sorted according to one of those three dimensions. This by itself
is not so difficult, but the rule by which to sort the cards changes periodically, meaning that the currently
pursued goal has to be abandoned and a new one adopted. Humans with PFC damage are quite capable of
applying the initial rule, regardless of the selected dimension. However, when they have to change to a new
rule they are unable to do so and usually continue sorting according to the initially learned mapping [23].
6
Figure 2.2: Functional subdivision of the cortex (from http://universe-review.ca/)
RED
PURPLE
BLACK
BLUE
RED
GREEN
GREEN
YELLOW
BLUE
Figure 2.3: Example of the Stroop task
7
2.1.3 Why stimulus-response behaviour is insufficient
Experiments such as the Stroop task and the WCST show that the PFC is particularly important when there
is more than one possible response to a given set of stimuli. For example, think of the everyday activity of
crossing a street. If you are born in a western European country such as Holland, your automatic response
will be to first look to the left and then to the right to see if there are any cars coming towards you. This is
something that does not require any attention, it is a completely automated behaviour.
Figure 2.4 gives a schematic impression of the neural brain pathways in the situation where standing on
the kerb of a road (stimulus S1) leads to looking left (response R1). In this figure, stimulus S2 represents a
S1
R1
S2
R2
S3
Figure 2.4: The stimulus-response pathway for crossing the street in Holland. (from [20])
set of sensory stimuli associated with being in Holland. These might be street name plates written in Dutch
or people on the streets speaking Dutch. The thick line between S1 and R1 represents the strong tendency
to look left before crossing a street. The red lines indicate the currently active pathway from stimulus to
response. But first looking to the left is not appropriate when you find yourself in a country such as New
Zealand where the cars drive on the left-hand side of the road instead. In this country we need to change
our behaviour to first look to the right. Stimulus S3 represents a set of stimuli associated with being in New
Zealand (note that S2 and S3 can not occur at the same time).
Figure 2.5 shows what happens when you go on a vacation to New Zealand. Initially, there is still a
S1
R1
S2
R2
S3
Figure 2.5: Crossing the street during your vacation to New Zealand.
strong tendency to look left before crossing the street because this is a highly automated behaviour. New
Zealand associated stimuli are not influential enough to enforce you to look right (response R2) instead.
However, after a few days you may find that you do not start by looking left any more. One possible way
of explaining this is that the strong connection between S1 and R1 is unlearnt and a connection from S1
to R2 is learned instead. There are two reasons why this is not very likely. First, an automated behaviour
8
is something that takes a reasonable amount of time to develop. It would be a whole lot easier to learn to
drive if this was not true. Second, when you return from your vacation you can very quickly reestablish
the habit of looking left. It appears that both pathways stay intact and that a separate control mechanism is
able to switch between the two pathways.
A different explanation of how the learning has taken place is through the use of the PFC. Figure 2.6
PFC
S1
R1
S2
R2
S3
Figure 2.6: PFC mediating the flow of activity in order to get the correct response.
shows the situation in which you have learned to change your behaviour during your visit to New Zealand.
Connections within PFC have formed that bias the stimulus-response pathways to select the appropriate
behaviour. When you return to Holland, you will have no problems crossing the street, because the automatic stimulus-response behaviour of looking right still exists. On top of this, the PFC helps to select the
right pathways too, as shown in figure 2.7. Interestingly, when you go back to New Zealand a few months
later, you may find it much easier to revert to the correct initial response of looking right. You only have to
turn on the correct representation in PFC to obtain the right behaviour.
2.1.4 Towards a theory of PFC behaviour
Such an abstract picture of stimulus-response pathways might explain the basic idea, but how exactly does
this work in our brain? A fundamental principle of neuroprocessing is that processing in the brain is
competitive. A pattern of activation over numerous input dimensions activates different pathways that all
compete for expression in behaviour. The pathways with the strongest support will win the competition
and exert control over areas that contribute to external behaviour. Miller and Cohen [20] have developed
a theory that extends the notion of biased competition to a control mechanism for PFC. To assess the
plausibility of their model, they define a minimal set of functional properties that a system must exhibit if
it can serve as a mechanism of cognitive control. Neural findings suggest that the PFC conforms to all of
those properties. I will give a short summary of the most important properties and the supporting evidence.
Firstly, a system capable of controlling behaviour must be able to access and influence a wide range
of information in other brain regions. As I mentioned in section 2.1.1, the PFC is the newest area in the
brain from an anatomical perspective. During its evolution it formed direct connections to almost every
9
PFC
S1
R1
S2
R2
S3
Figure 2.7: PFC still mediating the flow of activity back home
other cortical area. This way it can receive input from and exert control over virtually all sensory and
motor systems. Using tracing techniques in the brain of rhesus monkeys, researchers have found numerous
connections from a wide range of sensory systems into the frontal cortex as well as connections from
the PFC to premotor systems. For example, the parietal lobe, processing somatosensory information, has
projections to the frontal cortex [29]. The superior temporal sulcus, involved in integrating somatosensory,
auditory and visual information [36] connects with the frontal lobe [35]. Most of the connections do not
come from the primary sensory cortices but only from the secondary parts, so the information flowing into
the cortex is not raw sensory information. Information is delivered to different prefrontal areas which are in
turn linked to motor control structures related to their specific function. [4]. Again, most of the connections
link to higher-level control structures [12]. On top of this wide range of connections to other brain areas,
there is an extensive amount of connections that connect one PFC area to another [3], suggesting that
different areas within PFC are capable of sharing and intermixing information. This would be required
from a system capable of producing complex coordinated behaviour.
Secondly, if the PFC is responsible for selecting plans and goals, the neural activity pattern observed
should reflect the current plan and stay roughly the same as long as this plan is used. Coming back to my
previous example, this means that a similar pattern of PFC activity should occur every time you want to
cross a street. This same pattern is visible when crossing a street during your vacation to New Zealand.
After a few days a different pattern might have developed, but when you go back to Holland the initial
pattern immediately reoccurs. Asaad et al [1] performed experiments in which a monkey learns to associate
a visually presented cue with a saccade to the left or right. About 60% of the recorded neurons showed
activity that depended on both the cue and the saccade direction. But the activity of only 16% of those
neurons could be explained by a straightforward linear addition of both the input and cue activation. This
gives evidence for the fact that it is not merely input-output associations that are represented, but more
complex patterns representing a particular plan. In another experiment [41], a monkey viewed a video
screen on which four light spots were visible: right, left, up and down from the centre. On the slight
dimming of one of the four spots, the monkey had to foveate to that spot. Before the dimming, a visual cue
appeared according to either a spatial or a conditional rule. The conditional cue involved a letter associated
10
with one of the four spots. In the spatial condition, the location of the letter of the screen indicated the
light spot and the identity of the letter was unimportant. During the task, the activity of a sample of 311
prefrontal neurons was measured of which 221 neurons showed task-related activity. Between 33 and 50
percent of the task-related neurons showed statistically significant differences that could be attributed to
the rule the monkey was using. This result gives a good indication that there is a significant number of
neurons in PFC that encode the rules that are necessary for the task at hand.
Thirdly, the activity of PFC neurons must be resistant to wholesale updating by immediate sensory
stimuli. Being able to represent plans and goals is a good start, but it is not very useful if they are updated
at every opportunity. For example, say there is an ice cream shop on your way to work. If you would change
your plan from going to work to having an ice cream every time you passed the shop, you would never
arrive at work. The current plan in memory needs to be protected against interference from other plans until
the goal is actually reached lest chaotic unordered behaviour will prevail. On the other hand, assume you
did actually arrive at work and adopted the plan to carry out some work. Then you suddenly notice that the
office is on fire. This time, you definitely want to abandon your current plan and make a run for it instead of
first trying to reach your current goal of getting your work done. Although the plans and goals in PFC must
be protected against distractions, there must be a flexibility to update them when necessary. Fuster [11]
was one of the first researchers to show that neurons in prefrontal cortex show sustained activity during
the delay period in a delayed response test. In a delayed response test, a cue-response pairing is learned
and a delay is introduced between the cue onset and the time of the response that could subsequently lead
to reward. Monkeys showed a performance of nearly 100% on a delayed response test when the delay
was somewhere between 15 to 30 seconds. This shows that they are able to keep a plan in memory for
some time after developing it. Miller and Desimone carried out a delayed matching to sample task in
which a stimulus is presented to a monkey that must be matched to forthcoming stimuli [21]. In order to
get a reward, a response was required on the first matching stimulus, meaning that distracting intervening
stimuli must be ignored. They found that half of the recorded cells in PFC showed selectivity for whether
the sample matched the test stimulus. Furthermore, the activity of these neurons was sustained throughout
the trial.
It seems that there is enough evidence suggesting that the PFC is capable of playing an important role
in controlling behaviour. But if PFC controls behaviour, who or what controls the PFC? For any control
theory to be successful, the controller must be able to learn by itself without having to rely on a hidden
‘homunculus’ to explain its behaviour. The remaining question therefore is how the PFC ‘knows’ when
to update its representations in order to change the current plan or goal. Miller and Cohen suggest that a
mechanism of prediction and reward can be used to model its behaviour. To understand how this works, I
first take a few steps back and explain some of the fundamental ideas about prediction and reward.
2.2
Prediction and reward
Being able to predict future events has been a critical factor in the development and survival of animals
throughout history. If a creature is unable to find food or escape predators it has a very small chance of
survival. It is clear that random behaviour is not the best way of getting around. In animals, behaviour
is generally guided by something that can be referred to as reward. Reward is a concept for the intrinsic
positive value of an object, a behavioural act or an internal physical state. It represents something that is
generally good or satisfactory. Being able to predict future reward is therefore a very valuable skill that all
animals must possess in order to survive in the world.
There is a clear connection between prediction and reward. This is shown in a wide variety of conditioning experiments in which arbitrary stimuli with no intrinsic reward value are being associated with
rewarding objects. This effect was first described by Ivan Pavlov [27] and is known as Pavlovian or classical conditioning. Pavlov draws on the fact that dogs start producing saliva whenever they see food. The
food is called an unconditioned stimulus (US) because no conditioning has taken place to associate this
stimulus with a particular response, in this case salivation. Pavlov predicted that if a particular stimulus
would be present whenever the dog was presented with the food, this stimulus would become associated
with the food and therefore trigger the dog to produce saliva. I will refer to the stimulus that will be used
to condition a response as the neutral stimulus (NS). After conditioning, this (arbitrary) stimulus is called
11
a conditioned stimulus (CS) because it has no natural association with reward but it has proven to reliably
predict reward under certain conditions. In this situation a CS-US pair has developed, meaning that whenever the CS is present, any natural response associated with the US will also be triggered by the CS which
comes on earlier in time.
Some theories suggest that this learning process is triggered by the initial unpredictability of the reward
by the NS. A very influential model is the Rescorla-Wagner model of classical conditioning [30]. This
theory says that learning takes place whenever there exists a discrepancy between the expectation about
what would happen and what actually happens. Before a CS-US pair has developed, no prediction of
reward is associated with the NS. When after a delay an US comes on and reward is delivered, a discrepancy
between the prediction made by the stimulus (no reward) and the actual reward exists. What this means is
that the stimulus might actually predict a future reward. Therefore, a small amount of predictive power is
associated with the stimulus. After repeated presentations of the NS with subsequent reward, it becomes
conditioned and a CS-US pair develops. The CS now fully predicts the forthcoming reward and it also
triggers the natural response to the associated US.
An interesting property of this model is its ability to explain the behavioural phenomenon of ‘blocking’.
This is demonstrated by an experiment in which a rat learns that food will be delivered whenever a light
comes on [16]. When an extra cue, in this case an additional sound, is presented together with the light, this
second NS will not become a CS. Apparently, the additional sound does not add any predictive information
about the forthcoming reward. The Rescorla-Wagner model successfully predicts this behaviour because
there is no discrepancy between the predicted reward at the time the sound is presented (because of the
light, there will be a reward anyway) and the actually delivered reward.
Despite its enormous success in successfully predicting previously unexplained behaviour, there are
still some phenomena that are predicted incorrectly by the Rescorla-Wagner model [22]. Second-order
conditioning, for example, is something this model does not account for. Assume the situation we had
before where a reward is delivered whenever a light comes on. Next, we use the light as an US to pair
with the sound. However, this time the light is not followed by a subsequent reward. The model predicts
that a negative association between the sound and the light should develop, because of the absence of
the expected reward associated with the light. However, in an experimental situation, a positive association
between the sound and the reward usually develops. This is just one example of a number of phenomena the
Rescorla-Wagner model fails to explain. Nevertheless, it is still widely used because of its computational
simplicity.
Classical conditioning experiments prove the existence of some kind of reward prediction system in the
human brain. However, it does not tell us about the nature of this system. It is assumed that a neural transmitter called dopamine is closely associated with the human reward prediction mechanism. In section 2.3,
I look at the supposed function of this neurotransmitter in the brain and why it is thought to be involved in
reward prediction.
2.3
The role of dopamine in reward prediction
Dopamine (DA) is a chemical that is naturally produced in the human brain. It functions as a neurotransmitter, meaning that it can activate particular pathways of neurons, also referred to as the dopaminergic
system. Neurons sensitive to DA that make up the dopaminergic system are sometimes called DA neurons.
The system influences parts of the brain involved in motivation and goal-directed behaviour. Evidence for
the involvement of DA in the PFC was found in a self-stimulation experiment on rats [25]. Dopamine levels significantly increased when the rats pushed a lever to obtain an electrical pulse delivered to the medial
prefrontal cortex. Later it was discovered that DA neurons specifically respond to rewarding events such
as the delivery of food as well as to conditioned stimuli associated with a future reward [32]. DA neurons
respond to a wide range of somatosensory, visual and auditory stimuli. They do not seem to discriminate
according to the nature of the sensory cue, but merely distinguish between rewarding and non-rewarding
cues [33].
Another indication for the involvement of DA in reward systems comes from research on drugs like
amphetamine and cocaine. Those and other stimulating drugs (ab)used by humans were found to have a
positive effect on the dopamine concentrations in the mesolimbic system of rats [9]. Since DA neurons
12
naturally respond to events that predict reward, the intake of the drug signals an enormous upcoming
reward to the human body. Unfortunately for the addict, this reward fails to occur and it makes the subject
feel miserable after the influence of the drug has ceased. Normally, because of this discrepancy between
predicted reward and delivered reward, the human body would make sure that next time the predictive cue
comes on, no DA will be delivered. However, because of the direct influence of the drugs on the dopamine
system, the body is enforced to release more dopamine, again incorrectly signalling reward. This explains
the addictive effect those drugs have on the human body.
Various experiments have been conducted to find out how DA neurons respond in different situations.
As was expected, they respond to both unexpected rewards as well as predictive cues. According to the
theory of classical conditioning, learning takes place in the face of prediction and reward. Before any
cue-response pairings have developed, DA neurons mostly respond to the (unexpected) reward. After
successful training with a particular cue-response pair, the neurons come to respond to the cue more than
to the now predicted reward [18]. These results were replicated in an experiment on monkeys, performed
to find out how the activity of dopamine neurons changed during a delayed response learning task [33].
During learning, 25% of the recorded neurons responded to the delivery of a liquid reward. After learning,
only 9% of the neurons were activated by reward. In a later experiment, it was shown that this dopamine
response is related to the temporal unpredictability of an upcoming reward [24].
Figure 2.8 depicts the typical activity of DA neurons in the presence of an unexpected reward. Just after
DA
level
baseline
activity
time
reward
US
Figure 2.8: Dopamine activity upon delivery of an unexpected reward
reward delivery, DA neurons respond to this unexpected event by increased activity for a short period of
time. When the US shown in this picture is repeatedly presented at the same time before reward delivery,
it will become a predictor of reward according to the theory of classical conditioning. Figure 2.9 shows
what happens. While dopamine neurons initially responded to the reward, they have now come to respond
to the CS instead. This will only happen when the reward is consistently delivered at the same time after
cue onset. After training, if a reward is delivered earlier than expected, dopamine neurons do respond to
this unexpected reward. When an expected reward is not delivered at all, a decrease in DA activity occurs
at the time the reward was expected to occur. This is shown in figure 2.10 where a CS is not followed by
reward. There is an increase in activity just after CS onset, but at the expected time of reward delivery
activity drops below baseline level.
According to the Rescorla-Wagner model of classical conditioning, learning takes place whenever
there exists a discrepancy between the expectation about what would happen and what actually happens.
Dopamine neurons seem to provide information about exactly this discrepancy. Given the involvement of
DA in the PFC, it is highly likely that it serves to enable learning in the PFC. In section 2.4 I will explain
the neural mechanisms by which connections between cortical cells are strengthened or weakened in the
presence of DA. Another question left unanswered is how DA neurons are able to learn the correct timing. There is however a well-established algorithm by which artificial systems can learn to predict reward,
13
DA
level
baseline
activity
time
reward
CS
Figure 2.9: Dopamine responding to a CS in the case of an expected reward
DA
level
baseline
activity
no
reward
CS
Figure 2.10: Failure of occurrence of an expected reward
14
time
called the temporal difference algorithm. In section 2.5 I will explain how this algorithm works. There is
a striking similarity between the temporal difference given by the algorithm at any time and the timing of
midbrain dopamine neuron firing [34]. It is therefore suggested that this algorithm could be successfully
used for a biologically plausible model of DA activity. In section 2.6 I present the model in which Braver
and Cohen use the TD algorithm as a learning signal for a simple delayed response learning system.
2.4
Dopamine mediated learning
In the previous sections I explained how the presence of dopamine in the brain aids in learning neural
connections in the cortical brain areas. In order to create a model of learning behaviour, a more detailed
analysis of the learning process is required. Is the presence of dopamine required and if so, how does it
bring about long-lasting changes to the neural network responsible for behaviour?
The ability of a connection between two neurons to change in strength is called synaptic plasticity. In
general, two kinds of changes can happen; the connection can either be strengthened or weakened. The
former is called long-term potentiation (LTP), the latter is long-term depression (LTD). This mechanism of
plasticity is believed to underlie behavioural learning as well as the formation of memories. The idea of
synaptic plasticity has been described long before its existence in the brain was proven. In 1949, Donald
Hebb developed a theory describing synaptic plasticity [15]. The theory is based on the general idea that
two cells that repeatedly fire at the same time are related in some way. Whenever one of those cells becomes
activated, the other cell will tend to become activated as well. To accommodate for this, there must be a
strong connection between the two. During learning, this means that cells that happen to fire together are
likely to be part of the same global activation pattern. If this pattern belongs to a desired or good situation, it
is fruitful to make this pattern more likely to occur in a future situation. This can be achieved by increasing
the strength of the connections between all cells that are part of this desired pattern. Effectively, the theory
gives a computational implementation of the neural process of LTP. A similar situation exists for LTD; if
two cells repeatedly show decreased activity, the strength of the connection between them is decreased.
A number of researchers have suggested DA to be involved in establishing synaptic plasticity, i.e. the
ability of the connection, or synapse, between two neurons to change in strength. For example, Calabresi
et al. [7] showed that the induction of striatal LTD is blocked by the application of a DA antagonist. In
other words, the application of a chemical suppressing DA levels in the striatum, located in the basal
ganglia which are part of the human central nervous system, partly disables the process of weakening of
neural connections over time. By the application of DA, the process of LTD could be restored. The results
were found in an experiment on very small slices of rat brain striatum, submerged in a solution containing
the DA antagonist. But, because of the general application of dopamine to a whole slice of brain tissue,
other effects of dopamine can not be excluded. In 1996 Wickens et al. [42] investigated the effects of a
directed pulsatile application of dopamine. In addition, the timing of the dopamine application was set to
coincide with experimentally induced presynaptic and postsynaptic activity of the neurons involved. Both
LTP and LTD could be induced by using the correct timing of dopamine application. Similar to the timing
requirements for reinforcement learning, it seems that the same timing aspects are required for its neural
correlate.
As I suggested before, the DA timing aspect can be modelled by the temporal difference algorithm that
I will discuss in section 2.5. Assuming the correct temporal behaviour, the level of DA activation can be
used as a learning parameter for performing LTP and LTD. Consequently, learning and thus changes in
synaptic plasticity only takes place in the presence of dopamine.
2.5 The temporal difference algorithm
In section 2.1 I posed the question how it is that the PFC can learn to update represented plans appropriately
without the need for a control mechanism to explain its behaviour. Sections 2.2 and 2.3 suggested the
midbrain dopaminergic system to be involved. The initial question has been answered insofar that the
firing rate of DA neurons can be used to learn the correct behaviour. But how exactly do DA neurons
come to respond to the earliest predictor of reward? An influential idea based on animal learning theories
15
was introduced by Richard Sutton [39]. He created a computational procedure for the prediction of reward
using a temporal difference algorithm. This algorithm can be successfully used to model the DA response
required for PFC updates. In this section, I explain how the temporal difference algorithm can give a
reliable prediction about future events.
The basic idea behind the temporal difference algorithm is fairly easy to understand. At any point in
time, it tries to make an estimate of all expected future reward. This would be easy if one could look ahead
in time to observe all future events and associated rewards. Unfortunately, the future is highly dependent on
your own future actions and even if those were all predetermined, a completely deterministic environment
would be required to reliably predict future reward.
It is clear that looking into the future in order to get a reliable reward estimate is not feasible. The
only thing we can observe about our environment is the current sensory input including information about
the current reward. And we can remember this information, meaning we can also keep a short history of
past sensory input, undertaken actions and consequences. By evaluating past experiences and storing this
information in our brain, it is possible to construct a reward expectation for the future, based on our current
context.
To see how this works mathematically, I will start by looking at the simplest form of temporal difference, one step ahead prediction. Suppose Pt is the output of a simple linear connectionist unit:
Pt =
m
X
wti xit
(2.1)
i=1
where wti is the connection weight for unit i at time t and xit is the input activation. If at any time only one
input is active, the output of this unit represents the predicted reward for the given input at this time. If this
prediction is higher than 0, a reward is expected in the next time step. You can compare this to a CS-US
pair in classical conditioning (see figure 2.9. If the reward fails to occur, apparently the prediction was
too high. Similarly, if an unexpected reward occurs, the prediction was too low. To make a better future
prediction, the expectation needs to be changed. The amount by which we want to change it depends on
the difference between the predicted reward at time t and the perceived reward at time t + 1, rt+1 . This is
called the temporal difference:
T Dt = rt+1 − Pt
(2.2)
This temporal difference can be used to update the connection weights. We only want to change the
weight associated with the current input and we do not want to change it too radically. A well-established
method for doing this is by using the delta rule [43]. Basically what is says is that the amount by which
to change a connection weight is given by the difference between the expected and actual output times a
learning constant between 0 and 1. The delta learning rule for one-step-ahead prediction is:
i
wt+1
= wti + ηT Dt xit
(2.3)
where η > 0 is the learning rate. An important observation we can make here is that in a classical conditioning experiment only one sensory stimulus is paired with another. Since only one of the inputs in
equation 2.3 is active at a time, only one of the weights will be updated. Another way of putting this is to
say that only one connection weight is eligible for modification at time t. Applied at time t instead of t + 1
the delta-rule is given by the following equation:
i
wti = wt−1
+ η(rt − Pt−1 )xit−1
(2.4)
The interpretation of this equation is that the eligible connection weights are updated by subtracting the
previously made reward prediction for the specific input we are dealing with from the currently observed
reward. If the prediction was correct, no changes are made. Else the weights are updated to better predict
the current situation in the future.
This all works fine if the predicted reward immediately follows the cue that predicts it. This is sufficient
for learning stimulus-response combinations, but fails for more complex situations in which reward only
comes after completing a sequence of actions. We then want the sensory cue to predict a reward that only
comes after performing a sequence of actions taking more than one time step. Ultimately, we want to know
16
about all the future rewards that the current sensory input might lead us to. What this effectively means is
that we will have to remember everything we did in the past and where it has lead us to. This would make
the algorithm unnecessarily complex.
But there is a solution. Let’s assume that we can in fact make an infinite-step-ahead prediction of future
reward. This means that at any time, we know about the immediate reward following our stimulus-response
pair as well as all rewards that will follow later on because of our next sequence of actions. The expected
reward is simply the sum of all expected future rewards. Presumably, we will get a reward some time and
therefore our prediction will always be very high, even if the expected reward is still many time steps ahead.
To account for this, a discount factor needs to be introduced to give less value to predictions of reward still
far ahead. The prediction we can make now looks something like this:
Pt = rt+1 + γrt+2 + γ 2 rt+3 + ...
(2.5)
where 0 ≤ γ < 1 is the discount factor. The farther we look ahead in time, the more influential the discount
factor becomes. Applied at time t − 1 it gives us:
Pt−1 = rt + γrt+1 + γ 2 rt+2 + ...
(2.6)
Now notice that Pt−1 can be rewritten as follows:
Pt−1
= rt + γ(rt+1 + γrt+2 + γ 2 rt+3 + ...)
= rt + γPt
(2.7)
(2.8)
Apparently, the prediction at any time can be derived from the reward at that time and the prediction at the
previous time step. Even better, if this prediction were perfect, it would satisfy equation 2.7. The amount
by which two adjacent predictions fail to satisfy this equation can be used as an error measure for changing
the weights of equation 2.1. This temporal difference error is:
T Dt = rt + γPt − Pt−1
(2.9)
Note that this temporal difference equation is very similar to equation 2.3. When we take γ = 0, meaning
that future reward is simply discarded, it is exactly the same. The full equation for updating the weights
applied at time step t is now as follows:
i
wti = wt−1
+ η(rt + γPt − Pt−1 )xit−1
(2.10)
Put into words, this equation says that the weights are updated by looking at the difference between the
currently observed reward minus the previous prediction, plus the current next prediction. In other words,
if a reward was predicted but has not come yet, the next prediction must take this into account. The weights
for the previous input are adapted to ensure a better prediction for similar future situations.
2.6 A theory of cognitive control
Having gained a basic understanding of the processes in the brain that enable humans to express such
complex behaviour, I will now look at a model of control that Braver and Cohen introduce in [6]. This
model focuses on the idea that any system capable of controlling behaviour needs to be able to attend only
to contextually relevant information while ignoring other contextually irrelevant sensory input.
DA neurons are believed to be involved in updating PFC representations. Braver and Cohen hypothesise
that the effect of DA neurons is to modulate the responsivity of PFC units to their input, meaning that DA
serves as a gate between sensory input and PFC representations. This explains how the PFC can update its
representations when necessary while protecting them against interference from other, distracting stimuli.
DA comes to respond to the earliest predictor of reward so if DA is to open the gate between sensory input
and the PFC, updates are made only when an unexpected stimulus comes in that reliably predicts a future
reward. Any stimulus not associated with reward will not develop a predictive DA response and therefore
allow representations in PFC to be maintained. A second effect of DA is to strengthen the associations
17
between sensory stimuli that predict reward and the DA neurons themselves. This corresponds to the
reward prediction in temporal difference learning. However, there is one problem with this situation. If
the gating system is learned by observing external reward, but the reward acquisition in turn depends on a
correctly working gating mechanism, then how can this process get started? This is a classic example of a
bootstrapping problem.
To show that their theory of gated cognitive control is capable of bootstrapping, Braver and Cohen
constructed a neural model to carry out a simple cognitive control task. The task they used is a variant of a
delayed-response task in which a cue is presented at the beginning of each trial. This cue can be either the
letter A or B written in black or white, meaning four different cues can be distinguished. After a delay of
variable length, the letter X is given as a probe to which the network must reply with one of two possible
responses. One of those, called the target response, must be made when the probe follows one particular
cue (e.g. a black A). In all other cases the nontarget response is required. During the delay period, the
system can be presented with both target and nontarget stimuli. In order to give the correct response, the
network needs to ‘remember’ seeing the target cue while ignoring nontarget cues. The only feedback given
to the system is a value of reward for that particular trial.
OUTPUT
CONTEXT
Gating
Connection
Black
White
Color Pool
A
B
X
Identity Pool
STIMULUS INPUT
REWARD PREDICTION
Figure 2.11: Network used by Braver and Cohen (from [6])
Figure 2.11 shows the network they used. There is a stimulus layer with five inputs separated into two
different pools to represent identity and colour of the stimulus. All five units have an excitatory connection
to a corresponding unit in the context layer. Two network responses are possible, represented by two units
in the output layer. Both the context and the input layer are fully connected to the output so every input and
context unit has a connection to both output units. In every layer there are lateral inhibitory connections
to enable competition between units. Units in the context layer have strong self-excitatory connections as
well as inhibitory connections from a tonically active bias unit. This is used to simulate active maintenance
of context information in PFC. The most interesting layer is the reward-prediction unit. This unit receives
input from both the stimulus and the context layer. In addition, this unit observes the external reward value
at the current time step. Its behaviour is supposed to mimic the DA activity observed in the brain. Its
18
activity is therefore used to adjust all the weights in the system. For a more detailed description, see [6].
After training, the network was able to correctly respond to the given inputs while ignoring distractor
cues. Ten runs were performed and on every run the network was initialised with random weights. The
network converged to perfect performance on every run meaning that it was perfectly capable of bootstrapping. The results suggest that a gating mechanism can indeed be used to exert control over behaviour in
order to successfully carry out a delayed response task.
19
Chapter 3
Design of a simple stimulus-response
network
The theory so far is mostly concerned with the influence of the PFC on human behaviour. It is clear
that the PFC is very important, but to understand how it is that the PFC can exert its influence we first
need a model of the evolutionarily older stimulus-response pathways which the PFC exerts a modulatory
influence on. In this chapter I will introduce a highly simplified model of the cortical stimulus-response
pathways in the absence of PFC. The goal of this model is to simulate basic stimulus-response behaviour
that allows animals to appropriately respond to sensory input by performing those actions that lead to
a desired situation. I will start in section 3.1 by building up an architecture for the network. Then I
will discuss the learning regime including my own implementation of the temporal difference algorithm in
section 3.2. I created a few test cases to assess the overall performance of the network, section 3.3 discusses
the results. Next, in section 3.4 I introduce a slightly modified, more biologically plausible architecture and
discuss its performance. Finally, in section 3.5 the limitations of this network design are discussed.
3.1 Architecture
Animal behaviour is dependent on sensory stimuli. Without any input from the environment, no living
creature has any chance of survival. The key to survival is interaction with the world. Nature has given us
the ability to see, hear, taste, smell and feel. All those things are handled by the sensory system comprising sensory receptors, neural pathways and brain structures involved in processing and combining the raw
sensory information. Generalising the basic function of this system, it generates high-level representations
of sensory stimuli which analyse the scene in terms of its affordances for actions. Milner and Goodale hypothesise that there are two functionally and anatomically separate pathways for processing visual sensory
input [14]. The ‘what’ pathway, going through inferior temporal cortex, specialises in the identification
of objects. This is a very important aspect when it comes to understanding the world, but not so much
for reflexive motor behaviour. The ‘how’ pathway on the other hand is located in posterior parietal cortex
and specialises in tasks involving spatial perception. In experiments on patients with lesions to the parietal
region, subjects showed great difficulty in reaching for and picking up common objects, even though they
had no problems recognising and naming them. Stimulus-response behaviour relies on the possibilities for
interaction with an object much more than on its identification.
Based primarily on the ‘how’ pathway, a plan is generated to react to the environment in an appropriate
manner. This plan is then sequentially fed to the basic motor pathway, a brain system responsible for the
control of all body motor systems. It basically transforms a high-level motor command like moving your
arm into a correct set of muscle control signals to actually move the arm in the right direction. For my
brain model, I will take this complex system to an abstraction level where only three basic components
remain. The first component models the sensory system, the second component integrates the sensory
pathways which select an action. The third component mimics the motor pathway and is thus responsible
for executing the resulting action.
20
The most natural way of implementing this model is by the use of an artificial neural network. Based
on biological neural networks found in the brain, a number of basic computational units called artificial
neurons are connected to form a complex network capable of orchestrating complex behaviour. Every
neuron has at least one input and zero or more outputs. The output models the neuron’s firing rate. The
inputs model the firing rates of neurons which connect to it. Upon activation of a neuron, the weighted sum
of its inputs is computed and passed through a non-linear function called an activation function.
My stimulus-response (SR) network, is inspired by the network designed by Braver and Cohen [6]. It is
composed of an input layer, an output layer and a hidden layer. The input layer is used to represent a highlevel abstraction of sensory input values. For example, activation of the first input neuron, which I will call
S1 from now on, could represent the sight of a coffee cup. The input layer is unidirectionally connected
to the hidden layer. This layer is hidden in the sense that the pattern of activation does not simply reflect
one particular event or action but rather a particular action in a particular sensory context. This pattern
can not be mapped directly onto an observable external event or behaviour, hence the name ‘hidden layer’.
The hidden layer is unidirectionally connected to the output layer. The output or response layer represents
a high-level abstraction of an action, for example walking to the coffee machine. Figure 3.1 shows this
three-layer network. It is connected in such a way that there is exactly one way to get from every input to
H1
S1
H2
H3
S2
R1
R2
H4
Figure 3.1: A simple stimulus-response network with three layers
every output, in other words anything is possible. To make sure that only one stimulus-response pathway
is selected at any time, there are inhibitory (negative) connections in between the hidden neurons as well
as self-excitatory (positive) connections. This brings about winner-take-all behaviour meaning that only
one neuron in this layer can be active at a time. In neurophysiological terms this phenomenon is known as
lateral inhibition.
The idea of using inhibitory competition in a neural network is not new. Randall O’Reilly [26] states
that ‘roughly 15 to 20% of the neurons in the cortex are ... inhibitory interneurons’ and ‘any realistic model
of the cortex should include a role for inhibitory competition’. A sparse representation of the world allows
for categorisation which is very useful in an ever-changing environment. Inhibitory competition allows
the network to attend only to the strongest sensory input without even considering processing much less
important stimuli. In an emergency situation where your life depends on appropriate and quick response
to the situation, the brain needs to be able to focus on one thing only. Lateral inhibition allows a strong
sensory stimulus like pain to suppress other less important stimuli completely.
The network is activated when a sensory stimulus arrives. One of the input neurons will activate and in
turn activate the hidden layer which will then activate the output layer. The pattern of activation over the
output layer determines the motor action the system has chosen to take. Correctly choosing the parameters
for the activation function is crucial for the performance of the network. We want the output layer to
make a clear decision for taking an action in any situation. The most commonly used and also biologically
plausible activation function is the sigmoid function shown in equation 3.1. The graph of this function is
plotted in figure 3.2.
1
(3.1)
P (t) =
1 + e−x
Since the output of this function is always between zero and one, the output of any neuron is constrained
21
1
0.8
0.6
0.4
0.2
0
-4
-3
-2
0
-1
1
2
3
4
Figure 3.2: Standard sigmoid function
to fall between those two numbers. For both the input and output neurons, we need to think of a sensible
way to map the output value of the neuron to a useful interpretation. In other words, we need to figure
out what the real-life correlate of a maximally activated input neuron is. First note that one input neuron
in my network does not necessarily represent exactly one biological neuron. The activation of a single
artificial neuron can encode the existence of a distributed representation of a complex sensory event. An
excited artificial input neuron tells us that the sensory stimulus encoded by this neuron is present. This is a
binary event, either the stimulus is present or it is not, so we can encode the absence of the stimulus by an
activation value of zero.
A similar situation exists for the output neurons, either an action is performed or it is not. I interpret a
value between zero and one as the likelihood that the action is a good one in the current circumstances. A
value of zero means it is definitely not a good idea, while the value one means the exact opposite. Compare
this to the firing rate of a biological neuron. The average firing rate of neurons on an actively used neural
pathway is relatively high. This is the activity that is measured during an MRI scan. But this does not
mean that inactive connections show no activity at all. The average firing rate of those neurons is lower
than that of actively used neural pathways, but they still show electrical activity. The average firing rate of
unused neurons is also called the baseline activity. Neurons can also show an average activity below their
baseline. In my network, I encode the baseline activity by the value 0.5. This value signals a high level of
uncertainty or rather indifference about the current state of affairs.
The activation value of the output neurons is determined by the pattern of activation over the hidden
neurons. Using the same sigmoid function for those neurons, the output again varies between zero and one
and so does the input to the output neurons. However, looking at the graph in figure 3.2 we can see that an
input of zero gives an output of 0.5 meaning indifference. This is not a desirable situation, input activation
close to zero should generate an output close to zero. To tackle this problem, I make a small adjustment to
the sigmoid function, as shown in 3.2.
1
(3.2)
P (t) =
1 + e1−2x
As you can see in figure 3.3 an input of 0.5 now generates an output of 0.5.
Until now, I regarded the interneural connections as if they were static. This is incorrect; in order to
learn behaviour their strength needs to be dynamically changed to suit a behavioural pattern rewarded by the
environment. The synaptic strength of biological neurons can be translated into the strength of a connection
between two model neurons. A strong connection will cause the neuron on the receiving side to fire more
22
1
0.8
0.6
0.4
0.2
0
-4
-3
-2
0
-1
1
2
3
4
Figure 3.3: Adapted sigmoid function
quickly and stronger whereas a weak connection will do the opposite. As I explained in section 2.4, under
the right conditions the connection can be strengthened or weakened. A stronger connection between two
neurons makes it more likely that the neuron on the receiving side will fire. Because of the lateral inhibition
in the hidden layer, for a neuron in this layer to fire it needs a relatively strong connection with its input
neuron in order to receive enough input to win the competition. This is demonstrated in figure 3.4 where
the red connection from S1 to H2 (the second hidden neuron from the top) depicts a strong connection. the
blue connection from S1 to H1 depicts a weaker connection. In this situation the first input is triggered,
H1
S1
H2
H3
S2
R1
R2
H4
Figure 3.4: Example behaviour S1-R2
causing S1 to fire. This is depicted by colouring this neuron red. Because of the strong connection between
S1 and H2, this hidden neuron easily wins the competition between the four neurons. H1 receives much
less input from S1 because of the weaker connection and H3 and H4 receive no input at all because S2
did not fire. Since H2 is connected to R2, this output neuron is the only one to receive any input at all
and therefore fires. In this situation the network chose to select R2 when S1 came on, under different
circumstances other behaviour might have come about.
23
3.2 Learning stimulus-response behaviour
Now it is time to introduce the learning aspect. Learning is triggered by the environment so the first thing
I need to introduce is environmental feedback. The immediate environment of the SR system consists of
other subcortical brain structures including the dopaminergic subsystem. As I explained in sections 2.2
and 2.4 this system aids in learning behaviour by releasing small amounts of the neurotransmitter DA. I
also explained the striking similarity between the DA signal and the temporal difference algorithm. I will
now explain how the two are combined to achieve learning.
The DA signal is not targeted on single neurons, instead it is used as a measure for changing the connection strength throughout the network. To understand how this works, I will revisit Hebbian theory [15].
In a situation where learning is driven by reinforcement, the only patterns we want to strengthen are those
which lead to reward. We therefore need to add an extra constraint: regular Hebbian learning only takes
place if there is a DA burst. (Recall Wickens’ empirical support for this extra constraint, as described in
section 2.4). DA neurons fire whenever an unexpected reward is delivered or when a reward signalling
stimulus comes on. Vice versa, when something undesirable happens such as the absence of an expected
reward, DA levels drop below baseline level signalling something that could be regarded as a negative
reward. Using this signal as a learning parameter for either strengthening (in case DA levels are high) or
weakening (in case DA level are low) ‘active’ connections, i.e. connections between neurons that are firing,
our network is capable of learning stimulus-response combinations.
In the following example I assume there is a correctly working DA system that responds to the rewarding schedule S1 → R1. This means that the network is rewarded for selecting R1 whenever S1 comes on.
The initial network shown in figure 3.5a has no preference for choosing any particular hidden neuron. If
S1
S1
R1
R1
a)
b)
R2
R2
S2
S2
DA
S1
S1
R1
R1
c)
d)
R2
R2
S2
S2
Figure 3.5: Example of a situation where the behaviour S1-R1 is learned
all neural connections were initialised with equal strength, the network could never decide which hidden
neuron to use. In every reinforcement learning there is a tradeoff between exploration and exploitation.
A solution based solely on exploitation has bootstrapping problems when faced with a new unexplored
environment. Exploration is necessary early in learning, but as the expectation of reward gets higher, exploitation gets more rewarding and is preferable over exploration. So early in learning there is a need
for more exploration while in later stages the focus needs to be on exploitation. This is implemented in
24
the network by applying a certain amount of randomness to the activation of every neuron. As learning
progresses, this exploration factor decreases favouring exploitatory behaviour.
Figure 3.5b shows what happens when stimulus S1 comes on. One of the two hidden neurons connected
to the first input neuron activates at random (subject to the exploration factor) and response R1 is selected
as an output. This triggers an external reward causing the DA system to send a positive reward signal
which is delivered to every single connection in the network (3.5c). Hebbian theory says that only the
active connections have their weights altered, in this case the connection from S1 to H1 and the connection
from H1 to R1. The strength of those connections is increased (figure 3.5d) making it more likely that in a
future situation they are preferred over other, weaker connections.
For use with my neural network, I implemented the temporal difference algorithm in a dopamine unit.
To understand how this dopamine unit works, first have a look at the temporal difference function in
equation 3.3.
T Dt = rt + γPt − Pt−1
(3.3)
The first thing we need is information about external reward. The neural correlate of external reward is
stimulation of the dopaminergic system. This stimulation usually comes from outside the brain and can be
considered external to the brain structures I am modeling. The reward value is dependant on the specific
stimulus-response combination being learned and can be determined by observing the current input-output
state of the network. The second and final variable needed for calculating the temporal difference is a
prediction of reward. In my implementation of the dopamine unit this prediction is the output of a linear
connectionist unit:
m
X
Pt =
wti xit
(3.4)
i=1
wti
The connection weights
are internal variables of the DA unit, the input values xit are the inputs to the
SR network.
The timing of the DA system is crucial to its correct functioning. For the SR network one time step
consists of the selection of an input value, activation of the network and the selection of the appropriate
action. In the same time step I want to receive feedback from the DA system in order to update the
connection weights. Looking at the temporal difference function in equation 3.3, the temporal difference
available at the end of this time step is based on both the prediction made for this time step and the one for
the previous time step. Actually this provides us with an evaluation of our previous action. Unfortunately,
we want to have an evaluation of our current action and not of the one we did before.
The solution to this problem can be found in the observation that we are actually doing one-stepahead prediction. The SR network is trying to learn stimulus response combinations which means that
the predictions we make are predictions about immediate reward. Taking only immediate reward into
consideration, the equation for updating the weights of the dopamine unit (see section 2.5) is:
i
wti = wt−1
+ η(rt − Pt−1 )xit−1
(3.5)
Applying this formula at time t + 1 it can be rewritten as follows:
i
wt+1
= wti + η(rt+1 − Pt )xit
(3.6)
Now what does the factor rt+1 actually mean? It is the reward observed by the dopamine unit in the
next time step. The interesting thing is that this reward is actually the environmental reward based on the
current output of the system. It does not depend on the next input anymore because we only do one-stepahead prediction. In other words, the dopamine unit can provide the reinforcement signal even before the
new input to the system is known. This is the immediate evaluative feedback necessary for updating the
connection weights throughout the network.
Provided that the DA unit works correctly, this network is capable of learning every possible stimulus
response combination by strengthening connections on the pathway from input to output and weakening
other connections. In theory it all looks very promising. Section 3.3 describes a number of tests I created
to assess the performance of the network design.
25
3.3 Performance of the network
To show that the SR network described in this chapter is capable of learning stimulus response behaviour by
observing only evaluative environmental feedback, I made an implementation in Java. The code includes
a computational network component and a Graphical User Interface (GUI). Figure 3.6 shows what the
GUI looks like. For a description of the functionality of the different buttons I refer to chapter 5. The
Figure 3.6: Graphical User Interface for the SR network
neural network was constructed using a set of modular Java classes like Neuron and Connection. Multiple
Neuron classes are grouped together into a Layer component which can be connected to another Layer,
thus creating new instances of the Connection class. A functional overview of those three Java classes can
be found in appendix A.
To test the performance of the SR network I created three test cases. For each case, a different set
of reward schedules was used, i.e. the external reward used as input for the dopamine unit was the only
variable factor between the three cases. Those three test cases are:
1. S1 → R1
2. S1 → R2 ; S2 → R1
3. S1 → R1 ; S2 → R1
In each case the network architecture depicted in figure 3.1 was used, with two neurons in the input layer
connected to the four-neuron hidden layer in turn connected to an output layer with two neurons. The
weights of the connections between the input and hidden neurons are dynamic, meaning that Hebbian
learning is applied after each stimulus-response epoch. The connections between the hidden and output
neurons are statically set to 0.5. In theory these connections could be learnable too but since there is only
one connection from every hidden neuron to an output neuron, the strength of the connection is not decisive
in choosing a particular output. Too avoid unnecessary complexity, these connections are made static. The
26
dynamic weights between the input and output layer are initialised with a random value between 0.4 and
0.6, adding to a random initial behaviour of the network at the start of the exploration phase.
To test the performance of the network I created a batch file which can be interpreted by the network. It
sets the reward schedules according to the test case and then presents the network with an input 100 times.
On each input presentation the network is activated and selects an output. Next, the temporal difference
value is calculated and every learnable connections is updated using the Hebbian update rule:
∆weight = learningrate ∗ output ∗ (input − weight)
(3.7)
For the first test case, I presented the network only with S1 to show that the network is actually capable
of learning this simple schedule. Every session consisting of 100 epochs was run 100 times. At the start
of every session the network was reinitialised with new random values for the connection weights. In the
worst of those 100 sessions the wrong output was chosen in 7 of the 100 epochs, a fraction of 7%. Note that
this includes ‘mistakes’ made by the network during the training phase. The average number of incorrect
trials (i.e. an epoch in which a non-rewarding response was chosen) was 2.8. After 10 epochs the average
number of incorrect trials was only 0.3 out of 90 remaining epochs, a fraction of 0.4%. After 17 epochs the
network did not select the wrong output anymore. We can safely say that this is near perfect performance.
For the second test, the input to the network was randomly generated, in every epoch one of the two
possible input neurons was selected. The network was then activated and the correct response needed to
be selected, dependant on the selected input. In the worst case session, the network now made 14 errors, a
fraction of 14%. On average only 6.5 incorrect responses were selected. Again, the network was initialised
with random weights so there was no bias towards any response before the first trial. This time an average
of 3 mistakes were made after the first 10 trials. After 20 trials, the average number of mistakes had gone
down to 1.2. After another 10 epochs, less than one mistake was made on average. Although the overall
performance was not as good as in the first test case, this makes perfect sense considering that this time
two correct responses needed to be learned instead of one.
Given that in the third case two responses need to be learned as well, one would expect similar figures
as in the second test. The batch file was changed to set and test the correct rewards and run once more. In
the worst case session 12 errors were made this time, the average number of incorrect responses was 6.9.
After 10 trials, only 3.2 mistakes were made on average, after 20 trials only 1.1. Table 3.1 summarises the
results.
maximum
average
avg after 10 trials
avg after 20 trials
avg after 30 trials
test case 1
7
2.8
0.3
0.0
0.0
test case 2
14
6.5
3
1.2
0.3
test case 3
12
6.9
3.2
1.1
0.4
Table 3.1: Number of incorrect trials for each test case
We can conclude that the SR network performs very well when faced with various input-output reward
schedules. For every test case it took only a few trials to figure out which connections to strengthen and
weaken in order to receive reward. However, there are some comments to be made about the current setup.
Section 3.4 describes an attempt to create a more realistic situation in which the input and hidden layers
are fully connected. Finally some other limitations are discussed in section 3.5.
3.4
A fully connected version of the network
The current network design might give a very good performance, but the following issue needs to be
considered. The connections between the input and output are set up to connect every input neuron with
exactly one output neuron. There is no neurological basis for this sparse and very specific connectivity. If
one of the connections would somehow fail, it means that a stimulus-response pathway is disabled and can
27
not be restored anymore. This is obviously an undesirable situation that needs to be resolved. A much more
robust situation is achieved when multiple pathways from input to output are formed. A solution would be
to simply add more hidden neurons. Now if a connection fails, another neuron in the hidden layer can be
used to select the correct output instead.
But how many neurons are required to have a robust enough system? Probably quite a lot. A more
efficient solution would be to fully connect the input and hidden layer so that every input neuron is connected to every hidden neuron. This way, in case one of the connections fails the network can still rely on a
different hidden neuron to select the correct output. The hidden neurons are now shared between the input
neurons meaning that the same number of neurons can be used. Figure 3.7 shows this new design with a
fully connected input and hidden layer.
H1
S1
H2
H3
S2
R1
R2
H4
Figure 3.7: A network design with full connectivity between input and hidden layer
This new design probably has its impact on the ability to learn stimulus-response combinations. To test
the new design, I reran the three test cases described in section 3.3:
1. S1 → R1
2. S1 → R2 and S2 → R1
3. S1 → R1 and S2 → R1
Figure 3.8 gives a comparison between the performance of the sparsely connected and the fully connected
network. It clearly shows a higher average number of errors at the beginning. Since there are more connections running between the input and hidden layers, there are more options for the network to explore. This
explains why it takes more time for the network to settle into a situation where it hardly makes a mistake
anymore. On average, it takes the fully connected network 12 trials to make only one more mistake in the
remaining 88 epochs.
Now have a look at figure 3.9. Again, the overall error rate is higher for the fully connected version of
the network, but the learning curve is steep and after 20 epochs the average remaining number of mistakes
has gone down to 6. The one mistake threshold is reached in epoch 41. Figure 3.10 displays a similar
situation, the learning phase is short and the learning curve steep. In general I can conclude that although
the performance has gone down a little bit because of the full connectivity, the network is still capable of
achieving a very good performance. Since full connectivity is more biologically plausible, I will use it for
my final model as well.
3.5
Limitations of the current network
The network described in this section seems perfectly capable of learning any combination of stimulusresponse behaviour. But the network design also imposes some major limitations. The first and most
obvious limitation is the incapability of the network to handle distractor input. My network is based on a
design by Braver and Cohen [6] who implemented a network capable of learning task-relevant behaviour
while ignoring other irrelevant sensory input. Their network can handle distractors because they included
28
7
s
s
6
s
5
s
Average
number
4
of
errors
in the 3
remaining b
epochs
2
s
s
b
s
s
s
b
s
b
b
1
0
sparsely connected
fully connected
s
0
b
b
5
b
b
b
b
s
b
s
b
s
s
s
b
b
b
10
Number of lapsed epochs
s s
b b
15
s
b
s
b
s
b
s
b
20
Figure 3.8: Performance of the network for the first test case: S1 → R1
14
12
s
s
s
s
s
s
s
ss
s
Average 10
s
ss
number
ss
of
b
sparsely connected
8
s
errors
ss
s
fully
connected
b
in the
b
ss
b
ss
remaining 6
b
ss
bb
epochs
ss
bb
ss
4
bb
ss
ss
bb
sss
bb
sss
bbb
2
ssss
bbbb
sss
bbbbbbbbb
b b b b b b b b b b b b b b b sb bs bs bs bs sb
0
0
10
20
30
40
50
Number of lapsed epochs
Figure 3.9: Error rate of the network for the second test case: S1 → R2; S2 → R1
29
14
12
s
s
s
ss
s
s
s
ss
Average 10
ss
s
number
s
s
of
b
sparsely connected
8
ss
errors
s
fully
connected
s
in the
b
ss
ss
remaining 6 b b
ss
b
epochs
bb
ss
bb
ss
4
ss
bb
ss
bb
sss
bbb
sss
bbb
2
ssss
bbbb
ssss
bbbbbb
b b b b b b b b b b b b b b b b b b bs bs bs bs bs
0
0
10
20
30
40
50
Number of lapsed epochs
Figure 3.10: Error rate of the network for the third test case: S1 → R1; S2 → R1
a gating mechanism which supposedly is one of the functions of PFC. My network reacts on any input
regardless of its relevance in the current context. However, my focus is not on the capability of the brain to
ignore irrelevant sensory input. The focus of my research is on the output side of the network. Although the
Braver and Cohen network is able to act only on task-relevant sensory input, it is unable to learn a sequence
of actions triggered by a single sensory event. In this report I present a solution for this specific problem,
also based on processes thought to take place in PFC. Adding an input gating mechanism to include the
ability to ignore irrelevant stimuli would be a possible extension to the network, left for future work.
Being able to learn stimulus-response combinations is a good start, but it is not enough to explain
human behaviour. Not only can we learn to react to a stimulus, sometimes a whole sequence of actions
needs to be learned before getting rewarded. A critical question must be answered: Is the current network
design sufficient for learning sequences of actions? Until now I only looked at single stimulus-response
combinations. Both the sparsely connected and the fully connected network were very good at learning
those combinations. But with the current network design we can never learn behavioural sequences. The
output layer can only select one output at a time and even if it could select more than one output, no
information about the serial order of the actions is provided. My hypothesis is that the DA signal which
is currently only used for learning the correct input-output combinations can also be used for learning the
serial order of actions. This requires some changes to the network architecture. In chapter 4 I will describe
a model that can learn behavioural sequences. This model is an extension of the SR-network presented in
this chapter.
30
Chapter 4
A model of behavioural sequences
In this chapter I will introduce a neural network capable of learning behavioural sequences. Before I start
talking about the necessary changes to the network architecture I will briefly introduce in section 4.1 why
learning behavioural sequences is crucial for explaining human intelligence. In section 4.2 I will discuss a
well established theory for learning and executing action sequences. Sections 4.3 and 4.4 elaborate on the
design of my network.
4.1
Why do I want to learn behavioural sequences?
The short answer to the question, why do I want to learn behavioural sequences, is that stimulus-response
behaviour by itself is just too limited. Only in very rare situations do we immediately receive feedback
on actions in real life. More often, it takes time and the right sequence of actions to accomplish a goal.
For example, if I feel like having a cup of coffee I first need to grab a cup. Then I want to take the cup
to the coffee vending machine in order to fill it with coffee. Filling the cup requires a particular sequence
of actions to perform on the coffee machine. It is not until I finally drink the coffee that I get a positive
feedback from feeling a bit less thirsty. There are numerous situations like this where a specific sequence of
actions needs to be performed in order to get positive environmental feedback. If I were only able to learn
stimulus-response combinations I would get to drink my cup of coffee only after randomly performing the
correct sequence of actions. The only thing I will learn is that drinking out of a coffee cup might help me
lessen my thirst, because this is the last action I did before feeling satisfied. Unfortunately, drinking out of
an empty coffee cup is not very rewarding.
Learning a sequence of actions may not sound too complicated for a human being. All you have to do
is remember exactly what you did before receiving the reward. But how can we know if every action we
did was strictly necessary for getting the current result? And even if we did figure it out, another dilemma
awaits. We can never consciously be considering every action we take in life, let alone the exact set of
possible future consequences. After profound rehearsal of a successful behavioural sequence, we are able
to practice it without having to memorise it every single time. It is very useful indeed to be able to execute
more complicated behaviour automatically.
4.2
Existing models of sequences
So we need to be able to unconsciously execute a sequence of actions, how could this work in our brain?
A common solution for learning sequential tasks involves an (associative) chaining mechanism. The basic
idea behind associative chaining is that a subject first learns to associate a pleasant experience with the preceding action. The action is then associatively chained with another earlier action. The process continues
until the first action is associated with the initial stimulus that triggered this chain of events and the chain
is complete.
Another interesting range of models are the so-called competitive queueing (CQ) models. These models differ from associative chaining models in that a learned sequence is activated in parallel instead of
31
sequentially. Behavioural evidence suggests that a CQ approach to learning shows much more resemblance to learning processes actually taking place in the brain [13]. In this section I will explain how a CQ
model works. I focus on the execution of a plan representation of an already learned plan. Learning the
plan in the first place is an issue I will address later.
Let’s go back to my favourite example and assume that we know how to get ourselves a nice cup of
coffee. The sight of an empty coffee cup combined with a lingering urge for caffeine triggers the generation
of an action plan in the brain. Based on previous experiences with getting coffee, we come up with the
following action plan:
1. pick up the empty coffee cup
2. walk to the vending machine
3. place the cup in the slot
4. push the button for coffee
5. pick up the cup
6. drink from the cup
We know that we have to perform this entire sequence of actions from the moment the plan is triggered.
The only thing we need to do is to execute it in the right order. A CQ model can explain how the sequential
ordering of those actions can be determined.
CQ can be implemented in a neural network comprising two layers, a selection layer and an action
layer. The selection layer represents the plan to be executed, the action layer is used for determining
the immediate action to take. The basic architecture is depicted in figure 4.1. The CQ model comprises
Selection
layer
Action
layer
Figure 4.1: Generic competitive queueing architecture
two layers, a selection layer and an action layer. There is an excitatory connection from every selection
neuron to the corresponding action neuron. Conversely, there is an inhibitory connection going back to
the selection layer. We only want one action to be taken at a time, so the action layer implements lateral
inhibition to ensure that only one of the neurons in the layer wins. The action layer receives input from
the selection layer, so this winner will be the neuron with the highest input from the selection layer. For
the plan representation this means that if the neuron representing the first action to be taken is activated
most strongly, this will certainly be the first neuron to exert its influence on the action layer. Interestingly,
there is an inhibitory connection from neurons in the action layer back to their corresponding neurons in
the selection layer. Self-inhibition of a dominant representation is a widespread cognitive phenomenon,
often called inhibition-of-return [17]. Immediately after the activation of the neuron in the action layer, the
32
neuron in the selection layer is inhibited. This neuron will no longer be the most active any more, making
way for the next step in the sequence.
Figure 4.2 shows the first steps in the execution of the coffee making example. In this example I
represent the activity of a neuron in the selection layer by a number between 0 and 10. I assume the baseline
activity of all neurons to be 5. Figure 4.2a shows the initial situation. The action plan is represented by
S1
S2
S3
S4
S5
S1
S2
S3
S4
S5
9
6
8
10
7
9
6
8
1
7
a)
b)
E1
E2
E3
E4
E5
E1
E2
E3
E4
E5
S1
S2
S3
S4
S5
S1
S2
S3
S4
S5
9
6
8
2
7
1
6
8
3
7
E1
E2
E3
E4
E5
c)
d)
E1
E2
E3
E4
E5
Figure 4.2: Example execution of a sequence in a competitive queueing model
a gradient of activation over the selection neurons. In the example neuron S4 is given the highest level of
activation. The neurons in the action layer are competing with each other. The winner will be the one with
the most input from the selection layer. Because of the one-on-one connectivity between the two layers,
this winner is determined solely by the pattern of activation over the selection layer. This is represented
in the figure by the red colouring of E4. In the process of making coffee, action E4 would represent
picking up the cup. The next step is shown in figure 4.2b. The inhibitory connection from E4 back to S4
depresses the activity of the neuron. This is the inhibition of return ensuring that once the first part of our
planned sequence has been executed, processing of the second part of the plan can start. The depression
of neuron S4 makes way for S1 which now has the highest activation level. In figure 4.2c we can see that
E1 is selected in the same way as S4 selected action E4 to be taken. E1 would encode walking to the
coffee vending machine, the second step in getting coffee. Again, immediately after the selection of E1 the
inhibition of return depresses S1 making way for S3, placing the cup in the slot. In the mean time, neuron
S4 has recovered a bit from the strong inhibitory influence from the action layer. But unless a new plan is
put into the selection layer the activity will not rise above 5 anymore. This process continues until all the
actions in the sequence have been done in the correct order.
Competitive queueing is a very interesting technique for explaining how sequences can be learned in
the brain. Indeed there is good evidence that a CQ-type paradigm is used in PFC to plan certain types of
33
motor sequences [2]. Behavioural evidence also comes from the domain of error correction in language
technology. Transposition of individual letters is a common type of error made by people typing sentences
on a keyboard. However, this error hardly occurs in handwriting. It seems that the CQ paradigm only works
in a situation where a fast succession of (low level) actions is made, like making key presses on a keyboard
or drawing basic shapes. The type of actions generated by my model are higher level actions requiring
more attention and possibly taking much longer to execute. Transposition errors like we sometimes see in
typing are rarely found in those higher level sequences. For example, when making coffee people hardly
ever make the mistake of picking up the (empty) cup before pushing the button for coffee. It is therefore
unlikely that a CQ type of learning is used for learning those kind of sequences in PFC.
Another important issue with CQ is that a plan representation is destructively updated, i.e. as soon as an
action is performed, inhibition of return makes sure that the neuron(s) representing the action will become
inactive. After the sequence has been executed, no traces of it remain. This is problematic, because there
are suggestions that sensorimotor sequences can be retained and replayed after they have been executed.
By actively thinking about a recent experience the learning procedure that just took place can be activated
once more to strengthen its result and speed up the learning process. This can only be done when the PFC
activity is not destroyed while executing the plan, like a CQ model does. Plan destruction also does not
support intention recognition in action observation. The intention of peoples actions is often only worked
out retrospectively, when the action itself has been completed. The brain needs access to the original plan
representation in order to recognise the intention behind someone else’s actions.
A type of learning that is believed to be unique to the human species is imitation learning. Humans
possess the unique ability to learn by observing other people getting rewarded or making mistakes. It
was recently discovered that humans seem to have so-called mirror neurons [31]. Mirror neurons respond
when someone performs an object-oriented action such as reaching for a cup. This is exactly the kind of
action that can trigger the learning process in PFC. It is plausible to assume that the PFC neurons used for
learning are mostly mirror neurons. Interestingly, they respond in exactly the same way when the subject
sees someone else perform the same action. This allows us to learn by observation or imitation. Given that
the brain structures involved in learning are made up of mirror neurons, it makes no difference to them what
triggered the learning process. For mirror neurons, it is also useful to remain active during plan execution.
For example, assume you have practised making coffee and you know how to go to the coffee machine,
place a cup and drink from it. Now if you see someone walk away from the coffee machine with a filled
coffee cup, you can infer that he must have placed his cup and used the machine even though you did not
actually see him do this. This is the intention recognition process I mentioned earlier.
4.3
A neural model for learning sequences
We are now faced with the problem that we want to have a highly adaptive model of learning sequences
that does not destructively update its plans. Instead we have a model that can only learn stimulus-response
behaviour. It seems there is no suitable biologically plausible sequence learning model readily available
for implementation. The solution I present lies in the addition of a PFC component. Before getting into the
implementation details, it is useful to understand where I’m headed. I will therefore first present a general
description of my final model in section 4.3.1. In section 4.3.2 I continue by making changes to the design
of the stimulus-response network in order to allow sequence learning.
4.3.1
A first impression of the final network
The goal of my research is to have a biologically plausible model of the brain that can perform a sequence
of actions when presented with a single sensory stimulus. Somewhere in the model I want a representation
of the plan to be visible throughout the execution of the sequence. Research has shown that the PFC is
is likely to be the locus of this plan representation. The PFC does not exert its influence directly onto
the brains neural motor pathways. Instead it relies on basic stimulus-response pathways controlling the
automated execution of actions. It is on those pathways that the PFC exerts its influence by exciting some
of them and depressing others.
34
On the basis of this idea, I can make the black-box representation of the final network shown in figure 4.3. This model consists of three main components. The SR model presented earlier is included as one
PFC
input
DA
hidden
output
Figure 4.3: A black-box representation of the final network
component. It still comprises an input layer, a hidden layer and an output layer. Above the SR network lies
the PFC. Because of the black box representation, the internal working of the PFC is not revealed yet. The
important thing to note here is the connectivity between the PFC and the hidden layer. There are multidirectional connections between the two. Connections running from PFC towards the hidden layer are
being used to bias the SR pathways in such a way that the desired behaviour results. This is accomplished
by having a PFC neuron bias a particular hidden neuron responsible for selecting the desired action. The
DA unit works independently of the other two components. Information from the DA unit, responsible for
signalling both current and future reward, is available to every unit at any time.
Still it seems this design only allows basic stimulus-response behaviour to take place. Even with the
biasing influence of the PFC component, the SR network does not activate unless a sensory stimulus is
received. The execution of a sequence of actions requires the network to select a time-separated series of
outputs on the presentation of a single input that may or may not be present throughout the sequence. To
solve this serial order problem, I will introduce the concept of reafferent inputs, i.e. sensory stimuli which
are generated by the agent himself when he performs an action (see section 4.3.2 for details).
Before the PFC layer can exert its bias on the hidden layer, it needs to learn how and when to activate.
This can be learned by observation. The PFC layer actively observes the sensory input and the choices
made by the SR network. This is what the connections from the input layer and the hidden layer into PFC
are for. Unlike the SR component, the PFC can store the information it receives for an extended period of
time. Integrating the current state of the network with actions taken in the past, the PFC layer can record a
sequence of events and the associated reward. Based on this information the internal state of the PFC can
be updated to maximise future reward by putting a bias on the hidden layer at the right time.
Before explaining the implementation details of the PFC layer, the serial order problem needs to be
solved. It was suggested that PFC activity roughly stays the same throughout the execution of a learned
action plan. In order to execute the plan, at some stage a transformation from a (parallel) activation pattern
to a sequential selection of actions is made. Consider the coffee making example where a sequence of
actions leads to a reward in the form of a nice cup of coffee. If this activity has been practised very well,
the sight of an empty coffee cup will immediately trigger an action plan. A pattern of activation is set in
PFC to ensure the correct execution of the plan. Possibly, the SR network has already learned that it needs
to take an action when the cup is spotted. A number of different actions may be taken at this point, but the
PFC can exert some influence on the hidden layer to help it choose the first action in the sequence, picking
35
up the empty cup. Now it is time to take the next action. But the SR network only activates when sensory
input is present. Of course there is the sight of a picked up coffee cup now. But even if we closed our eyes
we would still know how to continue the process we just started. In other words, the presence of sensory
input is not strictly necessary for taking actions. We need to be able to act even when there is no sensory
input. This requires some changes to the SR network.
4.3.2
Extending the stimulus-response network
If we want the network to perform a sequence of actions it needs to be able to act independently of sensory
input. In terms of network design this means that the network needs to be activated every single time step,
even when no input is available. This is somewhat impractical because activating the network on no input
leads to a highly unreliable activation pattern. For example, say we want the network to first select R1
and then R2 after S1 comes on. The first step is shown in figure 4.4a. So far so good, the network can
S1
S1
R1
a)
S2
R1
b)
R2
S2
R2
Figure 4.4: Trying to learn a sequence in the stimulus-response network
learn to select R1 when presented with S1 by strengthening the neural pathway between the two. We get
into trouble trying to take the next step shown in figure 4.4b. Even though there is no input, inhibitory
interaction in the hidden layer selects a winner anyway. Hidden neuron H4 is selected as the winner of
the inhibitory competition, but there is no good reason for H4 being activated. The subsequent release of
DA will enable learning to take place. In this particular situation the connection between H4 and R2 is
strengthened. But there is no guarantee that the same hidden neuron will be used next time because there
is no support from the input layer.
So how can this problem be solved? It is true that there is no sensory input coming from the environment, at least no unexpected input is present. On the other hand, in a situation like this, the execution
of R1 in the first step provides us with important tactile feedback. Think about picking up a coffee cup.
Even with your eyes closed you can feel the cup in your hand. This tactile input is processed by the brain
just like any other external sensory event. If the feedback corresponds with your expectations it leads to
a state of awareness or confirmation of just having done an action. This awareness is an important aspect
in performing an action plan in a nondeterministic environment. In case the confirmation feedback fails to
occur it is no use to continue processing the rest of your plan, since its success depends on the correct sequential execution of every step. I represent the tactile feedback in the neural model by an additional set of
inputs. The new network is shown in figure 4.5. Two neurons are added to the input layer, labelled did-R1
and did-R2. They are activated only when the corresponding actions have successfully been carried out in
the previous time step. My model assumes that the selection of an action will always lead to a successful
execution of the action, so every time R1 is selected by the network it will be presented with did-R1 in the
next time step. Now the second step in the sequence can be learned by strengthening the pathway between
did-R1 and R2.
A second issue that needs some attention is the temporal difference algorithm. For the SR network
it was sufficient to do one-step-ahead prediction (see section 3.2) because we only looked at stimulusresponse behaviour. Now that we want to learn sequences of actions a full prediction of future reward is
needed to appropriately perform the necessary weight updates. Equation 4.1 shows what the calculation
36
H1
S1
S2
did-R1
did-R2
H2
R1
H3
R2
H4
Figure 4.5: Network model with reafferent inputs
for updating the weights of the DA unit looks like (for details see section 2.5).
i
wti = wt−1
+ η(rt + γPt − Pt−1 )xit−1
(4.1)
In order to compute the new weights for the DA unit we need the observed reward (rt ), the values of the
current and previous prediction of reward (Pt and Pt−1 ) and the previous input to the system (xit−1 ). Note
that the factor rt is actually the reward given to the result of the previous action. This means that we can
already observe it after the network activation in the previous time step. Instead of applying the weight
updates at the start of the activation sequence, we can try to set the new weights for the next time step after
activation of the network. Applying equation 4.1 at time t + 1 it can be rewritten as follows:
i
wt+1
= wti + η(rt+1 + γPt+1 − Pt )xit
(4.2)
Now the factor rt+1 is the environmental reward based on the current output of the system. This reward
can be observed immediately after the network has been activated and the output selected. The factor
Pt+1 needs some more attention. It stands for the reward prediction in the next time step. For making
a prediction of reward based on the current weights of the DA unit, the only variable needed is the input
to the system. In a nondeterministic environment there is no way of fully predicting future input values.
But is our environment completely nondeterministic? I just introduced the idea of internal tactile feedback
represented in the network by an additional external input. We can be pretty certain that this tactile feedback
will follow every action we take. Consider the situation in which the input S1 has a high prediction of
reward based on the successfully learned behavioural sequence S1 → {R1,R2}. In the first step the output
R1 will be chosen but no DA is released. Before observing the world in the next step, we already know
that action R1 has been selected. So even before the actual tactile feedback from our body is received, we
can assume with a high degree of certainty that did-R1 will be the next input. Using this knowledge about
the next input to the system, a new prediction of reward can be made and the weights of the prediction unit
updated.
4.3.3 Sequence learning without PFC
The two extensions to the SR model, tactile feedback and an updated reward prediction mechanism, already
allow the network to learn a sequence of actions. Even better, a computational implementation of the
prefrontal cortex has not even been included yet. Is this model then capable of learning sequences without
the need for PFC mediating behaviour? Have a look at the following example. In this example I use the
reward schedule S1 → {R1,R2}. The first step in learning this sequence, depicted in figure 4.6a, looks
familiar. Just like in the SR network without extensions, upon the presentation of stimulus S1 the network
activates and selects a winner in the hidden layer. Hidden neuron H1 activates R1 and the associated action
is taken. Unlike before, there is no DA activity because there is no reward or expectation of future reward.
This means that, at this time, no learning takes place. After the selection and execution of R1, the latter
of which is not explicitly modelled, the reafferent stimulus did-R1 comes on. Figure 4.6b shows what
37
H1
S1
DA
R1
H2
S2
S1
S2
a)
H1
DA
H2
R1
H3
R2
b)
did-R1
did-R1
R2
H3
did-R2
did-R2
H4
H1
S1
H2
S2
H4
DA
S1
R1
S2
c)
H1
DA
H2
R1
H3
R2
d)
did-R1
H3
did-R1
R2
did-R2
did-R2
H4
H4
Figure 4.6: Example of learning a sequence without the aid of the PFC
happens next. The network activates once more and randomly selects R2. The sequence S1 → {R1,R2}
has now been completed, therefore the DA unit fires. This allows the connections on the pathway from
did-R1 to R2 to be strengthened. Also, the DA unit increases the weight of its connection with stimulus
did-R1. Effectively, this means that did-R1 has now become an above average predictor of reward.
But the learning process is far from complete; only the last step in the sequence is more likely to be
taken correctly. The connections from S1 to R1 have not been strengthened at all. This is where the
temporal difference algorithm implemented in the DA unit becomes important. Observe what happens
when the network is presented with S1 once more. In figure 4.6c response R1 is randomly selected again.
The amount of reward predicted by did-R1 is slightly above average, meaning there now exists a temporal
difference. And a temporal difference means a release of DA into the system which allows learning to take
place. The amount of DA released may be small now, but as the network continues to correctly perform
the sequence the external reward will become expected and the temporal difference after completing the
sequence (as in figure 4.6d) decreases. As a result, the reward prediction of did-R1 increases and so does
the temporal difference between R1 and did-R1. This in turn enables more learning of the connections
between S1 and R1, until the reward can be fully predicted. At this point no more learning takes place, at
least for as long as the network does not make a mistake.
Then why do we need a PFC after all? Have a look at the following example. In this example I
concurrently use the two reward schedules:
• S1 → {R1,R2}
• S2 → {R2,R1}
I assume that the input to the network depicted in figure 4.5 is generated randomly. Consider a situation
where the first input has been selected a few times and by chance the network has selected the correct
sequence of outputs such that a reward has been delivered and both the network and the DA unit have
updated their weights. The second input has been selected once or twice, but the correct sequence has been
performed only once. In this situation the prediction of reward for the first input will be slightly above
average but not as much as the prediction for input did-R1. Figure 4.7 gives an example distribution of the
38
prediction of reward for every input. Remember that an expectation of 0.5 means complete uncertainty.
Anything higher than 0.5 means a chance of getting a positive reward at some time in the future. The
pathways from did-R1 to R2 are coloured red to indicate a significantly higher weight due to learning.
0.61
0.54
DA
0.94
0.57
Figure 4.7: Example situation showing the prediction of reward for every input to the system
Now observe what happens when the network is presented with S2 and it chooses to select R2. This is
a good start, but since the reward prediction for did-R2 is very low, the temporal difference is very low as
well. In the example situation the temporal difference equals 0.57 - 0.54 = 0.03. The temporal difference
is used for learning the connection between S2 and R2. Not much learning will take place before the
predictions of reward of S2 and did-R2 will have increased. Now observe what happens when the network
selects R1 instead of R2. This is obviously not a desirable situation, because the correct output sequence
starts with R2, not R1. Looking at the values in figure 4.7, it can be concluded that the temporal difference
is now 0.94 - 0.54 = 0.40! This time learning does take place, but the behaviour that is reinforced here is
the incorrect behaviour. Unfortunately there is no way of knowing at this moment in time.
A second example of incorrect behaviour, this time the following two reward schedules are used:
• S1 → {R1,R1}
• S2 → {R1,R2}
Consider the situation in which the first sequence has been performed correctly a few times, but the second
one has not. When the network is presented with S2 and R1 is selected (see figure 4.8a), the temporal
difference is quite high. By chance this is a good situation and the correct pathways are strengthened but
things turn bad on the next step. Figure 4.8b shows this situation. The network has developed a preference
a)
b)
Figure 4.8: Example of unlearning correct behaviour
for the behaviour did-R1 → R1 and this is likely to happen again this time. Unfortunately this is not the
correct behaviour in this situation. The pathway between did-R1 and R1 is weakened again, even though
we need it for the sequence S1 → {R1,R1}.
39
4.4 A final model including a PFC layer
The solution for the problems sketched in section 4.3 is one I have been working towards from the start of
this report. I will add a PFC layer to the network. The function of this layer is to bias the pathways in the
SR network towards the desired behaviour. In section 4.4.1 I will first look at the architecture of this new
layer. Section 4.4.2 discusses the updated learning regime.
4.4.1
The design of a PFC layer
I want the PFC layer to have the same properties observed in neurological experiments on (primate) PFC
to make it as biologically plausible as possible. I try to achieve this by designing a model that has the
same characteristics as the biological PFC. An important observation made in section 2.1.4 is that the
representation of a plan in PFC remains visible throughout the sequence. This was one of the reasons why
a competitive queueing model was insufficient to explain PFC behaviour. In my model, I want the PFC to
show sustained activity of a plan representation. The task ascribed to PFC is mediative, it indirectly biases
particular pathways involved in behavioural decisions, as in the model of Braver and Cohen. PFC has no
direct influence on input or output selection. The neurons in the hidden layer are a good candidate for topdown PFC influence. If PFC exerts a high influence on one particular hidden neuron, specific input-output
behaviour can be enforced. Without this bias the network reverts to learned habitual behaviour.
Figure 4.9 shows the new network design including a PFC layer connected to the hidden layer. The
PFC1 PFC2 PFC3 PFC4 PFC5 PFC6 PFC7 PFC8
S1
S2
R1
did-R1
R2
did-R2
Figure 4.9: A neural network with PFC layer
hidden layer now consists of eight neurons. This way every neuron can theoretically be used to represent
a unique input-output combination. For display purposes the input and hidden layer are only sparsely
connected. Similar to the first design for the SR network, there is exactly one pathway from every input
neuron to every output. In chapter 5 I will show that full connectivity between the input and hidden layer
yields the same results. There are also eight neurons in the PFC layer. Every one of those neurons is
connected to a specific hidden neuron.
Before any learning has taken place there will be no activity in PFC meaning no bias is placed on
the hidden layer. Without top-down influence from the PFC the SR component behaves just like the SR
network from chapter 3. The added value of the PFC layer becomes clear when demanding tasks like the
ones described in section 4.3.3 are given. Recall the task with the reward schedules:
• S1 → {R1,R1}
40
• S2 → {R1,R2}
Without PFC this task was impossible to learn because the response to the input did-R1 is ambiguous.
A decision can only be based on the input value initially presented to the network. A correctly learned
pattern of activation over the PFC neurons can bias the network in such a way that this decision can be
made without having to remember the initial sensory input. Observe what happens when the network is
faced with the same situation, only this time neurons PFC3 and PFC5 are actively biasing the network
throughout the sequence. Figure 4.10 shows what happens in the second step of the sequence. Even though
PFC1 PFC2 PFC3 PFC4 PFC5 PFC6 PFC7 PFC8
S1
S2
R1
did-R1
R2
did-R2
Figure 4.10: PFC biasing correct behaviour
the network has a strong tendency to select R2 whenever did-R1 comes on (represented by the red line),
the top-down bias from the PFC layer forces the hidden layer to select a different winner. This allows the
connection between did-R1 and R1 to be strengthened too. In the fully trained network, both connections
from did-R1 to the hidden layer will have become quite strong and PFC activation will be decisive in the
selection of an output.
4.4.2 Learning inside PFC
Using a specific pattern of activation, the PFC can bias the hidden layer towards any behaviour. A question
left unanswered is how the PFC can learn this pattern in the first place. To accomplish this, I subdivided
the PFC into a bias layer and a history layer. Figure 4.11 gives an overview of the network architecture.
The basic structure of the SR network is still intact. The input layer is fully connected to the hidden layer,
visualised in the figure by the solid arrow. Instead of one PFC layer, there now is a history layer and a bias
layer. The hidden layer connects to the history layer, which is in turn connected to the bias layer. Finally
the bias layer reconnects to the hidden layer. In the figure there are two copies of the same bias layer to
avoid crossing lines.
The bias layer performs the function of biasing the hidden layer in order to enforce a desired output
sequence. The PFC layer shown in figure 4.10 is actually a visualisation of only the bias layer. Activation
in the bias layer means that some sensory event has triggered a plan. By biasing the hidden neurons the
PFC can ensure a correct execution of the action plan, even when the SR network below has a different
output selection preference. The history layer, also embedded in PFC, simply keeps track of things going
on in the brain. Every hidden neuron is connected to a specific history neuron. Two extra neurons are
present to record the sensory input observed by the input layer. I will call those neurons history-input1 and
41
bias
gate
history
hidden
output
input
bias
Figure 4.11: A neural network with a two-layer PFC component
42
history-input2 from now on. The other history neurons are named history1 through to history8. The task
of the history layer is to keep track of sensory events as well as the response of the brain to those events.
The neurons are interconnected to allow integration of multiple sources of information. Figure 4.12 shows
the connections within this layer. The two history-input neurons, used to record the sensory input, are
input
hidden
Figure 4.12: Connections internal to the history layer
both fully connected to the eight history neurons connected to the hidden layer. Those connections are all
learnable. A strong connection from one of the two history-input neurons to a history neuron in the layer
represents a tendency to select one or more specific hidden neurons when the stimulus comes on. This way,
a sensory stimulus can trigger the selection of an action plan in PFC.
To learn the internal PFC connections no other source of information is available than the signal provided by the DA unit. Recall the shift in DA activation that takes place over time when a sequence of
actions consistently leads to subsequent reward. Initially, DA fires when the reward is delivered but as
learning takes place the DA activation shifts towards the earliest reliable predictor of reward. This will always be an external sensory event like S1 or S2, not a reafferent input like did-R1. One of the reasons why
the SR network was incapable of learning sequences was because no information about the initial stimulus
is available at the time of reward delivery. Unlike the hidden layer, the history layer in PFC has sustained
activity, meaning that information can be remembered over time. The two leftmost neurons in the history
layer are used to remember the external sensory input that triggered the sequence.
I will use the reward schedule S1 → {R1,R2} to explain the role of the PFC in learning this sequence.
Assume that the sequence is performed correctly. Step 1 is shown in figure 4.13a. Stimulus S1 comes
on and the network selects R1 going through hidden neuron H1. This activity is recorded by the PFC
history layer. After taking the next step, shown in figure 4.13b, the activity from the previous step is
still visible in PFC. The selection of did-R1 leads to response R2 and this too is recorded by the PFC
history layer. The correct sequence has now been completed and the DA unit activates because of the
unexpected reward. Like in previous examples, the connection from did-R1 to R2 is strengthened. But
this time there are connections in PFC that are strengthened as well. In this case the connection between
history-input1 and both history1 and history6 is strengthened. The delivery of a reward tells the brain that
the sequence has been successfully completed. The expected reward was delivered so there is no need to
remember the preceding sequence anymore. Activity in PFC will be reset, the only thing remaining are
the somewhat stronger internal connections. Resetting activity clears the way for trying to receive a new
reward. Figure 4.14a shows what happens when S1 comes on again. This time neuron history1 and history6
are activated simultaneously. There already is a possible plan for getting reward. Observe what happens
43
bias
bias
DA
DA
history
a)
history
b)
bias
bias
Figure 4.13: Network activation while learning a sequence
bias
bias
DA
DA
history
a)
history
b)
bias
Figure 4.14: Network activation with plan available
44
bias
after the selection of R1 as the first output. As shown in figure 4.14b, the input did-R1 is activated next. In
the previous epoch a little bit of learning has taken place when R2 was selected. The prediction of reward
for did-R1 has consequently gone up a bit and therefore a small temporal difference exists. This means a
small release of DA will take place. The gate between the history and bias layer (see figure 4.11) has been
closed so far. The small amount of DA released opens this gate a little bit, allowing some of the activity
from the history layer into the bias layer. The bias layer now exerts a tiny influence on the hidden layer.
The bias may not be very large at the moment, but as the behaviour is learned the temporal differences will
grow and so will the bias from the PFC. Eventually the temporal difference will have shifted to the time
when the initial stimulus appears. Before taking the first action, a plan is generated in the history layer
and, because of the high concentration of DA that opens the gate, copied to the bias layer. The bias on the
hidden neurons remains until the sequence is completed.
This all looks quite nice, at least in case everything goes exactly according to plan. But what happens
when a part of the required action sequence is incorrect? For example, assume that the sequence S1
→ {R1,R2} leads to reward but S1 → {R1,R1} is executed instead. After completing the sequence the
activation in the network is as shown in figure 4.15. The connections from history-input1 to history1 and
bias
DA
history
bias
Figure 4.15: Incorrect activation of the second part of the sequence
history5 are weakened. The behaviour S1 → R1 is discouraged even though we need it for getting a reward.
Next time, the network is likely to try something different. Expectedly, this does not yield any result and
S1 → R2 will be discouraged too, thus increasing the possibility of S1 → R1 happening again. When the
entire sequence is performed correctly for the first time, the temporal difference will be relatively high. By
now, the network has learned two things: the selection of R2 is a good option when the initial stimulus
was S1 and selecting R1 is not good in this situation. This knowledge is encoded in the strength of the
connections in PFC and stored separately from the underlying SR network. This is exactly the kind of
information needed to overcome the problems sketched in section 4.3.3.
45
Chapter 5
Implementation and evaluation of a
neural network including PFC layer
In chapter 4 I presented a neural network model of the human brain including the PFC. In this chapter I
will prove that this model is actually capable of learning sequences of actions. In section 5.1 I will discuss
some implementation details focusing mainly on the graphical user interface. In section 5.2 I will report
on a number of performance tests I carried out on the network.
5.1
Implementation
To show that the network I designed in chapter 4 is indeed capable of learning to perform sequences
of actions triggered by a single input, I made an implementation in Java. I used the same basic Java
framework I designed for the SR network (see section 3.3). A functional overview of the Java classes used
for building the network can be found in appendix A. The code consists of a separate network component
and Graphical User Interface (GUI). The GUI only contains code for displaying the network and retrieving
user input. The network component provides the neural network functions and carries out the calculations.
Two different network architectures can be selected by the user, one for the SR network from chapter 3 and
one for the network extended with PFC layer. Both have a different user interface showing the network
architecture.
Figure 5.1 shows what the GUI looks like for the SR network. The three different layers are visible
from left to right. The input layer has two neurons, the hidden layer four and the output layer two neurons.
In between the layers the connections are depicted by a red line. The strength of the connection is printed
on top of the line and the colour of the line also gives an indication of the strength of the connection. The
stronger the connection, the deeper the colour red will be. Inside every neuron, two numbers are printed.
The number printed in black is the current level of activation, the green number is the output value observed
at the adjacent neuron. On the very left side of the screen, the user can provide input to the network. The
user can herewith simulate external sensory input by clicking on the number left of the neuron. Selection
of an input does not yet trigger the network to activate, it is only used to select the sensory input that will
be presented to the network in the next activation cycle. Only one input can be selected at the same time.
The selection of a new input automatically leads to the deselection of all others. The two numbers on the
very right represent the response chosen by the network.
There is another GUI for the network including the PFC. Figure 5.2 shows the extended GUI. The
lower part of the screen shows the SR component. From left to right there is an input layer, hidden layer
and output layer, all connected with each other. The input layer has two additional neurons, the upper two
are the usual input neurons S1 and S2. The lower two represent the reafferent did-R1 and did-R2 inputs.
The topmost part of the GUI shows the neurons residing in PFC. The two neurons on the very left represent
the two history-input neurons (see section 4.4.2). The rest of the history neurons are depicted in the center
of the screen. Internal to the history layer there are only connections from the history-input neurons to the
other neurons. Initially, those connections have a strength of 0 because there is no learned plan yet. The
46
Figure 5.1: Graphical User Interface for the SR network
history layer is in turn connected to the bias layer shown on the right. The gate between the layers is not
visualised in the GUI. Also left out are the connections between the hidden layer and the PFC. Those are
not very interesting because they are static and can not be learned.
When the user selects an input, the current prediction of all future reward is immediately visible in
the upper part of the screen. This is the reward prediction the dopamine unit would use for calculating
the temporal difference if the current input were actually chosen. Right next to the dopamine prediction
the most recent temporal difference value is visible. This is the temporal difference value used in the last
learning step. The buttons on the left and right side of the text fields can be used to initialise a new network
or quit the program. The buttons in the lower part of the screen can be used to perform a number of different
actions. When the user clicks the run button, the network is activated on the selected input. This simulates
the presentation of the selected stimulus to the brain and has it react to the stimulus. Simultaneously the
dopaminergic network is activated on the same stimulus and the output of the DA unit is used to update the
strength of every learnable connection in the network.
In order to understand the activation process that takes place in the model, it has been broken up into a
number of sequential steps. Upon the manual selection of an input the DA unit only provides the reward
prediction. Selection of a different input provides the prediction for that input. After input selection the
user can first simulate an activation cycle by clicking the activate button. Every neuron in the network is
now activated and a candidate output is selected. No learning or temporal difference calculation has taken
place yet. The network only shows a possible activation pattern, this pattern is not final. Clicking the
activate button once more shows a different possible activation pattern with a different candidate output.
During testing this is a useful feature for having the network ‘randomly’ select a desired output. After
activation the user can click the go button to activate the learning and temporal difference processes on the
current pattern. The button sequence activate-go has exactly the same consequences as clicking run once.
The only difference is that the activation is visualised before learning has taken place. The learn button is
added to enhance the learning effect of the current network activation. Selection of the button allows the
user to ‘play god’ by providing the network with the same DA activation once more without calculating a
47
Figure 5.2: Graphical User Interface for the SR network including PFC
48
new temporal difference. It should only be used for testing purposes as it is not biologically plausible for
learning to take place more than once in a single activation cycle.
Finally there is the clear button. The reason for introducing this button has to do with the temporal
difference algorithm. Remember that in order to compute the temporal difference after network activation
it was sufficient to know the next input value (see section 4.3.2). In case a sequence of two actions needs
to be learned (e.g. S1 → {R1,R2}), the input following the first response will be the reafferent did-R1.
After the second response, the sequence should have been completed. Any reward prediction associated
with did-R1 is really a prediction for immediate reward. Equation 5.1 shows how the temporal difference
was computed.
T D = rt + γPt − Pt−1
(5.1)
Assume that the required behaviour has been learned and the network successfully completed the required
sequence, the factor rt will be 1 because of the externally provided reward. The behaviour is well practised,
so the factor Pt−1 will be close to 1 as well. In a normal situation the weight updates should be small
because no more learning needs to take place. If it was not for the factor γPt (in this case the prediction of
reward for did-R2) the previous prediction (Pt−1 ) and the external reward (rt ) would cancel each other out.
So upon delivery of an external reward the next input needs to be ignored. This is exactly what happens;
whenever an external reward comes in, the next input provided to the DA unit is 0. But what if a sequence
of two actions does not lead to subsequent reward? When the system is trained on a set of reward schedules
comprised of two actions it is no use trying to predict reward beyond the second action. The clear button
can be used to manually provide zero input to the DA unit. This clears all network activation and breaks
the multi-step ahead learning chain.
5.2
The performance of the final model
In order to prove that my network can learn to select sequences by the use of the PFC component, I created a
few test scenarios where the network needs to learn to generate a sequence of two consecutive actions after
presentation of a single input. There are two inputs available to the system and two possible actions to be
taken. In section 4.3.3 we have seen that there are some cases in which the PFC-less network is incapable
of learning the correct mappings. The addition of a PFC layer would theoretically solve these issues. I will
compare the performance of the full network for different sets of reward schedules with that of a network
with a disabled PFC. My hypothesis is that the PFC-less network shows a much worse performance than
the network including PFC.
Biological plausibility is one of my main concerns for the current network design. I already fully
connected the input and hidden layers for this reason. I have not looked at the connections between the
hidden layer and the output layer yet. Currently, the connections between the layers are perfectly evenly
distributed. In reality, there probably exists not such a nicely balanced distribution. To simulate this more
realistic situation I adapted the network such that every hidden neuron is randomly connected to exactly
one output neuron during the initialisation of a new network. This means that theoretically there can be
only one or in the worst case even no connection to one of the outputs. The network must be able to adapt
to this situation.
To test the performance of the network at several steps in the learning process the test comprises two
functionally different types of epoch. In a learn epoch, the network is presented with a random input
and then activated followed by an update of the connection weight. Next, the network is activated again
using the reafferent input and another weight update takes place. This is exactly what happens in a normal
situation, the network receives input and learns from every action it takes. This does not allow an extensive
test at a specific time in the learning process. A test epoch therefore consists of the presentation of the first
input followed by network activation. But no learning takes place during a test epoch. After selecting an
output, the reafferent input is presented and the network is activated again. The same sequence is repeated
for the second input. Effectively, the network has been tested once on both inputs now.
The tests I carried out differ only in the reward schedules used. A test starts with 100 test epochs, this
will give an indication of the initial performance before any learning has taken place at all. Next, a series
of 10 learn epochs is run with random values for the inputs. This is followed by a series of alternations of
100 test epochs and 10 learn epochs until a total of 250 learn epochs has taken place. By then, the system
49
is expected to have achieved its optimal performance. A measure of the average network performance is
achieved by running 100 sessions of 250 learn epochs intertwined with test epochs. At the start of every
session, a new network is initialised with new random initial values for every connection.
For the first test, I used the following reward schedules:
• S1 → {R1,R2}
• S2 → {R2,R1}
The problem the network has to face here is that the correct execution of the first of the two schedules
affects the second because the prediction of reward for the reafferent action is independent of the initial
stimulus (recall section 4.4.1). The two schedules can therefore not easily be learned independently of each
other. Because of the ability of the PFC to remember stimulus activation, it can bias the SR component
towards selection of the correct response. I first ran the aforementioned test in the PFC-less network. As
expected did the network not perform too well. After 250 learn epochs, it selected the correct sequence in
less than half of the test epochs. Closer inspection of the test results reveals that in most sessions at least
one of the reward schedules has in fact been learned. In rare cases the network has managed to learn the
other schedule as well because of coincidental random successful choices. In some other cases the network
did not manage to find any reliably rewarding schedule at all.
Enabling the PFC layer is expected to yield a much better performance. I ran the test again in the
network with PFC. The results of this test and the previous one are shown in figure 5.3. The initial per1
0.8
Average
number
of
correct
test
epochs
0.6
0.4
s
s
s
s
s
s
s
s
s s s
s s s s
s
s
s
s s
without PFC
with PFC
b
s
b b b b b b b b b b
s
b b b b b
s b b b b b b b
s b
s b
bs b
s
0.2
0
0
50
100
150
Number of learn epochs
200
250
Figure 5.3: Performance of the network for the first test case
formance is equal to the performance of the PFC-less network. This is also the expected performance of
a system choosing actions entirely at random. Since there are only two out of eight possible sequences
correct, the expected fraction of correctly executed sequences is 25%. In the first 40 learn epochs, performance increases for both networks. But even at this early stage in learning, the advantages of the PFC
are already clear. The performance of the PFC network keeps increasing while the PFC-less network’s
performance flats out and stays well below 50%. It is not until the PFC network reaches a level of 80% that
its performance starts to level.
The second test I ran comprised the following two reward schedules:
• S1 → {R1,R1}
• S2 → {R1,R2}
50
For comparison, I first ran the test on the PFC-less network. A very poor performance is expected because
without the PFC there is no way for the network to learn which action to associate with the reafferent input
did-R1. The correct output is solely dependant on the initial stimulus which is not remembered in the
network. Figure 5.4 confirms this hypothesis. Performance levels never reach anything above the initial
performance of the network. After 150 learn epochs, it has resided to choosing R2 as the first action in
the sequence every time because this reliably leads to not getting reward. With a reward expectation of 0
for every input, this is an undesirable but stable situation in which there are no temporal differences and
no learning takes place anymore. Again, the PFC makes a big difference. After 100 learn epochs, the
1
0.8
Average
number
of
correct
test
epochs
s
0.6
s
s
s
s
s s s s s s s s s
s s s
s
s s
without PFC
with PFC
s
b
s
s
0.4
0.2
0
s
s
sb b
b
0
s
b b
b b
b b b
b b b b b
b b b b b b b b b b b
50
100
150
200
250
Number of learn epochs
Figure 5.4: Performance of the network for the second test case
performance exceeds 75% and at 150 learn epochs a 90% performance level is reached. The asymptotic
95% performance level is reached after 200 learn epochs.
An important characteristic of human learning is our ability to learn by observation. When a learned
behaviour suddenly fails to get rewarded, it is necessary for the brain to react to the new situation, possibly
unlearning current plans and replacing them with updated ones. I created a test where this situation is
simulated by first having the reward schedules from the second test case:
• S1 → {R1,R1}
• S2 → {R1,R2}
After having run the network for 250 learn epochs, the network will show near perfect performance for
every input. At this point I replace the reward schedules with the ones from the first test case:
• S1 → {R1,R2}
• S2 → {R2,R1}
From this point on, I tested the network’s performance by running a series of alternations of 100 test epochs
followed by 10 learn epochs until 250 learn epochs have been run. Both sequences that got rewarded
before are not being rewarded anymore and a different sequence needs to be executed in order to receive
the reward. For the network it means that it first has to unlearn the current associations and then learn new
connections. As figure 5.5 shows, the performance level immediately after changing the reward schedules
is close to 0%. This is expected because the network has had no chance to adapt to the new situation. With
the temporal differences being extremely high, the learning curve is steep at the beginning. After 20 learn
51
1
0.8
Average
number
of
correct
test
epochs
0.6
0.4
s
0.2
s s
s s
s
s s
s
s
s
s
s
s s
s
s s s s s s
s s
with PFC
s
s
0 s
0
50
100
150
200
Number of learn epochs in second reward set
250
Figure 5.5: Performance of the network facing changing conditions
epochs the reward expectation has been adjusted and more random exploratory actions are taken. This
slows down the learning rate but allows the new set of reward schedules to be discovered. After around
200 learn epochs an asymptote of 75% is reached.
52
Chapter 6
Conclusions and further work
The goal of this research project was to create a biologically plausible model of PFC in order to guide a
neural network towards learning behavioural sequences. In this chapter I will assess this goal and look at
what still needs to be done. Section 6.1 presents a summary of the general line of reasoning I used in this
report. In section 6.2 I present the general conclusions of this project. A number of recommendations for
improvement are given in section 6.3. Finally in section 6.4 I present some suggestions for future work.
6.1
Discussion
The network my PFC model is connected to is an abstract simplification of neural processes taking place
in the (human) brain that allow basic stimulus-response behaviour. There is good evidence that the brain
implements a form of Hebbian learning on neural pathways used for guiding behaviour. This learning
regime can also be applied to connections in an artificial neural network. This form of unsupervised
learning only requires an external feedback signal giving an indication of the effectiveness of the last
action. The activation level of dopamine neurons found throughout the brain seems to encode exactly this.
DA activation patterns show a high resemblance with the computational method of temporal difference
learning.
I created a three-layer neural network to simulate stimulus-response behaviour. The output of a temporal difference function was used as a parameter for applying Hebbian learning throughout the network.
This network was perfectly capable of learning any possible stimulus-response combination. However,
human behaviour almost invariably depends on the correct execution of a sequence of actions before a
goal or desired situation is achieved. The basic stimulus-response network is incapable of learning or even
performing a sequence of actions. To allow the network to perform behavioural sequences, the notion of
reafferent stimuli was introduced. To the network, a reafferent stimulus is regarded as any other stimulus.
The characterising difference with a normal stimulus is that it is not generated by an external event, but
internally by the brain itself. It is a confirmation of having done a certain action instead of a reaction to
an external event. With this extension, the network architecture is suitable for performing sequences of
actions. Although the network is quite capable of learning to execute a series of two consecutive actions,
the performance drastically goes down when more than one reward schedule involving the same actions
needs to be learned concurrently.
In a similar fashion as the PFC orchestrates complex behaviour in the human brain, the PFC component I created aids the stimulus-response network in learning behavioural sequences also under more
difficult circumstances. I already stressed the importance of biological plausibility for the design of the
PFC component. Although it is not (yet) known exactly how it works, from neuroimaging experiments on
human and primate PFC, researchers have gained an understanding of the important properties of human
PFC. Unlike most other brain areas, the PFC displays sustained activity during the learning and execution
of a behavioural sequence. This property is also present in my PFC component. The initial stimulus is
remembered and if an execution plan is available, a pattern of activation becomes visible that is kept active
throughout the execution of the sequence. Another property of human PFC is that is does not directly
53
control the brain structures responsible for selecting motor outputs. Instead, it biases the same pathways
used by the stimulus-response network responsible for highly automated and instinctive behaviour. My
artificial PFC model exerts its influence by biasing the hidden neurons in the stimulus-response network as
opposed to controlling the output layer. Similar to the biological PFC, the bias is not directly on the output
but on the neurons in between the input and output layers. This allows very strong automated behaviour to
overrule the influence of the PFC. In a life-threatening situation, this can be critical for survival.
With the addition of the PFC, the network is now capable of learning more complex behavioural sequences of two consecutive actions. The test results show performance levels close to 95% for a fully
learned network. Disabling the PFC drastically impairs its capability to learn the more complex reward
schedules.
6.2 Conclusions
Before I start talking about my suggestions for research that could be done in succession to my work, I will
summarise the conclusions I can draw from my research.
Firstly, I explained how a computational model for learning stimulus-response behaviour can be constructed using biologically plausible methods. An artificial neural network provides a biologically founded
implementation of a system capable of processing information. Hebbian theory describes a biological
mechanism for strengthening and weakening neural connections under the right circumstances. In the
brain, the neurotransmitter dopamine plays an important role as well. Researchers such as Wickens [42]
have shown that (Hebbian) learning only takes place when DA levels are high. There is a striking similarity
between the computational method of temporal difference and the timing of DA firing. I used the concept
of dopamine mediated Hebbian learning in a neural network modeling the stimulus-response pathways in
the brain.
The second conclusion I can draw is that the aforementioned neural network is indeed capable of
learning arbitrary stimulus-response behaviour if it is consistently rewarded at the right time. The results in
section 3.3 support this claim. After an initial learning phase, the network showed near perfect performance
every time it was run.
My third conclusion is that, although this SR network is capable of learning one sequence, if more than
one set of sequences gets rewarded, the network performance drops. This was demonstrated by the performance tests in section 5.2. Learning to perform a sequence of actions is crucial for intelligent creatures
such as ourselves, because it allows us to suppress instinctive behaviour and consciously reach for more
rewarding future goals instead. Behavioural as well as neurological evidence suggests that the PFC is the
brain region responsible for orchestrating complex behaviour. A neural model of PFC should enable my
network to learn more complex sequential behaviour.
The last and most important conclusion I can draw is that the addition of a PFC to my SR network
solves the sequence learning problems in a biologically plausible way. Just like its biological counterpart,
the PFC layer I designed records a history of events and integrates this information into an observable
pattern of activation. Like in the brain, motor actions are still selected by the SR network, the PFC only
biases the stimulus-response pathways to bring about the desired behaviour. This allows the brain to very
quickly react to an emergency situation.
6.3 Recommendations
The results of the performance tests provide preliminary support for the hypothesis that a neural model of
PFC can be successfully developed to guide a neural network towards learning behavioural sequences. The
model I developed seems to work nicely under the conditions I created. But before it can be used in a more
challenging environment, a number of issues need to be resolved.
It proved to be difficult to implement lateral inhibition in the hidden and output layers. Recall that
lateral inhibition is a mechanism for putting the focus on the strongest of a set of input signals while at the
same time depressing the other inputs. It brings about winner-take-all behaviour which is very useful in
a noisy and unreliable environment. In my model, I implemented lateral inhibition by creating inhibitory
54
connections from every neuron to every other neuron in the same layer. To compensate for the very low
baseline activity of the neurons, I added an excitatory connection from every neuron to itself. The strength
of these connections is not learnable and needs to be set to a fixed initial value. Setting both the inhibitory
and excitatory strength relatively high produces random unpredictable behaviour. The relative strength of
the inputs to the layer is not sufficient to have much influence of the final activation pattern. Setting the
inhibitory and excitatory connection strength too low results in an undifferentiated activation pattern in
which no clear winner can be selected. Too much self-excitation can produce more than one winner, with
too little self-excitation no winner could come up at all.
The amount of inhibition and excitation to apply depends on the number of neurons in the layer. After
a change in network configuration such as additional inputs or output neurons, the lateral inhibition parameters must be tuned again. The new values can probably be derived from a linear function of the number of
neurons. However, the parameters also depend on the number and average strength of the inputs external
to the layer. Those can not be derived linearly since they in turn depend on the lateral inhibition processes
taking place inside the other layers. Currently, the lateral inhibition parameters are statically set upon initialisation of a new network. An option would be to set them at the start of a network activation cycle
as soon as the average input activation to the layer can be determined. This would work if the network
was activated in parallel, i.e. layer by layer. But this is not the case, activation of the network happens
by activating the neurons in the network in a random order. The notion of a layer is disregarded during
the activation process. By activating every neuron 100 times, the network settles into a stable activation
pattern. My suggestion would be to adjust the parameters after every activation cycle. Initially, the adjustments made will be quite radical but as the network settles, the lateral inhibition parameters will settle into
a stable state as well.
The current network design only allows sequences of two consecutive actions to be learned. In a reallife situation, behavioural sequences hardly ever consist of only two actions. In order to have the network
learn a sequence of arbitrary length we probably need more output neurons. Having more output neurons
implies a growth of the number of reafferent input neurons as well as the number of hidden and PFC
neurons. Although the complexity of the network increases, this should be relatively easy to implement.
A related issue has to do with the temporal differences. In the performance tests run on the network, I
manually cleared the reward prediction values after the network has selected two consecutive actions. It
makes sense to clear this prediction when a reward is delivered, but what if an unrewarded sequence of
two actions has been executed? How can the network know whether the expected reward is still to come
or if it failed to perform the correct sequence? This requires a notion of time-until-reward to be associated
with the reward prediction value. There are indications that the DA neurons in the brain do activate in
a time-dependent manner. In classical conditioning experiments animals have been trained to expect a
reward when presented with a conditioned stimulus. If after training the expected reward fails to occur,
a drop in dopamine activation level can be observed. This drop occurs at the very time that the animal
was expecting the reward to come. The exact time of reward delivery thus seems to be encoded in the DA
signal. Although little is yet known about the neural mechanism underlying this time-of-reward encoding,
I can use the concept in my model to allow for sequences of arbitrary length to be learned correctly.
Assume that the lateral inhibition and time-until-reward issues are resolved and say we want the network to learn the reward schedule S1 → {R1,R1,R2}. A strong connection between S1 and R1 will
develop, as well as a strong connection between did-R1 and R1. For the final step in the sequence, a strong
connection between did-R1 and R2 is required. Recall why the reafferent input did-R1 was introduced in
the first place. We needed a way to activate the network even when there is no external input present to
provide a way for generating one action from the completion of a previous action. The stimulus did-R1 is
regarded by the stimulus-response system like any other input, but unlike the other stimuli it is generated
by an internal representation of just having done a certain action. We are now faced with the problem that
did-R1 requires a different response at a different moment in sequence execution. But are the two did-R1
stimuli really the same? Not entirely, the context in which the stimuli present themselves is somewhat different. Instead of having the reafferent input represent the fact that a particular action has just been carried
out, a more complex context representation could be used as input to the system. More research is needed
to assess the nature and the biological plausibility of such a context representation.
55
6.4 Future work
My network design is inspired by Braver and Cohen’s model of cognitive control [6]. Where my focus
was aimed at the execution of sequences, they focused on the brain’s ability to selectively update or keep
existing context information. This allows us to attend only to contextually important sensory events while
ignoring others. Braver and Cohen theorise that phasic changes in DA activity serve the function of gating information into active memory in PFC. They implement this in their network by means of a gated
connection between the stimulus input layer and the context layer (see figure 2.11). The reward prediction
layer, which models DA activity, opens the gate when a contextually relevant stimulus is presented. We can
compare this with the predictive DA release in the brain whenever an unexpected opportunity for getting
reward presents itself. In my model, I assume that only one stimulus is presented to the network at a time
and that this stimulus really is contextually relevant. Similar to the Braver and Cohen design, the DA signal
could be used to gate information from the input layer into PFC. When a potentially rewarding stimulus is
then presented together with a less relevant stimulus, PFC will activate and bias the hidden layer towards
the selection of the rewarding sequence. Upon the presentation of one or more non-rewarding stimuli,
connections from the hidden layer to the PFC still allow it to learn by observation. The PFC will not put
an unnecessary bias onto the hidden layer and the stimulus-response network is free to select whichever
habitual or random exploratory response, unaffected by expectation.
In a broader context, my model of PFC can be used to explain the behavioural phenomenon of intention
recognition. Imagine that an observer has a plan to do action A1 followed by action A2 in context C. Now
imagine that this observer is watching another agent in this same context, and the agent performs action
A1. The mirror system hypothesis [31] tells us that the observer’s own representation of action A1 will
be activated in this situation. If we configure my network to allow activation not only to flow forward
from the input neurons but also backward from the output layer to the hidden layer, this will activate
the hidden neuron associated with the action the agent would perform if the observer had the same plan.
Figure 6.1a shows the network in this situation. This in turn will activate the observer’s representation of
S1
S1
A1
A1
a)
b)
A2
A2
did-A1
did-A1
Figure 6.1: Performance of the network for the second test case
one component of the PFC plan which he would use to generate that observed behaviour. The observer
now has a representation of a part of the agent’s plan. A remaining question is how to activate the entire
plan in order to actually understand the agent’s intentions. This requires a form of pattern auto-completion
inside PFC. In [8] Gregory Caza reports on a neural network model of this plan competition behaviour. The
general idea is that units which are frequently active together end up activating each other. After training,
this partial PFC plan will activate the entire plan using auto-completion of commonly found activation
patterns. Figure 6.1b shows the next step in doing intention recognition. Once the complete plan is active
and the observer detects the consequences of action A1, the network will activate action A2 of its own
accord. Effectively, the observer will use his plan to anticipate the observed agent’s next action. This type
of anticipation of a likely successor action in premotor mirror areas has indeed been found [10].
56
Bibliography
[1] W.F. Asaad, G. Rainer, and E.K. Miller. Neural activity in the primate prefrontal cortex during
associative learning. Neuron, 21:1399–1407, 1998.
[2] Bruno B. Averbeck, Matthew V. Chafee, David A. Crowe, and Apostolos P. Georgopoulos. Parallel processing of serial movements in prefrontal cortex. Proceedings of the National Academy of
Sciences, 99(20):13172–13177, 2002.
[3] H. Barbas and D.N. Pandya. Architecture and intrinsic connections of the prefrontal cortex in the
rhesus monkey. Journal of Comparative Neurology, 286:353–375, 1989.
[4] Helen Barbas. Connections underlying the synthesis of cognition, memory, and emotion in primate
prefrontal cortices. Brain Research Bulletin, 52:319–330, 2000.
[5] E. A. Berg. A simple objective technique for measuring flexibility in thinking. Journal of General
Psychology, page 15, 1948.
[6] Todd S. Braver and Jonathan D. Cohen. On the control of control: The role of dopamine in regulating
prefrontal function and working memory. In Stephen Monsell and Jon Driver, editors, Attention
and Performance XVIII; Control of cognitive processes, pages 713–737. The MIT Press, London,
England, 2000.
[7] P. Calabresi, R. Maj, N.B. Mercuri, and G. Bernardi. Coactivation of d1 and d2 dopamine receptors is
required for long-term synaptic depression in the striatum. Neuroscience letters, 142:95–99, August
1992.
[8] Gregory A. Caza. Computational model of plan competition in the prefrontal cortex. In Proceedings
of NZCSRSC ’07, the Fifth New Zealand Computer Science Research Student Conference, April 2007.
[9] G. Di Chiara and A. Imperato. Drugs abused by humans preferentially increase synaptic dopamine
concentrations in the mesolimbic system of freely moving rats. Proceedings of the National Academy
of Sciences of the United States of America, 85:5274–5278, 1988.
[10] L Fogassi, P F Ferrari, B Gesierich, S Rozzi, F Chersi, and G Rizzolatti. Parietal lobe: from action
organisation to intention understanding. Science, 308:662–667, 2005.
[11] J.M. Fuster. Neuron activity related to short-term memory. Science, 173:652–654, August 1971.
[12] S. Geyer, M. Matelli, G. Luppino, and K. Zilles. Functional neuroanatomy of the primate isocortical
motor system. Anatomy and embryology, 202:443–474, 2000.
[13] D.W. Glasspool and Houghton D. Dynamic representation of structural constraints in models of serial
behaviour. In J. Bullinaria, D. Glasspool, and G. Houghton, editors, Connectionist Representations,
pages 269–282. Springer-Verlag, London, 1997.
[14] Melvyn A. Goodale and A. David Milner. Separate visual pathways for perception and action. Trends
in Neuroscience, 15:20–25, 1992.
[15] D. Hebb. Organization of Behavior. J. Wiley & Sons, New York, 1949.
57
[16] L.J. Kamin. Predictability, surprise, attention and conditioning. In R. Church and B. Campbell,
editors, Punishment and Aversive Behavior. Appleton-Century-Crofts, New York, 1969.
[17] Raymond M. Klein. Inhibition of return. Trends in Cognitive Sciences, 4:138–147, 2000.
[18] T. Ljungberg, P. Apicella, and W. Schultz. Responses of monkey dopamine neurons during learning
of behavioral reactions. Journal of neurophysiology, 67:145–163, 1992.
[19] P. Maclean. The triune brain, emotion and scientific basis. In F.O. Schmitt, editor, The neurosciences:
second study program. Rockefeller University Press, New York, 1970.
[20] Earl K. Miller and Jonathan Cohen. An integrative theory of prefrontal cortex function. Annual
Revivew Neuroscience, 24:167–202, 2001.
[21] E.K. Miller. Neural mechanisms of visual working memory in prefrontal cortex of the macaque.
Journal of neuroscience, 16:5154–5167, 1996.
[22] Ralph R. Miller, Robert C. Barnet, and Nicholas J. Grahame. Assessment of the rescorla-wagner
model. Psychological Bulletin, 117:363–386, May 1995.
[23] B. Milner. Effects of different brain lesions on card sorting. the role of the frontal lobes. Archives of
neurology, 9:90–100, 1963.
[24] J. Mirenowicz and W. Schultz. Importance of unpredictability for reward responses in primate
dopamine neurons. Journal of Neurophysiology, 72:1024–1027, 1994.
[25] F. Mora and R.D. Myers. Brain self-stimulation: direct evidence for the involvement of dopamine in
the prefrontal cortex. Science, 197:1387–1389, September 1977.
[26] Randall C. O’Reilly. Generalization in interactive networks: The benefits of inhibitory competition
and hebbian learning. Neural Computation, 13:1199–1241, 2001.
[27] Ivan P. Pavlov. Conditioned reflexes. Routledge & Kegan Paul, London, 1927.
[28] E. Perret. The left frontal lobe of man and the suppression of habitual responses in verbal categorical
behaviour. Neuropsychologia, 12:323–330, 1974.
[29] M. Petrides and D.N. Pandya. Dorsolateral prefrontal cortex: comparative cytoarchitectonic analysis
in the human and the macaque brain and corticocortical connection patterns. European Journal of
Neuroscience, 11(3):1011–1036, 1999.
[30] R.A. Rescorla and A.R. Wagner. A theory of pavlovian conditioning: Variations in the effectiveness of reinforcement and non-reinforcement. In A.H. Black and W.F. Prokasy, editors, Classical
conditioning II: Current research and theory. Appleton-Century-Crofts, New York, 1972.
[31] G. Rizzolatti and L. Craighero. The mirror-neuron system. Annual Review of Neuroscience, 27:169–
192, 2004.
[32] W. Schultz. Responses of midbrain dopamine neurons to behavioral trigger stimuli in the monkey.
Journal of neurophysiology, 56:1439–1461, 1986.
[33] W Schultz, P Apicella, and T Ljungberg. Responses of monkey dopamine neurons to reward and
conditioned stimuli during successive steps of learning a delayed response task. The Journal of
Neuroscience, 13:900–913, March 1993.
[34] Wolfram Schultz, Peter Dayan, and P. Read Montague. A neural substrate of prediction and reward.
Science, 275(5306):1593–1599, March 1997.
[35] B Seltzer and D.N. Pandya. Frontal lobe connections of the superior temporal sulcus in the rhesus
monkey. Journal of Comparative Neurology, 281:97–113, 1989.
58
–
59
[36] B. Seltzer and D.N. Pandya. Parietal, temporal, and occipita projections to cortex of the superior
temporal sulcus in the rhesus monkey: A retrograde tracer study. Journal of Comparative Neurology,
343:445–463, 1994.
[37] A.L. Semendeferi, N. Schenker, and H. Damasio. Humans and great apes share a large frontal cortex.
Nature neuroscience, 5:272–276, March 2002.
[38] J. R. Stroop. Studies of interference in serial verbal reactions. Journal of Experimental Psychology,
18:643–662, 1935.
[39] Richard S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning,
3:9–44, 1988.
[40] Pere Vendrell, Carme Junque, Jesus Pujol, M. Angeles Jurado, Joan Molet, and Jordan Grafman. The
role of prefrontal regions in the stroop task. Neuropsychologia, 33:341–352, March 1995.
[41] Ilsun M. White and Steven P. Wise. Rule-dependent neuronal activity in the prefrontal cortex. Experimental brain research, 126:315–335, May 1999.
[42] J. R. Wickens, A. J. Begg, and G. W. Arbuthnott. Dopamine reverses the depression of rat corticostriatal synapses which normally follows high-frequency stimulation of cortex in vitro. Neuroscience,
70:1–5, 1996.
[43] B. Widrow and M.E. Hoff. Adaptive switching circuits. Institute of Radio Engineers, Western Electronic Show and Convention, Convention Record, part 4, pages 96–104, 1960.
July 20, 2007
Appendix A
Package network
Package Contents
Page
Classes
Connection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
This class represents a connection between two artifical neurons in an artificial neural
network.
Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
This class represents a layer of neurons in an artificial neural network.
Neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
This class represents an artificial neuron in an artificial neural network.
60
network– Connection
A.1 Classes
A.1.1
C LASS Connection
This class represents a connection between two artifical neurons in an artificial neural network.
D ECLARATION
public class Connection
extends java.lang.Object
S ERIALIZABLE F IELDS
• private double threshold
– Threshold for activating this connection.
• private double random
– Random value to add to this connections weight.
• private boolean modifiable
– Is the weight of this connection modifiable.
• private boolean gated
– Is there a (dopamine) gate on this connection .
• private int activationtype
– Type of activation. Can either be ConnectionType.EXCITATORY or
ConnectionType.INHIBITORY.
• private int location
– Location of the connection. Can be either Layer.INTERNAL or Layer.EXTERNAL
• private String name
– Name of this connection.
• private int learningtype
– Type of learning to apply for this connection. Can be either ConnectionType.LTP or
ConnectionType.LTD.
• public Hashtable<Integer,Neuron>neurons
– A table containing the neurons on either side of this connection. The table contains a mapping
from one neuron to the other as well as mapping from type (INPUT or OUTPUT) to neuron.
F IELDS
• public Hashtable<Integer,Neuron>neurons
– A table containing the neurons on either side of this connection. The table contains a mapping
from one neuron to the other as well as mapping from type (INPUT or OUTPUT) to neuron.
61
network– Connection
62
C ONSTRUCTORS
• Connection
public Connection( network.Neuron
direction, int location )
source, network.Neuron
target, byte
– Usage
∗ Creates a new connection between two artificial neurons. The source neuron is
unidirectionally connected to the target neuron. If the connection is bidirectional, the
output neuron is unidirectionally connected to the input neuron as well.
By default, a modifiable excitatory connection is created. Activation of the source neuron
leads to excitation of the target neuron. If the learning function is called on the
connection the weight is updated according to a Hebbian learning rule. The default
learning type is LTP.
The connection has an initial weight of 0.5.
– Parameters
∗ inputNeuron M ETHODS
• getLocation
public int getLocation( )
– Usage
∗ Returns the location of this connection. This can be Layer.INTERNAL or
Layer.EXTERNAL.
– Returns - the location of this connection.
• getName
public String getName( )
– Usage
∗ Returns the name of this connection.
– Returns - the current name of this connection.
• getOutput
public double getOutput( int
hash )
– Usage
∗ Returns the activation of the source neuron multiplied by the weight of the connection.
The provided hash identifies the source neuron.
– Parameters
∗ hash - the hash value of the source neuron.
– Returns - the activation level observed by the target neuron.
• getWeight
public double getWeight( )
– Usage
∗ Returns the weight of this connection.
– Returns - the current weight of this connection.
• initWeight
public void initWeight( double
– Usage
weightCentre, double
weightRange )
network– Connection
63
∗ Initializes the weight of this connection to a random value that lies in the weight range
around the weight centre.
– Parameters
∗ weightCentre - the value used as the weight centre.
∗ weightRange - the maximum value by which to stray from the centre.
• isModifiable
public boolean isModifiable( )
– Usage
∗ Returns the modifiability of this connection.
– Returns - true if the weight of this connection is modifiable.
• learn
public void learn( double
lr )
– Usage
∗ Updates the weight of this connection by applying a Hebbian learning rule on the output
values of the source and target neurons.
– Parameters
∗ lr - the learning rate for the Hebbian learning rule.
• setActivationType
public void setActivationType( int
activationtype )
– Usage
∗ Sets the connection type of this connection. The type can be either
ConnectionType.EXCITATORY or ConnectionType.INHIBITORY.
– Parameters
∗ activationtype - the connection type for this connection.
• setLearningType
public void setLearningType( int
learningtype )
– Usage
∗ Sets the learning type for this connection. The type can be either ConnectionType.LTP or
ConnectionType.LTD.
– Parameters
∗ learningtype - the learning type for this connection.
• setModifiable
public void setModifiable( boolean
modifiable )
– Usage
∗ Sets the modifiability of this connection.
– Parameters
∗ modifiable - true if this connection should be learned and consequently have its
weight updated.
• setName
public void setName( String
name )
– Usage
∗ Sets the name of this connection.
– Parameters
∗ name - the name to be given to this connection.
network– Layer
64
• setRandom
public void setRandom( double
random )
– Usage
∗ Sets the random factor for this connection. If the getOutput() function is called on the
connection it will add the random factor to the output.
– Parameters
∗ random - a random amount to temporarily add to the weight of this connection.
• setThreshold
public void setThreshold( double
threshold )
– Usage
∗ Sets the threshold value for this connection. If the getOutput() function is called on the
connection it will only return a positive value if the output is above the threshold value.
– Parameters
∗ threshold - the threshold value to use for this connection.
• setWeight
public void setWeight( double
weight )
– Usage
∗ Sets the weight for this connection
– Parameters
∗ weight - the weight for this connection.
A.1.2
C LASS Layer
This class represents a layer of neurons in an artificial neural network.
A layer can be connected to another layer in various ways.
D ECLARATION
public class Layer
extends java.lang.Object
S ERIALIZABLE F IELDS
• private int activation function
– The activation function currently used for activating this layer.
• private String name
– This layer’s name.
network– Layer
65
F IELDS
• public static final int FULLY
–
• public static final int ONE
–
• public static final int ONEONE
–
• public static final int ONETWO
–
• public static final int ONESOME
–
• public static final int EXTERNAL
–
• public static final int INTERNAL
–
• public static final int SELF
–
• public static final int OTHER
–
C ONSTRUCTORS
• Layer
public Layer( String
name, int
neuronCount )
– Usage
∗ The constructor creates a new layer
with the specified number of neurons.
– Parameters
∗ name - the name given to this layer.
∗ neuronCount - the number of neurons in this layer.
M ETHODS
• activate
public void activate( )
– Usage
network– Layer
66
∗ Activates the neurons in this layer
by activating every single neuron in the layer.
Consequently, the outputs of the neurons are
set using a ’winner takes all / output one’ strategy.
• average
public void average( )
• clearActivation
public void clearActivation( )
– Usage
∗ Clears the activation of the neurons in this layer.
The activation of every neuron is set to 0.
• connect
public Vector<Connection>connect( network.Layer
network.ConnectionType ct )
layer, int
type,
– Usage
∗ Connects this layer to another layer with
full connectivity.
– Parameters
∗ layer - the layer to connect this layer to.
∗ type - the connection type for the new layer connection (FULLY, ONEONE or
ONETWO)
∗ ct - the ConnectionType Object describing important connection features
• getActivation
public double getActivation( )
– Usage
∗ Returns the activation of all the neurons in this layer.
– Returns - the activation of the neurons in this layer
• getConnections
public HashSet<Connection>getConnections( )
– Usage
∗ Returns the external connections to and from this layer
– Returns - an unordered set of Connection objects
containing all the external connections of this layer.
network– Layer
67
• getName
public String getName( )
– Usage
∗ Returns the name of this layer.
– Returns - the name of this layer.
• getNeuronCount
public int getNeuronCount( )
– Usage
∗ Returns the number of neurons in this layer.
– Returns - the number of neurons in this layer.
• getNeuronSet
public HashSet<Neuron>getNeuronSet( )
– Usage
∗ Returns the neurons in this layer.
– Returns - an unordered set of Neuron objects
containing all the neurons in this layer.
• getOutputs
public double getOutputs( )
– Usage
∗ Returns the outputs of this layer.
– Returns - the current outputs of this layer.
• getWeights
public Hashtable<String,Double>getWeights( )
– Usage
∗ Returns the weights of all external connections
to and from this layer as a hash table. The name of the
connection is used as a key for looking up its weight.
– Returns - a hash table with the weights of the connections
to and from this layer.
• initWeights
public void initWeights( double
– Usage
weightCentre, double
weightRange )
network– Layer
68
∗ Initialises the weights of all external connections
of this layer to a value that lies in the weight
range around the weight centre.
– Parameters
∗ the - value used as the weight centre.
∗ the - maximum value by which to stray from the centre.
• printActivation
public void printActivation( )
• printConnections
public void printConnections( )
• selfBias
public void selfBias( double
weight )
– Usage
∗ Creates a bias unit with excitatory connections to
every neuron in this layer.
– Parameters
∗ weight - the weight for the connections.
• selfConnect
public void selfConnect( int
type, double
weight )
– Usage
∗ Connects this layer to itself with
full connectivity using the specified
connection type.
– Parameters
∗ type - the connection type (inhibitory or excititory).
∗ weight - the weight for the connections.
• selfExcite
public void selfExcite( double
weight )
– Usage
∗ Creates a self-excitatory connection to every neuron
in this layer.
– Parameters
∗ weight - the weight for the connections.
• selfInhibit
public void selfInhibit( double
weight )
network– Layer
69
– Usage
∗ Creates inhibitory connections from
every neuron to every other neuron
in this layer.
– Parameters
∗ weight - the weight for the connections.
• setActivation
public void setActivation( double [] activation )
– Usage
∗ Manually sets the activation of the neurons in this
layer to the given activation pattern.
– Parameters
∗ the - activation pattern set for the neurons in this layer.
• setActivationFunction
public void setActivationFunction( int
activation function )
• setInternalConnections
public void setInternalConnections( double
weight )
– Usage
∗ Sets all the weights of the internal
(inhibitory) connections to the same
value.
– Parameters
∗ weight - the weight for the connections.
• setOutputs
public void setOutputs( )
– Usage
∗ Sets the outputs of the neurons in this layer
by applying a ’winner takes all / output one’ function
on their current activation.
• setRandom
public void setRandom( double
randominput )
• setSelfConnections
public void setSelfConnections( double
– Usage
weight )
network– Neuron
∗ Sets all the weights of the self-excitatory
(internal) connections to the same value.
– Parameters
∗ weight - the weight for the connections.
A.1.3
C LASS Neuron
This class represents an artificial neuron in an artificial neural network. Neuron objects can be connected with another
Neuron using a Connection.
D ECLARATION
public class Neuron
extends java.lang.Object
S ERIALIZABLE F IELDS
• private int hash
– Hash value of this neuron for use in a hash table.
• private int activation function
– The currently used activation function.
• private Vector<Connection>ext connections
– External connections to this neuron (from other layers).
• private Vector<Connection>int connections
– Internal connections to this neuron (from within the same layer).
• private String name
– Name of this neuron.
• private String status info
– Status information about the input and activation values.
• private StringBuffer buf
– Buffer used for status info.
F IELDS
• public static final int SIGMOID
– Use a sigmoidal activation function.
• public static final int LINEAR
– Use a linear activation function.
• public static final int PFC
– Use a PFC specific sigmoidal activation function
70
network– Neuron
71
C ONSTRUCTORS
• Neuron
public Neuron( )
– Usage
∗ Constructs a new Neuron with a sigmoidal activation function.
M ETHODS
• activate
public void activate( )
– Usage
∗ Activates this neuron by computing its activation value. This is done by taking the
weighted sum of both the internal and the external connections to this neuron and finally
feeding this value to the set activation function.
• connect
public void connect( network.Connection
connection )
– Usage
∗ Associates this neuron with a connection.
– Parameters
∗ connection - the connection to associate this neuron with.
• getActivation
public double getActivation( )
– Usage
∗ Returns the activation of this neuron.
– Returns - the most recent activation value of this neuron.
• getName
public String getName( )
– Usage
∗ Returns the name of this neuron.
– Returns - the name that has been given to this neuron.
• getOutput
public double getOutput( )
– Usage
∗ Returns the latest set output of this neuron.
– Returns - the most recent output of this neuron.
• getStatus
public String getStatus( )
– Usage
∗ Gives some information about the input values received from other neurons and used for
activating this neuron.
– Returns - a string containing status information about the neuron.
• setActivation
public void setActivation( double
activation )
network– Neuron
72
– Usage
∗ Activates this neuron by manually setting its activation value.
– Parameters
∗ activation - the activation to manually set for this neuron.
• setActivationFunction
public void setActivationFunction( int
activation function )
– Usage
∗ Sets the activation function to use for this neuron. Possible activation functions are
SIGMOID, LINEAR or PFC.
– Parameters
∗ activation function - the activation function to use for this neuron.
• setName
public void setName( String
name )
– Usage
∗ Sets the name of this neuron.
– Parameters
∗ name - the preferred name for this neuron.
• setOutput
public void setOutput( double
output )
– Usage
∗ Sets the output of this neuron to a given value.
– Parameters
∗ output - the output value to set for this neuron.
Download