Stimulus detection and decision making via spike

advertisement
Stimulus detection and decision making via spike-based
reinforcement learning
Giancarlo La Camera
Department of Neurobiology and Behavior
Stony Brook University
Stony Brook, NY 11794, USA
giancarlo.lacamera@stonybrook.edu
Robert Urbanczik
Department of Physiology
University of Bern
Bühlplatz 5, Bern, Switzerland
urbanczik@pyl.unibe.ch
Walter Senn
Department of Physiology
University of Bern
Bühlplatz 5, Bern, Switzerland
senn@pyl.unibe.ch
Abstract
In theoretical and experimental investigations of decision-making, the main task has typically been one of classification,
wherein the relevant stimuli cueing decisions are known to the decision maker: the latter knows which stimuli are
relevant, and knows when it is being presented with one. However, in many real-life situations it is not clear which
segments in a continuous sensory stream are action relevant, and relevant segments may blend seamlessly into irrelevant
ones. Then the decision problem is just as much about when to act as about choosing the right action. Here, we present
a spiking neuron network which learns to classify hidden relevant segments of a continuous sensory stream of spatiotemporal patterns of spike trains. The network has no a-priori knowledge of the stimuli, when they are being presented,
and their behavioral significance – i.e., whether or not they are action-relevant. The network is trained by the reward
received for taking correct decisions in the presence of relevant stimuli. Simulation results show that by maximizing
expected reward the spiking network learns to distinguish behaviourally relevant segments in the input stream from
irrelevant ones, performing a task akin to temporal stimulus segmentation.
Keywords:
spiking neuron; temporal segmentation; signal detection; gradient learning; synaptic plasticity; spike-timing patterns; firing rate
patterns; neural circuit
Acknowledgements
We acknowledge support from the National Science Foundation (Grant IIS-1161852) and the Swiss National Science
Foundation through the SystemsX.ch initiative (“Neurochoice”).
Spiking network models aspire to produce biologically plausible models of learning and decision making (see e.g. [1]).
For concreteness, consider the following 2 choice classification task: a set of input stimuli is to be associated with one of
two possible correct actions – e.g., ‘go left’ vs. ‘go right’. The correct decision is rewarded whereas the incorrect decision
is punished. In a ‘canonical’ spiking network model designed to learn this task [2, 3], populations of sensory neurons
project to populations of ‘decision neurons’ via plastic synapses, as shown in Fig. 1a. Each stimulus is represented by the
activation of a predefined sensory population, such as the orange population in Fig. 1a. After an input is presented to the
network, some competition occurs at the level of the decision populations, which ends when one of the two populations
enters a state of activity having higher firing rate than the other (or, in alternative models, its activity reaches a pre-defined
threshold earlier than the other population). The winning population initiates the corresponding action. If that action
is correct, the network is rewarded, otherwise it is punished. Based on this outcome, the synapses between the input
neurons and the decision neurons are modified so as to increase the chance of producing the correct action in response to
future presentations of the same stimulus. This class of models are able to capture much of the physiology and behavior
observed in typical laboratory tasks which inspired them [1]; however, they are designed to work in a somewhat limited
scenario, in which: 1) every stimulus presented to the agent is relevant, in the sense that, if met with the correct action,
a reward is obtained; 2) the agent knows the identity of all the stimuli and when they are being presented; 3) there is a
well-defined time period during which a decision must be made (decisions are enforced); 4) all decisions lead to feedback
(either reward or punishment) – hence, feedback is received for each stimulus presentation. Also, the network model of
Fig. 1a has as many input populations as relevant stimuli: to introduce a new stimulus, one has to augment the model
with an additional population of neurons encoding that stimulus.
Here, we present a spiking network model (the learning agent, or ‘agent’ for short) which learns to segment a continuous
input stream by identifying those segments of the stream that are action-relevant (see Fig. 1b). The relevant stimuli are
spatio-temporal patterns of spike trains hidden among a host of non-relevant patterns in the same continuous stream.
Learning is achieved with an online, spike-based learning rule that tries to maximize reward. Compared with the learning scenario outlined above, here 1) the a-priori relevance of the stimuli is not known to the agent; 2) the agent does not
know when and if a stimulus is being presented; 3) the agent is not required to make a decision at any time; and 4) only
correct decisions made in the presence of a relevant stimulus lead to feedback. This is the fundamental distinction between
relevant and non-relevant stimuli: if any decision is made in the presence of a non-relevant stimulus, nothing happens
– in particular, no rewarding feedback is given. If every action is costly (as assumed below), the optimal behavior in
the presence of non-relevant stimuli is to do nothing.1 Finally note that, contrary to the model of Fig. 1a, in our network
additional stimuli can be represented as new segments of the stream, with no need to add populations of input neurons.
Network architecture and decision dynamics. The spiking network model we propose in this work is illustrated in
Fig. 1b. Two decision populations of N = 100 spiking neurons each (labeled as L and R respectively) receive input
spike trains via plastic synapses (Fig. 1b). When the difference in spike counts between the two populations exceeds
a threshold ΘD , |spk(L) − spk(R)| > ΘD , a decision occurs. As long as |spk(L) − spk(R)| < ΘD , no decisions are
taken. Each stimulus is randomly deemed either relevant or irrelevant, with relevant stimuli arbitrarily associated to
one of two correct decisions, either ‘go left’ (accomplished if spk(L) − spk(R) > ΘD ), or ‘go right’ (accomplished if
spk(L) − spk(R) < ΘD ). When a decision occurs, a rewarding feedback is obtained after a minor delay (50ms), the
stimulus is removed, and the population activity is reset to zero. Every decision (whether correct or incorrect) incurs a
small cost −0.1 (to prevent the agent to take decisions continuously), and positive reward (R = 1) is given only for a
correct decision in the presence of a relevant stimulus (netting a total reward of R = 0.9). Incorrect decisions are not
punished (and thus only incur the cost R = −0.1). The rationale for such choice is that an additional negative reward for
an incorrect response to a relevant stimulus would signal the presence of a relevant stimulus at the time of a decision,
aiding the solution of the identification task. In case of multiple correct responses to the same relevant stimulus, only
the first such response is rewarded. We tested the model with both precise spike timing patterns (task 1) and firing rate
patterns (task 2), as detailed in a later section. In both tasks, stimuli were of random duration around a mean of 500ms.
Decision neurons and learning rule. We indicate with Ps a smoothed version of the readout spk(L) − spk(R) in the
following. The neurons contributing to the population activity responsible for making decisions were modeled as spike
response models with a noisy escape mechanism for action potential emission [4] – i.e., a spike is emitted with a given
probability φ(u(t)) depending on the current value of the membrane potential u at time t. Learning occurred via the
online learning rule introduced in [5],
dwiν
= η|Rt |a(Ps )(rν − 1)Eiν ,
(1)
dt
where wiν is the synaptic weight between pre-synaptic (input) neuron i and post-synaptic (decision) neuron ν, η is the
learning rate, Rt is the reward at time t, rν is an individualized reward signal that equals 1 if neuron ν made the right
decision, and −1 otherwise.PThe factor |Rt | insures that synaptic update is confined to a temporal window around
reward delivery. Eiν (t) ∝ ( tν δ(t − tν ) − φ(uν (t))) P SPi (t) is a low-pass filter of the time-derivative of the gradient
(with respect to the synaptic weights) of the log-likelihood of producing the output spike train {tν } given an input spike
1
Note how this is different from a 3-way classification task where the stimuli are to be separated in 3 classes (‘go left’, ‘go right’,
and ‘do nothing’), and in which ignoring non-relevant stimuli would be rewarded as the correct response to those stimuli.
1
Figure 1: Alternative neural circuit models for decision making tasks. a) In a ‘canonical’ decision-making circuit, each input stimulus
is represented by an increase in firing rate in a dedicated population of neurons (here, the orange population). Two decision populations code for ‘go left’ (red) and ‘go right’ (green), respectively. A read-out initiates either action, and the decision is met with a
reward or punishment. The outcome modulates synaptic plasticity (dashed curves) at either one of the pathways or both. To represent
a new relevant stimulus, a new population of neurons must be added to the network. b) In the type of cortical circuit studied in this
paper, the input is a spatio-temporal pattern of spike trains (each spike train coming from a different input neuron). Relevant inputs
are hidden segments of this pattern (shaded areas): if met with the appropriate response, a reward is delivered. The network has no
a priori knowledge of the relevant segments: these are formed by segmenting the input through a process of reinforcement learning.
No additional populations are required to represent additional stimuli – whether relevant or not. See the text for details.
pattern causing a post-synaptic potential P SPi (t) on neuron i at time t (see [4, 5] for details). Note that only the synapses
targeting neurons voting for the wrong decision (rν = −1) are updated according to the above learning rule; the update
2
is full (a(Ps ) = 1) in case of an incorrect decision and attenuated by a factor a(Ps ) ∝ e−Ps /N in case of a correct decision.
This allows for synapses to undergo a full update only when most needed (i.e., following a wrong population decision).
Moveover, synaptic updates for neurons voting for incorrect decisions are smaller for a larger population readout Ps
because of the value of the attenuation factor a(Ps ) in this case. Since Ps can be interpreted as an internal measure of the
agent’s ‘confidence’ in its decision, the synaptic update is small for correct decisions taken with large confidence.
In the case of episodic learning, the learning rule Eq. 1 performs stochastic gradient ascent in a monotonic function of
reward and population activity [5]. This learning rule can be understood as an improvement over Williams’s general
gradient learning rule [6]. The need to introduce the individualized reward signal rν arises because otherwise learning
worsens as the population size increases, as demonstrated in [5]. The individualized reward signal can be made available
locally at each synapse by broadcasting feedback from the population readout Ps (e.g., through a neurotransmitter such
as acetylcholine or serotonin) and from each neuron’s own activity St (e.g., through intracellular calcium transients), in
addition to the global reward feedback Rt (see [5] for details).
Finally, learning occurred only on synapses targeting neurons in the L population, with the synapses projecting onto the
R population kept fixed. This way, the R population was a ‘contrast population’ used as reference for making decisions.
Since the only variable responsible for decisions is the difference between the activities of the two populations, this choice
is legitimate and allows for a minimal implementation. Note that there is no a priori preference for which population
should be the learning one: their roles can be interchanged without affecting the results.
Simulation results with spike timing patterns (task 1). In this scenario, stimuli were patterns of 60 spike trains. Each
spike train was obtained as a realization of a Poisson process with a constant firing rate of 6Hz. The choice of a Poisson
process is convenient but not strictly necessary, i.e., any other distribution could have been used instead [7]. Once
created, the spike patterns were presented each time unmodified, i.e., for each pattern, the spike trains were kept fixed
across repeated presentations (‘frozen’ patterns; this is an un-biological simplification that will be relaxed later). Note
that all spike patterns have exactly the same statistics, and thus they cannot be encoded or decoded by firing rate. As
shown in [5], stimuli of this sort can be classified by a single decision population equipped with the learning rule Eq. 1
within a so-called ‘time controlled’ paradigm [8], whereby an action is required at the end of stimulus presentation and
no stimulus identification is involved. Here, however, relevant stimuli must be identified first, and decisions can be
made at any time, or could not be made at all. A simulation run for this classification task with the online learning rule
Eq. 1 is shown in Fig. 2 for the case of 6 stimuli (the same model can also learn tens of stimuli; not shown here to ease
illustration). The network was able to learn to identify relevant segments and make the correct decision in response to
them (pink and green shaded areas in panel a), while holding actions in the presence of non-relevant segments – at least
in a large fraction of them and given the limited simulation times. Performance tends to increase with learning (panel
b, top). The evolution of decision times for the best and the worst stimuli (panel b, bottom) shows that as the network
2
Figure 2: Simulation results with ‘frozen’ patterns of spike trains. a) Dynamics of decisions after learning for 3000 trials in the
architecture of Fig. 1b. The population readout (Ps in the main text) correctly makes the decision to go left in response to the ‘pink’
segments of the input stream, and to go right in response to the ‘green’ segments, by crossing a threshold (dashed horizontal lines,
positive for decision ‘left’ and negative for decision ‘right’). Correct decisions cause transient increase in reward feedback Rt (red line).
The numbers below the negative decision threshold label the segments. After a decision is taken, the current segment disappears and
reward or penalty is given after a delay of 50ms. Stimuli were presented in random order. b) Top: performance as % correct in
response to relevant stimuli steadily improves with learning and converges to a value close to optimal in 3000 trials (asymptotic overall
performance was only slightly worse; not shown). Bottom: decision times for two stimuli vs. number of presentations of those stimuli.
In both panels, curves were smoothed out with a low pass filter x̄n = (1 − λ)x̄n−1 + λxn , with λ = 0.05. c) Detail of decision times
(top) and performance (bottom) for all 6 stimuli used in the task. In the top panel, the squares represent the total durations of the
stimuli, dots are the sampled decision times in the last 100 trials, and crosses are the average decision times. After learning, the fastest
decisions were in response to relevant segments, whereas decision times were fewer and closer to the maximal stimulus duration
(∼ 500ms) for non-relevant segments (key: L=‘go left’, R=‘go right’ and N=‘non-relevant stimuli’)
became more confident about a decision, its response to the related stimulus became faster (best stimulus), whereas when
stimuli had not been yet correctly identified, the decision times tended to be flat or increase during learning to allow for
more information to be accrued (worst stimulus). In panel c), mean decision times and performance are shown for all
stimuli (stimuli marked N were non-relevant stimuli). The best stimulus (panel b) was stimulus 5, for which performance
reached 100% correct after training; the worst stimulus was stimulus 4, a non-relevant segment (like segment 3 in panel
a). Note that the end-performance with this stimulus after training was ∼ 75% correct (see panel c), bottom), which
means that ∼ 25% of the time the agent took an action during the presentation of this stimulus (the agent, however, is
still learning to ignore this stimulus, see panel b), bottom, dashed line). Note that in the case of stimuli 1 and 2 the agent
had become very confident of the correct decision, as implied by the short reaction times and the population activity
overshoots above the decision threshold in the time interval between the decision and the rewarding feedback.
Simulation results with firing rate patterns (task 2). The previous scenario assumed that patterns of spike trains are
reproduced exactly unmodified at each stimulus presentation (‘frozen’ spike patterns). This is clearly only a convenient
starting point. A more realistic scenario could be based on firing rate coding, whereby a stimulus is defined by the firing
rates of its input spike trains, collectively, but the spike times are generated anew during each stimulus presentation. This
is both a more realistic scenario and a more challenging learning task for our spike-based learning rule. The stimuli were
patterns of 60 Poisson spike trains with constant firing rate, each randomly sampled from values 6, 22, 40, 60 spikes/s.
The pattern of firing rates defining a stimulus were always fixed, but the actual spike times were generated anew at
each stimulus presentation to produce Poisson spike trains with the given firing rates. By construction, all stimuli have
the same overall firing rate, i.e., the stimuli could not be distinguished by unsupervised processing based on the overall
firing rate of the input. The simulation shown in Fig. 3 confirms that the agent is able to identify relevant stimuli coded
as patterns of firing rates, despite using a learning rule not explicitly designed to learn firing rates.
Conclusions and discussion. Learning to abstract relevant information from the environment is a crucial component
of decision making; yet, current models typically assume that the relevant inputs are known to the decision maker, and
defined once and for all. Here, we put forward a spiking network model able to detect stimuli from the environment
based on their behavioral relevance. Since the stimuli are presented in sequence in a continuous stream, with unknown
starting and ending points, the task is akin to temporal stimulus segmentation, i.e., the task of discovering boundaries
between successive stimuli. Segmentation tasks such as ours are typically solved by methods such as Hidden Markov
3
Figure 3: Simulation results with stationary firing rate patterns, same keys as Fig. 2. See the text for details.
Models [9], which require a-priori knowledge of the relevant stimuli (or at least the number of relevant stimuli), are
not based on online algorithms, and lack biological plausibility. In contrast, our spiking network model learns online,
does not require a priori knowledge of the relevant stimuli or even when they are being presented, it allows for direct
comparison with neurobiological data and thus could help uncover potential correlates of decision confidence and other
aspects of decision making. Another hallmark of our study is the use of ‘information-controlled’ tasks, which allows
subjects to respond whenever they feel confident [8].
Our model differs from the class of neural-circuit models of decision making depicted in Fig. 1a, which require a neural
population encoding each stimulus, and a priori knowledge of the relevant stimuli, and when they start and end. Following an approach more similar to ours, the ‘tempotron’ [10] can learn to separate spike patterns into two classes, which
could be interpreted as ‘relevant’ vs. ‘non-relevant’. However, the tempotron needs to know when stimuli start and end,
and is given feedback for non-responses to relevant stimuli, which helps their identification. Also, the tempotron is only
capable of binary decisions. Moreover, if applied in a population of neurons rather than in a single neuron, performance
again slows down if no feedback from the population activity (resulting in an individual reward signal) is given.
This work can be extended in a number of directions. One could consider a visual segmentation task wherein a sequence
of images slowly appear and disappear on top of a noisy background, and the task of the agent is to identify the images
that are action-relevant. Preliminary simulations with a simple version of this task show encouraging results. A second
direction is to go beyond 2 choice tasks. This could be obtained by subdividing the decision neurons into as many
subpopulations as alternative decisions, with each subpopulation encoding a different decision. Each subpopulation
would obey the same learning rule, which is aesthetically appealing and biologically plausible. Preliminary simulations
show that with this modified architecture, the network also does a better job at learning to ignore non-relevant stimuli.
References
[1] X.-J. Wang. Decision making in recurrent neuronal circuits. Neuron, 60(2):215–34, Oct 2008.
[2] X.-J. Wang. Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36:955–968, 2002.
[3] S. Fusi, W. F. Asaad, E. K. Miller, and X.-J. Wang. A neural circuit model of flexible sensorimotor mapping: learning and forgetting
on multiple timescales. Neuron, 54:319–333, Apr 2007.
[4] J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner. Optimal spike-timing-dependent plasticity for precise action potential
firing in supervised learning. Neural Comput, 18(6):1318–48, Jun 2006.
[5] R. Urbanczik and W. Senn. Reinforcement learning in populations of spiking neurons. Nat Neurosci, 12(3):250–2, Mar 2009.
[6] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(229),
1992.
[7] M. C. Wiener and B. J. Richmond. Model based decoding of spike trains. Biosystems, 67(1-3):295–300, 2002.
[8] J. Zhang, R. Bogacz, and P. Holmes. A comparison of bounded diffusion models for choice in time controlled tasks. J Math
Psychol, 53(4):231–241, Aug 2009.
[9] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–
286, 1989.
[10] R. Gütig and H. Sompolinsky. The tempotron: a neuron that learns spike timing-based decisions. Nat Neurosci, 9(3):420–8, Mar
2006.
4
Download