Stimulus detection and decision making via spike-based reinforcement learning Giancarlo La Camera Department of Neurobiology and Behavior Stony Brook University Stony Brook, NY 11794, USA giancarlo.lacamera@stonybrook.edu Robert Urbanczik Department of Physiology University of Bern Bühlplatz 5, Bern, Switzerland urbanczik@pyl.unibe.ch Walter Senn Department of Physiology University of Bern Bühlplatz 5, Bern, Switzerland senn@pyl.unibe.ch Abstract In theoretical and experimental investigations of decision-making, the main task has typically been one of classification, wherein the relevant stimuli cueing decisions are known to the decision maker: the latter knows which stimuli are relevant, and knows when it is being presented with one. However, in many real-life situations it is not clear which segments in a continuous sensory stream are action relevant, and relevant segments may blend seamlessly into irrelevant ones. Then the decision problem is just as much about when to act as about choosing the right action. Here, we present a spiking neuron network which learns to classify hidden relevant segments of a continuous sensory stream of spatiotemporal patterns of spike trains. The network has no a-priori knowledge of the stimuli, when they are being presented, and their behavioral significance – i.e., whether or not they are action-relevant. The network is trained by the reward received for taking correct decisions in the presence of relevant stimuli. Simulation results show that by maximizing expected reward the spiking network learns to distinguish behaviourally relevant segments in the input stream from irrelevant ones, performing a task akin to temporal stimulus segmentation. Keywords: spiking neuron; temporal segmentation; signal detection; gradient learning; synaptic plasticity; spike-timing patterns; firing rate patterns; neural circuit Acknowledgements We acknowledge support from the National Science Foundation (Grant IIS-1161852) and the Swiss National Science Foundation through the SystemsX.ch initiative (“Neurochoice”). Spiking network models aspire to produce biologically plausible models of learning and decision making (see e.g. [1]). For concreteness, consider the following 2 choice classification task: a set of input stimuli is to be associated with one of two possible correct actions – e.g., ‘go left’ vs. ‘go right’. The correct decision is rewarded whereas the incorrect decision is punished. In a ‘canonical’ spiking network model designed to learn this task [2, 3], populations of sensory neurons project to populations of ‘decision neurons’ via plastic synapses, as shown in Fig. 1a. Each stimulus is represented by the activation of a predefined sensory population, such as the orange population in Fig. 1a. After an input is presented to the network, some competition occurs at the level of the decision populations, which ends when one of the two populations enters a state of activity having higher firing rate than the other (or, in alternative models, its activity reaches a pre-defined threshold earlier than the other population). The winning population initiates the corresponding action. If that action is correct, the network is rewarded, otherwise it is punished. Based on this outcome, the synapses between the input neurons and the decision neurons are modified so as to increase the chance of producing the correct action in response to future presentations of the same stimulus. This class of models are able to capture much of the physiology and behavior observed in typical laboratory tasks which inspired them [1]; however, they are designed to work in a somewhat limited scenario, in which: 1) every stimulus presented to the agent is relevant, in the sense that, if met with the correct action, a reward is obtained; 2) the agent knows the identity of all the stimuli and when they are being presented; 3) there is a well-defined time period during which a decision must be made (decisions are enforced); 4) all decisions lead to feedback (either reward or punishment) – hence, feedback is received for each stimulus presentation. Also, the network model of Fig. 1a has as many input populations as relevant stimuli: to introduce a new stimulus, one has to augment the model with an additional population of neurons encoding that stimulus. Here, we present a spiking network model (the learning agent, or ‘agent’ for short) which learns to segment a continuous input stream by identifying those segments of the stream that are action-relevant (see Fig. 1b). The relevant stimuli are spatio-temporal patterns of spike trains hidden among a host of non-relevant patterns in the same continuous stream. Learning is achieved with an online, spike-based learning rule that tries to maximize reward. Compared with the learning scenario outlined above, here 1) the a-priori relevance of the stimuli is not known to the agent; 2) the agent does not know when and if a stimulus is being presented; 3) the agent is not required to make a decision at any time; and 4) only correct decisions made in the presence of a relevant stimulus lead to feedback. This is the fundamental distinction between relevant and non-relevant stimuli: if any decision is made in the presence of a non-relevant stimulus, nothing happens – in particular, no rewarding feedback is given. If every action is costly (as assumed below), the optimal behavior in the presence of non-relevant stimuli is to do nothing.1 Finally note that, contrary to the model of Fig. 1a, in our network additional stimuli can be represented as new segments of the stream, with no need to add populations of input neurons. Network architecture and decision dynamics. The spiking network model we propose in this work is illustrated in Fig. 1b. Two decision populations of N = 100 spiking neurons each (labeled as L and R respectively) receive input spike trains via plastic synapses (Fig. 1b). When the difference in spike counts between the two populations exceeds a threshold ΘD , |spk(L) − spk(R)| > ΘD , a decision occurs. As long as |spk(L) − spk(R)| < ΘD , no decisions are taken. Each stimulus is randomly deemed either relevant or irrelevant, with relevant stimuli arbitrarily associated to one of two correct decisions, either ‘go left’ (accomplished if spk(L) − spk(R) > ΘD ), or ‘go right’ (accomplished if spk(L) − spk(R) < ΘD ). When a decision occurs, a rewarding feedback is obtained after a minor delay (50ms), the stimulus is removed, and the population activity is reset to zero. Every decision (whether correct or incorrect) incurs a small cost −0.1 (to prevent the agent to take decisions continuously), and positive reward (R = 1) is given only for a correct decision in the presence of a relevant stimulus (netting a total reward of R = 0.9). Incorrect decisions are not punished (and thus only incur the cost R = −0.1). The rationale for such choice is that an additional negative reward for an incorrect response to a relevant stimulus would signal the presence of a relevant stimulus at the time of a decision, aiding the solution of the identification task. In case of multiple correct responses to the same relevant stimulus, only the first such response is rewarded. We tested the model with both precise spike timing patterns (task 1) and firing rate patterns (task 2), as detailed in a later section. In both tasks, stimuli were of random duration around a mean of 500ms. Decision neurons and learning rule. We indicate with Ps a smoothed version of the readout spk(L) − spk(R) in the following. The neurons contributing to the population activity responsible for making decisions were modeled as spike response models with a noisy escape mechanism for action potential emission [4] – i.e., a spike is emitted with a given probability φ(u(t)) depending on the current value of the membrane potential u at time t. Learning occurred via the online learning rule introduced in [5], dwiν = η|Rt |a(Ps )(rν − 1)Eiν , (1) dt where wiν is the synaptic weight between pre-synaptic (input) neuron i and post-synaptic (decision) neuron ν, η is the learning rate, Rt is the reward at time t, rν is an individualized reward signal that equals 1 if neuron ν made the right decision, and −1 otherwise.PThe factor |Rt | insures that synaptic update is confined to a temporal window around reward delivery. Eiν (t) ∝ ( tν δ(t − tν ) − φ(uν (t))) P SPi (t) is a low-pass filter of the time-derivative of the gradient (with respect to the synaptic weights) of the log-likelihood of producing the output spike train {tν } given an input spike 1 Note how this is different from a 3-way classification task where the stimuli are to be separated in 3 classes (‘go left’, ‘go right’, and ‘do nothing’), and in which ignoring non-relevant stimuli would be rewarded as the correct response to those stimuli. 1 Figure 1: Alternative neural circuit models for decision making tasks. a) In a ‘canonical’ decision-making circuit, each input stimulus is represented by an increase in firing rate in a dedicated population of neurons (here, the orange population). Two decision populations code for ‘go left’ (red) and ‘go right’ (green), respectively. A read-out initiates either action, and the decision is met with a reward or punishment. The outcome modulates synaptic plasticity (dashed curves) at either one of the pathways or both. To represent a new relevant stimulus, a new population of neurons must be added to the network. b) In the type of cortical circuit studied in this paper, the input is a spatio-temporal pattern of spike trains (each spike train coming from a different input neuron). Relevant inputs are hidden segments of this pattern (shaded areas): if met with the appropriate response, a reward is delivered. The network has no a priori knowledge of the relevant segments: these are formed by segmenting the input through a process of reinforcement learning. No additional populations are required to represent additional stimuli – whether relevant or not. See the text for details. pattern causing a post-synaptic potential P SPi (t) on neuron i at time t (see [4, 5] for details). Note that only the synapses targeting neurons voting for the wrong decision (rν = −1) are updated according to the above learning rule; the update 2 is full (a(Ps ) = 1) in case of an incorrect decision and attenuated by a factor a(Ps ) ∝ e−Ps /N in case of a correct decision. This allows for synapses to undergo a full update only when most needed (i.e., following a wrong population decision). Moveover, synaptic updates for neurons voting for incorrect decisions are smaller for a larger population readout Ps because of the value of the attenuation factor a(Ps ) in this case. Since Ps can be interpreted as an internal measure of the agent’s ‘confidence’ in its decision, the synaptic update is small for correct decisions taken with large confidence. In the case of episodic learning, the learning rule Eq. 1 performs stochastic gradient ascent in a monotonic function of reward and population activity [5]. This learning rule can be understood as an improvement over Williams’s general gradient learning rule [6]. The need to introduce the individualized reward signal rν arises because otherwise learning worsens as the population size increases, as demonstrated in [5]. The individualized reward signal can be made available locally at each synapse by broadcasting feedback from the population readout Ps (e.g., through a neurotransmitter such as acetylcholine or serotonin) and from each neuron’s own activity St (e.g., through intracellular calcium transients), in addition to the global reward feedback Rt (see [5] for details). Finally, learning occurred only on synapses targeting neurons in the L population, with the synapses projecting onto the R population kept fixed. This way, the R population was a ‘contrast population’ used as reference for making decisions. Since the only variable responsible for decisions is the difference between the activities of the two populations, this choice is legitimate and allows for a minimal implementation. Note that there is no a priori preference for which population should be the learning one: their roles can be interchanged without affecting the results. Simulation results with spike timing patterns (task 1). In this scenario, stimuli were patterns of 60 spike trains. Each spike train was obtained as a realization of a Poisson process with a constant firing rate of 6Hz. The choice of a Poisson process is convenient but not strictly necessary, i.e., any other distribution could have been used instead [7]. Once created, the spike patterns were presented each time unmodified, i.e., for each pattern, the spike trains were kept fixed across repeated presentations (‘frozen’ patterns; this is an un-biological simplification that will be relaxed later). Note that all spike patterns have exactly the same statistics, and thus they cannot be encoded or decoded by firing rate. As shown in [5], stimuli of this sort can be classified by a single decision population equipped with the learning rule Eq. 1 within a so-called ‘time controlled’ paradigm [8], whereby an action is required at the end of stimulus presentation and no stimulus identification is involved. Here, however, relevant stimuli must be identified first, and decisions can be made at any time, or could not be made at all. A simulation run for this classification task with the online learning rule Eq. 1 is shown in Fig. 2 for the case of 6 stimuli (the same model can also learn tens of stimuli; not shown here to ease illustration). The network was able to learn to identify relevant segments and make the correct decision in response to them (pink and green shaded areas in panel a), while holding actions in the presence of non-relevant segments – at least in a large fraction of them and given the limited simulation times. Performance tends to increase with learning (panel b, top). The evolution of decision times for the best and the worst stimuli (panel b, bottom) shows that as the network 2 Figure 2: Simulation results with ‘frozen’ patterns of spike trains. a) Dynamics of decisions after learning for 3000 trials in the architecture of Fig. 1b. The population readout (Ps in the main text) correctly makes the decision to go left in response to the ‘pink’ segments of the input stream, and to go right in response to the ‘green’ segments, by crossing a threshold (dashed horizontal lines, positive for decision ‘left’ and negative for decision ‘right’). Correct decisions cause transient increase in reward feedback Rt (red line). The numbers below the negative decision threshold label the segments. After a decision is taken, the current segment disappears and reward or penalty is given after a delay of 50ms. Stimuli were presented in random order. b) Top: performance as % correct in response to relevant stimuli steadily improves with learning and converges to a value close to optimal in 3000 trials (asymptotic overall performance was only slightly worse; not shown). Bottom: decision times for two stimuli vs. number of presentations of those stimuli. In both panels, curves were smoothed out with a low pass filter x̄n = (1 − λ)x̄n−1 + λxn , with λ = 0.05. c) Detail of decision times (top) and performance (bottom) for all 6 stimuli used in the task. In the top panel, the squares represent the total durations of the stimuli, dots are the sampled decision times in the last 100 trials, and crosses are the average decision times. After learning, the fastest decisions were in response to relevant segments, whereas decision times were fewer and closer to the maximal stimulus duration (∼ 500ms) for non-relevant segments (key: L=‘go left’, R=‘go right’ and N=‘non-relevant stimuli’) became more confident about a decision, its response to the related stimulus became faster (best stimulus), whereas when stimuli had not been yet correctly identified, the decision times tended to be flat or increase during learning to allow for more information to be accrued (worst stimulus). In panel c), mean decision times and performance are shown for all stimuli (stimuli marked N were non-relevant stimuli). The best stimulus (panel b) was stimulus 5, for which performance reached 100% correct after training; the worst stimulus was stimulus 4, a non-relevant segment (like segment 3 in panel a). Note that the end-performance with this stimulus after training was ∼ 75% correct (see panel c), bottom), which means that ∼ 25% of the time the agent took an action during the presentation of this stimulus (the agent, however, is still learning to ignore this stimulus, see panel b), bottom, dashed line). Note that in the case of stimuli 1 and 2 the agent had become very confident of the correct decision, as implied by the short reaction times and the population activity overshoots above the decision threshold in the time interval between the decision and the rewarding feedback. Simulation results with firing rate patterns (task 2). The previous scenario assumed that patterns of spike trains are reproduced exactly unmodified at each stimulus presentation (‘frozen’ spike patterns). This is clearly only a convenient starting point. A more realistic scenario could be based on firing rate coding, whereby a stimulus is defined by the firing rates of its input spike trains, collectively, but the spike times are generated anew during each stimulus presentation. This is both a more realistic scenario and a more challenging learning task for our spike-based learning rule. The stimuli were patterns of 60 Poisson spike trains with constant firing rate, each randomly sampled from values 6, 22, 40, 60 spikes/s. The pattern of firing rates defining a stimulus were always fixed, but the actual spike times were generated anew at each stimulus presentation to produce Poisson spike trains with the given firing rates. By construction, all stimuli have the same overall firing rate, i.e., the stimuli could not be distinguished by unsupervised processing based on the overall firing rate of the input. The simulation shown in Fig. 3 confirms that the agent is able to identify relevant stimuli coded as patterns of firing rates, despite using a learning rule not explicitly designed to learn firing rates. Conclusions and discussion. Learning to abstract relevant information from the environment is a crucial component of decision making; yet, current models typically assume that the relevant inputs are known to the decision maker, and defined once and for all. Here, we put forward a spiking network model able to detect stimuli from the environment based on their behavioral relevance. Since the stimuli are presented in sequence in a continuous stream, with unknown starting and ending points, the task is akin to temporal stimulus segmentation, i.e., the task of discovering boundaries between successive stimuli. Segmentation tasks such as ours are typically solved by methods such as Hidden Markov 3 Figure 3: Simulation results with stationary firing rate patterns, same keys as Fig. 2. See the text for details. Models [9], which require a-priori knowledge of the relevant stimuli (or at least the number of relevant stimuli), are not based on online algorithms, and lack biological plausibility. In contrast, our spiking network model learns online, does not require a priori knowledge of the relevant stimuli or even when they are being presented, it allows for direct comparison with neurobiological data and thus could help uncover potential correlates of decision confidence and other aspects of decision making. Another hallmark of our study is the use of ‘information-controlled’ tasks, which allows subjects to respond whenever they feel confident [8]. Our model differs from the class of neural-circuit models of decision making depicted in Fig. 1a, which require a neural population encoding each stimulus, and a priori knowledge of the relevant stimuli, and when they start and end. Following an approach more similar to ours, the ‘tempotron’ [10] can learn to separate spike patterns into two classes, which could be interpreted as ‘relevant’ vs. ‘non-relevant’. However, the tempotron needs to know when stimuli start and end, and is given feedback for non-responses to relevant stimuli, which helps their identification. Also, the tempotron is only capable of binary decisions. Moreover, if applied in a population of neurons rather than in a single neuron, performance again slows down if no feedback from the population activity (resulting in an individual reward signal) is given. This work can be extended in a number of directions. One could consider a visual segmentation task wherein a sequence of images slowly appear and disappear on top of a noisy background, and the task of the agent is to identify the images that are action-relevant. Preliminary simulations with a simple version of this task show encouraging results. A second direction is to go beyond 2 choice tasks. This could be obtained by subdividing the decision neurons into as many subpopulations as alternative decisions, with each subpopulation encoding a different decision. Each subpopulation would obey the same learning rule, which is aesthetically appealing and biologically plausible. Preliminary simulations show that with this modified architecture, the network also does a better job at learning to ignore non-relevant stimuli. References [1] X.-J. Wang. Decision making in recurrent neuronal circuits. Neuron, 60(2):215–34, Oct 2008. [2] X.-J. Wang. Probabilistic decision making by slow reverberation in cortical circuits. Neuron, 36:955–968, 2002. [3] S. Fusi, W. F. Asaad, E. K. Miller, and X.-J. Wang. A neural circuit model of flexible sensorimotor mapping: learning and forgetting on multiple timescales. Neuron, 54:319–333, Apr 2007. [4] J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner. Optimal spike-timing-dependent plasticity for precise action potential firing in supervised learning. Neural Comput, 18(6):1318–48, Jun 2006. [5] R. Urbanczik and W. Senn. Reinforcement learning in populations of spiking neurons. Nat Neurosci, 12(3):250–2, Mar 2009. [6] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(229), 1992. [7] M. C. Wiener and B. J. Richmond. Model based decoding of spike trains. Biosystems, 67(1-3):295–300, 2002. [8] J. Zhang, R. Bogacz, and P. Holmes. A comparison of bounded diffusion models for choice in time controlled tasks. J Math Psychol, 53(4):231–241, Aug 2009. [9] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257– 286, 1989. [10] R. Gütig and H. Sompolinsky. The tempotron: a neuron that learns spike timing-based decisions. Nat Neurosci, 9(3):420–8, Mar 2006. 4