DA encodes reward prediction error

Asymmetric Coding of Temporal Difference Errors: Implications for Dopamine Firing Patterns Y. 1,2 Niv , M.O. 2 Duff and P. 2 Dayan Prefrontal Cortex Nucleus Accumbens (Ventral Striatum) Amygdala (1)Interdisciplinary Center for Neural Computation, Hebrew University, Jerusalem, yaelniv@alice.nc.huji.ac.il (2) Gatsby Computational Neuroscience Unit, University College London Experimental results: measuring propagating errors Overview Fiorillo et al. (2003) • Substantial evidence suggests that phasic dopaminergic firing represents a temporal difference (TD) error in the predictions of future reward. • Recent experiments probe the way information about outcomes propagates back to the stimuli predicting them. These use stochastic rewards (eg., Fiorillo et al., 2003) which allow systematic study of persistent prediction errors even in well learned tasks. Dorsal Striatum (Caudate, Putamen) Ventral Tegmental Area Substantia Nigra Simulating TD with asymmetric coding Negative δ(t) scaled by d=1/6 prior to PSTH summation Classical conditioning paradigm (delay conditioning) using probabilistic outcomes -> generates ongoing prediction errors in a learned task 2 sec visual stimulus indicating reward probability – 100%, 75%, 50%, 25% or 0% Probabilistic reward (drops of juice) • We use a novel theoretical analysis to show that across-trials ramping in DA activity may be a signature of this process. Importantly, we address the asymmetric coding in DA activity of positive and negative TD errors, and acknowledge the constant learning that results from ongoing prediction errors. Single DA cell recordings in VTA/SNc: • At stimulus time - DA represents mean expected reward (compliant with TD hypothesis) Introduction: What does phasic Dopamine encode? • Surprising ramping of activity in the delay DA single cell recordings from the lab of Wolfram Schultz Learning proceeds normally (without scaling): • Necessary to produce the right predictions • Can be biologically plausible -> Fiorillo et al.’s hypothesis: Coding of uncertainty However: • No prediction error to `justify’ ramp Unpredicted reward (neutral/no stimulus) • TD learning predicts away any predictable quantity Predicted reward (learned task) • Uncertainty not available for control Omitted reward (probe trial) -> DA encodes a temporally sophisticated reward signal Short visual stimulus -> The uncertainty hypothesis seems contradictory to the TD hypothesis Computational hypothesis – DA encodes reward prediction error: A TD resolution: Ramps result from backpropagating (Sutton, Barto 1987, Montague, Dayan, Sejnowski, 1996) prediction errors - V (t)   r ( )  r (t  1)  r (t  2)  r (t  3)  ...  t  r (t  1)  V (t  1) Temporal Difference error V (t)   (t  1) Visualizing Temporal-Difference Learning: 15 20 25 0 0 30 15 20 timesteps -> Ongoing (intertwined) backpropagation of asymmetrically coded positive and negative errors causes ramps to appear in the summed PSTH 0.5 10 20 20 timesteps 30 30 -> The ramp itself is a between trial and not a within trial phenomenon 0.5 10 20 timesteps 30 10 20 30 0.5 0 0 1 10 20 timesteps 30 (results from summation over different reward histories) Solution: Lower learning rate in trace conditioning eliminates ramp Morris et al. (2004) (see also Fiorillo et al. (2003)) Indeed: computed learning rate in Morris et al.’s data near zero (personal communication) With asymmetric coding of errors, the mean TD error at the time of reward is proportional to p(1-p) -> Indeed maximal at p=50% Experiment Model However: • No need to assume explicit coding of uncertainty – Ramping in DA activity is explained by neural constraints. • Explanation for puzzling absence of ramp in trace conditioning results. • Experimental tests:  Ramp as within or between trial phenomenon?  Relationship between ramp size and learning rate (within/between experiments)? Challenges to TD remain: TD and noise; Conditioned inhibition; additivity… 0.8 0.6 Selected References 0.4 [1] Fiorillo, Tobler & Schultz (2003) - Discrete coding of reward probability and uncertainty by dopamine neurons. Science, 299, 1898–1902. [2] Morris, Arkadir, Nevet, Vaadia & Bergman (2004) – Coincident but distinct messages of midbrain dopamine and striatal tonically active neurons. Neuron, 43, 133-143. [3] Montague, Dayan & Sejnowski (1996) – J Neurosci, 16:1936-1947. [4] Sutton and Barto (1988) – Reinforcement learning: An introduction, MIT Press. 0.2 1 0.5 0 0 30 1 0 0 Values 0.5 10 25 Task learned 1 Values Values 10 1 0 0 1 0 0 5 TD Error TD Error 0.5 20 δ(t) 30 0.5 Learning continues (~10 trials) 1 270% Bayer and Glimcher Schultz lab TD Error TD Error δ(t) After third trial TD Error Values r(t) 10 DA 0.5 10 Reward (probabilistic) = drops of juice x25% 1 5 • Same (if not more) uncertainty • But: no DA ramping Conclusion: Uncertainty or Temporal Difference? After first trial 0 0 Trace period x75% 55% 1 0 0 Note that according to TD, activity at time of reward should cancel out – but it doesn’t. This is because… • Prediction errors can be positive or negative • However, firing rate is positive -> encoding of negative errors is relative to baseline activity • But: baseline activity in DA cells is low (2-5Hz) -> asymmetric coding of errors • Precise computational theory for generation of DA firing patterns • Compelling account for role of DA in appetitive conditioning V(30) x50% p = 75% -> Phasic DA encodes reward prediction error x(1) x(2) … x50% p = 50% •  (t  1)  r (t  1)  V (t  1)  V (t) V(1) Trace conditioning: A puzzle solved 0 -0.2 0 10 20 30 Acknowledgements This research was funded by an EC Thematic Network short-term fellowship to YN and The Gatsby Charitable Foundation.

DA encodes reward prediction error

Related documents

Products

Support

DA encodes reward prediction error

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib