Evolution of Reinforcement Learning in Unpredictable Environments

Dopamine, Uncertainty and TD Learning Yael Niv Michael Duff Peter Dayan Gatsby Computational Neuroscience Unit, UCL CNS 2004 What is the function of Dopamine? Prefrontal Cortex Dorsal Striatum (Caudate, Putamen) Parkinson’s Disease -> Movement control? Intracranial selfstimulation; Drug addiction -> Reward pathway? -> Learning? Nucleus Accumbens (Ventral Striatum) Amygdala Ventral Tegmental Area Substantia Nigra Also involved in: - Working memory - Novel situations - ADHD - Schizophrenia … What does phasic Dopamine encode? Unpredicted reward (neutral/no stimulus) Predicted reward (learned task) Omitted reward (probe trial) (Schultz et al.) The TD Hypothesis of Dopamine V (t )   r ( )  r (t  1)  r (t  2)  r (t  3)  ...  t  r (t  1)  V (t  1)  (t  1)  r (t  1)  V (t  1)  V (t ) Temporal V (t )   (t  1)  Phasic DA encodes a reward prediction error • Precise theory for generation of DA firing patterns • Compelling account for the role of DA in classical conditioning (Sutton+Barto 1987, Schultz,Dayan,Montague 1997) V  value r  reward difference error But: Fiorillo, Tobler & Schultz 2003 • Introduce inherent uncertainty into the classical conditioning paradigm Stimulus = 2 sec visual stimulus Reward (probabilistic) = drops of juice • Five visual stimuli indicating different reward probabilities: P= 100%, 75%, 50%, 25%, 0% Fiorillo, Tobler & Schultz 2003 At stimulus time - DA represents mean expected reward Delay activity - A ramp in activity up to reward Hypothesis: DA ramp encodes uncertainty in reward “Uncertainty Ramping” and TD error? • The uncertainty is predictable from the stimulus • TD predicts away predictable quantities  If it represents uncertainty, the ramping activity should disappear with learning according to TD.  Uncertainty ramping is not easily compatible with the TD hypothesis Are the ramps really coding uncertainty? A closer look at FTS’s results At time of reward: • Prediction errors result from probabilistic reward delivery p = 50% p = 75% • Crucially: Positive and negative errors cancel out A TD Resolution: • TD prediction error δ(t) can be positive or negative • Neuronal firing rate is only positive (negative values can be encoded relative to base firing rate) But: DA base firing rate is low -> asymmetric encoding of δ(t) DA 55% 270% δ(t) Simulating TD with asymmetric errors Negative δ(t) scaled by d=1/6 prior to PSTH summation Learning proceeds normally (without scaling) − Necessary to produce the right predictions − Can be biologically plausible DA - Uncertainty or Temporal Difference? With asymmetric coding of errors, the mean TD error at the time of reward  p(1-p) => Maximal at p=50% Experiment Model However: • No need to assume explicit coding of uncertainty Ramping is explained by neural constraints. • Explanation for puzzling absence of ramp in trace conditioning results. • Experimental test: Ramp as within or between trial phenomenon? Challenges: TD and noise; Conditioned inhibition, additivity Trace conditioning: A puzzle and its resolution CS = short visual stimulus Trace period US (probabilistic) = drops of juice • Same (if not more) uncertainty, but no DA ramping (Fiorillo et al.; Morris, Arkadir, Nevet, Vaadia & Bergman) • Resolution: lower learning rate in trace conditioning eliminates ramp Other sources of uncertainty: Representational Noise (1) • Rate coding is inherently stochastic • Add noise to tapped delay line representation σ = 0.0577 σ = 0.0866 Mirenowicz and Schultz (1996) σ = 0.1155 prediction error weights => TD learning is robust to this type of noise Other sources of uncertainty: Representational Noise (2) • Neural timing of events is necessarily inaccurate • Add temporal noise to tapped delay line representation ε = 0.05 ε = 0.10 => Devastating effects of even small amounts of temporal noise on TD predictions

Evolution of Reinforcement Learning in Unpredictable Environments

Related documents

Products

Support

Evolution of Reinforcement Learning in Unpredictable Environments

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib