dopamine and prediction error TD error t rt Vt 1 Vt L R Vt (t ) R no prediction 1 prediction, reward prediction, no reward Schultz 1997 humans are no different • dorsomedial striatum/PFC – goal-directed control • dorsolateral striatum – habitual control • ventral striatum – Pavlovian control; value signals • dopamine... in humans… 5 stimuli: 40¢ 20¢ 0/40¢ 0¢ 0¢ < 1 sec 5 sec ISI 0.5 sec You won 40 cents 19 subjects (dropped 3 non learners, N=16) 3T scanner, TR=2sec, interleaved 234 trials: 130 choice, 104 single stimulus randomly ordered and counterbalanced 2-5sec ITI what would a prediction error look like (in BOLD)? prediction errors in NAC unbiased anatomical ROI in nucleus accumbens (marked per subject*) raw BOLD (avg over all subjects) can actually decide between different neuroeconomic models of risk * thanks to Laura deSouza Polar Exploration Peter Dayan Nathaniel Daw John O’Doherty Ray Dolan Exploration vs. exploitation Classic dilemma in learned decision making For unfamiliar outcomes, how to trade off learning about their values against exploiting knowledge already gained Exploration vs. exploitation Reward Time • Exploitation – Choose action expected to be best – May never discover something better Exploration vs. exploitation Reward Time • Exploitation – Choose action expected to be best – May never discover something better • Exploration: – Choose action expected to be worse Exploration vs. exploitation Reward Time • Exploitation – Choose action expected to be best – May never discover something better • Exploration: – Choose action expected to be worse – If it is, then go back to the original Exploration vs. exploitation Reward Time • Exploitation – Choose action expected to be best – May never discover something better • Exploration: – Choose action expected to be worse Exploration vs. exploitation Reward Time • Exploitation – Choose action expected to be best – May never discover something better • Exploration: – Choose action expected to be worse – If it is better, then exploit in the future Exploration vs. exploitation Reward Time • Exploitation – Choose action expected to be best – May never discover something better • Exploration: – – – – Choose action expected to be worse Balanced by the long-term gain if it turns out better (Even for risk or ambiguity averse subjects) nb: learning non trivial when outcomes noisy or changing Bayesian analysis (Gittins 1972) • Tractable dynamic program in restricted class of problems – “n-armed bandit” • Solution requires balancing – Expected outcome values – Uncertainty (need for exploration) – Horizon/discounting (time to exploit) – Choose best sum of value plus bonus – Bonus increases with uncertainty Value • Optimal policy: Explore systematically • Intractable in general setting – Various heuristics used in practice Action Experiment • How do humans handle tradeoff? • Computation: Which strategies fit behavior? – Several popular approximations • Difference: what information influences exploration? • Neural substrate: What systems are involved? – PFC, high level control • Competitive decision systems (Daw et al. 2005) – Neuromodulators • dopamine (Kakade & Dayan 2002) • norepinephrine (Usher et al. 1999) Task design Subjects (14 healthy, righthanded) repeatedly choose between four slot machines for points (“money”), in scanner Trial Onset Slots revealed Task design Subjects (14 healthy, righthanded) repeatedly choose between four slot machines for points (“money”), in scanner Trial Onset +~430 ms Slots revealed Subject makes choice chosen slot spins. + Task design Subjects (14 healthy, righthanded) repeatedly choose between four slot machines for points (“money”), in scanner Trial Onset +~430 ms Slots revealed Subject makes choice chosen slot spins. + + obtained 57 points +~3000 ms Outcome: Payoff revealed Task design Subjects (14 healthy, righthanded) repeatedly choose between four slot machines for points (“money”), in scanner Trial Onset +~430 ms Slots revealed Subject makes choice chosen slot spins. + + obtained 57 points + +~3000 ms Outcome: +~1000 ms Payoff revealed Screen cleared Trial ends Payoff structure Noisy to require integration of data Subjects learn about payoffs only by sampling them Payoff structure Noisy to require integration of data Subjects learn about payoffs only by sampling them Payoff Payoff structure Payoff structure Nonstationary to encourage ongoing exploration (Gaussian drift w/ decay) Analysis strategy • Behavior: Fit an RL model to choices – Find best fitting parameters – Compare different exploration models • Imaging: Use model to estimate subjective factors (explore vs. exploit, value, etc.) – Use these as regressors for the fMRI signal – After Sugrue et al. Behavior Behavior Behavior Behavior Behavior Behavior model 1. Estimate payoffs mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these Behavior model Kalman filter Error update 1. Estimate payoffs (like TD) Exact inference mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities Pgreen , Pred etc Choose randomly according to these Behavior model 2. Derive choice probabilities x x payoff Kalman filter Error update 1. Estimate payoffs (like TD) Exact inference mgreen , mred etc sgreen , sred etc x trial t Pgreen , Pred etc Choose randomly according to these t+1 Behavior model 2. Derive choice probabilities x x payoff Kalman filter Error update 1. Estimate payoffs (like TD) Exact inference mgreen , mred etc sgreen , sred etc x trial t Pgreen , Pred etc Choose randomly according to these t+1 Behavior model 2. Derive choice probabilities x x x payoff Kalman filter Error update 1. Estimate payoffs (like TD) Exact inference mgreen , mred etc sgreen , sred etc x trial t t+1 mred mred Pgreen , Pred etc Choose randomly according to these Behavior model 2. Derive choice probabilities x x payoff Kalman filter Error update 1. Estimate payoffs (like TD) Exact inference mgreen , mred etc sgreen , sred etc x x x x trial t t+1 mred mred Pgreen , Pred etc Choose randomly according to these Kalman filter Error update 1. Estimate payoffs (like TD) Exact inference mgreen , mred etc sgreen , sred etc 2. Derive choice probabilities Pgreen , Pred etc payoff Behavior model trial t t+1 mred mred 2 2 2 sred / sre s d o 2 2 sred (1 )sred Choose randomly according to these Behrens & volatility Behavior model Kalman filter 1. Estimate payoffs mgreen , mred etc sgreen , sred etc Compare rules: How is 2. Derive choice exploration probabilities directed? Pgreen , Pred etc Choose randomly according to these Behavior model mgreen , mred etc sgreen , sred etc Compare rules: How is 2. Derive choice exploration probabilities directed? Pgreen , Pred etc Choose randomly according to these mgreen , mred etc sgreen , sred etc Compare rules: How is 2. Derive choice exploration probabilities directed? (dumber) Value Behavior model Action (smarter) mgreen , mred etc sgreen , sred etc Compare rules: How is 2. Derive choice exploration probabilities directed? Value Behavior model Action Probability Randomly “e-greedy” 1 3e Pred e (dumber) if mred max(all m ) otherwise (smarter) mgreen , mred etc sgreen , sred etc Compare rules: How is 2. Derive choice exploration probabilities directed? Action By value “softmax” Probability Randomly “e-greedy” Value Behavior model 1 3e Pred e (dumber) if mred max(all m ) otherwise Pred exp( βμred ) (smarter) mgreen , mred etc sgreen , sred etc Compare rules: How is 2. Derive choice exploration probabilities directed? By value “softmax” Action By value and uncertainty “uncertainty bonuses” Probability Randomly “e-greedy” Value Behavior model 1 3e Pred e (dumber) if mred max(all m ) otherwise Pred exp( βμred ) Pred exp( β[ μred +φσred ]) (smarter) Model comparison • Assess models based on likelihood of actual choices – Product over subjects and trials of modeled probability of each choice – Find maximum likelihood parameters • Inference parameters, choice parameters • Parameters yoked between subjects • (… except choice noisiness, to model all heterogeneity) Behavioral results -log likelihood (smaller is better) # parameters e-greedy softmax uncertainty bonuses 4208.3 3972.1 3972.1 19 19 20 • Strong evidence for exploration directed by value • No evidence for direction by uncertainty – Tried several variations Behavioral results -log likelihood (smaller is better) # parameters e-greedy softmax uncertainty bonuses 4208.3 3972.1 3972.1 19 19 20 • Strong evidence for exploration directed by value • No evidence for direction by uncertainty – Tried several variations Imaging methods • 1.5 T Siemens Sonata scanner • Sequence optimized for OFC (Deichmann et al. 2003) • 2x385 volumes; 36 slices; 3mm thickness • 3.24 secs TR • SPM2 random effects model • Regressors generated using fit model, trial-by-trial sequence of actual choices/payoffs. Imaging results L vStr • TD error: dopamine targets (dorsal and ventral striatum) x,y,z= 9,12,-9 dStr • Replicate previous studies, but weakish – Graded payoffs? p<0.01 p<0.001 x,y,z= 9,0,18 Value-related correlates L vmPFC vmPFC p<0.01 p<0.001 % signal change probability (or exp. value) of chosen action: vmPFC probability of chosen action x,y,z=-3,45,-18 L mOFC p<0.01 p<0.001 x,y,z=3,30,-21 mOFC % signal change payoff amount: OFC payoff Exploration • Non-greedy > greedy choices: exploration • Frontopolar cortex • Survives whole-brain correction L rFP rFP p<0.01 p<0.001 x,y,z=-27,48,4; 27,57,6 LFP Timecourses Frontal pole IPS Checks • Do other factors explain differential BOLD activity better? – Multiple regression vs. RT, actual reward, predicted reward, choice prob, stay vs. switch, uncertainty, more – Only explore/exploit is significant – (But 5 additional putative explore areas eliminated) • Individual subjects: BOLD differences stronger for better behavioral fit Frontal poles • “One of the least well understood regions of the human brain” • No cortical connections outside PFC (“PFC for PFC”) • Rostrocaudal hierarchy in PFC (Christoff & Gabrielli 2000; Koechlin et al. 2003) • Imaging – high level control – Coordinating goals/subgoals (Koechlin et al. 1999, Braver & Bongiolatti 2002; Badre & Wagner 2004) – Mediating cognitive processes (Ramnani & Owen 2004) – Nothing this computationally specific • Lesions: task switching (Burgess et al. 2000) – more generic: perseveration Interpretation • Cognitive decision to explore overrides habit circuitry? Via parietal? – Higher FP response when exploration chosen most against the odds – Explore RT longer • Exploration/exploitation are neurally distinct • Computationally surprising, esp. bad for uncertainty bonus schemes – proper exploration requires computational integration – no behavioral evidence either • Why softmax? Can misexplore – Deterministic bonus schemes bad in adversarial/multiagent setting – Dynamic temperature control? (norepinephrine; Usher et al.; Doya) Conclusions • Subjects direct exploration by value but not uncertainty • Cortical regions differentially implicated in exploration – computational consequences • Integrative approach: computation, behavior, imaging – Quantitatively assess & constrain models using raw behavior – Infer subjective states using model, study neural correlates Open Issues • model-based vs model-free vs Pavlovian control – environmental priors vs naive optimism vs neophilic compulsion? • environmental priors and generalization – curiosity/`intrinsic motivation’ from expected future reward