models: reinforcement learning & fMRI Nathaniel Daw 11/28/2007 overview • reinforcement learning • model fitting: behavior • model fitting: fMRI overview • reinforcement learning – simple example – tracking – choice • model fitting: behavior • model fitting: fMRI Reinforcement learning: the problem Optimal choice learned by repeated trial-and-error – eg between slot machines that pay off with different probabilities But… – Payoff amounts & probabilities may be unknown – May additionally be changing – Decisions may be sequentially structured (chess, mazes: this we wont consider today) Very hard computational problem; computational shortcuts essential Interplay between what you can and should do Both have behavioral & neural consequences Simple example n-armed bandit, unknown but IID payoffs – surprisingly rich problem Vague strategy to maximize expected payoff: 1) Predict expected payoff for each option 2) Choose the best (?) 3) Learn from outcome to improve predictions Simple example 1) Predict expected payoff for each option – – – 2) Take VL = last reward received on option L (more generally, some weighted average of past rewards) This is an unbiased, albeit lousy, estimator Choose the best – (more generally, choose stochastically s.t. the machine judged richer is more likely to be chosen) Say left machine pays 10 with prob 10%, 0 owise Say right machine pays 1 always What happens? (Niv et al. 2000; Bateson & Kacelnik) Behavioral anomalies • Apparent risk aversion arises due to learning, i.e. due to the way payoffs are estimated – Even though we are trying to optimize expected reward, risk neutral – Easy to construct other examples for risk proneness, “probability matching” • Behavioral anomalies can have computational roots • Sampling and choice interact in subtle ways what can we do? Reward prediction 0.6 What can we do? 0.5 weight 0.4 0.3 Exponentially weighted running average of rewards on an option: 0.2 0.1 0 -10 -9 -8 -7 -6 -5 -4 -3 trials into past Vt rt 1 rt 1 1 rt 2 ... 2 Convenient form because it can be recursively maintained (‘exponential filter’) Vt rt 1 Vt 1 Vt 1 t ; where t rt Vt 1 ‘error-driven learning’, ‘delta rule’, ‘Rescorla-Wagner’ -2 -1 what should we do? [learning] Bayesian view Specify ‘generative model’ for payoffs • Assume payoff following choice of A is Gaussian with unknown mean mA; known variance s2PAYOFF payoff for A • Assume mean mA changes via a Gaussian random walk with zero mean and variance s2WALK mA trials Bayesian view Describe prior beliefs about parameters as a probability distribution • Assume they are Gaussian with mean mˆ A ; variance sˆ A2 mean of payoff for A Update beliefs in light of experience with Bayes’ rule P(mA | payoff) / P(payoff | mA)P(mA) Bayesian belief updating mean of payoff for A Bayesian belief updating mean of payoff for A Bayesian belief updating mean of payoff for A Bayesian belief updating mean of payoff for A Bayesian belief updating mean of payoff for A mˆ A (t ) mˆ A (t 1) (t ) (r (t ) mˆ A (t 1)) 2 2 2 ˆ ˆ s A (t ) 1 (t ) s A (t 1) sWALK (t ) sˆ (t 1) / sˆ (t 1) s 2 A 2 A 2 PAYOFF Notes on Kalman filter Looks like Rescorla/Wagner but • We track uncertainty as well as mean • Learning rate is function of uncertainty (asymptotically constant but nonzero) Why do we exponentially weight past rewards? mˆ A (t ) mˆ A (t 1) (t ) (r (t ) mˆ A (t 1)) 2 2 2 ˆ ˆ s A (t ) 1 (t ) s A (t 1) sWALK (t ) sˆ (t 1) / sˆ (t 1) s 2 A 2 A 2 PAYOFF what should we do? [choice] The n-armed bandit n slot machines binary payoffs, unknown fixed probabilities you get some limited (technically: random, exponentially distributed) number of spins want to maximize income surprisingly rich problem The n-armed bandit 1. Track payoff probabilities Bayesian: learn a distribution over possible probs for each machine This is easy: Just requires counting wins and losses (Beta posterior) The n-armed bandit 2. Choose This is hard. Why? The explore-exploit dilemma 2. Choose Simply choosing apparently best machine might miss something better: must balance exploration and exploitation simple heuristics, eg choose at random once in a while Explore / exploit 2.5 left bandit: 4/8 spins rewarded posterior belief 2 1.5 right bandit: 1/2 spins rewarded 1 mean of both distributions: 50% 0.5 0 0 0.2 0.4 0.6 payoff (%) 0.8 1 Which should you choose? Explore / exploit 2.5 left bandit: 4/8 spins rewarded posterior belief 2 1.5 right bandit: 1/2 spins rewarded 1 green bandit more uncertain (distribution has larger variance) 0.5 0 0 0.2 0.4 0.6 payoff (%) 0.8 1 Which should you choose? Explore / exploit 2.5 … it also has a larger chance of being better 2 posterior belief although green bandit has a larger chance of being worse… 1.5 …which would be useful to find out, if true 1 0.5 0 0 0.2 0.4 0.6 payoff (%) 0.8 1 Which should you choose? Trade off uncertainty, exp value, horizon ‘Value of information’: exploring improves future choices How to quantify? Optimal solution This is really a sequential choice problem; can be solved with dynamic programming Naïve approach: Each machine has k ‘states’ (number of wins/losses so far); state of total game is product over all machines; curse of dimensionality (kn states) Clever approach: (Gittins 1972) Problem decouples to one with k states – consider continuing on a single bandit versus switching to a bandit that always pays some known amount. The amount for which you’d switch is the ‘Gittins index’. It properly balances mean, uncertainty & horizon overview • reinforcement learning • model fitting: behavior – pooling multiple subjects – example • model fitting: fMRI Model estimation What is a model? – parameterized stochastic data-generation process Model m predicts data D given parameters q Estimate parameters: posterior distribution over q by Bayes’ rule P(q | D, m) P( D | q , m) P(q | m) Typically use a maximum likelihood point estimate instead arg maxq P( D | q , m) ie the parameters for which data are most likely. Can still study uncertainty around peak: interactions, identifiability application to RL eg D for a subject is ordered list of choices ct, rewards rt P( D | q , m) P(ct | c1..t 1 , r1..t 1 , q , m) t for eg P(ct | c1..t 1 , r1..t 1 ,q , m) exp(Vt (ct )) where V might be learned by an exponential filter with decay q Example behavioral task 0.5 probability shock 0 0.5 probability Reinforcement learning for reward & punishment: • participants (31) repeatedly choose between boxes • each box has (hidden, changing) chance of giving money (20p) • also, independent chance of giving electric shock (8 on 1-10 pain scale) 0.25 money 0.25 0 0 100 200 trial 300 This is good for what? • parameters may measure something of interest – eg learning rate, monetary value of shock • allow to quantify & study neural representations of subjective quantities – expected value, prediction error • compare models • compare groups Compare models P ( m | D ) P ( D | m ) P ( m) P( D | m) dq m P( D |q m , m) P(q m | m) In principle: ‘automatic Occam’s razor’ In practice: approximate integral as max likelihood + penalty: Laplace, BIC, AIC etc. Frequentist version: likelihood ratio test Or: holdout set; difficult in sequential case Good example refs: Ho & Camerer Compare groups • How to model data for a group of subjects? • Want to account for (potential) inter-subject variability in parameters q – this is called treating the parameters as “random effects” – ie random variables instantiated once per subject – hierarchical model: • each subject’s parameters drawn from population distribution • her choices drawn from model given those parameters P( D | q , m) dq s P(q s |q ) P(cs ,t | cs ,1..t 1 , rs ,1..t 1 , q s , m) s t Random effects model P( D | q , m) dq s P(q s |q ) P(cs ,t | cs ,1..t 1 , rs ,1..t 1 , q s , m) s t Hierarchical model: – What is qs? e.g., a learning rate – What is P(qs | q)? eg a Gaussian, or a MOG – What is q? eg the mean and variance, over the population, of the regression weights Interested in identifying population characteristics q (all multisubject fMRI analyses work this way) Random effects model P( D | q , m) dq s P(q s |q ) P(cs ,t | cs ,1..t 1 , rs ,1..t 1 , q s , m) s t Interested in identifying population characteristics q – method 1: summary statistics of individual ML fits (cheap & cheerful: used in fMRI) – method 2: estimate integral over parameters eg with Monte Carlo What good is this? – can make statistical statements about parameters in population – can compare groups – can regularize individual parameter estimates ie, P(q | qs) : “empirical Bayes” Example behavioral task 0.5 probability shock 0 0.5 probability Reinforcement learning for reward & punishment: • participants (31) repeatedly choose between boxes • each box has (hidden, changing) chance of giving money (20p) • also, independent chance of giving electric shock (8 on 1-10 pain scale) 0.25 money 0.25 0 0 100 200 trial 300 Behavioral analysis Fit trial-by-trial choices using “conditional logit” regression model coefficients estimate effects on choice of past rewards, shocks, & choices (Lau & Glimcher; Corrado et al) selective effect of acute tryptophan depletion? choice shock reward value(box 1) = [ 0 0 1 0 0 0 1 0… 0 1 1 0 0 0 0 0… 0 1 1 0 0 1 1 0… ] • [weights] value(box 2) = [ 1 0 0 0 1 0 0 1… 0 0 0 0 1 0 0 1… 1 0 0 0 1 0 0 1… ] • [weights] etc values choice probabilities using logistic (‘softmax’) rule prob(box 1) exp(value(box 1)) probabilities choices stochastically estimate weights by maximizing joint likelihood of choices, conditional on rewards <- avoid - choose -> 5 0 reward -5 -1 -6 shock -1 -6 lag (trials) choice -1 -6 Summary statistics of individual ML fits – fairly noisy (unconstrained model, unregularized fits) <- avoid - choose -> 5 0 reward -5 -1 -6 shock -1 -6 lag (trials) choice -1 -6 0.6 models predict exponential decays in reward & shock weights & typically neglect choice-choice autocorrelation 0.5 0.4 0.3 0.2 0.1 0 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 <- avoid - choose -> 5 0 reward -5 -1 -6 shock -1 -6 lag (trials) choice -1 -6 Fit of TD model (w/ exponentially decaying choice sensitivity), visualized same way (5x fewer parameters, essentially as good fit to data; estimates better regularized) 5 £0.04 -£0.12 <- avoid - choose -> £0.20 0 reward -5 -1 -6 Quantify value of pain shock -1 -6 lag (trials) choice -1 -6 <- avoid - choose -> 5 0 reward -5 -1 -6 shock -1 -6 lag (trials) choice -1 Effect of acute tryptophan depletion? -6 5 <- avoid - choose -> Control Depleted 0 reward -5 -1 -6 shock -1 -6 lag (trials) choice -1 Depleted participants are: • equally shock-driven • more ‘sticky’ (driven to repeat choices) • less money-driven (this effect less reliable) -6 Control Depleted <- avoid - choose -> linear effects of blood tryptophan levels: 5 0 reward shock sensitivity -5 -1 -6 shock -1 -6 lag (trials) choice -1 -6 10 0 -10 -20 p > .5 -0.2 0 tryp change 0.2 Control Depleted 0 reward -5 -1 -6 shock -1 choice -6 lag (trials) choice sensitivity <- avoid - choose -> linear effects of blood tryptophan levels: 5 -1 -6 4 2 0 -2 p < .005 -0.2 -0.1 0 0.1 0.2 tryp change Control Depleted <- avoid - choose -> linear effects of blood tryptophan levels: 5 0 reward -5 -1 shock -6 -1 15 10 5 0 p < .01 -5 -6 lag (trials) choice sensitivity reward sensitivity 20 choice -0.2 -0.1 0 0.1 0.2 tryp change -1 -6 4 2 0 -2 p < .005 -0.2 -0.1 0 0.1 0.2 tryp change overview • reinforcement learning • model fitting: behavior • model fitting: fMRI – random effects – RL regressors L rFP rFP p<0.01 p<0.001 LFP What does this mean when there are multiple subjects? • regression coefficients as random effects • if we drew more subjects from this population is the expected effect size > 0? History 1990-1991 – SPM paper, software released, used for PET low ratio of samples to subjects (within-subject variance not important) 1992-1997 – Development of fMRI more samples per subject 1998 – Holmes & Friston introduce distinction between fixed and random effects analysis in conference presentation; reveal SPM had been fixed effects all along 1999 – Series of papers semi-defending fixed effects; but software fixed RL & fMRI • Common approach: fit models to behavior, use models to generate regressors for fMRI GLM – eg predicted value; error in predicted value – where in brain does BOLD signal correlate with computationally generated signal (convolved with HRF)? – quantify & study neural representation of subjective factors reward prediction error (O’Doherty et al 2003 and lots of other papers) Schoenberg et al 2007 Examples: Value expectation (exactly same approach common in animal phys) Sugrue et al. (2004): primate LIP neurons: % signal change Daw, O’Doherty et al. (2006): vmPFC activity in humans probability of chosen action note: can also fit parametric models to neural signals; compare neural & behavioral fits (Kable et al 2007; Tom et al 2007) note 2: must as always be suspicious about spurious correlations – still good to use controls (eg is regressor loading better in this condition than another) Examples: loss aversion Tom et al (2007): compare loss aversion estimated from neural value signals to behavioral loss aversion from choices utility money example positional uncertainty in navigation task (Yoshida et al 2006) model: subjects assume they are someplace until proven wrong; then try assuming somewhere else estimate where subject thinks they are at each step correlate uncertainty in position estimate with BOLD signal summary • trial and error learning & choice – interaction between the two – rich theory even for simple tasks • model fits to choice behavior – hierarchical model of population – quantify subjective factors • same methods for fMRI, ephys – but keep your wits about you