models: reinforcement learning & fMRI

advertisement
models: reinforcement
learning & fMRI
Nathaniel Daw
11/28/2007
overview
• reinforcement learning
• model fitting: behavior
• model fitting: fMRI
overview
• reinforcement learning
– simple example
– tracking
– choice
• model fitting: behavior
• model fitting: fMRI
Reinforcement learning: the problem
Optimal choice learned by repeated trial-and-error
– eg between slot machines that pay off with different probabilities
But…
– Payoff amounts & probabilities may be unknown
– May additionally be changing
– Decisions may be sequentially structured (chess, mazes: this we wont
consider today)
Very hard computational problem; computational shortcuts
essential
Interplay between what you can and should do
Both have behavioral & neural consequences
Simple example
n-armed bandit, unknown but IID payoffs
– surprisingly rich problem
Vague strategy to maximize expected payoff:
1) Predict expected payoff for each option
2) Choose the best (?)
3) Learn from outcome to improve predictions
Simple example
1)
Predict expected payoff for each option
–
–
–
2)
Take VL = last reward received on option L
(more generally, some weighted average of past rewards)
This is an unbiased, albeit lousy, estimator
Choose the best
–
(more generally, choose stochastically s.t. the machine judged
richer is more likely to be chosen)
Say left machine pays 10 with prob 10%, 0 owise
Say right machine pays 1 always
What happens? (Niv et al. 2000; Bateson & Kacelnik)
Behavioral anomalies
• Apparent risk aversion arises due to learning, i.e.
due to the way payoffs are estimated
– Even though we are trying to optimize expected
reward, risk neutral
– Easy to construct other examples for risk proneness,
“probability matching”
• Behavioral anomalies can have computational
roots
• Sampling and choice interact in subtle ways
what can we do?
Reward prediction
0.6
What can we do?
0.5
weight
0.4
0.3
Exponentially weighted
running average of rewards
on an option:

0.2
0.1
0
-10
-9
-8
-7
-6
-5
-4
-3
trials into past

Vt   rt  1   rt 1  1   rt 2  ...
2
Convenient form because it can be recursively maintained
(‘exponential filter’)
Vt   rt  1  Vt 1
 Vt 1   t ; where  t  rt  Vt 1
‘error-driven learning’, ‘delta rule’, ‘Rescorla-Wagner’
-2
-1
what should we do? [learning]
Bayesian view
Specify ‘generative model’ for payoffs
• Assume payoff following choice of A is Gaussian with
unknown mean mA; known variance s2PAYOFF
payoff for A
• Assume mean mA changes via a Gaussian random walk with
zero mean and variance s2WALK
mA
trials
Bayesian view
Describe prior beliefs about parameters as a probability distribution
• Assume they are Gaussian with mean mˆ A ; variance sˆ A2
mean of payoff for A
Update beliefs in light of experience with Bayes’ rule
P(mA | payoff) / P(payoff | mA)P(mA)
Bayesian belief updating
mean of payoff for A
Bayesian belief updating
mean of payoff for A
Bayesian belief updating
mean of payoff for A
Bayesian belief updating
mean of payoff for A
Bayesian belief updating
mean of payoff for A
mˆ A (t )  mˆ A (t  1)   (t )  (r (t )  mˆ A (t  1))
2
2
2
ˆ
ˆ
s A (t )  1   (t ) s A (t 1)  sWALK
 (t )  sˆ (t  1) / sˆ (t  1)  s
2
A
2
A
2
PAYOFF

Notes on Kalman filter
Looks like Rescorla/Wagner but
• We track uncertainty as well as mean
• Learning rate is function of uncertainty (asymptotically
constant but nonzero)
Why do we exponentially weight past rewards?
mˆ A (t )  mˆ A (t  1)   (t )  (r (t )  mˆ A (t  1))
2
2
2
ˆ
ˆ
s A (t )  1   (t ) s A (t 1)  sWALK
 (t )  sˆ (t  1) / sˆ (t  1)  s
2
A
2
A
2
PAYOFF

what should we do? [choice]
The n-armed bandit
n slot machines
binary payoffs, unknown fixed probabilities
you get some limited (technically: random,
exponentially distributed) number of spins
want to maximize income
surprisingly rich problem
The n-armed bandit
1. Track payoff probabilities
Bayesian: learn a distribution over
possible probs for each machine
This is easy: Just requires counting
wins and losses (Beta posterior)
The n-armed bandit
2. Choose
This is hard. Why?
The explore-exploit dilemma
2. Choose
Simply choosing apparently best machine
might miss something better: must balance
exploration and exploitation
simple heuristics, eg choose at random once
in a while
Explore / exploit
2.5
left bandit: 4/8 spins rewarded
posterior belief
2
1.5
right bandit: 1/2 spins rewarded
1
mean of both distributions: 50%
0.5
0
0
0.2
0.4
0.6
payoff (%)
0.8
1
Which should you choose?
Explore / exploit
2.5
left bandit: 4/8 spins rewarded
posterior belief
2
1.5
right bandit: 1/2 spins rewarded
1
green bandit more uncertain
(distribution has larger variance)
0.5
0
0
0.2
0.4
0.6
payoff (%)
0.8
1
Which should you choose?
Explore / exploit
2.5
… it also has a
larger chance of
being better
2
posterior belief
although green
bandit has a
larger chance of
being worse…
1.5
…which would be
useful to find out, if
true
1
0.5
0
0
0.2
0.4
0.6
payoff (%)
0.8
1
Which should you choose?
Trade off uncertainty, exp value, horizon
‘Value of information’: exploring improves future choices
How to quantify?
Optimal solution
This is really a sequential choice problem; can be solved with dynamic
programming
Naïve approach:
Each machine has k ‘states’ (number of wins/losses so far); state of
total game is product over all machines; curse of dimensionality (kn
states)
Clever approach: (Gittins 1972)
Problem decouples to one with k states – consider continuing on a
single bandit versus switching to a bandit that always pays some
known amount. The amount for which you’d switch is the ‘Gittins index’.
It properly balances mean, uncertainty & horizon
overview
• reinforcement learning
• model fitting: behavior
– pooling multiple subjects
– example
• model fitting: fMRI
Model estimation
What is a model?
– parameterized stochastic data-generation process
Model m predicts data D given parameters q
Estimate parameters: posterior distribution over q by Bayes’ rule
P(q | D, m)  P( D | q , m) P(q | m)
Typically use a maximum likelihood point estimate instead
arg maxq P( D | q , m)
ie the parameters for which data are most likely.
Can still study uncertainty around peak: interactions, identifiability
application to RL
eg D for a subject is ordered list of choices
ct, rewards rt
P( D | q , m)   P(ct | c1..t 1 , r1..t 1 , q , m)
t
for eg
P(ct | c1..t 1 , r1..t 1 ,q , m)  exp(Vt (ct ))
where V might be learned by an
exponential filter with decay q
Example behavioral task
0.5
probability
shock
0
0.5
probability
Reinforcement learning for reward &
punishment:
• participants (31) repeatedly
choose between boxes
• each box has (hidden, changing)
chance of giving money (20p)
• also, independent chance of
giving electric shock (8 on 1-10
pain scale)
0.25
money
0.25
0
0
100
200
trial
300
This is good for what?
• parameters may measure something of interest
– eg learning rate, monetary value of shock
• allow to quantify & study neural representations
of subjective quantities
– expected value, prediction error
• compare models
• compare groups
Compare models
P ( m | D )  P ( D | m ) P ( m)
P( D | m)   dq m P( D |q m , m) P(q m | m)
In principle: ‘automatic Occam’s razor’
In practice: approximate integral as max likelihood + penalty:
Laplace, BIC, AIC etc. Frequentist version: likelihood ratio test
Or: holdout set; difficult in sequential case
Good example refs: Ho & Camerer
Compare groups
• How to model data for a group of subjects?
• Want to account for (potential) inter-subject
variability in parameters q
– this is called treating the parameters as “random
effects”
– ie random variables instantiated once per subject
– hierarchical model:
• each subject’s parameters drawn from population distribution
• her choices drawn from model given those parameters
P( D | q , m)    dq s P(q s |q ) P(cs ,t | cs ,1..t 1 , rs ,1..t 1 , q s , m)
s
t
Random effects model
P( D | q , m)    dq s P(q s |q ) P(cs ,t | cs ,1..t 1 , rs ,1..t 1 , q s , m)
s
t
Hierarchical model:
– What is qs? e.g., a learning rate
– What is P(qs | q)? eg a Gaussian, or a MOG
– What is q? eg the mean and variance, over the
population, of the regression weights
Interested in identifying population characteristics q
(all multisubject fMRI analyses work this way)
Random effects model
P( D | q , m)    dq s P(q s |q ) P(cs ,t | cs ,1..t 1 , rs ,1..t 1 , q s , m)
s
t
Interested in identifying population characteristics q
– method 1: summary statistics of individual ML fits
(cheap & cheerful: used in fMRI)
– method 2: estimate integral over parameters eg with
Monte Carlo
What good is this?
– can make statistical statements about parameters in
population
– can compare groups
– can regularize individual parameter estimates
ie, P(q | qs) : “empirical Bayes”
Example behavioral task
0.5
probability
shock
0
0.5
probability
Reinforcement learning for reward &
punishment:
• participants (31) repeatedly
choose between boxes
• each box has (hidden, changing)
chance of giving money (20p)
• also, independent chance of
giving electric shock (8 on 1-10
pain scale)
0.25
money
0.25
0
0
100
200
trial
300
Behavioral analysis
Fit trial-by-trial choices using “conditional logit”
regression model
 coefficients estimate effects on choice of past rewards,
shocks, & choices (Lau & Glimcher; Corrado et al)
 selective effect of acute tryptophan depletion?
choice
shock
reward
value(box 1) = [ 0 0 1 0 0 0 1 0… 0 1 1 0 0 0 0 0… 0 1 1 0 0 1 1 0… ] • [weights]
value(box 2) = [ 1 0 0 0 1 0 0 1… 0 0 0 0 1 0 0 1… 1 0 0 0 1 0 0 1… ] • [weights]
etc
values  choice probabilities using logistic (‘softmax’) rule
prob(box 1)

exp(value(box 1))
probabilities  choices stochastically
estimate weights by maximizing joint likelihood of choices, conditional on
rewards
<- avoid - choose ->
5
0
reward
-5
-1
-6
shock
-1
-6
lag (trials)
choice
-1
-6
Summary statistics of individual ML fits
– fairly noisy (unconstrained model, unregularized fits)
<- avoid - choose ->
5
0
reward
-5
-1
-6
shock
-1
-6
lag (trials)
choice
-1
-6
0.6
models predict exponential decays in reward & shock
weights
 & typically neglect choice-choice autocorrelation
0.5
0.4
0.3
0.2
0.1
0
-10
-9
-8
-7
-6
-5
-4
-3
-2
-1
<- avoid - choose ->
5
0
reward
-5
-1
-6
shock
-1
-6
lag (trials)
choice
-1
-6
Fit of TD model (w/ exponentially decaying choice sensitivity),
visualized same way
(5x fewer parameters, essentially as good fit to data; estimates better
regularized)
5
£0.04
-£0.12
<- avoid - choose ->
£0.20
0
reward
-5
-1
-6
Quantify value of pain
shock
-1
-6
lag (trials)
choice
-1
-6
<- avoid - choose ->
5
0
reward
-5
-1
-6
shock
-1
-6
lag (trials)
choice
-1
Effect of acute tryptophan depletion?
-6
5
<- avoid - choose ->
Control
Depleted
0
reward
-5
-1
-6
shock
-1
-6
lag (trials)
choice
-1
Depleted participants are:
• equally shock-driven
• more ‘sticky’ (driven to repeat choices)
• less money-driven (this effect less reliable)
-6
Control
Depleted
<- avoid - choose ->
linear effects
of blood
tryptophan
levels:
5
0
reward
shock sensitivity
-5
-1
-6
shock
-1
-6
lag (trials)
choice
-1
-6
10
0
-10
-20
p > .5
-0.2
0
tryp change
0.2
Control
Depleted
0
reward
-5
-1
-6
shock
-1
choice
-6
lag (trials)
choice sensitivity
<- avoid - choose ->
linear effects
of blood
tryptophan
levels:
5
-1
-6
4
2
0
-2 p < .005
-0.2 -0.1
0 0.1 0.2
tryp change
Control
Depleted
<- avoid - choose ->
linear effects
of blood
tryptophan
levels:
5
0
reward
-5
-1
shock
-6
-1
15
10
5
0
p < .01
-5
-6
lag (trials)
choice sensitivity
reward sensitivity
20
choice
-0.2 -0.1
0 0.1 0.2
tryp change
-1
-6
4
2
0
-2 p < .005
-0.2 -0.1
0 0.1 0.2
tryp change
overview
• reinforcement learning
• model fitting: behavior
• model fitting: fMRI
– random effects
– RL regressors
L
rFP
rFP
p<0.01
p<0.001
LFP
What does this mean when there are multiple subjects?
• regression coefficients as random effects
• if we drew more subjects from this population is the
expected effect size > 0?
History
1990-1991 – SPM paper, software released, used for PET
low ratio of samples to subjects (within-subject variance
not important)
1992-1997 – Development of fMRI
more samples per subject
1998 – Holmes & Friston introduce distinction between fixed
and random effects analysis in conference presentation;
reveal SPM had been fixed effects all along
1999 – Series of papers semi-defending fixed effects; but
software fixed
RL & fMRI
• Common approach: fit models to behavior, use models
to generate regressors for fMRI GLM
– eg predicted value; error in predicted value
– where in brain does BOLD signal correlate with computationally
generated signal (convolved with HRF)?
– quantify & study neural representation of subjective factors
reward prediction error (O’Doherty et al 2003 and lots of other papers)
Schoenberg et al 2007
Examples: Value expectation
(exactly same approach common in animal phys)
Sugrue et al. (2004): primate LIP neurons:
% signal change
Daw, O’Doherty et al. (2006): vmPFC activity in humans
probability of chosen action
note: can also fit parametric models to neural
signals; compare neural & behavioral fits (Kable et
al 2007; Tom et al 2007)
note 2: must as always be suspicious about
spurious correlations
– still good to use controls (eg is regressor loading
better in this condition than another)
Examples: loss aversion
Tom et al (2007):
compare loss aversion
estimated from neural
value signals to
behavioral loss aversion
from choices
utility
money
example
positional uncertainty in navigation task (Yoshida et al 2006)
model: subjects assume they are someplace until proven wrong; then try assuming somewhere else
estimate where subject thinks they are at each step
correlate uncertainty in position estimate with BOLD signal
summary
• trial and error learning & choice
– interaction between the two
– rich theory even for simple tasks
• model fits to choice behavior
– hierarchical model of population
– quantify subjective factors
• same methods for fMRI, ephys
– but keep your wits about you
Download