Document

advertisement
dopamine and prediction error
TD error
 t  rt  Vt 1  Vt
L
R
Vt
 (t )
R
no prediction
1
prediction, reward
prediction, no reward
Schultz 1997
humans are no different
• dorsomedial striatum/PFC
– goal-directed control
• dorsolateral striatum
– habitual control
• ventral striatum
– Pavlovian control; value signals
• dopamine...
in humans…
5 stimuli:
40¢
20¢
0/40¢
0¢
0¢
< 1 sec
5 sec
ISI
0.5 sec
You won
40 cents
19 subjects (dropped 3 non learners, N=16)
3T scanner, TR=2sec, interleaved
234 trials: 130 choice, 104 single stimulus
randomly ordered and counterbalanced
2-5sec
ITI
what would a prediction error look like (in BOLD)?
prediction errors in NAC
unbiased anatomical ROI in
nucleus accumbens
(marked per subject*)
raw BOLD
(avg over all subjects)
can actually decide between different neuroeconomic models of risk
* thanks to Laura deSouza
Polar Exploration
Peter Dayan
Nathaniel Daw John O’Doherty Ray Dolan
Exploration vs. exploitation
Classic dilemma in learned decision making
For unfamiliar outcomes, how to trade off
learning about their values against exploiting
knowledge already gained
Exploration vs. exploitation
Reward
Time
• Exploitation
– Choose action expected to be best
– May never discover something better
Exploration vs. exploitation
Reward
Time
• Exploitation
– Choose action expected to be best
– May never discover something better
• Exploration:
– Choose action expected to be worse
Exploration vs. exploitation
Reward
Time
• Exploitation
– Choose action expected to be best
– May never discover something better
• Exploration:
– Choose action expected to be worse
– If it is, then go back to the original
Exploration vs. exploitation
Reward
Time
• Exploitation
– Choose action expected to be best
– May never discover something better
• Exploration:
– Choose action expected to be worse
Exploration vs. exploitation
Reward
Time
• Exploitation
– Choose action expected to be best
– May never discover something better
• Exploration:
– Choose action expected to be worse
– If it is better, then exploit in the future
Exploration vs. exploitation
Reward
Time
• Exploitation
– Choose action expected to be best
– May never discover something better
• Exploration:
–
–
–
–
Choose action expected to be worse
Balanced by the long-term gain if it turns out better
(Even for risk or ambiguity averse subjects)
nb: learning non trivial when outcomes noisy or changing
Bayesian analysis (Gittins 1972)
• Tractable dynamic program in
restricted class of problems
– “n-armed bandit”
• Solution requires balancing
– Expected outcome values
– Uncertainty (need for exploration)
– Horizon/discounting (time to exploit)
– Choose best sum of value plus bonus
– Bonus increases with uncertainty
Value
• Optimal policy: Explore systematically
• Intractable in general setting
– Various heuristics used in practice
Action
Experiment
• How do humans handle tradeoff?
• Computation: Which strategies fit behavior?
– Several popular approximations
• Difference: what information influences exploration?
• Neural substrate: What systems are involved?
– PFC, high level control
• Competitive decision systems (Daw et al. 2005)
– Neuromodulators
• dopamine (Kakade & Dayan 2002)
• norepinephrine (Usher et al. 1999)
Task design
Subjects (14
healthy, righthanded)
repeatedly
choose
between four
slot machines
for points
(“money”),
in scanner
Trial
Onset
Slots
revealed
Task design
Subjects (14
healthy, righthanded)
repeatedly
choose
between four
slot machines
for points
(“money”),
in scanner
Trial
Onset
+~430 ms
Slots
revealed
Subject
makes choice chosen slot spins.
+
Task design
Subjects (14
healthy, righthanded)
repeatedly
choose
between four
slot machines
for points
(“money”),
in scanner
Trial
Onset
+~430 ms
Slots
revealed
Subject
makes choice chosen slot spins.
+
+
obtained
57
points
+~3000 ms
Outcome:
Payoff
revealed
Task design
Subjects (14
healthy, righthanded)
repeatedly
choose
between four
slot machines
for points
(“money”),
in scanner
Trial
Onset
+~430 ms
Slots
revealed
Subject
makes choice chosen slot spins.
+
+
obtained
57
points
+
+~3000 ms
Outcome:
+~1000 ms
Payoff
revealed Screen
cleared
Trial ends
Payoff structure
Noisy to require integration of data
Subjects learn about payoffs only by sampling them
Payoff structure
Noisy to require integration of data
Subjects learn about payoffs only by sampling them
Payoff
Payoff structure
Payoff structure
Nonstationary to encourage ongoing exploration
(Gaussian drift w/ decay)
Analysis strategy
• Behavior: Fit an RL model to choices
– Find best fitting parameters
– Compare different exploration models
• Imaging: Use model to estimate subjective
factors (explore vs. exploit, value, etc.)
– Use these as regressors for the fMRI signal
– After Sugrue et al.
Behavior
Behavior
Behavior
Behavior
Behavior
Behavior model
1. Estimate payoffs
mgreen , mred etc
sgreen , sred etc
2. Derive choice
probabilities
Pgreen , Pred etc
Choose randomly according to these
Behavior model
Kalman filter
Error update
1. Estimate payoffs
(like TD)
Exact inference
mgreen , mred etc
sgreen , sred etc
2. Derive choice
probabilities
Pgreen , Pred etc
Choose randomly according to these
Behavior model
2. Derive choice
probabilities
x
x
payoff
Kalman filter
Error update
1. Estimate payoffs
(like TD)
Exact inference
mgreen , mred etc
sgreen , sred etc
x
trial
t
Pgreen , Pred etc
Choose randomly according to these
t+1
Behavior model
2. Derive choice
probabilities
x
x
payoff
Kalman filter
Error update
1. Estimate payoffs
(like TD)
Exact inference
mgreen , mred etc
sgreen , sred etc
x
trial
t
Pgreen , Pred etc
Choose randomly according to these
t+1
Behavior model
2. Derive choice
probabilities
x
x
x
payoff
Kalman filter
Error update
1. Estimate payoffs
(like TD)
Exact inference
mgreen , mred etc
sgreen , sred etc
x
trial
t
t+1
mred  mred  
Pgreen , Pred etc
Choose randomly according to these
Behavior model
2. Derive choice
probabilities
x
x
payoff
Kalman filter
Error update
1. Estimate payoffs
(like TD)
Exact inference
mgreen , mred etc
sgreen , sred etc
x
x
x
x
trial
t
t+1
mred  mred  
Pgreen , Pred etc
Choose randomly according to these
Kalman filter
Error update
1. Estimate payoffs
(like TD)
Exact inference
mgreen , mred etc
sgreen , sred etc
2. Derive choice
probabilities
Pgreen , Pred etc
payoff
Behavior model
trial
t
t+1
mred  mred  
2
2
2
  sred
/  sre

s
d
o
2
2
sred
 (1  )sred
Choose randomly according to these
Behrens & volatility
Behavior model
Kalman filter
1. Estimate payoffs
mgreen , mred etc
sgreen , sred etc
Compare rules:
How is
2. Derive choice
exploration
probabilities
directed?
Pgreen , Pred etc
Choose randomly according to these
Behavior model
mgreen , mred etc
sgreen , sred etc
Compare rules:
How is
2. Derive choice
exploration
probabilities
directed?
Pgreen , Pred etc
Choose randomly according to these
mgreen , mred etc
sgreen , sred etc
Compare rules:
How is
2. Derive choice
exploration
probabilities
directed?
(dumber)
Value
Behavior model
Action
(smarter)
mgreen , mred etc
sgreen , sred etc
Compare rules:
How is
2. Derive choice
exploration
probabilities
directed?
Value
Behavior model
Action
Probability
Randomly
“e-greedy”
1  3e
Pred  
 e
(dumber)
if mred  max(all m )
otherwise
(smarter)
mgreen , mred etc
sgreen , sred etc
Compare rules:
How is
2. Derive choice
exploration
probabilities
directed?
Action
By value
“softmax”
Probability
Randomly
“e-greedy”
Value
Behavior model
1  3e
Pred  
 e
(dumber)
if mred  max(all m )
otherwise
Pred  exp( βμred )
(smarter)
mgreen , mred etc
sgreen , sred etc
Compare rules:
How is
2. Derive choice
exploration
probabilities
directed?
By value
“softmax”
Action
By value and uncertainty
“uncertainty bonuses”
Probability
Randomly
“e-greedy”
Value
Behavior model
1  3e
Pred  
 e
(dumber)
if mred  max(all m )
otherwise
Pred  exp( βμred )
Pred  exp( β[ μred +φσred ])
(smarter)
Model comparison
• Assess models based on likelihood of
actual choices
– Product over subjects and trials of modeled
probability of each choice
– Find maximum likelihood parameters
• Inference parameters, choice parameters
• Parameters yoked between subjects
• (… except choice noisiness, to model all
heterogeneity)
Behavioral results
-log likelihood
(smaller is better)
# parameters
e-greedy
softmax
uncertainty bonuses
4208.3
3972.1
3972.1
19
19
20
• Strong evidence for
exploration directed
by value
• No evidence for
direction by
uncertainty
– Tried several
variations
Behavioral results
-log likelihood
(smaller is better)
# parameters
e-greedy
softmax
uncertainty bonuses
4208.3
3972.1
3972.1
19
19
20
• Strong evidence for
exploration directed
by value
• No evidence for
direction by
uncertainty
– Tried several
variations
Imaging methods
• 1.5 T Siemens Sonata scanner
• Sequence optimized for OFC (Deichmann
et al. 2003)
• 2x385 volumes; 36 slices; 3mm thickness
• 3.24 secs TR
• SPM2 random effects model
• Regressors generated using fit model,
trial-by-trial sequence of actual
choices/payoffs.
Imaging results
L
vStr
• TD error: dopamine
targets (dorsal and
ventral striatum)
x,y,z=
9,12,-9
dStr
• Replicate previous
studies, but weakish
– Graded payoffs?
p<0.01
p<0.001
x,y,z=
9,0,18
Value-related correlates
L
vmPFC
vmPFC
p<0.01
p<0.001
% signal change
probability (or exp. value) of chosen action: vmPFC
probability of chosen action
x,y,z=-3,45,-18
L
mOFC
p<0.01
p<0.001
x,y,z=3,30,-21
mOFC
% signal change
payoff amount: OFC
payoff
Exploration
• Non-greedy > greedy choices:
exploration
• Frontopolar cortex
• Survives whole-brain correction
L
rFP
rFP
p<0.01
p<0.001
x,y,z=-27,48,4; 27,57,6
LFP
Timecourses
Frontal pole
IPS
Checks
• Do other factors explain differential BOLD activity better?
– Multiple regression vs. RT, actual reward, predicted reward, choice prob,
stay vs. switch, uncertainty, more
– Only explore/exploit is significant
– (But 5 additional putative explore areas eliminated)
• Individual subjects: BOLD differences stronger for better behavioral fit
Frontal poles
• “One of the least well understood
regions of the human brain”
• No cortical connections outside PFC
(“PFC for PFC”)
• Rostrocaudal hierarchy in PFC
(Christoff & Gabrielli 2000; Koechlin
et al. 2003)
• Imaging – high level control
– Coordinating goals/subgoals (Koechlin et al. 1999, Braver &
Bongiolatti 2002; Badre & Wagner 2004)
– Mediating cognitive processes (Ramnani & Owen 2004)
– Nothing this computationally specific
• Lesions: task switching (Burgess et al. 2000)
– more generic: perseveration
Interpretation
• Cognitive decision to explore overrides habit circuitry?
Via parietal?
– Higher FP response when exploration chosen most against the
odds
– Explore RT longer
• Exploration/exploitation are neurally distinct
• Computationally surprising, esp. bad for uncertainty
bonus schemes
– proper exploration requires computational integration
– no behavioral evidence either
• Why softmax? Can misexplore
– Deterministic bonus schemes bad in adversarial/multiagent
setting
– Dynamic temperature control?
(norepinephrine; Usher et al.; Doya)
Conclusions
• Subjects direct exploration by value but not
uncertainty
• Cortical regions differentially implicated in
exploration
– computational consequences
• Integrative approach: computation, behavior,
imaging
– Quantitatively assess & constrain models using raw
behavior
– Infer subjective states using model, study neural
correlates
Open Issues
• model-based vs model-free vs Pavlovian
control
– environmental priors vs naive optimism vs
neophilic compulsion?
• environmental priors and generalization
– curiosity/`intrinsic motivation’ from expected
future reward
Download