Supplemental Methods
Experimental Paradigm
Subjects completed a total of 200 trials (4 conditions * 50 trials per condition) for each
drug/placebo session, with the order of the conditions counterbalanced across runs and
subjects. Within each run of 50 trials, an intertrial interval of 0-8 seconds separated each
trial. Although subjects were not directly cued to the nature of the reward space, the
clock face changed color between runs to indicate that a different task condition was
present. A different set of four colors was used on the second day to avoid cuing subjects
to the reward structure.
Power Analysis
For power computations, the primary data analysis endpoint for the exploration parameter
was the interaction between factors of drug and genotype in a two-way repeatedmeasures
( for a repeated measure with two levels (drug condition) and a
between-subjects factor with three levels (genotype), we calculated the number of
subjects needed to detect an effect with 80% power at an alpha of 0.05, assuming a
moderate effect size (0.25) and a modest correlation between repeated measures (0.25).
Based on these parameters, the critical F value was 3.15, and the necessary number of
subjects was 63. Selecting greater than 80% power, a smaller effect size, or a smaller
correlation between repeated measures would naturally increase the number of subjects
Computational Model: Overview
Previous studies have shown that RTs in the exploration-exploitation task can be well
predicted by assuming that participants track the outcome statistics associated with two
general classes of responses (‘fast’ vs. ‘slow’ – they are instructed in advance that
sometimes they will do better by responding faster and sometimes slower). However,
this assumption does not limit the ability of the model to account for continuous
variations in RT.
By tracking statistics only with these two classes, subjects can
nevertheless adjust their RTs in a continuous way as a function of the relative difference
in the expected values of these options (Badre et al, 2012; Cavanagh et al, 2012; Frank et
al, 2009; Strauss et al, 2011). Indeed, the exploitation component of the model estimates
the degree to which responses are adjusted as a function of the relative difference in the
mean expected values for fast and slow responses.
More specifically, both exploitation of the RTs producing the highest rewards and
exploration for even better rewards are driven by errors of prediction in tracking expected
reward value V. On trial t, the expected reward for each clock face is calculated as
V(t) = V(t-1) + αδ(t-1)
where α determines how rapidly V is updated and δ is the reward prediction error (RPE).
This V value represents the average expected value of rewards in each clock face, but the
goal of the learner is to try to maximize their rewards and hence select RTs that produce
the largest number of positive deviations from this average. Because it is unreasonable to
assume that learners track a separate value for each RT from 0 to 5000ms, a simplifying
assumption is that the learners can track the values of responses that are faster or slower
than average.
A strategic exploitation component thus tracks the reward structure
associated with distinct response classes (fast or slow, respectively). This component is
intended to capture how participants track the probability of realizing a better than
expected outcome (positive RPE) following faster or slower responses, allowing them to
continuously adjust RTs in proportion to their relative value differences.
The model contains two factors relevant to exploitation: the parameter ρ and the quantity
αG - αN (described below). Unlike the explore parameter ε, which scales the difference in
the uncertainties of the two (fast/slow) distributions, the parameter ρ scales the degree to
which subjects adjust their responses as a function of the relative difference in the mean
expected values:
ρ [ μ slow (t) – μ fast (t) ]
Thus, subjects with larger ρ parameters tend to choose the option that has led to a larger
number of positive prediction errors in the past. Similarly, the model also incorporates a
potential bias for subjects to speed / slow RTs differentially for positive and negative
reward prediction errors, captured by parameters αG and αN that represent learning rates in
the striatal “Go” and “NoGo” pathways, respectively. Importantly for this study, our
previous work has demonstrated that these factors are affected by striatal genes and/or
more diffuse dopaminergic changes (Frank et al, 2009; Moustafa et al, 2008), leading to
the prediction that they should be insensitive to the effects of COMT inhibition.
Computational Model: Detailed Explanation
In addition to the components described above and in the main text, the full RT model
included additional contributions to responding that were not a focus of the present
experiment, for consistency with prior reports. The full model estimates reaction time
( R̂T ) on trial t as follows:
R̂T (t) = K + l RT (t -1) - Go(t) + NoGo(t) +
r[mslow (t) - m fast (t)]+ n [RTbest - RTavg ]+ Explore(t)
where K is a free parameter capturing baseline response speed (irrespective of reward), l
reflects autocorrelation between the current and previous RT, and n captures the
tendency to adapt RTs toward the single largest reward experienced thus far (“going for
gold”). For details on these parameters, see Frank et al. (2009). Their inclusion allows the
model to better fit overall RTs but we have found that they do not change the pattern of
findings associated with exploration or exploitation.
Go and NoGo learning reflect a striatal bias to speed responding as a function of positive
reward prediction errors (RPEs) and to slow responding as a function of negative RPEs.
Evidence for speeding and slowing in the task is separately tracked:
Go(t) = Go(t-1) +
NoGo(t) = NoGo(t-1) +
positive (
are the previously-described learning rates scaling the effects of
and negative (
) errors in expected value prediction V (i.e., positive and
negative RPE). Go learning speeds RT, while NoGo learning slows it. This bias to speed
and slow RTs as a function of positive and negative RPEs is adaptive in this task given
that subjects tend to initially make relatively fast responses, and prior studies have found
that these biases and model parameters are influenced by striatal dopaminergic
manipulations and genetics (Frank et al, 2009; Moustafa et al, 2008). However, this
approach does not consider when it is best to respond in a strategic manner, and in fact, it
is not adaptive in environments where slow responses yield higher rewards (in which
case positive and negative RPE's will lead to maladaptive RT adjustments).
For the more strategic exploitative component, reward statistics were computed via
Bayesian updating of “fast” or “slow” actions, as described in the main text. Fast or slow
actions were classified based on whether they were faster or slower than the local
average, which was computed as:
RTavg(t) = RTavg(t-1) + [RT(t-1) - RTavg(t-1)]
However, fast and slow responses can be defined in other ways – for example, based on
whether the clock hand is in the first or second half of the clock face – and outcomes
from the model are the same. (The use of an adaptive version of the boundary is more
general and would allow the algorithm to converge to an appropriate RT even if reward
functions are non-monotonic).
Free parameters were estimated for each subject via the Simplex method as those
minimizing the sum of squared error between predicted and observed RTs. Multiple
starting points were used for each optimization process to reduce the likelihood of local
minima. All parameters were free to vary for each participant, with the exception of
used in the expected value (V) update, which was set to 0.1 for all participants to
prevent model degeneracy (Frank et al, 2009). For other details regarding the primary
continuous RT model, including alternative models that provide poorer behavioral fits to
the data, please see Frank et al (2009). Note that among the alternative models tested in
that paper is a Kalman filter model in which the mean expected reward values and their
uncertainties are estimated with Normal distributions, rather than the beta distributions
used here. However, the variance (uncertainty) in the Kalman filter tends to be overly
dominated by the first trial: subjects are given no information about the number of
possible points that they might gain, leading to large variance in initial estimates, which
then declines more dramatically after a few trials when rewards are experienced than
does the variance in the probability distributions for the beta priors. This phenomenon
means that the neural estimate of relative uncertainty would largely reflect contributions
of a very few number of trials using this model. For the “RT difference model”, the same
procedure was used, except that parameters were optimized to predict the change in RT
from one trial to the next, rather than the raw RT (and hence parameters related to
autocorrelation and “going for gold” were dropped, because these quantities reflect
attempts to model trends in overall RTs, not their differences).
For the “negative-
permitting model”, a similar procedure was used, except that epsilon values were
permitted to take negative values. However, because participants tend to repeat the same
action (fast or slow) after they have learned, this exploitation tendency results in the most
selected action becoming the least uncertain (due to more sampling), and hence can cause
a negative exploration parameter simply due to repeated action selection. To account for
this phenomenon we also allowed a “sticky choice” parameter that captures the tendency
for RTs in a given trial to be autocorrelated not only with the last trial but with an
exponentially decaying history (see Badre et al 2012 and Cavanagh et al 2012). To
maintain the same total number of parameters we removed the “going for gold”
component in this model.
Model Fits
Goodness of fit was assessed by a sum squared error (SSE) term that reflected the
difference between subject behavior (main paper, Figure 3A) and model predictions
(main paper, Figure 3B).
Because the number of data points and the number of
parameters were identical across subjects, more complicated measures of goodness of fit
(e.g. the Akaike Information Criterion) devolve to this simpler value. Mean SSE values
were not significantly different across drug conditions: 6.3 x 107 ± 2.7 x 106 (s.e.m.) for
tolcapone versus 6.4 x 107 ± 2.7 x 106 for placebo (F(1,62) = 0.54, p = 0.82 (n.s.)). A
trend-level difference in SSE between genotypes (F(2,62) = 2.68, p = 0.076) was driven
by a trend difference between SSE values for Met/Met (7.1 x 107 ± 4.5 x 106) and
Val/Val (5.8 x 107 ± 3.9 x 106) subjects (Tukey’s test, p = 0.067). All other comparisons
between groups, including Met/Val subjects (6.2 x 107 ± 3.5 x 106), were not significant.
Finally, there was no drug x genotype interaction (F(2,62) = 0.38, p = 0.69 (n.s.)),
arguing that even these trend-level differences cannot explain reliable differences in the
Explore parameter across genotypes.
We have also investigated whether the balance between reduced model complexity and
increased goodness of fit might be better addressed by removing/adding model
parameters. Our previous analyses converged upon the core model used in this report;
their details are beyond the current scope but can be found in our previous work (Frank et
al, 2009).
Genetic Analysis
Genotype and Locus of Control.
Genotype groups did not differ significantly with
respect to age, alcohol use, anxiety, impulsivity, or locus of control scores. In particular,
as also noted in the body of the paper, no difference in LOC values was seen between the
three COMT genotypes (F(2,62) = 0.88, p = 0.42 (n.s.)). Moreover, the median LOC
value of 13 was well approximated by the mean values across genotypes: Met/Met =
12.1 ± 1.0 (s.e.m.), Met/Val = 13.6 ± 0.7, and Val/Val = 12.9 ± 0.6.
Supplementary Figure Legends
Figure S1. Reaction time (RT) data across all 50 trials of each task condition for the
different subsets of drug and subject (left column) paired with the corresponding model
predictions (right column). From top to bottom, these panels demonstrate placebo and
tolcapone RT data, followed by RT data for subjects with the Met/Met, Met/Val, and
Val/Val genotypes, respectively, at the COMT Val158Met polymorphism. The number
of subjects contributing to each graph is indicated to the right of the subset label.
Figure S2. The correlation between the change in reaction time from one trial to the next
(“RT swing”) and the difference in the standard deviations of the fast and slow belief
distributions (“Relative uncertainty”) for all sessions with a positive explore parameter
across subjects, independent of drug condition. Both RT swing and relative uncertainty
values have been converted to Z scores to facilitate comparison. As predicted, greater
relative uncertainty on a given trial correlates with greater changes in RT from that trial
to the next (mean r across participants = 0.33, p < 0.00001 for the t-test evaluating
whether these coefficients are different from zero across participants). This positive
correlation provides important support for the idea that relative uncertainty drives
exploration of the reward space.
Figure S3. A. Shown are the individual changes in the explore parameter (epsilon), in
units of milliseconds per unit standard deviation of the belief distributions (ms/σ), from
placebo (abscissa) to tolcapone (ordinate) for subjects of each genotype (red: Met/Met;
blue: Met/Val; black: Val/Val). B. The same data shown in A are re-plotted, this time
segregated by locus of control score for subjects with a more external (red: LOC ≤ 13) or
more internal (blue: LOC > 13) score. In both plots, the preponderance of the red
symbols lies above the dashed equality line. The panels also demonstrate the distribution
of the explore parameter across subjects.
