Supplementary Methods Procedure All tasks were presented using E-Prime 2.0 (Psychology Software Tools) on a laptop equipped with Windows XP operating system. For XG, testing was done over four 2-hour sessions at his residence. Each task was repeated twice between two different testing sessions to ensure that behavior was replicable across days. Similar to XG, elderly controls did two iterations of each task in four 2-hour sessions. Testing was done at the University of Pennsylvania. In all cases, new stimuli were used when a task was repeated to prevent carryover learning effects. In all but one task (Probabilistic Selection Task), all participants (including XG) made incentivecompatible decisions and were able to earn additional performance-based compensation. At the end of each testing session, compensation ($15/hour) was provided for participation and for performance in one randomly chosen task. Only one task was chosen, rather than paying all tasks, so that participants had an incentive to do well on all tasks (e.g., doing well on the first few tasks alone did not guarantee a good payment) and so that the compensation in a single task was at meaningful levels (e.g., each trial is worth $0.05 as opposed to less than a penny). Participants were fully informed of these compensation contingencies before each testing session. Tasks Weather prediction task The Weather Prediction Task is a probabilistic classification task often used to test implicit category learning. Because patients with Parkinson’s disease are impaired on this task, it has 1 been argued that the striatum is necessary for this kind of learning (Knowlton et al., 1996, Shohamy et al., 2004). This task aims to evaluate XG’s ability to learn and combine stimulusvalues in a probabilistic learning environment without reversals. This task requires participants to predict “rain” or “shine” based on combinations of one to three cues, which look like playing cards (Supplementary Figure 5). We used a modified version of this task (Shohamy et al., 2004) to prevent any possibility of a floor effect in performance. In this modified version, there are only four cues in total. Each cue is associated with a fixed conditional probability of “rain”, and the probability of “rain” for a given combination of cues is the product of the individual cues’ conditional probabilities. Both outcomes were equally likely across 200 testing trials. All possible combinations of 1, 2, or 3 cues (14 total combinations) are pseudorandomized so that the same pattern does not appear consecutively. Two lists of 200 pseudorandomized trials are used and the order the lists are used is counterbalanced across all participants for the two testing sessions. On each trial, participants are shown a combination of 1, 2, or 3 cues in the center of the screen. Based on the combinations of cues, participants decide between “rain” or “shine” by pressing a button on the keyboard. After each response, auditory feedback for correct (high pitched ring) or incorrect (low-pitched buzzer) choices, as well as visual feedback (sun or rain cloud above cue combinations) is provided. If participants fail to respond within 2.5 seconds, “Answer Now” appears above the cue combinations to indicate that a response is needed. If participants fail to respond within five seconds, the low-pitched buzzer is sounded and the trial is terminated. A score bar, initiated at 200 points, is provided at the bottom of the screen to provide feedback on 2 overall performance. Participants gain one point for each correct response and lose one point for each incorrect response. Each point has a monetary value of $0.02. Performance was measured as the percentage of time participants chose the more likely answer, that is, whether their response (“rain” or “shine”) was the more likely one given the cue combination. Probabilistic selection task The Probabilistic Selection Task has been used to study the mechanisms that govern learning from positive and negative feedback (Frank et al., 2004). It has been shown with this task that Parkinson’s patients on dopaminergic agonists (such as levodopa) exhibit enhanced learning from positive reward and reduced learning from negative reward (Frank et al., 2004). Vice versa, patients off medication exhibit reduced learning from positive reward and enhanced learning from negative reward. A computational model of learning was developed around the idea of dopamine bursts and dips in the striatum representing positive and negative reward prediction errors, respectively (Cohen and Frank, 2009). This task aims to evaluate XG’s ability to process positive and negative reward and his ability to learn stimulus values in a probabilistic learning environment without reversals. This task has two phases: a training phase and a test phase (Supplementary Figure 6). In the training phase, participants become familiar with three different pairs of Hiragana characters (AB, CD, and EF). The stimulus pairs are presented in a random order. On any given trial, only one pair of stimuli is shown and each character is positioned randomly at two set locations (left and right of the central divide) on the computer screen. Participants respond by pressing a button to choose either the left- or right-positioned stimulus. Feedback is provided after every trial indicating whether the choice is correct or incorrect. Feedback of “Correct” or “Incorrect” is 3 provided probabilistically: 80/20 for AB, 70/30 for CD, and 60/40 for EF. For example, choosing A leads to positive feedback for 80% of AB trials, while choosing B leads to negative feedback as often. Vice versa, choosing A leads to negative feedback for 20% of AB trials, while choosing B leads to positive feedback as often. Typically, participants complete training once criteria are met (at least 65% choices of A, 60% choices of C, and 50% choices of E). Criteria are evaluated after every 60 trials with a maximum of 480 trials. However, since XG did not reach criteria during training, we retested control participants on a modified version of the task that removed the criteria during the training component. XG was tested twice on the task. Controls were tested once with criteria and twice without criteria. Different sets of Hiragana characters were used to limit possible learning effects across testing sessions. Performance during the training phase is measured by the participants’ choice of the richer stimulus from each pair of stimuli (stimuli A, C, and E). After finishing the training phase, participants proceed to the test phase of the task, in which they are tested with new stimulus combinations involving A (AC, AD, AE, AF) and B (BC, BD, BE, BF), along with the three familiar combinations (AB, CD, EF). In this part of the task, participants are told to use the knowledge they gained about the individual stimuli from the training phase and/or use their “gut instincts” in order to choose the best option from each pair of stimuli. There are 66 trials during the test phase (6 trials per stimulus pair). Participants are not given feedback during this phase to ensure that any choices made are a result of learning from the first part of this experiment. In this segment of the task, performance is measured as the percentage of trials participants choose A or avoid B in the novel pairs. 4 Crab game The Crab Game is a dynamic foraging task that has been used to study reinforcement learning in Parkinson’s disease patients on and off levodopa, compared to healthy young adults and elderly controls (Rutledge et al., 2009). This task aims to evaluate XG’s ability to engage in reinforcement learning in an environment with changing probabilities in which matching rather than maximizing is favored (Lau and Glimcher, 2008). In this dynamic foraging task, participants search for crabs in one of two static traps marked by a green and a red buoy (Supplementary Figure 7). Each buoy is attached to a cage visible only after a choice has been made. On any given trial, an individual cage can only hold a maximum of one crab. Once a cage is baited with a crab, it remains armed until chosen. Buoys are baited on a variable-ratio reinforcement schedule of 6:1 with a total reward probability of 0.3 (i.e., the participant can find a maximum of 12 crabs every 40 trials at the richer buoy). The identity of the richer buoy switches every 40 trials. Such variable ratio schedules are often used to elicit matching behaviors in animals, since matching outperforms maximizing on such tasks (Lau and Glimcher, 2008). Participants practice for 40 trials with no contingency reversals before beginning the dynamic foraging task for 320 trials. Participants are informed that no reversals would occur during the practice block. During the dynamic foraging task, block transitions are not signaled and participants are informed that these reversals would occur periodically throughout the task. For this task, performance is measured as a participant’s probability of choosing the richer option. 5 Fish game The Fish Game is a probabilistic reversal-learning task with rewards. In the Fish Game, each option is baited with a fixed probability each trial, but uncollected rewards are not carried over between trials. Under these conditions, maximizing, rather than matching, optimizes performance. Thus, this task aims to evaluate XG’s ability to engage in reinforcement learning in an environment with changing probabilities in which maximizing rather than matching is favored. The Fish Game is adapted from the Crab Game (Supplementary Figure 8). In the Fish Game, participants can fish at either end of the lake. The options are marked by a red and a green fishing rod placed on the left and right ends of the lake, respectively. Participants practice for 40 trials with no contingency reversals before playing the game for 320 trials. Reward ratios, total reward probabilities, and reversal contingencies are the same as the Crab Game. Performance is measured as a participant’s probability of choosing the richer option. Bait game The Bait Game is a variant of the Fish Game in which participants learn to avoid losses rather than collect gains. Thus, this task aims to evaluate XG’s ability to engage in probabilistic reversal learning based on punishment instead of reward. In this task, participants are fishing in a barren lake (Supplementary Figure 9). Although no fish can be caught, participants are told that a large fish is roaming the lake and can eat the bait off of the hook if that rod is chosen. If the bait (image of a worm) is eaten, a replacement is 6 automatically bought at a cost of $0.05. Participants are given $5.00 prior to testing and are told to minimize their losses during the task. For this task, the punishment ratio was 6:1 with a total punishment probability of 0.3. Participants practice for 40 trials with no contingency reversals before playing the game for 320 trials. Reversal contingencies are the same as the Fish and Crab Games. Performance is measured as a participant’s probability of choosing the richer option. Stimulus-value learning This reversal-learning task has been used to differentiate stimulus-value from action-value representation in the ventromedial prefrontal cortex (Glascher et al., 2009). This task aims to cleanly evaluate XG’s ability to learn stimulus values, in a probabilistic learning environment with reversals. In this task, participants choose between two distinct fractal stimuli, which are positioned randomly at two static locations (left and right of the central white dot) on screen (Supplementary Figure 10). On each trial, participants respond by pressing a button on a keyboard to choose between the two fractals. Positive feedback is provided if a fractal is armed with a reward (picture of a coin); otherwise, negative feedback is provided (red X overlaying a coin; no coin is found). The fractals are probabilistically rewarded, with the richer fractal rewarded 70% of the time and the poorer fractal rewarded 30% of the time. Participants are informed that on any given trial one fractal has a higher likelihood of delivering a reward and that this association reverses periodically throughout the task. Participants practice for 40 trials with no contingency reversals before beginning a full run of 320 trials. Each reward 7 (coin) has a monetary value of $0.05. Performance is measured as a participant’s probability of choosing the richer option. All participants were tested with two different iterations of the task. In the first iteration, reversals are dependent on performance (Glascher et al., 2009): after four consecutive choices of the richer fractal, contingency reversal occurred with a 0.25 probability for every consecutive trial. If consecutive choice of the richer fractal is broken before a reversal occurs, the reversal criterion is reset. Because this criterion makes reversals dependent on correct choices, block lengths vary among participants. In order to have more comparable data between XG and controls for this task, we also tested participants on a modified version of the task with a fixed number of blocks and transition points (8 equal blocks of 40 trials, 7 transition points). New fractals were used for each testing session. Action-value learning This reversal-learning task has been used to differentiate action-value from stimulus-value representation in the ventromedial prefrontal cortex (Glascher et al., 2009). This task aims to cleanly evaluate XG’s ability to learn action values, in a probabilistic learning environment with reversals. In this task, participants decide between two actions on a trackball mouse, either clicking the left mouse button with the index finger or swiping the trackball upward with the thumb (Supplementary Figure 11). The richer action is rewarded 70% of the time and the poorer action is rewarded 30% of the time. Similar to the stimulus-value learning task, participants are informed that on any given trial one action provides a higher likelihood of receiving a reward 8 and that this association reverses periodically throughout the task. Participants practice for 40 trials with no contingency reversals before beginning a full run of 320 trials. Each reward (coin) has a monetary value of $0.05. Performance is measured as a participant’s probability of choosing the richer option. Again, all participants are tested on two iterations of the task with different reversal contingencies, one with choice-dependent transitions and one with fixed transitions. For the action-value and stimulus-value learning tasks, we report in the main manuscript the data from the version of the task with fixed reversal contingences. We observed the same dissociation in the version of the task with choice-dependent reversal contingencies (Supplementary Figure 12). Statistical tools In all tasks, binomial probability tests were used to compare XG’s performance against chance. To compare XG’s performance against matched controls, we used a modified t-test specifically designed for case studies (Crawford and Howell, 1998): 𝑡= 𝑋1 − ̅̅̅ 𝑋2 𝑁 +1 𝑆2 √ 2𝑁 2 In the above formula, 𝑋1 = patient’s score, ̅̅̅ 𝑋2 = mean of normative sample, 𝑆2 = standard deviation of normative sample, and 𝑁2 = size of normative sample. The degrees of freedom, 𝜐, are 𝑁1 + 𝑁2 − 2, which reduces to 𝑁2 − 1. We used the t-distribution to estimate what 9 percentage of the normative sample is likely to exhibit the same score as XG at an error rate of 5%. This does not overestimate the rarity of the patient’s score and is more conservative than the typical method of using a z-score. Reported t-tests are all one-sided based on the a priori assumption that the patient will perform worse than the normative sample. Choice data from the action-value and stimulus-value learning tasks were fit with a linear regression in MATLAB to estimate the influence of past rewards and past choices on current choices. This model includes terms for the past 5 rewards (t1 – t5) and the last choice (c). Data were also fitted with a 3-parameter reinforcement learning model. Inputs to this model are the sequences of choices and outcomes from each participant. The model assumes that choices are a function of the participant’s perceived value of each option (V1 and V2) for every trial. Values are initiated at zero at the beginning of the experiment and are updated with the following rule (where V1(t) is the value of option 1 at trial t): V1(t +1) = V1(t) + ad (t) (1) d (t) = R1(t) -V1(t) (2) Here, the value of the chosen option is updated by (t), the reward prediction error, which is the difference between the reward experienced and the reward expected. R1(t) is a binary vector of experienced reward (1 = reward, 0 otherwise) from choosing option 1 on trial t. For each trial, the reward prediction error is discounted by , the learning rate. If the learning rate is low ( 10 close to 0), values are changed very slowly. If the learning rate is high ( close to 1), values are updated very quickly, and recent outcomes have a much greater influence than less recent outcomes. The value for the unchosen option (V2 in this example) is not updated. Given V1 and V2, the model then computes the probability of choosing option 1 (P1(t)) by the following logit regression model: 𝑃1 (𝑡) = 1 (3) 1+ 𝑒 −𝑧 𝑧 = 𝛽(𝑉1 (𝑡) − 𝑉2 (𝑡)) + 𝑐 (4) In equation 4, the inverse noise parameter is the regression weight connecting the values of each option to the choices and c is a constant term that captures a bias toward one or the other option. Likelihood ratio tests are used to compare the full model (all three parameters , , and c) to a reduced model with no learning (only c). The test statistic has a chi-squared distribution with degrees of freedom equal to the difference in the number of parameters for the two models so that model complexity is accounted for. We also used Bayesian Model Comparison, computing Bayesian Information Criterion scores for each model penalizing for the number of parameters. The model was fit to the data by the method of maximum likelihood using the optimization toolbox in MATLAB. This reinforcement learning model was also fit to data in the Crab, Fish and Bait Games. We could not fit the model to the learning phase of the probabilistic selection task, due to a coding error in what data was saved in that task. 11 References Cohen MX, Frank MJ. Neurocomputational models of basal ganglia function in learning, memory and choice. Behav Brain Res 2009; 199: 141-56. Crawford JR, Howell DC. Comparing an individual's test score against norms derived from small samples. The Clinical Neuropsychologist 1998; 12: 482-6. Glascher J, Hampton AN, O'Doherty JP. Determining a role for ventromedial prefrontal cortex in encoding action-based value signals during reward-related decision making. Cerebral Cortex 2009; 19: 483-95. Frank MJ, Seeberger LC, O'Reilly RC. By carrot or by stick: Cognitive reinforcement learning in Parkinsonism. Science 2004; 306: 1940-3. Knowlton BJ, Mangels JA, Squire LR. A neostriatal habit learning system in humans. Science 1996; 273: 1399-402. Lau B, Glimcher PW. Value representations in the primate striatum during matching behavior. Neuron 2008; 58: 451-63. Rutledge RB, Lazzaro SC, Lau B, Myers CE, Gluck MA, Glimcher PW. Dopaminergic drugs modulate learning rates and perseveration in Parkinson's patients in a dynamic foraging task. J Neurosci 2009; 29: 15104-14. Shohamy D, Myers CE, Onlaor S, Gluck MA. Role of the basal ganglia in category learning: how do patients with Parkinson's disease learn? Behavioral Neuroscience 2004;118: 676-86. 12 Supplementary Tables Supplementary Table 1: Neuropsychological evaluation Ability Tested Dementia Premorbid IQ Cognitive Ability Test Name MMSE AMNART Revised WAIS-III Info Subscale Visual Acuity Rosenbaum vision screener Contrast Sensitivity Cambridge Low Contrast Gratings Color Vision Ishihara Color Vision Deficiency Test Intermediate Vision Visual Object and Space and Spatial Location Perception Battery Shape Detection Screening Test 1, Incomplete Letters Test 5, Dot Counting Test 6, Position Discrimination Test T, Number Location Visuospatial Memory Brief Visuospatial Memory Test – Revised Trial 1 Trial 2 Trial 3 Total Recall Learning Delayed Recall Executive Function Wisconsin Card Sort – 64 Total Errors Perseverative Responses Perseverative Errors Non-perseverative Errors Conceptual Level Responses Executive Function D-KEFS Tower Test Total Achievement Score Total Rule Violation Mean First-Move Time Time-Per-Move Ratio Move Accuracy Ratio Rule-Violation-Per-Item Ratio 13 Score 28/30 111.0 12 (mean = 10, SD = 3) OD 20/25 -2, OS 20/30 -2 OD 28, OS 29 (mean = 28.4, SD = 6.5) 38/38 20/20 20/20 10/10 19/20 10/10 Percentiles: 54% 96% 92% 88% 90% 93% Percentiles: 93% 87% 92% 66% 86% Percentiles: 50% 46% 84% 37% 16% 50% Supplementary Table 2: Testing schedule Day 1 Probabilistic Selection1 Stimulus-value Learning3 Action-value Learning3 Fish Game Crab Game Day 2 Action-value Learning3 Stimulus-Value Learning3 Crab Game Fish Game Bait Game Probabilistic Selection2 Day 3 Probabilistic Selection2 Stimulus-Value Learning4 Action-Value Learning4 Bait Game Weather Prediction 1 With criteria Without criteria 3 Dynamic reversal blocks (performance-dependent reversals) 4 Static reversal blocks (defined reversal points) 2 14 Day 4 Action-value Learning4 Stimulus-value Learning4 Weather Prediction Supplementary figures Supplementary Figure 1. Photograph showing dystonic posturing of XG’s hands. 15 Supplementary Figure 2. Median response times for XG and healthy controls across five tasks. XG was significantly slower than controls during the Bait Game (t10 = 2.15, P = 0.028). Error bars denote standard deviation. 16 Supplementary Figure 3. XG and healthy controls’ performance in the training phase of the Probabilistic Selection Task. (A) Plotted is probability of choosing the richer option given a specific left-right configuration (e.g., AB means symbol A on the left and B on the right, BA means vice versa). XG only performed at above chance levels for one left-right configuration of each stimulus pair (BA, DC and FE, respectively) and not for the other (AB, CD, and EF, respectively). XG was significantly impaired at choosing the richer option for AB configuration (t10 = –2.830, P = 0.009) and CD configuration (t10 = –3.047, P = 0.006). (B) Plotted is the probability of repeating the choice of the richer option conditional upon the same (‘consistent’) left-right configuration (e.g., ABt-1, ABt) or different (‘inconsistent’) left-right configuration (e.g., ABt-1, BAt). XG was impaired relative to controls for the inconsistent configuration of AB trials (t10 = –2.84, P = 0.009) as well the inconsistent configuration of CD trials (t10 = –2.923, P = 0.008), and did not repeat the choice of the richer option at greater than chance levels in these configurations. XG was also impaired relative to controls for the consistent configuration of AB trials (t10 = –1.93, P = 0.041). Error bars denote standard deviation. 17 Supplementary Figure 4. XG and healthy controls’ performance as a function of time across all tasks. In each figure, performance is shown broken down into blocks of 40 trials. In the bottom row, these blocks correspond to reversals in the reward contingencies. In top row, these blocks only correspond to time, as there are no reversals of the reward contingencies in the Weather Prediction of Probabilistic Selection Tasks. In the tasks that require the learning of stimulus values (Weather Prediction, Probabilistic Selection, Stimulus-Value Learning), XG’s performance is impaired from the first block. In the tasks without reversals (top row), XG’s performance does not show a clear decline with time, as would be indicative of a problem with the long-term stability of values or maintaining set during learning. In the tasks with reversals (bottom row), XG’s performance also does not show a clear decline with time, as would be indicative of a problem that is specific to reversal learning as opposed to initial learning. 18 Supplementary Figure 5. Weather Prediction Task. (A) The task used four individual cues (C1C4). (B) Each trial presents one of the fourteen different combinations of 1-3 cues from A. (C) Probability table for rain vs. shine for each card combination pattern. 19 Supplementary Figure 6. Probabilistic Selection Task. In the training phase, three pairs of stimuli are presented and the stimuli are probabilistically rewarded at different rates (reward probabilities are shown in parentheses). In the testing phase, novel pairs of stimuli are presented without feedback. These novel pairs include all possible combinations of the stimulus most associated with positive feedback (A: AC, AD, AE, AF) and the stimulus most associated with negative feedback (B: BC, BD, BE, BF). 20 Supplementary Figure 7. Sequence of events across trials in the Crab Game. 21 Supplementary Figure 8. Sequence of events across trials in the Fish Game. 22 Supplementary Figure 9. Sequence of events across trials in the Bait Game. 23 Supplementary Figure 10. Sequence of events across trials in the Stimulus-Value Learning Task. 24 Supplementary Figure 11. Sequence of events across trials in the Action-Value Learning Task. 25 Supplementary Figure 12. (A,B) XG and healthy controls’ performance for the stimulus-value learning (A) and action-value learning tasks (B), in which reversals were dependent on performance. Plotted is the probability of choosing the richer option averaged around all transitions (i.e., reversal of the reward contingencies). Vertical dashed line denotes transition. Horizontal line denotes chance performance. Data smoothing kernel = 11. Similar to the versions of these tasks with fixed transitions presented in main manuscript, XG was impaired at stimulusvalue learning but performed well in action-value learning. 26