Appendix A
We adapted the win-stay-lose- shift (WSLS) model from Worthy&Maddox (2014): π(π π‘ππ¦|π€ππ) and π(π βπππ‘|πππ π ) is defined as: π(π π‘ππ¦|π€ππ) = π(π π‘+1
|πβππππ π‘
= π πππ π(π‘) ≥ π‘(π‘ − 1)), π(π βπππ‘|πππ π ) = π(π π‘+1
|πβππππ π‘
= π πππ π(π‘) < π‘(π‘ − 1)).
On every trial, they are updated: π(π π‘ππ¦|π€ππ) π‘+1
= π(π π‘ππ¦|π€ππ) π‘
+ π π( π π‘ππ¦
| π€ππ
)
(1)
(2)
(3)
× (π(π π‘ππ¦|π€ππ) πππππ
− π(π π‘ππ¦|π€ππ) π‘
) π(π βπππ‘|πππ π ) π‘+1
= π(π βπππ‘|πππ π ) π‘
+ π π( π βπππ‘
| πππ π
)
× (π(π βππ‘|πππ π ) πππππ
− π(π βπππ‘|πππ π ) π‘
) π(π π‘ππ¦|π€ππ) πππππ
, π π( π π‘ππ¦
| π€ππ
)
, π(π βππ‘|πππ π ) πππππ
, π π( π βπππ‘
| πππ π
)
are free parameters.
(4)
At the same time the Q values of each action-state pair is updated with reinforcement learning:
π(π π‘
, π π‘
) ← π(π π‘
, π π‘
) + πΌ(π π‘
+ πΎ max π′
π(π π‘+1
, π′) − π( π π‘
, π π‘
)).
The win-stay-loss-shift and reinforcement learning assumptions are combined though: if π(π‘) ≥ π‘(π‘ − 1) ,
(5) if π(π‘) < π‘(π‘ − 1) ,
π(π π‘
, π π‘
) ← πΎπ€π ππ ∗ π(π π‘ππ¦|π€ππ) π‘
+ (1 − πΎπ€π ππ ) × π(π π‘
, π π‘
), (6)
π(π π‘
, π π‘
) ← πΎπ€π ππ ∗ π(π βπππ‘|πππ π ) π‘
+ (1 − πΎπ€π ππ ) × π(π π‘
, π π‘
).
In the end, we also assume a softmax action selection:
(7)
π(π π‘
, π π‘
) = exp (π½ ⋅ π(π π‘
, π
∑
3 π′=0 exp (π½ ⋅ π(π π‘ π‘
))
, π ′ ))
.
(8)
This model has 10 free parameters and yielded worse BIC scores than our FPEQ model (see Table 1).
1
Table S1. WSLS Model Parameters
WSLS-Q Parameter lr
PStayWin_initial
PStayWin _final
PShiftLoss_initial
PShiftLoss_final
Young
Mean(SD)
0.08 (.06)
0.50 (.02)
0.41 (.4)
0.46 (.47)
0.68 (.33)
Elderly
Mean(SD)
0.04 (.05)
0.50 (.01)
0.50 (.39)
0.37 (.48)
0.61 (.35) lr_StayWin lr_ShiftLoss
Decay
Kwsls
0.64 (.42)
0.74 (.30)
0.65 (.46)
0.11 (.08)
0.76 (.35)
0.81 (.25)
0.56 (.49)
0.12 (.08)
Exploration 13.21 (5.76) 12.12 (6.35)
The model-based Q model
We also implemented a model-based Q learner, which utilizes experience with state transitions to estimate the probabilities π(π , π, π ′) of transferring from state π to state π ′ by having taken an action π .
π(π , π, π
′ ) ← π(π , π, π ′ ) + πΌ
1
(πΏ π ,π
′
− π(π , π, π ′)).
(9) πΏ π ,π
′
∈ {0,1} is a binary indicator that πΏ π ,π
′
= 1 for the observed transition and πΏ π ,π
′
= 0 for all the states not arrived.
The reward at each state is estimated as:
π (π , π) ← π (π , π) + πΌ
2
(π − π (π , π)).
Then, the value function can be calculated with the transition and reward functions:
(10)
π(π , π) ← π (π , π) + πΎ ∑ π(π , π, π ′) max π′
π(π
′
, π′).
π ′
(11)
Having learned the value function π(π , π) , it is possible to select an action at state π according to the values of each action in this state. Here we use a sofmax distribution:
π(π , π) = exp (π½ ⋅ π(π , π))
∑ 3 π′=0 exp (π½ ⋅ π(π , π ′ ))
.
(12)
2
There are four free parameters in this model: the learning rate in the estimating the transition model πΌ
1
, the learning rate in estimating the reward function πΌ
2
, the discount factor in the updating the value function πΎ , and the exploration parameter π½ . The model based Q-model yielded worse BIC scores than our FPEQ model (see Table 1).
Table S2. Model-based Q Parameters
Model-Based-Q
Parameter lr_R lr_S decay
Young
Mean(SD)
.25 (.27)
0.32 (.34)
0.85 (.26)
Elderly
Mean(SD)
.09 (.07)
.33 (.28)
.76 (.37) exploration 7.07 (6.22) 3.22 (3.69)
Table S3. Correlations between measures of intelligence and FPEQ parameters in the elderly group
TD Learn
Rate
FPE+ FPE- Decay Exploitation QSA Abs.TD
LPS3 -.23 .02 -.13 0 .05 -.24 .35
-.01 -.08 .22 0 -.35 .17 .001 LPS4
All p’s > .2
3
Appendix B
FPEQ generative model.
To examine the effect of the empirically observed model parameters on choice behavior, we simulated the FPEQ model with parameters fixed to either the young or elderly group’s best fitting parameters. The simulation gives a series of states in exactly the same format as the participants’ data.
The figure below shows a plot of the fraction of choosing each path for the observed (left) and simulated
(right) data for the young (top) and elderly (bottom) age groups. Paths 1-4 end in states 4-7, respectively, so that path 1 is the most lucrative path, and path 4 is the least lucrative path. It shows that the young group increasingly chose the most lucrative path 1 (blue line), whereas the elderly group increasingly chose the least lucrative path 4 (purple line). Model simulations yoked to the young and elderly groups’ best fitting parameters, respectively, reproduced their respective preferences for paths 1 and 4.
Figure S1.
4