blah

Appendix A

We adapted the win-stay-lose- shift (WSLS) model from Worthy&Maddox (2014): 𝑝(𝑠𝑡𝑎𝑦|𝑤𝑖𝑛) and 𝑝(𝑠ℎ𝑖𝑓𝑡|𝑙𝑜𝑠𝑠) is defined as: 𝑝(𝑠𝑡𝑎𝑦|𝑤𝑖𝑛) = 𝑝(𝑎 𝑡+1

|𝑐ℎ𝑖𝑜𝑐𝑒 𝑡

= 𝑎 𝑎𝑛𝑑 𝑟(𝑡) ≥ 𝑡(𝑡 − 1)), 𝑝(𝑠ℎ𝑖𝑓𝑡|𝑙𝑜𝑠𝑠) = 𝑝(𝑎 𝑡+1

|𝑐ℎ𝑖𝑜𝑐𝑒 𝑡

= 𝑎 𝑎𝑛𝑑 𝑟(𝑡) < 𝑡(𝑡 − 1)).

On every trial, they are updated: 𝑝(𝑠𝑡𝑎𝑦|𝑤𝑖𝑛) 𝑡+1

= 𝑝(𝑠𝑡𝑎𝑦|𝑤𝑖𝑛) 𝑡

+ 𝜃 𝑝( 𝑠𝑡𝑎𝑦

| 𝑤𝑖𝑛

)

(1)

(2)

(3)

× (𝑝(𝑠𝑡𝑎𝑦|𝑤𝑖𝑛) 𝑓𝑖𝑛𝑎𝑙

− 𝑝(𝑠𝑡𝑎𝑦|𝑤𝑖𝑛) 𝑡

) 𝑝(𝑠ℎ𝑖𝑓𝑡|𝑙𝑜𝑠𝑠) 𝑡+1

= 𝑝(𝑠ℎ𝑖𝑓𝑡|𝑙𝑜𝑠𝑠) 𝑡

+ 𝜃 𝑝( 𝑠ℎ𝑖𝑓𝑡

| 𝑙𝑜𝑠𝑠

)

× (𝑝(𝑠ℎ𝑖𝑡|𝑙𝑜𝑠𝑠) 𝑓𝑖𝑛𝑎𝑙

− 𝑝(𝑠ℎ𝑖𝑓𝑡|𝑙𝑜𝑠𝑠) 𝑡

) 𝑝(𝑠𝑡𝑎𝑦|𝑤𝑖𝑛) 𝑓𝑖𝑛𝑎𝑙

, 𝜃 𝑝( 𝑠𝑡𝑎𝑦

| 𝑤𝑖𝑛

)

, 𝑝(𝑠ℎ𝑖𝑡|𝑙𝑜𝑠𝑠) 𝑓𝑖𝑛𝑎𝑙

, 𝜃 𝑝( 𝑠ℎ𝑖𝑓𝑡

| 𝑙𝑜𝑠𝑠

)

are free parameters.

(4)

At the same time the Q values of each action-state pair is updated with reinforcement learning:

𝑄(𝑠 𝑡

, 𝑎 𝑡

) ← 𝑄(𝑠 𝑡

, 𝑎 𝑡

) + 𝛼(𝑟 𝑡

+ 𝛾 max 𝑎′

𝑄(𝑠 𝑡+1

, 𝑎′) − 𝑄( 𝑠 𝑡

, 𝑎 𝑡

)).

The win-stay-loss-shift and reinforcement learning assumptions are combined though: if 𝑟(𝑡) ≥ 𝑡(𝑡 − 1) ,

(5) if 𝑟(𝑡) < 𝑡(𝑡 − 1) ,

𝑉(𝑠 𝑡

, 𝑎 𝑡

) ← 𝐾𝑤𝑠𝑙𝑠 ∗ 𝑝(𝑠𝑡𝑎𝑦|𝑤𝑖𝑛) 𝑡

+ (1 − 𝐾𝑤𝑠𝑙𝑠) × 𝑄(𝑠 𝑡

, 𝑎 𝑡

), (6)

𝑉(𝑠 𝑡

, 𝑎 𝑡

) ← 𝐾𝑤𝑠𝑙𝑠 ∗ 𝑝(𝑠ℎ𝑖𝑓𝑡|𝑙𝑜𝑠𝑠) 𝑡

+ (1 − 𝐾𝑤𝑠𝑙𝑠) × 𝑄(𝑠 𝑡

, 𝑎 𝑡

).

In the end, we also assume a softmax action selection:

(7)

𝑃(𝑠 𝑡

, 𝑎 𝑡

) = exp (𝛽 ⋅ 𝑉(𝑠 𝑡

, 𝑎

∑

3 𝑎′=0 exp (𝛽 ⋅ 𝑉(𝑠 𝑡 𝑡

))

, 𝑎 ′ ))

.

(8)

This model has 10 free parameters and yielded worse BIC scores than our FPEQ model (see Table 1).

1

Table S1. WSLS Model Parameters

WSLS-Q Parameter lr

PStayWin_initial

PStayWin _final

PShiftLoss_initial

PShiftLoss_final

Young

Mean(SD)

0.08 (.06)

0.50 (.02)

0.41 (.4)

0.46 (.47)

0.68 (.33)

Elderly

Mean(SD)

0.04 (.05)

0.50 (.01)

0.50 (.39)

0.37 (.48)

0.61 (.35) lr_StayWin lr_ShiftLoss

Decay

Kwsls

0.64 (.42)

0.74 (.30)

0.65 (.46)

0.11 (.08)

0.76 (.35)

0.81 (.25)

0.56 (.49)

0.12 (.08)

Exploration 13.21 (5.76) 12.12 (6.35)

The model-based Q model

We also implemented a model-based Q learner, which utilizes experience with state transitions to estimate the probabilities 𝑇(𝑠, 𝑎, 𝑠′) of transferring from state 𝑠 to state 𝑠′ by having taken an action 𝑎 .

𝑇(𝑠, 𝑎, 𝑠

′ ) ← 𝑇(𝑠, 𝑎, 𝑠 ′ ) + 𝛼

1

(𝛿 𝑠,𝑠

′

− 𝑇(𝑠, 𝑎, 𝑠′)).

(9) 𝛿 𝑠,𝑠

′

∈ {0,1} is a binary indicator that 𝛿 𝑠,𝑠

′

= 1 for the observed transition and 𝛿 𝑠,𝑠

′

= 0 for all the states not arrived.

The reward at each state is estimated as:

𝑅(𝑠, 𝑎) ← 𝑅(𝑠, 𝑎) + 𝛼

2

(𝑟 − 𝑅(𝑠, 𝑎)).

Then, the value function can be calculated with the transition and reward functions:

(10)

𝑄(𝑠, 𝑎) ← 𝑅(𝑠, 𝑎) + 𝛾 ∑ 𝑇(𝑠, 𝑎, 𝑠′) max 𝑎′

𝑄(𝑠

′

, 𝑎′).

𝑠′

(11)

Having learned the value function 𝑄(𝑠, 𝑎) , it is possible to select an action at state 𝑠 according to the values of each action in this state. Here we use a sofmax distribution:

𝑃(𝑠, 𝑎) = exp (𝛽 ⋅ 𝑄(𝑠, 𝑎))

∑ 3 𝑎′=0 exp (𝛽 ⋅ 𝑄(𝑠, 𝑎 ′ ))

.

(12)

2

There are four free parameters in this model: the learning rate in the estimating the transition model 𝛼

1

, the learning rate in estimating the reward function 𝛼

2

, the discount factor in the updating the value function 𝛾 , and the exploration parameter 𝛽 . The model based Q-model yielded worse BIC scores than our FPEQ model (see Table 1).

Table S2. Model-based Q Parameters

Model-Based-Q

Parameter lr_R lr_S decay

Young

Mean(SD)

.25 (.27)

0.32 (.34)

0.85 (.26)

Elderly

Mean(SD)

.09 (.07)

.33 (.28)

.76 (.37) exploration 7.07 (6.22) 3.22 (3.69)

Table S3. Correlations between measures of intelligence and FPEQ parameters in the elderly group

TD Learn

Rate

FPE+ FPE- Decay Exploitation QSA Abs.TD

LPS3 -.23 .02 -.13 0 .05 -.24 .35

-.01 -.08 .22 0 -.35 .17 .001 LPS4

All p’s > .2

3

Appendix B

FPEQ generative model.

To examine the effect of the empirically observed model parameters on choice behavior, we simulated the FPEQ model with parameters fixed to either the young or elderly group’s best fitting parameters. The simulation gives a series of states in exactly the same format as the participants’ data.

The figure below shows a plot of the fraction of choosing each path for the observed (left) and simulated

(right) data for the young (top) and elderly (bottom) age groups. Paths 1-4 end in states 4-7, respectively, so that path 1 is the most lucrative path, and path 4 is the least lucrative path. It shows that the young group increasingly chose the most lucrative path 1 (blue line), whereas the elderly group increasingly chose the least lucrative path 4 (purple line). Model simulations yoked to the young and elderly groups’ best fitting parameters, respectively, reproduced their respective preferences for paths 1 and 4.

Figure S1.

4

blah

Related documents

Products

Support

blah

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib