Cover Letter

advertisement
S2. Matching strategy in state-dependent choice behaviors
Stationary condition of the matching strategy
The stationary condition given as Eq. 7 in the main text was derived for the purpose of
avoiding the difficulty in calculating P ( s) w j in Markov decision processes[29]. We
provide another derivation suitable for the present framework. To transform  r w j ,
we
use
the
recursive
relation
P ( s ')   a , s P( s ' | a, s) pas P( s) , where
for
the
long-term
state
distribution
P( s ' | a, s)  P( st 1  s ' | at  a, st  s) .
By
setting  r | a, s wj  0 and P ( s ' | a, s) w j  0 according to the matching
strategy, we obtain
 r
w j


w j

r | a, s pas P( s )   r | a, s
s ,a
s ,a
pas
P ( s )
P( s )   r | a, s pas
w j
w j
s ,a
  r | a, s
  s ',a ' P ( s | a ', s ') pa ' s ' P( s ')
pas
P ( s )   r | a , s pa s
w j
w j
s ,a
  r | a, s
pas
p P ( s ')
P ( s )   r | a, s pas P( s | a ', s ') a ' s '
w j
w j
s ', a ' s , a
  r | a, s
pas
pa ' s ' P ( s ')
P ( s )   rt 1 | a ', s '
w j
w j
s ', a '
  r | a, s
pas
pa ' s '
P ( s ')
P ( s )   rt 1 | a ', s '
P( s ')   rt 1 | a ', s ' pa ' s '
w j
w j
w j
s ', a '
s ', a '
s ,a
s ,a
s ,a
s ,a
   r | a, s  rt 1 | a, s
s ,a
pas
 w
P( s )   rt 1 | a ', s ' pas
s ,a
j
P ( s )
,
w j
where we used the simplified expression rt  | a ', s '  rt  | at  a ', st  s ' and the
relation
rt  1 | a ', s '   a , s rt  | a, s pas P( s | a ', s ') .
Transforming
 r w j
repetitively in the same way, we obtain
 r
P( s )
 '
 p
    rt  | a, s  as P( s )   rt  ' | a, s pas
w j
w j
s ,a   0
s ,a
 w j
P ( s )
 '
 p
     rt  | a, s  r   as P( s)   rt  ' | a, s pas
,

w
w j
s ,a   0
s ,a
 j
1
where we used the relation r
pas
 w
a
  r  | a, s

the infinite sum
0
t
 0 derived from the probability conservation. If
j
 r

converges, then lim rt  | a, s  r
 
and we
can simplify  r w j by taking the limit:  '   . However, the infinite sum may
oscillate. In order to avoid oscillations in the individual terms, we take the average over
to obtain
 '  0,1,
 r
1 T 1    '
P( s) 
 p
 lim       rt  | a, s  r   as P( s)   rt  ' | a, s pas

w j T  T  '0  s ,a   0
w j 
s ,a
 w j
p
1 T 1  '
1 T 1
P( s)
  lim   rt  | a, s  r  as P( s)   lim  rt  ' | a, s pas
.
T  T
T

w j
T  ' 0
w j
s ,a
 ' 0   0
s ,a
T 1
T 1
Noting that lim 1  rt  | a, s  lim 1  rt  a, s  r a, s  r , we obtain
T 
T
T 
 0
T
 0
 r
1 T 1  '
  lim   rt  | a, s  r
T  T
w j
s ,a
 ' 0   0
 w
1 T 1  '
  rt  | a, s  r
T  T
s ,a
 ' 0   0
 w
  lim
T 1  '
1
  rt  | a, s  r
T  T
s ,a
 ' 0   0
  lim
pas
P( s)  r

s
j
pas
p
as
s ,a
j
pas
 w
P( s)  r
P( s )
w j
P( s )
w j
P( s),
j
which coincides with the right hand side of Eq. 7 in the main text.
Extension of the matching law
Using vectors
dps   dp1s , dp2 s ,
1 T 1  '
  rt  a, s  r
T  T
 ' 0   0
Qas  lim
, dpns 
and
Qs   Q1s , Q2 s ,
, Qns  , where
 , the stationary condition for the matching strategy (Eq.
7 in the text) is written as a vector form:

s
P( s)Qs  dps  0 . If the stationary choice
probability pas  0 for a and s , changes in the probabilities dps  are allowed
to be in an arbitrary direction that satisfies 1  dps  0 for s because of the
assumption for the functions
 pas (w)
(Eq. 9 in the main text). Therefore, the vector
Qs should be parallel to the vector 1 for s . This implies Q1s  Q2 s 
2
 Qns for s .
To obtain a stationary point along a boundary on which pas  0 for some option in
some state, we forbid the changes in this direction ( dpas  0 ), and obtain that the
components of Qs should be identical for all options exhibiting non-zero choice
probabilities. Using the summation

a
Qas pas  Vs , the above mentioned condition is
1 T 1  '
  rt  s  r  ,
T  T
 ' 0   0
written as “ pas  0 or Qas  Vs for a, s ”. Using Vs  lim
1 T 1  '
1 T 1  '
r
a
,
s

r

lim


 t 
  rt  s  r
T  T
T  T
 ' 0   0
 ' 0   0
1 T 1  '
 lim   rt  a, s  rt  | s .
T  T
 ' 0   0
Qas  Vs  lim

Thus, we obtain Eq. 8 in the text. The extended matching law is derived from the
matching strategy that attempts to maximize the average reward  r  . In contrast, if the
brain’s decision system simply attempts to maximize the average reward in each state,
r | s   a r | a, s pas ( w ) , without taking the total average
r   s r | s P( s) into
account, then the matching strategy leads to the normal matching law in the individual
states: pas  0 or
r | a, s  r | s  0 for a and s . It is interesting to investigate
which matching law, the normal one or the extended version, animals’ state-dependent
choice behavior exhibits.
3
Download