VB_freelearner

advertisement
VB free learner
VB free learner
J. Daunizeau1,2
1 Brain
2 Wellcome
and Spine Institute, Paris, France
Trust Centre for Neuroimaging, University College of London, United Kingdom
Address for Correspondence:
Jean Daunizeau
Motivation, Brain and Behaviour
Brain and Spine Institute
47, bvd de l'Hopital, Paris, France
Email: jean.daunizeau@gmail.com
VB free learner
Abstract
VB free learner
Notation and acronyms
We listed below acronyms and mathematical notations that may not be familiar to the reader.
BDT
Bayesian Decision Theory
VB
Variational Bayes
COTP
Cue-Outcome Transition Probability
CRP
Chinese Restaurant Process
‘is defined as’
E  x or x
the expectation of the random vector x .
V  x
the variance-covariance matrix of the random vector x .
x
N   ,   ‘the random vector x follows a Gaussian distribution with mean  and covariance  ’
VB free learner
1. Introduction
VB free learner
2. Theory
2.1 Description of the task
In the following, we will assume the task can be framed as a variant the so-called multi-armed bandit
problem. In brief, participants are asked to bet on the outcome of an idealized experiment (the ‘bandit’),
which is influenced by the subject’s actions.
A bandit is modelled as a random number generator, with specific properties we detail below. First, we
assume that bandits can deliver different outcomes u ( o ) , which may be associated with different
 
(o)
subjective rewards. We assume the latter is captured by the utility of the outcome, i.e. U u
, where U
is the so-called utility function.
Second, a set of cues (e.g., the bandit label) are accessible prior to the outcome, and we assume they
are sampled from a set of L symbols. Third, the agent can choose in between M admissible actions a
(c.f. ‘multi-armed bandit’). Last, we assume that the contingencies between the cues and the outcomes
are dependent upon the chosen action at each trial. In other terms, the conditional distribution

P u (o) u (c) , a

fully determines the bandit behaviour. Critically, and though it may be experimentally
controlled, this conditional distribution is unknown to the agent. This means that the agent has to learn
the above cue-outcome transition probabilities (COTP) in order to act optimally, i.e. maximize its
expected reward.
A key idea here is that the COTP may vary over time or trials, such that, for example, cues that were
associated with winning now get associated with loosing. This is what we refer to as an ‘unstable’
environment. Thus the agent has to quickly revise what it may have learned at any point in time in order
to be able to adapt to these changes. From a Bayesian perspective, this means that the agent must have
priors about the structure of the environment (i.e., the bandit) that allow flexible learning and adaptive
behaviour. We will detail such Bayesian agents below.
VB free learner
2.2 Perceptual and response models
The first step is to define the candidate response models that we wish to consider. In what follows, we
will assume the agent summarizes the COTP by its first two moments, i.e. mean and variance. This is
what we call the perceptual model, i.e. the prior assumptions, under which the observer assimilate new
information. We will see that this yields a Bayesian belief update that is very close in form to the classical
Rescorla-Wagner (delta) learning rule.
A complete BDT response model is fully specified by a perceptual model and an action emission law.
We will see an example of this below.
2.2.1
The perceptual model
In the following, we will represent the cues and actions with binary vectors, whose dimension matches
the number of possibilities, and with all-zero entries except for the relevant dimension, e.g.:
u ( c )   0 0 1 0
T
actually means "third type of cue". In contradistinction, the outcome u ( o ) is
continuous. We assume that that the observer summarizes the COTP using its two first moments. This
allows us to represent the outcome’s likelihood as a mixture of Gaussian distributions, as follows:



p ut( o ) ut( c ) , at , xt   N xl(1),m,t , exp  xl(2)
, m ,t 
l
m

ut(,cl ) at ,m
(1)
(2)
where xl(1)
, m ,t (resp. xl , m ,t ) is a hidden state that captures the first- (resp. -log- second-) order moment of
u ( o ) , given the cue and the chosen action.
 
Specifying priors p xt(1)
effectively shapes how subjects learn to predict outcomes from previous
expositions to cue-outcome-actions triplets
perceptual model.
u
(o)
t
, ut( c ) , at  , and thus completes the definition of the
VB free learner
The key idea here is that the hidden states xt (and hence the COTP) are constrained to vary (smoothly)
across time or trials. Without loss of generality, we will assume xt behaves as a random walk, with two
variants:

D0: Subjects assumed a priori that the hidden states xt varies smoothly over time, according to
a first-order Markov process. This is modelled as a random walk with a Gaussian transition
density:

px
 
,   N  x

, 
p xt(1)1,l ,m | xt(1),l ,m ,   N xt(1),l ,m , 1
(2)
t 1,l , m
| xt(2)
,l , m
(2)
t ,l , m
(8)
2
Here,  are variance hyperparameter that represent the static roughness of changes in hidden
states.

(1)
D1: Similarly to the perceptual model above, subjects assumed a priori that the hidden states xt
varied smoothly over time, according to a first-order Markov process:

px
 
,   N  x

, exp  x

  
(1,1)
(1,1)
(1,2)
p xt(1,1)
1,l , m | xt ,l , m ,   N xt ,l , m , exp 1 xt 1,l , m  1
(2,1)
t 1,l , m
(2,1)
t ,l , m
|x
(2,1)
t ,l , m
(2,2)
2 t 1,l , m
(9)
2
Here, the roughness is now varying in time, and controlled by other hidden states xt( ,2) , which we
refer to as volatility states, and are assumed to also vary according to a similar first-order Markov
process:

px
 
,   N  x
(1,2)
(1,2)
p xt(1,2)
1,l , m | xt ,l , m ,   N xt ,l , m , 1
(2,2)
t 1,l , m
| xt(2,2)
,l , m
(2,2)
t ,l , m

, 2

(10)
VB free learner
Here, the variance hyperparameters
 represent the roughness of changes in volatility, which we
have assumed to be the same across cues and actions.
Once can see that it would be trivial to extend the hierarchy up to an arbitrary number of levels, where
each level’s smoothness is controlled by the level above, but we do not consider these extensions in the
remainder of this manuscript. This concludes the description of the dynamic perceptual models, whose

complete generative models derives from the product of the likelihood p ut('o ) ut('c ) , at ' , xt '
 in Equation 7
and the Markov priors in Equation 8 or 9-10, depending on the model variant. Again, the parameter 
that controls the prior belief will be estimated empirically, given the series of cues u1:( ct ) , outcomes u1:( ot )
and actions a1:t of each subject.
2.2.2
Variational Bayesian recognition of the perceptual models
Given the perceptual models described above, we can now specify the recognition process in terms of
their variational Bayesian (VB) inversion. In brief, subjects update their belief on-line, using successive

triplets ut( c ) , ut( o ) , at

to optimise t , the sufficient statistics of the posterior density on the trial t . It
suffices to say that the VB recognition process invariably derives from the minimization of the statistical
surprise conveyed by outcomes at each trial.
2.2.2.1 Model D0.
At trial t , the free energy of model D0 writes:



Ft  ln p ut( o ) ut( c ) , at ,, xt  ln p xt u1:t 1 , a1:t 1
where the prior predictive density


q
 S q
(19)

p xt(i ) u1:t 1 , a1:t 1  N  l(,im) ,t 1 ,  l(,im) ,t 1  i  derives from
marginalizing out the transition density (Equation 8) with respect to their approximate posterior
distribution.
VB free learner
We first characterize the variational energy of xt(1) and the ensuing VB update:
 


1
ut(,cl ) at ,m exp  xl(2)
ut( o )  xl(1),m ,t

, m ,t 1
2 l m
2
1
1
  (1)
xl(1),m ,t  l(1)
, m ,t 1
2 l m  l ,m,t 1  1
I xt(1)  

l(1)
,m ,t
2


I
xt(1),l ,m


(1)
(2)
 l(1)
 l(1)
ut( o )  l(1)
, m ,t 1   l , m ,t exp  xl , m ,t 1
, m ,t 1
 , m ,t

1
 0   (1) 
l , m ,t
1

exp  xl(2)
 (1)
,
m
,
t

1

 l ,m,t 1  1



ut(,cl )  1, at ,m  1
if

(20)
 

(1)
From Equation 20, one can see that the density q x (1)  N l(1)
, m ,t ,  l , m ,t

is separable across l
and m indices (no cross-talk between cue- and action-dependent estimates). Equation 20 is
similar in form to a Rescorla-Wagner learning rule, where ut(o)  l(1)
, m,t 1 is the prediction error and
 l(1),m,t is the learning rate (which is controlled by 1 )
Let us now focus on the variational energy of xt(2) and its two first derivatives:
 

1
(2)
ut(,cl ) at ,m  xl(2)

, m ,t 1  exp  xl , m ,t 1

2 l m
2
1
1
(2)
  (2)
xl(2)
, m ,t  l , m ,t 1
2 l m  l ,m ,t 1   2
I xt(2)  

 u
I
1
  exp  xl(2)
, m ,t 1
(1)
xt ,l ,m
2
(o)
t

I
xt(2)
,l , m
(o)
t
 u
( o)
t
 xl(1),m,t
 xl(1),m,t

2


2

2

(2)
l , m ,t 1

2



 2

(2)
l , m ,t 1

 xl(2)
, m,t 



 2
1 (2) 
 (2)
(2)




 l ,m,t 1  exp  l(2)
ut( o )  xl(1), m,t
l
,
m
,
t
l
,
m
,
t

1
, m ,t 1


2

 0   (1)
1
 l ,m ,t  1
2
1
exp  l(2)
ut( o )  xl(1),m ,t
 (2)

, m ,t 1
2
 l ,m,t 1   2




1
(2)
l , m ,t 1

l(,2m),t
 xl(1), m,t

I
1
  1  exp  xl(2)
, m ,t 1
(1)
xt ,l ,m
2 

 u


2



if
ut(,cl )  1, at ,m  1

(21)
One then needs to derive the expectations under both approximate marginal posterior densities::
VB free learner

exp  xl(2)
, m ,t 1
u
(o)
t
x
(1)
l , m ,t



q x( 2 )
1 (2) 

 exp   l(2)
 l ,m ,t 1 
, m ,t 1 

2


2
 
q x (1)

 u
(o)
t

(1)
l , m ,t

2

(1)
l , m ,t 1
This completes the VB belief update rules for model D0.
2.2.2.2 Model D1.
To be continued…
(22)
VB free learner
Appendix 1: VB recognition: deriving the learning rules from the perceptual models
Within a variational Bayesian framework, negative surprise is measured (or, more precisely, lowerbounded) via the so-called perceptual free-energy
F  q   E ln p  x, u m   S  q  , where the
expectation is taken under the approximate posterior density (representation) q  x  and S  denotes
the Shannon entropy. Now variational Bayesian update equations derive from noting that:
q  x   exp I  x 
F

0
(i )
q
 I  x   E ln p  x, u m  
(i )
(i )
(A1)
where p  x, u m  is the joint pdf and the expectation is taken under all other approximate posterior
 
densities q x ( j ) : j  i . Whenever the generative model relies upon prior densities that are conjugate to
the likelihood function, iteratively applying equation A1 above simply yields the update rules for the
approximate posterior sufficient statistics. Otherwise, the so-called Laplace approximation simply
performs a parametric approximation of this density, whereby the first-order moment is placed at the
 
mode and the second-order moment is the inverse curvature of the variational energy I x ( i )
at the
  to second order at
mode. We can even further simplify this by truncating a Taylor expansion of I x ( i )
the last representation yields the actual update rule:
I  xt(i )   E ln p  xt , u1:t m  
 I  t(i 1)  
I
xt(i )
 xt(i )  t(i1)  
t(i1)
2
1 (i )
(i ) T  I
x


 t t 1  x(i)2
2
t
x
(i )
t
 t(i )1 
t(i1)
 
I xt( i )
I
xt(i )
t( i )
 (i )
(i )
( i ) I
 t  t 1  t
xt(i )

0
1
 2 I

 (i )
t    ( i )2 
 xt t(i1) 

t(i1)
(A2)
VB free learner
where all derivatives are evaluated at the last expectation  t(i 1) ,  t( i ) and t(i ) are the first- and second-
 
order moments of the representation of the approximate posterior density q x ( i ) . Equation A2 is
actually the first iteration of a full Laplace approximation (see Mathys et al., 2011). We will see that t(i )
plays the role of a learning rate, and the gradient I xt(i ) , that of a prediction error.
Download