VB_freelearner

VB free learner VB free learner J. Daunizeau1,2 1 Brain 2 Wellcome and Spine Institute, Paris, France Trust Centre for Neuroimaging, University College of London, United Kingdom Address for Correspondence: Jean Daunizeau Motivation, Brain and Behaviour Brain and Spine Institute 47, bvd de l'Hopital, Paris, France Email: jean.daunizeau@gmail.com VB free learner Abstract VB free learner Notation and acronyms We listed below acronyms and mathematical notations that may not be familiar to the reader. BDT Bayesian Decision Theory VB Variational Bayes COTP Cue-Outcome Transition Probability CRP Chinese Restaurant Process ‘is defined as’ E  x or x the expectation of the random vector x . V  x the variance-covariance matrix of the random vector x . x N   ,   ‘the random vector x follows a Gaussian distribution with mean  and covariance  ’ VB free learner 1. Introduction VB free learner 2. Theory 2.1 Description of the task In the following, we will assume the task can be framed as a variant the so-called multi-armed bandit problem. In brief, participants are asked to bet on the outcome of an idealized experiment (the ‘bandit’), which is influenced by the subject’s actions. A bandit is modelled as a random number generator, with specific properties we detail below. First, we assume that bandits can deliver different outcomes u ( o ) , which may be associated with different   (o) subjective rewards. We assume the latter is captured by the utility of the outcome, i.e. U u , where U is the so-called utility function. Second, a set of cues (e.g., the bandit label) are accessible prior to the outcome, and we assume they are sampled from a set of L symbols. Third, the agent can choose in between M admissible actions a (c.f. ‘multi-armed bandit’). Last, we assume that the contingencies between the cues and the outcomes are dependent upon the chosen action at each trial. In other terms, the conditional distribution  P u (o) u (c) , a  fully determines the bandit behaviour. Critically, and though it may be experimentally controlled, this conditional distribution is unknown to the agent. This means that the agent has to learn the above cue-outcome transition probabilities (COTP) in order to act optimally, i.e. maximize its expected reward. A key idea here is that the COTP may vary over time or trials, such that, for example, cues that were associated with winning now get associated with loosing. This is what we refer to as an ‘unstable’ environment. Thus the agent has to quickly revise what it may have learned at any point in time in order to be able to adapt to these changes. From a Bayesian perspective, this means that the agent must have priors about the structure of the environment (i.e., the bandit) that allow flexible learning and adaptive behaviour. We will detail such Bayesian agents below. VB free learner 2.2 Perceptual and response models The first step is to define the candidate response models that we wish to consider. In what follows, we will assume the agent summarizes the COTP by its first two moments, i.e. mean and variance. This is what we call the perceptual model, i.e. the prior assumptions, under which the observer assimilate new information. We will see that this yields a Bayesian belief update that is very close in form to the classical Rescorla-Wagner (delta) learning rule. A complete BDT response model is fully specified by a perceptual model and an action emission law. We will see an example of this below. 2.2.1 The perceptual model In the following, we will represent the cues and actions with binary vectors, whose dimension matches the number of possibilities, and with all-zero entries except for the relevant dimension, e.g.: u ( c )   0 0 1 0 T actually means "third type of cue". In contradistinction, the outcome u ( o ) is continuous. We assume that that the observer summarizes the COTP using its two first moments. This allows us to represent the outcome’s likelihood as a mixture of Gaussian distributions, as follows:    p ut( o ) ut( c ) , at , xt   N xl(1),m,t , exp  xl(2) , m ,t  l m  ut(,cl ) at ,m (1) (2) where xl(1) , m ,t (resp. xl , m ,t ) is a hidden state that captures the first- (resp. -log- second-) order moment of u ( o ) , given the cue and the chosen action.   Specifying priors p xt(1) effectively shapes how subjects learn to predict outcomes from previous expositions to cue-outcome-actions triplets perceptual model. u (o) t , ut( c ) , at  , and thus completes the definition of the VB free learner The key idea here is that the hidden states xt (and hence the COTP) are constrained to vary (smoothly) across time or trials. Without loss of generality, we will assume xt behaves as a random walk, with two variants:  D0: Subjects assumed a priori that the hidden states xt varies smoothly over time, according to a first-order Markov process. This is modelled as a random walk with a Gaussian transition density:  px   ,   N  x  ,  p xt(1)1,l ,m | xt(1),l ,m ,   N xt(1),l ,m , 1 (2) t 1,l , m | xt(2) ,l , m (2) t ,l , m (8) 2 Here,  are variance hyperparameter that represent the static roughness of changes in hidden states.  (1) D1: Similarly to the perceptual model above, subjects assumed a priori that the hidden states xt varied smoothly over time, according to a first-order Markov process:  px   ,   N  x  , exp  x     (1,1) (1,1) (1,2) p xt(1,1) 1,l , m | xt ,l , m ,   N xt ,l , m , exp 1 xt 1,l , m  1 (2,1) t 1,l , m (2,1) t ,l , m |x (2,1) t ,l , m (2,2) 2 t 1,l , m (9) 2 Here, the roughness is now varying in time, and controlled by other hidden states xt( ,2) , which we refer to as volatility states, and are assumed to also vary according to a similar first-order Markov process:  px   ,   N  x (1,2) (1,2) p xt(1,2) 1,l , m | xt ,l , m ,   N xt ,l , m , 1 (2,2) t 1,l , m | xt(2,2) ,l , m (2,2) t ,l , m  , 2  (10) VB free learner Here, the variance hyperparameters  represent the roughness of changes in volatility, which we have assumed to be the same across cues and actions. Once can see that it would be trivial to extend the hierarchy up to an arbitrary number of levels, where each level’s smoothness is controlled by the level above, but we do not consider these extensions in the remainder of this manuscript. This concludes the description of the dynamic perceptual models, whose  complete generative models derives from the product of the likelihood p ut('o ) ut('c ) , at ' , xt '  in Equation 7 and the Markov priors in Equation 8 or 9-10, depending on the model variant. Again, the parameter  that controls the prior belief will be estimated empirically, given the series of cues u1:( ct ) , outcomes u1:( ot ) and actions a1:t of each subject. 2.2.2 Variational Bayesian recognition of the perceptual models Given the perceptual models described above, we can now specify the recognition process in terms of their variational Bayesian (VB) inversion. In brief, subjects update their belief on-line, using successive  triplets ut( c ) , ut( o ) , at  to optimise t , the sufficient statistics of the posterior density on the trial t . It suffices to say that the VB recognition process invariably derives from the minimization of the statistical surprise conveyed by outcomes at each trial. 2.2.2.1 Model D0. At trial t , the free energy of model D0 writes:    Ft  ln p ut( o ) ut( c ) , at ,, xt  ln p xt u1:t 1 , a1:t 1 where the prior predictive density   q  S q (19)  p xt(i ) u1:t 1 , a1:t 1  N  l(,im) ,t 1 ,  l(,im) ,t 1  i  derives from marginalizing out the transition density (Equation 8) with respect to their approximate posterior distribution. VB free learner We first characterize the variational energy of xt(1) and the ensuing VB update:     1 ut(,cl ) at ,m exp  xl(2) ut( o )  xl(1),m ,t  , m ,t 1 2 l m 2 1 1   (1) xl(1),m ,t  l(1) , m ,t 1 2 l m  l ,m,t 1  1 I xt(1)    l(1) ,m ,t 2   I xt(1),l ,m   (1) (2)  l(1)  l(1) ut( o )  l(1) , m ,t 1   l , m ,t exp  xl , m ,t 1 , m ,t 1  , m ,t  1  0   (1)  l , m ,t 1  exp  xl(2)  (1) , m , t  1   l ,m,t 1  1    ut(,cl )  1, at ,m  1 if  (20)    (1) From Equation 20, one can see that the density q x (1)  N l(1) , m ,t ,  l , m ,t  is separable across l and m indices (no cross-talk between cue- and action-dependent estimates). Equation 20 is similar in form to a Rescorla-Wagner learning rule, where ut(o)  l(1) , m,t 1 is the prediction error and  l(1),m,t is the learning rate (which is controlled by 1 ) Let us now focus on the variational energy of xt(2) and its two first derivatives:    1 (2) ut(,cl ) at ,m  xl(2)  , m ,t 1  exp  xl , m ,t 1  2 l m 2 1 1 (2)   (2) xl(2) , m ,t  l , m ,t 1 2 l m  l ,m ,t 1   2 I xt(2)     u I 1   exp  xl(2) , m ,t 1 (1) xt ,l ,m 2 (o) t  I xt(2) ,l , m (o) t  u ( o) t  xl(1),m,t  xl(1),m,t  2   2  2  (2) l , m ,t 1  2     2  (2) l , m ,t 1   xl(2) , m,t      2 1 (2)   (2) (2)      l ,m,t 1  exp  l(2) ut( o )  xl(1), m,t l , m , t l , m , t  1 , m ,t 1   2   0   (1) 1  l ,m ,t  1 2 1 exp  l(2) ut( o )  xl(1),m ,t  (2)  , m ,t 1 2  l ,m,t 1   2     1 (2) l , m ,t 1  l(,2m),t  xl(1), m,t  I 1   1  exp  xl(2) , m ,t 1 (1) xt ,l ,m 2    u   2    if ut(,cl )  1, at ,m  1  (21) One then needs to derive the expectations under both approximate marginal posterior densities:: VB free learner  exp  xl(2) , m ,t 1 u (o) t x (1) l , m ,t    q x( 2 ) 1 (2)    exp   l(2)  l ,m ,t 1  , m ,t 1   2   2   q x (1)   u (o) t  (1) l , m ,t  2  (1) l , m ,t 1 This completes the VB belief update rules for model D0. 2.2.2.2 Model D1. To be continued… (22) VB free learner Appendix 1: VB recognition: deriving the learning rules from the perceptual models Within a variational Bayesian framework, negative surprise is measured (or, more precisely, lowerbounded) via the so-called perceptual free-energy F  q   E ln p  x, u m   S  q  , where the expectation is taken under the approximate posterior density (representation) q  x  and S  denotes the Shannon entropy. Now variational Bayesian update equations derive from noting that: q  x   exp I  x  F  0 (i ) q  I  x   E ln p  x, u m   (i ) (i ) (A1) where p  x, u m  is the joint pdf and the expectation is taken under all other approximate posterior   densities q x ( j ) : j  i . Whenever the generative model relies upon prior densities that are conjugate to the likelihood function, iteratively applying equation A1 above simply yields the update rules for the approximate posterior sufficient statistics. Otherwise, the so-called Laplace approximation simply performs a parametric approximation of this density, whereby the first-order moment is placed at the   mode and the second-order moment is the inverse curvature of the variational energy I x ( i ) at the   to second order at mode. We can even further simplify this by truncating a Taylor expansion of I x ( i ) the last representation yields the actual update rule: I  xt(i )   E ln p  xt , u1:t m    I  t(i 1)   I xt(i )  xt(i )  t(i1)   t(i1) 2 1 (i ) (i ) T  I x    t t 1  x(i)2 2 t x (i ) t  t(i )1  t(i1)   I xt( i ) I xt(i ) t( i )  (i ) (i ) ( i ) I  t  t 1  t xt(i )  0 1  2 I   (i ) t    ( i )2   xt t(i1)   t(i1) (A2) VB free learner where all derivatives are evaluated at the last expectation  t(i 1) ,  t( i ) and t(i ) are the first- and second-   order moments of the representation of the approximate posterior density q x ( i ) . Equation A2 is actually the first iteration of a full Laplace approximation (see Mathys et al., 2011). We will see that t(i ) plays the role of a learning rate, and the gradient I xt(i ) , that of a prediction error.

VB_freelearner

Related documents

Products

Support

VB_freelearner

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib