Bayesian model selec-on and es-ma-on: Simultaneous mixed effects for models and parameters a Klinik für Psychiatrie und Psychotherapie Campus Mitte a* a,b c Daniel J. Schad , Michael A. Rapp , Quen-n J.M. Huys c b * contact: danieljschad@gmail.com B) Varia-onal Bayes: Methods Introduc-on 0.16. OPTIMIZATION ALGORITHM Bayesian model selection and estimation (BMSE): Powerful methods for determining the most likely among sets of competing hypotheses about the mechanisms and parameters that generated observed data, e.g., from experiments on decision-making. 53 Sufficient statistics approach: combine empirical Bayes [1] with random effects for models [2] Full random effects inference: Variational Bayes Mixed-effects (or empirical / hierarchical Bayes‘) models: Provide full inference in group-studies – with repeated observations for each individual – by adequatly capturing: - Individual differences (random effects / posteriors) - Mechanisms & parameters common to all individuals (fixed effects / priors) Previous models: have assumed mixed-effects - either for model parameters: Huys et al. [1] applied empirical Bayes‘ via Expectation Maximization (EM) to reinforcement learning models - or for the model identity: Stephan et al. [2] developed a Variational Bayes‘ (VB) method for treating models as random-effects Here: A) We evaluate the empirical Bayes‘ method assuming mixed-effects 4 for parameters for reinforcement learning models [1] B) We present a novel Variational Bayes' (VB) model which considers 0.2 and Parameter mixed-effects for models parametersEstimation: simultaneouslyEmpirical Bayes THEORY In an empirical Bayes’ approach the prior is inferred from the group information by inteFigure 21: Bayesian dependency graphs for our random e↵ects generative model for grating over the vector of individual subject-parameters, ✓, multi-subject data. Rectangles denote deterministic parameters and shaded circles represent observed values. ↵ = parameters of the Dirichlet distribution (number of model Z 1 Simulations from decision N = 90 simulated subjects ”occurrences”); r =known parameters of theprocesses multinomialwith distribution (probabilities of the mod- Generating prior parameters can be P (X | µ✓ , ✓ ) / d✓P (X,✓ | µ✓ , ✓ ) (4) labels; ✓ = individual subject parameters; µ = prior group mean; 2 = els); m = model recovered from simulated data 1 Simple RL:variance; ab = simple model, assuming state, 2 actions, ratefor (a) prior group µ0 , ⌫ RL = hyper-priors for the µ1 parameter; a0 , b0 =learning hyper-priors - The precision scales with number of inverse noisinessy (b) parameters the 2 parameter; = observed data;(N=60) y | m = probability of the data given model k; k = data points as theoretically expected and prior parameters are set to their ML estimates model= index; K = model number(N of = models; Rep repetition 30) n = subject index; N = number of subjects. A B Prior Mean Prior PriorVariance Variance 1.00 1.0 Parameter Parameter Z 1 2 Beta Beta 0.8 0.75 Learning Learning Rate Rate = argmax P (X | µ , RL(5) + 200 trials µ̂ ,ˆ d✓ P (X | ✓) P (✓ | µ✓ ,Simple ✓ ✓ ✓ ✓ ) = argmax ✓) 1 0.6 µ✓ , ✓ µ✓ , ✓ 1 0.50 With sufficient data, the correct model 0.4 0 can be identified for all subjects and 0.25 0.2 for both methods with high certainty 0.00 0.0 −1 A) empirical Bayes B) Varia-onal Bayes: Results Model: Rep # Subjects ●● ●●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ●● ●● ● ●● ● ● ●● ●●● ●● ●● ● ●● ● ●● ● 100 C 200 ●● ● ●● ● ●● ● ● ● ● ●● ●● ● ●●●● ●● ● ● ● ●● ●●●● ●● 0 40 ● ● −10 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 20 0.995 50 100 # Trials ● ● 60 ● ● 150 ● ● 300 ● ● −5.0 ● 1 ● ● ● 10 ● 600 ● −20 1200 Expectation Maximization with Laplace approximation 100 100 D Prior Mean: Beta ● ●● ●● ●● 0.2.1 300 Group Level Precision [1/SE] Group Level Precision [1/SE] ● ● ● Log Likelihood [Relative to Maximum] ●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 200 200 300 ∆Mean: BIC < −2: N = 0Rate /1000 Prior Prior Mean: Learning Learning Rate 300 4040 ##Trials Trials ●● 60 60 150 150 300 300 600 600 1200 1200 ● P(|t|>T −7.5 krit 150 ) = 0.048 100 1.4 1.2 1 0.99 0.985 0.98 0.975 1 0.8 0.8 0.6 0.6 0.97 EM Sufficient Statistics Full Variational Bayes True 0.4 0.965 0.96 # Data Total [Rank] 0.95 20 −30 1.2 0.2 0 Rep ab Computational Model 0.2 0 rep Model Parameters 15 250 0.4 0.955 ● # sim = 1000/1000 1.6 Full Variational Bayes Sufficient Statistics 5 Parameter Values 0 ● ● Learning Rate Average Correct Probability Mass True and Estimates +/− 95% CI Beta Model: ab 1.4 Model: Rep To estimate model parameters we use Expectation Maximization (EM), where we iterate 55 10 10 20 20 50 50 100 100 Deviation from MLE [dx] between an expectation step, where we approximate the distribution of latent subject20 2020 Simple RL + 20 trials parameters, and a maximization step, where we search for maximum likelihood estimates The likelihood for the prior is approximately 10 1010 With scarce data per subject, the full (MLE) for the prior parameters. this procedure, Gaussian,In providing a basis each for a parameter Laplace- update is obtained VB method improves model 0 00 conditional on the currently best estimates for theerror otherbars, parameters. Approximation to derive with comparison compared to the 100 200 300 100 100 200 200 300 300 Total # of observations [sqrt N x T] Total Total##of ofobservations observations[sqrt [sqrtNNxxT] T] normative alpha errors sufficient (see statistics approach To ease computations with thebeta likelihood of the prioralpha parameters we take its logarithm Equantion 6). As the distribution of individual subject parameters cannot in general be Table.!Δ!BIC!scores!(model9BIC!minus!BIC!for!the!best!model!for!the!dataset)!for!six! 0 , S) to solved analytically and directly, we introduce the distribution q(✓) ⌘ N ormal(✓; ✓ different!models!for!six!different!datasets!generated!from!these!same!models.!The! highlighted!diagonal!shows!that!for!all!simulated!data!sets!the!generating!model!was! approximate the distribution over ✓ via a normal distribution (i.e., Laplace approximation). also!recovered!from!the!data.! In the E-step, we approximate this distribution based on the prior parameters2step: and theHybrid data. = model-based + model-free; model-free; non-learner [3] Generating" Fitted"model" Model" " In the M-step,m" we can subsequently update our estimates for the other (prior) parameters " m2b2alr" mr" 2b2alr" m2b2al" 2b2al" m2b2alr" 0" 337" 49"given441" 1297" 531" this approximation to ✓. To solve equation (7) we utilize Jensen’s inequality and N = 30 N = 30 mr" 42" 0" 428" 800" 801" 1490" 2step + 201 trials the logarithm into the integral in (8). 2b2alr" 12" 841" 0"move280" 2678" 271" m2b2al" 6" 452" 95" 0" 514" 83" Using three similar m" 40" 21" 408" 45" 0" 436" Z 1 models with differing 2b2al" 16" 1391" 5" 18" 2271" 0" ! model parameters in log p (X | µ✓ , ✓ ) = log d✓ p (X, ✓ | µ✓ , ✓ ) (6) BICint extracts the true generating model from the data 1 the 2step task also Z 1 p (X, ✓ | µ✓ , ✓ ) yields posterior = log d✓ q(✓) (7) q(✓) 1 uncertainty, and the Z 1 p (X, ✓ | µ✓ , ✓ ) full VB performs best d✓ q(✓) log (8) q(✓) 1 ● ● ● −0.1 0.1 0.2 −0.2 −0.1 0.0 0.1 100 40 50 ●● ●● ● ● ● ● 20 50 0 5 BIC0 − BIC1 10 0 −4 15 −2 0 t−value 2 0 4 0 0.2 P(|t|>T ) = 0.048 krit # sim = 1000/1000 ∆ BIC < −2: # sim N = =1 1000/1000 /1000 300 1.2 1 0.98 ●● 150 150 0.4 0.6 p−value 0.8 1 P(|t|>Tkrit) = 0.059 100 1 0.96 0.94 0.92 0.9 0.88 0.6 0.5 a 200 Parameter: b Parameter: 250 0.84 0.2 0.82 100 150 100 50 200 100 80 80 60 60 40 40 20 20 0 Rep ab Computational Model 100 50 Model: MF+Rep Model-Free Model: MB+MF+Rep Hybrid 0 −4 0 −2 5 0 10 BIC0 − BICt−value 1 N = 1 /1000 2 15 4 0 −2 0.2 0.6 0 0.4 2 t−value p−value 0.84 1 0 0 0.2 0.4 0.6 p−value 0.8 150 100 100 60 150 100 50 0 −5 Conclusions 0 5 BIC0 − BIC1 10 15 40 50 EM 0.6 Sufficient Statistics Full Variational Bayes True 0.5 1 1.5 0.9 0.5 1 0.4 0 0.5 0.3 −0.5 0 0.85 Ntrue = 30 true 2 0.95 Model: Rep Non-Learner 0.7 true 1 80 200 1.5 Sufficient Statistics Full Variational Bayes 1 P(|t|>Tkrit) = 0.059 250 a 0 # sim = 1000/1000 300 Parameter: 0 −4 Parameter Values 0 −5 0.2 −1 −0.5 0.1 0.8 20 0 −4 −2 0 t−value 2 4 0 −1 MB+MF+Rep MF+Rep Non-Learner Rep Hybrid Model-Free Computational Model 0 0.2 0.4 0.6 p−value 0.8 b1 b2 a1 a2 l w Model Parameters −1.5 r b1 b2 a1 a2 l Model Parameters r 0 r1 r2 Model Parameters 1 - Fitting empirical Bayes‘ models of reinforcement learning using Expectation Maximization (EM) [1] exhibits desirable normative properties - Our new Variational Bayes method suggests that we can and should understand the heterogeneity and homogeneity observed in group studies of decision-making by investigating contributions of both, the underlying mechanisms and their parameters - We find increased accuracy in Bayesian model comparison for our new VB method compared to previous approaches [1, 2] - We expect that this new mixed-effects method will prove useful for a wide range of computational modeling approches in group studies of cognition and biology Model-Free Model: MF+Rep Model: MB+MF+Rep Hybrid 2step + 201 trials + parameters based on real data [4] The advantage for the full VB is visible also for parameters obtained from real (observed) data 2 Ntrue = 21 Sufficient Statistics Full Variational Bayes 1.5 1 0.98 1.5 Model: Rep Non-Learner Ntrue = 3 0.94 0.92 0.9 0.88 1 1 Ntrue = 3 EM Sufficient Statistics Full Variational Bayes True 0.5 0.5 0.8 0 0 0.6 −0.5 −0.5 0.86 1.4 1.2 1 0.96 Parameter Values 15 Average Correct Probability Mass ∆ BIC < −2: 10 b a Model Parameters 50 Average Correct Probability Mass 5 BIC0 − BIC1 0 150 50 0 rep Model Parameters 100 2.5 0 −5 EM Sufficient Statistics Full Variational Bayes True 0.4 0.86 1 0.8 0.8 250 Model: ab 1.5 Sufficient Statistics Full Variational Bayes 5 0.2 150 N = 0 /1000 300 0.0 60 0 −5 ∆ BIC < −2: −0.2 100 200 Parameter Values Parameter: ● ##Subjects Subjects 3030 b ● 1.4 10 Average Correct Probability Mass 30 80 b a Model Parameters 0.4 0.84 −1 −1 0.82 0.2 0.8 MB+MF+Rep MF+Rep Rep Hybrid Model-Free Non-Learner Computational Model −1.5 b1 b2 a1 a2 l w Model Parameters References [1] Huys, Q. J. M., Eshel, N., O'Nions, E., Sheridan, L., Dayan, P., & Roiser, J. P. (2012). Bonsai trees in your head: How the pavlovian system sculpts goal-directed choices by pruning decision trees. PLoS Computational Biology, 8(3), e1002410. [2] Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J., & Friston, K. J. (2009). Bayesian model selection for group studies. Neuroimage, 46(4), 1004-1017. [3] Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron, 69(6), 1204–1215. [4] Schad, D. J., Jünger, E., Garbusow, M., Sebold, M., Bernhardt, N., Javadi, A. H., et al. (2014). Individual differences in processing speed and working memory capacity moderate the balance between habitual and goal-directed choice behaviour. Submitted. r −1.5 b1 b2 a1 a2 l Model Parameters r 0 r1 r2 Model Parameters Acknowledgements This research was supported by Deutsche Forschungsgemeinschaft (DFG FOR 1617 Alcohol and Learning), Grant RA1047/2-1