Bayesian models for fMRI data Klaas Enno Stephan Translational Neuromodeling Unit (TNU) Institute for Biomedical Engineering, University of Zurich & ETH Zurich Laboratory for Social & Neural Systems Research (SNS), University of Zurich Wellcome Trust Centre for Neuroimaging, University College London With many thanks for slides & images to: FIL Methods group, particularly Guillaume Flandin and Jean Daunizeau SPM Course Zurich 13-15 February 2013 The Reverend Thomas Bayes (1702-1761) Bayes‘ Theorem Posterior Likelihood P ( | y ) p ( y | ) p ( ) p( y) Evidence Reverend Thomas Bayes 1702 - 1761 “Bayes‘ Theorem describes, how an ideally rational person processes information." Wikipedia Prior Bayes’ Theorem Given data y and parameters , the joint probability is: p ( y , ) p ( | y ) p ( y ) p ( y | ) p ( ) Eliminating p(y,) gives Bayes’ rule: Likelihood Posterior P ( | y ) Prior p ( y | ) p ( ) p( y) Evidence Bayesian inference: an animation Principles of Bayesian inference Formulation of a generative model likelihood p(y|) prior distribution p() y f ( ) p ( y | ) N ( f ( ), ) p ( ) N ( p , p ) Observation of data y Update of beliefs based upon observations, given a prior state of knowledge p ( | y ) p ( y | ) p ( ) Posterior mean & variance of univariate Gaussians Likelihood & Prior p ( y | ) N ( , ) 2 e y p ( ) N ( p , p ) 2 Posterior 2 Posterior: p ( | y ) N ( , ) 1 2 1 2 e Likelihood 1 p 2 1 1 2 p 2 p e 2 Posterior mean = variance-weighted combination of prior mean and data mean Prior p Same thing – but expressed as precision weighting Likelihood & prior 1 e p ( y | ) N ( , ) y 1 p ( ) N ( p , p ) Posterior Posterior: p ( | y ) N ( , 1 ) Likelihood e p e p Prior p Relative precision weighting p Same thing – but explicit hierarchical perspective Likelihood & Prior p( y | p ( (1) (1) ) N ( ) N ( (2) y (1) ,1 / ,1 / (2) (1) ) (1 ) p ( (1 ) (2) (2) Posterior (1) (1) (2) (1) Prior (2) (2) Relative precision weighting (1 ) Likelihood | y ) N ( ,1 / ) (1 ) ) Posterior (1) (2) Why should I know about Bayesian stats? Because Bayesian principles are fundamental for • statistical inference in general • sophisticated analyses of (neuronal) systems • contemporary theories of brain function Problems of classical (frequentist) statistics p-value: probability of observing data in the effect’s absence H0 : 0 p( y | H 0 ) Limitations: One can never accept the null hypothesis Given enough data, one can always demonstrate a significant effect Correction for multiple comparisons necessary Solution: infer posterior probability of the effect p ( | y ) Generative models: Forward and inverse problems forward problem p y , m p | m likelihood prior posterior distribution p y , m inverse problem Dynamic causal modeling (DCM) EEG, MEG fMRI Forward model: Predicting measured activity given a putative neuronal state Model inversion: Estimating neuronal mechanisms from brain activity measures y g ( x , ) dx dt Friston et al. (2003) NeuroImage f ( x, u , ) The Bayesian brain hypothesis & free-energy principle sensations – predictions Prediction error Change sensory input Change predictions Action Perception Maximizing the evidence (of the brain's generative model) = minimizing the surprise about the data (sensory inputs). Friston et al. 2006, J Physiol Paris Individual hierarchical Bayesian learning volatility k 1 k x3 x3 associations k 1 k x2 x2 k 1 events in the world x1 sensory stimuli u k 1 Mathys et al. 2011, Front. Hum. Neurosci. k x1 u k i ˆ i 1 i P E i 1 Aberrant Bayesian message passing in schizophrenia: Backward & lateral i i g i ( i 1 , i ) i i abnormal (precision-weighted) prediction errors input 1 2 3 4 u 2 3 4 abnormal modulation of NMDAR-dependent synaptic plasticity at forward connections of cortical hierarchies Forward & lateral i i 1 T i 1 T i 1 i i 1 i 1 Forward recognition effects i 1 De-correlating lateral interactions T Stephan et al. 2006, Biol. Psychiatry i T i i i Lateral interactions mediating priors i i 1 i i i Backward generation effects g i ( i 1 , i ) Why should I know about Bayesian stats? Because SPM is getting more and more Bayesian: • Segmentation & spatial normalisation • Posterior probability maps (PPMs) – 1st level: specific spatial priors – 2nd level: global spatial priors • Dynamic Causal Modelling (DCM) • Bayesian Model Selection (BMS) • EEG: source reconstruction Spatial priors Bayesian segmentation on activation extent and normalisation Posterior probability maps (PPMs) Image time-series Kernel Realignment Smoothing Design matrix Dynamic Causal Modelling Statistical parametric map (SPM) General linear model Statistical inference Normalisation Gaussian field theory p <0.05 Template Parameter estimates Spatial normalisation: Bayesian regularisation Deformations consist of a linear combination of smooth basis functions (3D DCT). Find maximum a posteriori (MAP) estimates: Deformation parameters MAP: log p ( | y ) log p ( y | ) log p ( ) log p ( y ) “Difference” between template and source image Squared distance between parameters and their expected values (regularisation) Spatial normalisation: overfitting Affine registration. (2 = 472.1) Template image Non-linear registration using regularisation. (2 = 302.7) Non-linear registration without regularisation. (2 = 287.3) Bayesian segmentation with empirical priors • Goal: for each voxel, compute probability that it belongs to a particular tissue type, given its intensity p (tissue | intensity) p (intensity | tissue) ∙ p (tissue) • Likelihood: Intensities are modelled by a mixture of Gaussian distributions representing different tissue classes (e.g. GM, WM, CSF). • Priors: obtained from tissue probability maps (segmented images of 151 subjects). Ashburner & Friston 2005, NeuroImage Bayesian fMRI analyses General Linear Model: y X with ~ N ( 0, C ) What are the priors? • In “classical” SPM, no priors (= “flat” priors) • Full Bayes: priors are predefined • Empirical Bayes: priors are estimated from the data, assuming a hierarchical generative model Parameters of one level = priors for distribution of parameters at lower level Parameters and hyperparameters at each level can be estimated using EM Posterior Probability Maps (PPMs) Posterior distribution: probability of the effect given the data p ( | y ) mean: size of effect precision: variability Posterior probability map: images of the probability that an activation exceeds some specified threshold , given the data y p ( | y ) p ( | y ) Two thresholds: • activation threshold : percentage of whole brain mean signal • probability that voxels must exceed to be displayed (e.g. 95%) 2nd level PPMs with global priors 1st level (GLM): 2nd level (shrinkage prior): y X (1 ) (1) (1) (2) 0 Heuristically: use the variance of meancorrected activity over voxels as prior variance of at any particular voxel. (1) reflects regionally specific effects assume that it is zero on average over voxels variance of this prior is implicitly estimated by estimating (2) (1) (2) p ( ) N ( 0 , C ) p ( ) N ( 0 , C ) (2) p ( ) 0 In the absence of evidence to the contrary, parameters will shrink to zero. 2nd level PPMs with global priors 1st level (GLM): y X (1 ) p ( ) N ( 0 , C ) voxel-specific p ( ) N ( 0 , C ) global pooled estimate over voxels 2nd level (shrinkage prior): 0 (2) Compute Cε and C via ReML/EM, and apply the usual rule for computing posterior mean & covariance for Gaussians: Friston & Penny 2003, NeuroImage C |y X C X C T 1 m |y C |y X C y T 1 1 1 PPMs vs. SPMs p ( | y ) p ( y | ) p ( ) PPMs Posterior Likelihood Prior SPMs u Bayesian test: p ( | y ) t f ( y) Classical t-test: p ( t u | 0 ) PPMs and multiple comparisons Friston & Penny (2003): No need to correct for multiple comparisons: Thresholding a PPM at 95% confidence: in every voxel, the posterior probability of an activation is 95%. At most, 5% of the voxels identified could have activations less than . Independent of the search volume, thresholding a PPM thus puts an upper bound on the false discovery rate. NB: being debated PPMs vs.SPMs rest [2.06] rest contrast(s) < PPM 2.06 SPMresults: C:\home\spm\analysis_PET Height threshold P = 0.95 Extent threshold k = 0 voxels < < SPMmip [0, 0, 0] 4 < SPMmip [0, 0, 0] < 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 60 < SPM{T39.0} SPMresults: C:\home\spm\analysis_PET Height threshold T = 5.50 Extent threshold k = 0 voxels 1 4 7 10 13 16 19 22 Design matrix PPMs: Show activations greater than a given size 3 contrast(s) 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 60 1 4 7 10 13 16 19 22 Design matrix SPMs: Show voxels with non-zero activations PPMs: pros and cons Advantages • One can infer that a cause did not elicit a response • Inference is independent of search volume • do not conflate effectsize and effectvariability Disadvantages • Estimating priors over voxels is computationally demanding • Practical benefits are yet to be established • Thresholds other than zero require justification Model comparison and selection Given competing hypotheses on structure & functional mechanisms of a system, which model is the best? Which model represents the best balance between model fit and model complexity? For which model m does p(y|m) become maximal? Pitt & Miyung (2002) TICS Bayesian model selection (BMS) Model evidence: p ( y | , m ) p ( | m ) d log p ( y | m ) log p ( y | , m ) K L q , p | m K L q , p | y , m accounts for both accuracy and complexity of the model p(y|m) p( y | m) Gharamani, 2004 y all possible datasets Various approximations, e.g.: - negative free energy, AIC, BIC a measure of generalizability McKay 1992, Neural Comput. Penny et al. 2004a, NeuroImage Approximations to the model evidence Maximizing log model evidence = Maximizing model evidence Logarithm is a monotonic function Log model evidence = balance between fit and complexity log p ( y | m ) accuracy ( m ) complexity ( m ) log p ( y | , m ) complexity ( m ) No. of parameters In SPM2 & SPM5, interface offers 2 approximations: Akaike Information Criterion: Bayesian Information Criterion: No. of data points AIC log p ( y | , m ) p BIC log p ( y | , m ) p log N 2 Penny et al. 2004a, NeuroImage The (negative) free energy approximation • Under Gaussian assumptions about the posterior (Laplace approximation): log p ( y | m ) log p ( y | , m ) K L q , p | m K L q , p | y , m F log p ( y | m ) K L q , p | y , m log p ( y | , m ) K L q , p | m accuracy com plexity The complexity term in F • In contrast to AIC & BIC, the complexity term of the negative free energy F accounts for parameter interdependencies. KL q ( ), p ( | m ) 1 2 ln C 1 2 ln C | y 1 2 |y T 1 C |y • The complexity term of F is higher – the more independent the prior parameters ( effective DFs) – the more dependent the posterior parameters – the more the posterior mean deviates from the prior mean • NB: Since SPM8, only F is used for model selection ! Bayes factors To compare two models, we could just compare their log evidences. But: the log evidence is just some number – not very intuitive! A more intuitive interpretation of model comparisons is made possible by Bayes factors: positive value, [0;[ B12 p ( y | m1 ) p( y | m2 ) Kass & Raftery classification: Kass & Raftery 1995, J. Am. Stat. Assoc. B12 p(m1|y) Evidence 1 to 3 50-75% weak 3 to 20 75-95% positive 20 to 150 95-99% strong 150 99% Very strong BMS in SPM8: an example attention M1 stim M1 M2 M3 M4 M3 stim PPC V1 attention V1 V5 PPC M2 M2 better than M1 BF 2966 F = 7.995 PPC attention stim V1 V5 M3 better than M2 BF 12 F = 2.450 V5 M4 attention PPC M4 better than M3 BF 23 F = 3.144 stim V1 V5 Thank you