Introduction - Translational Neuromodeling Unit

Bayesian models for fMRI data Klaas Enno Stephan Translational Neuromodeling Unit (TNU) Institute for Biomedical Engineering, University of Zurich & ETH Zurich Laboratory for Social & Neural Systems Research (SNS), University of Zurich Wellcome Trust Centre for Neuroimaging, University College London With many thanks for slides & images to: FIL Methods group, particularly Guillaume Flandin and Jean Daunizeau SPM Course Zurich 13-15 February 2013 The Reverend Thomas Bayes (1702-1761) Bayes‘ Theorem Posterior Likelihood P ( | y )  p ( y |  ) p ( ) p( y) Evidence Reverend Thomas Bayes 1702 - 1761 “Bayes‘ Theorem describes, how an ideally rational person processes information." Wikipedia Prior Bayes’ Theorem Given data y and parameters , the joint probability is: p ( y ,  )  p ( | y ) p ( y )  p ( y |  ) p ( ) Eliminating p(y,) gives Bayes’ rule: Likelihood Posterior P ( | y )  Prior p ( y |  ) p ( ) p( y) Evidence Bayesian inference: an animation Principles of Bayesian inference  Formulation of a generative model likelihood p(y|) prior distribution p() y  f ( )   p ( y |  )  N ( f ( ),   ) p ( )  N (  p ,  p )  Observation of data y  Update of beliefs based upon observations, given a prior state of knowledge p ( | y )  p ( y |  ) p (  ) Posterior mean & variance of univariate Gaussians Likelihood & Prior p ( y |  )  N ( ,  ) 2 e y   p ( )  N (  p ,  p ) 2  Posterior 2 Posterior: p ( | y )  N (  ,  ) 1  2  1  2 e  Likelihood 1 p 2  1  1      2 p  2   p  e  2 Posterior mean = variance-weighted combination of prior mean and data mean  Prior p Same thing – but expressed as precision weighting Likelihood & prior 1 e p ( y |  )  N ( ,  ) y   1 p ( )  N (  p ,  p )  Posterior Posterior: p ( | y )  N (  ,   1 ) Likelihood   e   p  e    p  Prior p Relative precision weighting  p Same thing – but explicit hierarchical perspective Likelihood & Prior p( y | p ( (1) (1) )  N ( )  N ( (2) y (1) ,1 /  ,1 /  (2) (1) )  (1 ) p (  (1 )  (2)  (2)  Posterior (1)  (1)    (2) (1)  Prior  (2)   (2) Relative precision weighting  (1 ) Likelihood | y )  N (  ,1 /  )     (1 ) ) Posterior (1)   (2)  Why should I know about Bayesian stats? Because Bayesian principles are fundamental for • statistical inference in general • sophisticated analyses of (neuronal) systems • contemporary theories of brain function Problems of classical (frequentist) statistics p-value: probability of observing data in the effect’s absence H0 :  0 p( y | H 0 ) Limitations:  One can never accept the null hypothesis  Given enough data, one can always demonstrate a significant effect  Correction for multiple comparisons necessary Solution: infer posterior probability of the effect p ( | y ) Generative models: Forward and inverse problems forward problem p  y  , m  p  | m  likelihood  prior posterior distribution p  y , m  inverse problem Dynamic causal modeling (DCM) EEG, MEG fMRI Forward model: Predicting measured activity given a putative neuronal state Model inversion: Estimating neuronal mechanisms from brain activity measures y  g ( x , )   dx dt Friston et al. (2003) NeuroImage  f ( x, u , )   The Bayesian brain hypothesis & free-energy principle sensations – predictions Prediction error Change sensory input Change predictions Action Perception Maximizing the evidence (of the brain's generative model) = minimizing the surprise about the data (sensory inputs). Friston et al. 2006, J Physiol Paris Individual hierarchical Bayesian learning  volatility k 1 k x3 x3 associations k 1 k x2 x2 k 1 events in the world x1 sensory stimuli u k 1 Mathys et al. 2011, Front. Hum. Neurosci. k x1 u k i  ˆ i 1 i P E i 1 Aberrant Bayesian message passing in schizophrenia: Backward & lateral  i   i  g i ( i 1 ,  i )   i i abnormal (precision-weighted) prediction errors input 1 2 3 4 u 2 3 4  abnormal modulation of NMDAR-dependent synaptic plasticity at forward connections of cortical hierarchies Forward & lateral  i   i 1 T i 1   T   i 1 i    i 1  i 1 Forward recognition effects   i 1 De-correlating lateral interactions T   Stephan et al. 2006, Biol. Psychiatry   i T i  i i Lateral interactions mediating priors i  i 1 i   i i Backward generation effects  g i ( i 1 ,  i )  Why should I know about Bayesian stats? Because SPM is getting more and more Bayesian: • Segmentation & spatial normalisation • Posterior probability maps (PPMs) – 1st level: specific spatial priors – 2nd level: global spatial priors • Dynamic Causal Modelling (DCM) • Bayesian Model Selection (BMS) • EEG: source reconstruction Spatial priors Bayesian segmentation on activation extent and normalisation Posterior probability maps (PPMs) Image time-series Kernel Realignment Smoothing Design matrix Dynamic Causal Modelling Statistical parametric map (SPM) General linear model Statistical inference Normalisation Gaussian field theory p <0.05 Template Parameter estimates Spatial normalisation: Bayesian regularisation Deformations consist of a linear combination of smooth basis functions (3D DCT). Find maximum a posteriori (MAP) estimates: Deformation parameters MAP: log p ( | y )  log p ( y |  )  log p ( )  log p ( y ) “Difference” between template and source image Squared distance between parameters and their expected values (regularisation) Spatial normalisation: overfitting Affine registration. (2 = 472.1) Template image Non-linear registration using regularisation. (2 = 302.7) Non-linear registration without regularisation. (2 = 287.3) Bayesian segmentation with empirical priors • Goal: for each voxel, compute probability that it belongs to a particular tissue type, given its intensity p (tissue | intensity) p (intensity | tissue) ∙ p (tissue) • Likelihood: Intensities are modelled by a mixture of Gaussian distributions representing different tissue classes (e.g. GM, WM, CSF). • Priors: obtained from tissue probability maps (segmented images of 151 subjects). Ashburner & Friston 2005, NeuroImage Bayesian fMRI analyses General Linear Model: y  X   with  ~ N ( 0, C  ) What are the priors? • In “classical” SPM, no priors (= “flat” priors) • Full Bayes: priors are predefined • Empirical Bayes: priors are estimated from the data, assuming a hierarchical generative model Parameters of one level = priors for distribution of parameters at lower level Parameters and hyperparameters at each level can be estimated using EM Posterior Probability Maps (PPMs) Posterior distribution: probability of the effect given the data p ( | y ) mean: size of effect precision: variability Posterior probability map: images of the probability that an activation exceeds some specified threshold , given the data y  p (   | y )   p ( | y )  Two thresholds: • activation threshold : percentage of whole brain mean signal • probability  that voxels must exceed to be displayed (e.g. 95%) 2nd level PPMs with global priors 1st level (GLM): 2nd level (shrinkage prior): y  X  (1 )  (1)  (1) (2)  0 Heuristically: use the variance of meancorrected activity over voxels as prior variance of  at any particular voxel. (1) reflects regionally specific effects  assume that it is zero on average over voxels  variance of this prior is implicitly estimated by estimating (2)  (1) (2) p ( )  N ( 0 , C  ) p ( )  N ( 0 , C  ) (2) p ( ) 0 In the absence of evidence to the contrary, parameters will shrink to zero. 2nd level PPMs with global priors 1st level (GLM): y  X   (1 ) p ( )  N ( 0 , C  ) voxel-specific p ( )  N ( 0 , C  ) global  pooled estimate over voxels 2nd level (shrinkage prior):  0 (2) Compute Cε and C via ReML/EM, and apply the usual rule for computing posterior mean & covariance for Gaussians: Friston & Penny 2003, NeuroImage C |y  X C  X  C T 1 m |y  C |y X C  y T  1  1 1 PPMs vs. SPMs p ( | y )  p ( y |  ) p ( ) PPMs Posterior Likelihood Prior SPMs  u  Bayesian test: p (   | y )   t  f ( y) Classical t-test: p ( t  u |   0 )   PPMs and multiple comparisons Friston & Penny (2003): No need to correct for multiple comparisons: Thresholding a PPM at 95% confidence: in every voxel, the posterior probability of an activation  is  95%. At most, 5% of the voxels identified could have activations less than . Independent of the search volume, thresholding a PPM thus puts an upper bound on the false discovery rate. NB: being debated PPMs vs.SPMs rest [2.06] rest contrast(s) < PPM 2.06 SPMresults: C:\home\spm\analysis_PET Height threshold P = 0.95 Extent threshold k = 0 voxels < < SPMmip [0, 0, 0] 4 < SPMmip [0, 0, 0] < 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 60 < SPM{T39.0} SPMresults: C:\home\spm\analysis_PET Height threshold T = 5.50 Extent threshold k = 0 voxels 1 4 7 10 13 16 19 22 Design matrix PPMs: Show activations greater than a given size 3 contrast(s) 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 60 1 4 7 10 13 16 19 22 Design matrix SPMs: Show voxels with non-zero activations PPMs: pros and cons Advantages • One can infer that a cause did not elicit a response • Inference is independent of search volume • do not conflate effectsize and effectvariability Disadvantages • Estimating priors over voxels is computationally demanding • Practical benefits are yet to be established • Thresholds other than zero require justification Model comparison and selection Given competing hypotheses on structure & functional mechanisms of a system, which model is the best? Which model represents the best balance between model fit and model complexity? For which model m does p(y|m) become maximal? Pitt & Miyung (2002) TICS Bayesian model selection (BMS) Model evidence:  p ( y |  , m ) p ( | m ) d  log p ( y | m )  log p ( y |  , m )  K L  q    , p   | m    K L  q    , p   | y , m   accounts for both accuracy and complexity of the model p(y|m) p( y | m)  Gharamani, 2004 y all possible datasets Various approximations, e.g.: - negative free energy, AIC, BIC a measure of generalizability McKay 1992, Neural Comput. Penny et al. 2004a, NeuroImage Approximations to the model evidence Maximizing log model evidence = Maximizing model evidence Logarithm is a monotonic function Log model evidence = balance between fit and complexity log p ( y | m )  accuracy ( m )  complexity ( m )  log p ( y |  , m )  complexity ( m ) No. of parameters In SPM2 & SPM5, interface offers 2 approximations: Akaike Information Criterion: Bayesian Information Criterion: No. of data points AIC  log p ( y |  , m )  p BIC  log p ( y |  , m )  p log N 2 Penny et al. 2004a, NeuroImage The (negative) free energy approximation • Under Gaussian assumptions about the posterior (Laplace approximation): log p ( y | m )  log p ( y |  , m )  K L  q    , p   | m    K L  q    , p   | y , m   F  log p ( y | m )  K L  q    , p   | y , m    log p ( y |  , m )  K L  q    , p   | m   accuracy com plexity The complexity term in F • In contrast to AIC & BIC, the complexity term of the negative free energy F accounts for parameter interdependencies. KL q ( ), p ( | m )   1 2 ln C   1 2 ln C  | y  1 2   |y    T 1 C   |y • The complexity term of F is higher – the more independent the prior parameters ( effective DFs) – the more dependent the posterior parameters – the more the posterior mean deviates from the prior mean • NB: Since SPM8, only F is used for model selection !    Bayes factors To compare two models, we could just compare their log evidences. But: the log evidence is just some number – not very intuitive! A more intuitive interpretation of model comparisons is made possible by Bayes factors: positive value, [0;[ B12  p ( y | m1 ) p( y | m2 ) Kass & Raftery classification: Kass & Raftery 1995, J. Am. Stat. Assoc. B12 p(m1|y) Evidence 1 to 3 50-75% weak 3 to 20 75-95% positive 20 to 150 95-99% strong  150  99% Very strong BMS in SPM8: an example attention M1 stim M1 M2 M3 M4 M3 stim PPC V1 attention V1 V5 PPC M2 M2 better than M1 BF 2966 F = 7.995 PPC attention stim V1 V5 M3 better than M2 BF  12 F = 2.450 V5 M4 attention PPC M4 better than M3 BF  23 F = 3.144 stim V1 V5 Thank you

Introduction - Translational Neuromodeling Unit

Related documents

Products

Support

Introduction - Translational Neuromodeling Unit

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib