Bayesian models for fMRI data Klaas Enno Stephan Laboratory for Social and Neural Systems Research Institute for Empirical Research in Economics University of Zurich Functional Imaging Laboratory (FIL) Wellcome Trust Centre for Neuroimaging University College London With many thanks for slides & images to: FIL Methods group, particularly Guillaume Flandin The Reverend Thomas Bayes (1702-1761) Methods & models for fMRI data analysis in neuroeconomics April 2010 Why do I need to learn about Bayesian stats? Because SPM is getting more and more Bayesian: • Segmentation & spatial normalisation • Posterior probability maps (PPMs) – 1st level: specific spatial priors – 2nd level: global spatial priors • Dynamic Causal Modelling (DCM) • Bayesian Model Selection (BMS) • EEG: source reconstruction Spatial priors Bayesian segmentation on activation extent and normalisation Posterior probability maps (PPMs) Image time-series Kernel Realignment Smoothing Design matrix Dynamic Causal Modelling Statistical parametric map (SPM) General linear model Statistical inference Normalisation Gaussian field theory p <0.05 Template Parameter estimates Problems of classical (frequentist) statistics p-value: probability of getting the observed data in the effect’s absence. If small, reject null hypothesis that there is no effect. H0 : 0 p( y | H 0 ) Probability of observing the data y, given no effect ( = 0). Limitations: One can never accept the null hypothesis Given enough data, one can always demonstrate a significant effect Correction for multiple comparisons necessary Solution: infer posterior probability of the effect p( | y ) Probability of the effect, given the observed data Overview of topics • Bayes' rule • Bayesian update rules for Gaussian densities • Bayesian analyses in SPM – Segmentation & spatial normalisation – Posterior probability maps (PPMs) • 1st level: specific spatial priors • 2nd level: global spatial priors – Bayesian Model Selection (BMS) Bayes‘ Theorem Posterior Likelihood Prior p( y | ) p( ) P( | y ) p( y ) Evidence Reverend Thomas Bayes 1702 - 1761 “Bayes‘ Theorem describes, how an ideally rational person processes information." Wikipedia Bayes’ Theorem Given data y and parameters , the conditional probabilities are: p( y , ) p( y | ) p( ) p( y , ) p( | y ) p( y ) Eliminating p(y,) gives Bayes’ rule: Likelihood Posterior Prior p( y | ) p( ) P( | y ) p( y ) Evidence Bayesian statistics new data prior knowledge p( y | ) p ( ) p( | y) p( y | ) p( ) posterior likelihood ∙ prior Bayes theorem allows one to formally incorporate prior knowledge into computing statistical probabilities. Priors can be of different sorts: empirical, principled or shrinkage priors. The “posterior” probability of the parameters given the data is an optimal combination of prior knowledge and new data, weighted by their relative precision. Bayes in motion - an animation Principles of Bayesian inference Formulation of a generative model likelihood p(y|) prior distribution p() Observation of data y Update of beliefs based upon observations, given a prior state of knowledge p( | y ) p( y | ) p( ) Posterior mean & variance of univariate Gaussians Likelihood & Prior y p ( y | ) N ( y; , ) 2 e p( ) N ( ; p , p2 ) Posterior Posterior: p( | y) N ( ; , ) 2 1 1 1 2 e2 p2 1 1 2 2 p p e 2 Posterior mean = variance-weighted combination of prior mean and data mean Likelihood Prior p Same thing – but expressed as precision weighting Likelihood & prior p ( y | ) N ( y; , ) 1 e y p( ) N ( ; p , p1 ) Posterior Posterior: p( | y) N ( ; , 1 ) e p p e p Relative precision weighting Likelihood Prior p Same thing – but explicit hierarchical perspective y (1) (1) Likelihood & Prior p( y | ) N ( y; ,1 / ) (1) (1) (1) (1) ( 2) ( 2) p( (1) ) N ( (1) ; ( 2 ) ,1 / ( 2 ) ) Posterior Posterior p( (1) Likelihood | y ) N ( ; ,1 / ) (1) (1) ( 2 ) (1) (1) ( 2 ) ( 2 ) Relative precision weighting (1) Prior ( 2) Bayesian regression: univariate case Normal densities p( ) N ( ; p , p2 ) Univariate linear model y x e | y p( y | ) N ( y;x, e2 ) x p( | y) N ( ; | y ,2| y ) p 1 2 |y | y x2 2 e 1 p2 x 1 2 y 2 p p e 2 |y Relative precision weighting Bayesian GLM: multivariate case General Linear Model Normal densities p(θ) N (θ; η p , C p ) y Xθ e p(y | θ) N (y; Xθ, Ce ) 1 1 C | y XT Ce X C p 2 p(θ | y) N (θ; η | y , C | y ) 1 1 η | y C | y XT Ce y C p η p One step if Ce is known. Otherwise iterative estimation with EM. 1 An intuitive example 10 2 5 0 -5 Prior Likelihood Posterior -10 -10 -5 0 1 5 10 Still intuitive 10 2 5 0 -5 Prior Likelihood Posterior -10 -10 -5 0 1 5 10 Less intuitive Prior Likelihood Posterior 10 2 5 0 -5 -10 -10 -5 0 1 5 10 Bayesian (fixed effects) group analysis Likelihood distributions from different subjects are independent Under Gaussian assumptions this is easy to compute: one can use the posterior from one subject as the prior for the next group posterior covariance p | y1 y N p y1 y N p individual posterior covariances N p p yi i 1 N p y1 p yi i 2 N p y1 , y2 p yi p y1 1 | y1 ,...,y N C | y ,...,y 1 N i 3 y N 1 p y N “Today’s posterior is tomorrow’s prior” group posterior mean N C|1yi i 1 N 1 C | yi | yi C | y1 ,...,y N i 1 individual posterior covariances and means Bayesian analyses in SPM8 • Segmentation & spatial normalisation • Posterior probability maps (PPMs) – 1st level: specific spatial priors – 2nd level: global spatial priors • Dynamic Causal Modelling (DCM) • Bayesian Model Selection (BMS) • EEG: source reconstruction Spatial normalisation: Bayesian regularisation Deformations consist of a linear combination of smooth basis functions lowest frequencies of a 3D discrete cosine transform. Find maximum a posteriori (MAP) estimates: simultaneously minimise – squared difference between template and source image – squared difference between parameters and their priors Deformation parameters MAP: log p ( | y ) log p ( y | ) log p ( ) log p ( y ) “Difference” between template and source image Squared distance between parameters and their expected values (regularisation) Bayesian segmentation with empirical priors • Goal: for each voxel, compute probability that it belongs to a particular tissue type, given its intensity p (tissue| intensity) p (intensity | tissue) ∙ p (tissue) • Likelihood model: Intensities are modelled by a mixture of Gaussian distributions representing different tissue classes (e.g. GM, WM, CSF). • Priors are obtained from tissue probability maps (segmented images of 151 subjects). Ashburner & Friston 2005, NeuroImage Unified segmentation & normalisation • Circular relationship between segmentation & normalisation: – Knowing which tissue type a voxel belongs to helps normalisation. – Knowing where a voxel is (in standard space) helps segmentation. • Build a joint generative model: – model how voxel intensities result from mixture of tissue type distributions – model how tissue types of one brain have to be spatially deformed to match those of another brain • Using a priori knowledge about the parameters: adopt Bayesian approach and maximise the posterior probability Ashburner & Friston 2005, NeuroImage Bayesian fMRI analyses General Linear Model: y X with ~ N (0, C ) What are the priors? • In “classical” SPM, no priors (= “flat” priors) • Full Bayes: priors are predefined on a principled or empirical basis • Empirical Bayes: priors are estimated from the data, assuming a hierarchical generative model PPMs in SPM Parameters of one level = priors for distribution of parameters at lower level Parameters and hyperparameters at each level can be estimated using EM Hierarchical models and Empirical Bayes y X (1) (1) (1) Hierarchical model (1) X ( 2) ( 2) ( 2) Parametric Empirical Bayes (PEB) ( n 1) X ( n ) ( n ) ( n ) EM = PEB = ReML Single-level model y (1) X (1) ( 2) ... X (1) X ( n1) ( n ) X (1) X ( n ) ( n ) Restricted Maximum Likelihood (ReML) Posterior Probability Maps (PPMs) Posterior distribution: probability of the effect given the data p( | y ) mean: size of effect precision: variability Posterior probability map: images of the probability (confidence) that an activation exceeds some specified threshold , given the data y p( | y ) p( | y ) Two thresholds: • activation threshold : percentage of whole brain mean signal (physiologically relevant size of effect) • probability that voxels must exceed to be displayed (e.g. 95%) PPMs vs. SPMs p( | y) p( y | ) p( ) PPMs Posterior Likelihood Prior SPMs u Bayesian test: p( | y) t f ( y) Classical t-test: p(t u | 0) 2nd level PPMs with global priors 1st level (GLM): 2nd level (shrinkage prior): y X (1) (1) (1) p( ) N (0, C ) (1) ( 2) ( 2) p( ) N (0, C ) 0 ( 2) Basic idea: use the variance of over voxels as prior variance of at any particular voxel. 2nd level: (2) = average effect over voxels, (2) = voxel-to-voxel variation. (1) reflects regionally specific effects assume that it sums to zero over all voxels shrinkage prior at the second level variance of this prior is implicitly estimated by estimating (2) p( ) 0 In the absence of evidence to the contrary, parameters will shrink to zero. Shrinkage Priors Small & variable effect Large & variable effect Small but clear effect Large & clear effect 2nd level PPMs with global priors 1st level (GLM): y X (1) p( ) N (0, C ) voxel-specific p( ) N (0, C ) global pooled estimate 2nd level (shrinkage prior): 0 We are looking for the same effect over multiple voxels Pooled estimation of C over voxels Friston & Penny 2003, NeuroImage ( 2) Once Cε and C are known, we can apply the usual rule for computing the posterior mean & covariance: C | y X T C1 X C1 1 m | y C | y X T C1 y PPMs and multiple comparisons No need to correct for multiple comparisons: Thresholding a PPM at 95% confidence: in every voxel, the posterior probability of an activation is 95%. At most, 5% of the voxels identified could have activations less than . Independent of the search volume, thresholding a PPM thus puts an upper bound on the false discovery rate. PPMs vs.SPMs rest [2.06] rest contrast(s) < PPM 2.06 SPMresults: C:\home\spm\analysis_PET Height threshold P = 0.95 Extent threshold k = 0 voxels < < SPMmip [0, 0, 0] 4 < SPMmip [0, 0, 0] < 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 60 < SPM{T39.0} SPMresults: C:\home\spm\analysis_PET Height threshold T = 5.50 Extent threshold k = 0 voxels 1 4 7 10 13 16 19 22 Design matrix PPMs: Show activations greater than a given size 3 contrast(s) 1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 60 1 4 7 10 13 16 19 22 Design matrix SPMs: Show voxels with non-zero activations PPMs: pros and cons Advantages • One can infer that a cause did not elicit a response • Inference is independent of search volume • SPMs conflate effectsize and effectvariability Disadvantages • Estimating priors over voxels is computationally demanding • Practical benefits are yet to be established • Thresholds other than zero require justification 1st level PPMs with local spatial priors • Neighbouring voxels often not independent • Spatial dependencies vary across the brain • But spatial smoothing in SPM is uniform • Matched filter theorem: SNR maximal when smoothing the data with a kernel which matches the smoothness of the true signal Contrast map • Basic idea: estimate regional spatial dependencies from the data and use this as a prior in a PPM regionally specific smoothing markedly increased sensitivity AR(1) map Penny et al. 2005, NeuroImage The generative spatio-temporal model q1 q2 r2 r1 K p α p k P p(β) p( p ) k 1 p 1 p k Ga k ; q1 , q2 u1 p( p ) Ga( p ; r1 , r2 ) u2 p W p w p n Ga n ; u1 , u2 p w Tk N w Tk ; 0, k1 ST S K k 1 n 1 observation noise p λ p n N W regression coefficients Y Penny et al. 2005, NeuroImage Y=XW+E T k 1 P p ( A) p(a p ) p 1 p (a p ) N a p ; 0, p 1 (ST S) 1 A autoregressive parameters = spatial precision of parameters = observation noise precision = precision of AR coefficients The spatial prior Prior for k-th parameter: N w T k pw Shrinkage prior T k ;0, 1 k S S T 1 Spatial precision: determines the amount of smoothness Spatial kernel matrix Different choices possible for spatial kernel matrix S. Currently used in SPM: Laplacian prior (same as in LORETA) Example: application to event-related fMRI data Smoothing Contrast maps for familiar vs. non-familiar faces, obtained with - smoothing - global spatial prior - Laplacian prior Global prior Laplacian Prior Bayesian model selection (BMS) Given competing hypotheses on structure & functional mechanisms of a system, which model is the best? Which model represents the best balance between model fit and model complexity? For which model m does p(y|m) become maximal? Pitt & Miyung (2002), TICS Bayesian model selection (BMS) Bayes’ rules: p( y | , m) p( | m) p( | y, m) p ( y | m) Model evidence: p( y | m) p( y | , m) p( | m) d accounts for both accuracy and complexity of the model allows for inference about structure (generalisability) of the model Various approximations, e.g.: - negative free energy - AIC - BIC Penny et al. (2004) NeuroImage Model comparison via Bayes factor: p( y | m1 ) BF p( y | m2 ) Example: BMS of dynamic causal models attention modulation of back- M1 PPC ward or forward connection? stim additional driving effect of attention on PPC? M3 M2 M2 better than M1 BF = 2966 V1 attention attention stim V5 PPC PPC V1 V5 BF = 12 M3 better than M2 stim V1 V5 M4 bilinear or nonlinear modulation of forward connection? Stephan et al. (2008) NeuroImage attention PPC BF = 23 M4 better than M3 stim V1 V5 Thank you