Bayesian Inference Chris Mathys Wellcome Trust Centre for Neuroimaging UCL SPM Course London, May 12, 2014 Thanks to Jean Daunizeau and Jérémie Mattout for previous versions of this talk A spectacular piece of information May 12, 2014 2 A spectacular piece of information Messerli, F. H. (2012). Chocolate Consumption, Cognitive Function, and Nobel Laureates. New England Journal of Medicine, 367(16), 1562–1564. May 12, 2014 3 So will I win the Nobel prize if I eat lots of chocolate? This is a question referring to uncertain quantities. Like almost all scientific questions, it cannot be answered by deductive logic. Nonetheless, quantitative answers can be given – but they can only be given in terms of probabilities. Our question here can be rephrased in terms of a conditional probability: ๐ ๐๐๐๐๐ ๐๐๐ก๐ ๐๐ ๐โ๐๐๐๐๐๐ก๐ = ? To answer it, we have to learn to calculate such quantities. The tool for this is Bayesian inference. May 12, 2014 4 «Bayesian» = logical and logical = probabilistic «The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man's mind.» — James Clerk Maxwell, 1850 May 12, 2014 5 «Bayesian» = logical and logical = probabilistic But in what sense is probabilistic reasoning (i.e., reasoning about uncertain quantities according to the rules of probability theory) «logical»? R. T. Cox showed in 1946 that the rules of probability theory can be derived from three basic desiderata: 1. Representation of degrees of plausibility by real numbers 2. Qualitative correspondence with common sense (in a well-defined sense) 3. Consistency May 12, 2014 6 The rules of probability By mathematical proof (i.e., by deductive reasoning) the three desiderata as set out by Cox imply the rules of probability (i.e., the rules of inductive reasoning). This means that anyone who accepts the desiderata must accept the following rules: 1. ๐๐ 2. ๐ ๐ = 3. ๐ ๐, ๐ = ๐ ๐ ๐ ๐ ๐ = ๐ ๐ ๐ ๐ ๐ (Normalization) ๐ =1 ๐๐ ๐, ๐ (Marginalization – also called the sum rule) (Conditioning – also called the product rule) «Probability theory is nothing but common sense reduced to calculation.» — Pierre-Simon Laplace, 1819 May 12, 2014 7 Conditional probabilities The probability of ๐ given ๐ is denoted by ๐ ๐๐ . In general, this is different from the probability of ๐ alone (the marginal probability of ๐), as we can see by applying the sum and product rules: ๐ ๐ = ๐ ๐, ๐ = ๐ ๐ ๐๐ ๐ ๐ ๐ Because of the product rule, we also have the following rule (Bayes’ theorem) for going from ๐ ๐ ๐ to ๐ ๐ ๐ : ๐ ๐๐ = May 12, 2014 ๐ ๐๐ ๐ ๐ = ๐ ๐ ๐ ๐๐ ๐ ๐ ๐′ ๐ ๐ ๐′ ๐ ๐′ 8 The chocolate example In our example, it is immediately clear that ๐ ๐๐๐๐๐ ๐โ๐๐๐๐๐๐ก๐ is very different from ๐ ๐โ๐๐๐๐๐๐ก๐ ๐๐๐๐๐ . While the first is hopeless to determine directly, the second is much easier to find out: ask Nobel laureates how much chocolate they eat. Once we know that, we can use Bayes’ theorem: posterior likelihood model ๐ ๐โ๐๐๐๐๐๐ก๐ ๐๐๐๐๐ ๐ ๐๐๐๐๐ ๐ ๐๐๐๐๐ ๐โ๐๐๐๐๐๐ก๐ = ๐ ๐โ๐๐๐๐๐๐ก๐ prior evidence Inference on the quantities of interest in fMRI/DCM studies has exactly the same general structure. May 12, 2014 9 Inference in SPM forward problem ๐ ๐ฆ ๐, ๐ likelihood posterior distribution ๐ ๐ ๐ฆ, ๐ inverse problem May 12, 2014 10 Inference in SPM Likelihood: ๐ ๐ฆ ๐, ๐ Prior: ๐ ๐๐ Bayes’ theorem: ๐ ๐ฆ ๐, ๐ ๐ ๐ ๐ ๐ ๐ ๐ฆ, ๐ = ๐ ๐ฆ๐ ๏ฑ generative model ๐ May 12, 2014 11 A simple example of Bayesian inference (adapted from Jaynes (1976)) Two manufacturers, A and B, deliver the same kind of components that turn out to have the following lifetimes (in hours): A: B: Assuming prices are comparable, from which manufacturer would you buy? May 12, 2014 12 A simple example of Bayesian inference How do we compare such samples? May 12, 2014 13 A simple example of Bayesian inference What next? May 12, 2014 14 A simple example of Bayesian inference The procedure in brief: • Determine your question of interest («What is the probability that...?») • Specify your model (likelihood and prior) • Calculate the full posterior using Bayes’ theorem • [Pass to the uninformative limit in the parameters of your prior] • Integrate out any nuisance parameters • Ask your question of interest of the posterior All you need is the rules of probability theory. (Ok, sometimes you’ll encounter a nasty integral – but that’s a technical difficulty, not a conceptual one). May 12, 2014 15 A simple example of Bayesian inference The question: • What is the probability that the components from manufacturer B have a longer lifetime than those from manufacturer A? • More specifically: given how much more expensive they are, how much longer do I require the components from B to live. • Example of a decision rule: if the components from B live 3 hours longer than those from A with a probability of at least 80%, I will choose those from B. May 12, 2014 16 A simple example of Bayesian inference The model (bear with me, this will turn out to be simple): • likelihood (Gaussian): ๐ ๐ ๐ฅ๐ ๐, ๐ = ๐=1 ๐ 2๐ 1 2 ๐ exp − ๐ฅ๐ − ๐ 2 2 • prior (Gaussian-gamma): ๐ ๐, ๐ ๐0 , ๐ 0 ๐0 , ๐0 = ๐ฉ ๐ ๐0 , ๐ 0 ๐ May 12, 2014 −1 Gam ๐ ๐0 , ๐0 17 A simple example of Bayesian inference The posterior (Gaussian-gamma): ๐ ๐, ๐ ๐ฅ๐ = ๐ฉ ๐ ๐๐ , ๐ ๐ ๐ −1 Gam ๐ ๐๐ , ๐๐ Parameter updates: ๐๐ = ๐0 + ๐ ๐ฅ − ๐0 , ๐ 0 + ๐ ๐๐ = ๐0 + ๐ฅโ 1 ๐ ๐ ๐ฅ๐ , ๐=1 May 12, 2014 ๐ 2 โ 1 ๐ ๐ ๐ = ๐ 0 + ๐, ๐ 2 ๐ 0 ๐ + ๐ฅ − ๐0 2 ๐ 0 + ๐ ๐๐ = ๐0 + ๐ 2 2 with ๐ ๐ฅ๐ − ๐ฅ 2 ๐=1 18 A simple example of Bayesian inference The limit for which the prior becomes uninformative: • For ๐ 0 = 0, ๐0 = 0, ๐0 = 0, the updates reduce to: ๐๐ = ๐ฅ ๐ ๐ = ๐ ๐ ๐๐ = 2 ๐ 2 ๐๐ = ๐ 2 • As promised, this is really simple: all you need is ๐, the number of datapoints; ๐, their mean; and ๐๐ , their variance. • This means that only the data influence the posterior and all influence from the parameters of the prior has been eliminated. • The uninformative limit should only ever be taken after the calculation of the posterior using a proper prior. May 12, 2014 19 A simple example of Bayesian inference Integrating out the nuisance parameter ๐ gives rise to a tdistribution: May 12, 2014 20 A simple example of Bayesian inference The joint posterior ๐ ๐๐ด , ๐๐ต ๐ฅ๐ ๐ด , ๐ฅ๐ of our ๐ ๐๐ต ๐ฅ๐ two ๐ต independent ๐ต is simply the product posteriors ๐ ๐๐ด ๐ฅ๐ ๐ด and . It will now give us the answer to our question: ∞ ๐ ๐๐ต − ๐๐ด > 3 = ∞ d๐๐ด ๐ ๐๐ด ๐ฅ๐ −∞ d๐๐ต ๐ ๐๐ต ๐ฅ๐ ๐ด ๐ต = 0.9501 ๐๐ด +3 Note that the t-test told us that there was «no significant difference» even though there is a >95% probability that the parts from B will last at least 3 hours longer than those from A. May 12, 2014 21 Bayesian inference The procedure in brief: • Determine your question of interest («What is the probability that...?») • Specify your model (likelihood and prior) • Calculate the full posterior using Bayes’ theorem • [Pass to the uninformative limit in the parameters of your prior] • Integrate out any nuisance parameters • Ask your question of interest of the posterior All you need is the rules of probability theory. May 12, 2014 22 Frequentist (or: orthodox, classical) versus Bayesian inference: hypothesis testing Classical Bayesian • define the null, e.g.: ๐ป0 : ๐ = 0 • invert model (obtain posterior pdf) ๐ ๐ก ๐ป0 ๐ ๐๐ฆ ๐ ๐ก> ๐ก∗ ๐ก∗ ๐ ๐ป0 ๐ฆ ๐ป0 ๐ก≡๐ก ๐ ๐ ๐0 • estimate parameters (obtain test stat.) • define the null, e.g.: • apply decision rule, i.e.: • apply decision rule, i.e.: if ๐ ๐ก > ๐ก ∗ ๐ป0 ≤ ๐ผ then reject H0 May 12, 2014 ๐ป0 : ๐ > ๐0 if ๐ ๐ป0 ๐ฆ ≥ ๐ผ then accept H0 23 Model comparison: general principles • Principle of parsimony: «plurality should not be assumed without necessity» • Automatically enforced by Bayesian model comparison y = f(x) Model evidence: ๐ ๐ฆ๐ = ๐ ๐ฆ ๐, ๐ ๐ ๐ ๐ d๐ ≈ exp ๐๐๐๐ข๐๐๐๐ฆ − ๐๐๐๐๐๐๐ฅ๐๐ก๐ฆ “Occam’s razor” : y=f(x) model evidence p(y|m) x space of all data sets May 12, 2014 24 Model comparison: negative variational free energy F ๐ฅ๐จ๐ − ๐ฆ๐จ๐๐๐ฅ ๐๐ฏ๐ข๐๐๐ง๐๐ โ log ๐ ๐ฆ ๐ = log ๐ ๐ฆ, ๐ ๐ d๐ = log ๐ ๐ sum rule multiply by 1 = ๐ ๐ ๐ ๐ Jensen’s inequality ≥ ๐ ๐ log ๐ ๐ฆ, ๐ ๐ d๐ ๐ ๐ a lower bound on the log-model evidence ๐ ๐ฆ, ๐ ๐ d๐ ๐ ๐ =: ๐ญ = ๐ง๐๐ ๐๐ญ๐ข๐ฏ๐ ๐ฏ๐๐ซ๐ข๐๐ญ๐ข๐จ๐ง๐๐ฅ ๐๐ซ๐๐ ๐๐ง๐๐ซ๐ ๐ฒ ๐นโ = ๐ ๐ฆ, ๐ ๐ ๐ ๐ log d๐ ๐ ๐ ๐ ๐ log Kullback-Leibler divergence ๐ ๐ฆ ๐, ๐ ๐ ๐ ๐ d๐ ๐ ๐ product rule = ๐ ๐ log ๐ ๐ฆ ๐, ๐ d๐ − ๐พ๐ฟ ๐ ๐ , ๐ ๐ ๐ ๐๐จ๐ฆ๐ฉ๐ฅ๐๐ฑ๐ข๐ญ๐ฒ ๐๐๐๐ฎ๐ซ๐๐๐ฒ (๐๐ฑ๐ฉ๐๐๐ญ๐๐ ๐ฅ๐จ๐ −๐ฅ๐ข๐ค๐๐ฅ๐ข๐ก๐จ๐จ๐) May 12, 2014 25 Model comparison: F in relation to Bayes factors, AIC, BIC ๐ ๐ฆ ๐1 ๐ ๐ฆ ๐1 ๐๐๐ฒ๐๐ฌ ๐๐๐๐ญ๐จ๐ซ โ = exp log ๐ ๐ฆ ๐0 ๐ ๐ฆ ๐0 Bayes factor Prior odds Posterior odds ≈ exp ๐น1 − ๐น0 [Meaning of the Bayes factor: ๐ญ= = exp log ๐ ๐ฆ ๐1 − log ๐ ๐ฆ ๐0 ๐ ๐ ๐1 ๐ฆ ๐ ๐ฆ ๐1 = ๐0 ๐ฆ ๐ ๐ฆ ๐0 ๐ ๐1 ๐ ๐0 ] ๐ ๐ log ๐ ๐ฆ ๐, ๐ d๐ − ๐พ๐ฟ ๐ ๐ , ๐ ๐ ๐ = Accuracy − Complexity ๐๐๐ โ Accuracy − ๐ ๐ ๐๐๐ โ Accuracy − log ๐ 2 May 12, 2014 Number of parameters Number of data points 26 A note on informative priors • Any model consists of two parts: likelihood and prior. • The choice of likelihood requires as much justification as the choice of prior because it is just as «subjective» as that of the prior. • The data never speak for themselves. They only acquire meaning when seen through the lens of a model. However, this does not mean that all is subjective because models differ in their validity. • In this light, the widespread concern that informative priors might bias results (while the form of the likelihood is taken as a matter of course requiring no justification) is misplaced. • Informative priors are an important tool and their use can be justified by establishing the validity (face, construct, and predictive) of the resulting model as well as by model comparison. May 12, 2014 27 Applications of Bayesian inference May 12, 2014 28 segmentation and normalisation realignment posterior probability maps (PPMs) smoothing dynamic causal modelling general linear model statistical inference normalisation multivariate decoding Gaussian field theory p <0.05 template May 12, 2014 29 Segmentation (mixture of Gaussians-model) class variances ๏ณ2 ๏ณ1 ๏ญ1 … ๏ณk ๏ญ2 yi … ith voxel label ith voxel value ci ๏ฌ class frequencies ๏ญk class means May 12, 2014 grey matter white matter CSF 30 fMRI time series analysis PPM: regions best explained by short-term memory model short-term memory design matrix (X) prior variance of GLM coeff prior variance of data noise AR coeff (correlated noise) GLM coeff PPM: regions best explained by long-term memory model fMRI time series May 12, 2014 long-term memory design matrix (X) 31 Dynamic causal modeling (DCM) m1 m2 m3 m4 attention attention attention attention PPC stim V1 V5 PPC stim V1 V5 PPC stim V1 V5 PPC stim V1 V5 attention models marginal likelihood ln p ๏จy m๏ฉ 15 PPC 10 1.25 stim 5 estimated effective synaptic strengths for best model (m4) 0.10 0.26 V1 0.39 0.26 0.13 V5 0.46 0 m1 m2 m3 m4 May 12, 2014 32 Model comparison for group studies differences in log- model evidences ln p ๏จ y m1 ๏ฉ ๏ญ ln p ๏จ y m2 ๏ฉ m1 m2 subjects Fixed effect Assume all subjects correspond to the same model Random effect Assume different subjects might correspond to different models May 12, 2014 33 Thanks May 12, 2014 34