Bayesian Inference Chris Mathys Wellcome Trust Centre for Neuroimaging UCL SPM Course (M/EEG) London, May 14, 2013 Thanks to Jean Daunizeau and Jérémie Mattout for previous versions of this talk A spectacular piece of information May 14, 2013 2 A spectacular piece of information Messerli, F. H. (2012). Chocolate Consumption, Cognitive Function, and Nobel Laureates. New England Journal of Medicine, 367(16), 1562–1564. May 14, 2013 3 So will I win the Nobel prize if I eat lots of chocolate? This is a question referring to uncertain quantities. Like almost all scientific questions, it cannot be answered by deductive logic. Nonetheless, quantitative answers can be given – but they can only be given in terms of probabilities. Our question here can be rephrased in terms of a conditional probability: đ đđđđđ đâđđđđđđĄđ = ? To answer it, we have to learn to calculate such quantities. The tool to do that is Bayesian inference. May 14, 2013 4 «Bayesian» = logical and logical = probabilistic «The actual science of logic is conversant at present only with things either certain, impossible, or entirely doubtful, none of which (fortunately) we have to reason on. Therefore the true logic for this world is the calculus of probabilities, which takes account of the magnitude of the probability which is, or ought to be, in a reasonable man's mind.» — James Clerk Maxwell, 1850 May 14, 2013 5 «Bayesian» = logical and logical = probabilistic But in what sense is probabilistic reasoning (i.e., reasoning about uncertain quantities according to the rules of probability theory) «logical»? R. T. Cox showed in 1946 that the rules of probability theory can be derived from three basic desiderata: 1. Representation of degrees of plausibility by real numbers 2. Qualitative correspondence with common sense (in a well-defined sense) 3. Consistency May 14, 2013 6 The rules of probability By mathematical proof (i.e., by deductive reasoning) the three desiderata as set out by Cox imply the rules of probability (i.e., the rules of inductive reasoning). This means that anyone who accepts the desiderata as must accept the following rules: 1. đđ 2. đ đ = 3. đ đ, đ = đ đ đ đ đ = đ đ đ đ đ (Normalization) đ =1 đđ đ, đ (Marginalization – also called the sum rule) (Conditioning – also called the product rule) «Probability theory is nothing but common sense reduced to calculation.» — Pierre-Simon Laplace, 1819 May 14, 2013 7 Conditional probabilities The probability of đ given đ is denoted by đ đđ . In general, this is different from the probability of đ alone (the marginal probability of đ), as we can see by applying the sum and product rules: đ đ = đ đ, đ = đ đ đđ đ đ đ Because of the product rule, we also have the following rule (Bayes’ theorem) for going from đ đ đ to đ đ đ : đ đđ = May 14, 2013 đ đđ đ đ = đ đ đ đđ đ đ đ′ đ đ đ′ đ đ′ 8 The chocolate example In our example, it is immediately clear that đ đđđđđ đâđđđđđđĄđ is very different from đ đâđđđđđđĄđ đđđđđ . While the first is hopeless to determine directly, the second is much easier to assess: ask Nobel laureates how much chocolate they eat. Once we know that, we can use Bayes’ theorem: posterior likelihood model đ đâđđđđđđĄđ đđđđđ đ đđđđđ đ đđđđđ đâđđđđđđĄđ = đ đâđđđđđđĄđ prior evidence Inference on the quantities of interest in M/EEG studies has exactly the same general structure. May 14, 2013 9 Inference in M/EEG forward problem đ đĻ đ, đ likelihood posterior distribution đ đ đĻ, đ inverse problem May 14, 2013 10 Inference in M/EEG Likelihood: đ đĻ đ, đ Prior: đ đđ Bayes’ theorem: đ đĻ đ, đ đ đ đ đ đ đĻ, đ = đ đĻđ īą generative model đ May 14, 2013 11 A simple example of Bayesian inference (adapted from Jaynes (1976)) Two manufacturers, A and B, deliver the same kind of components that turn out to have the following lifetimes (in hours): A: B: Assuming prices are comparable, from which manufacturer would you buy? May 14, 2013 12 A simple example of Bayesian inference How do we compare such samples? • By comparing their arithmetic means Why do we take means? • If we take the mean as our estimate, the error in our estimate is the mean of the errors in the individual measurements • Taking the mean as maximum-likelihood estimate implies a Gaussian error distribution • A Gaussian error distribution appropriately reflects our prior knowledge about the errors whenever we know nothing about them except perhaps their variance May 14, 2013 13 A simple example of Bayesian inference What next? • Let’s do a t-test: not significant! Is this satisfactory? • No, so what can we learn by turning to probability theory (i.e., Bayesian inference)? May 14, 2013 14 A simple example of Bayesian inference The procedure in brief: • Determine your question of interest («What is the probability that...?») • Specify your model (likelihood and prior) • Calculate the full posterior using Bayes’ theorem • [Pass to the uninformative limit in the parameters of your prior] • Integrate out any nuisance parameters • Ask your question of interest of the posterior All you need is the rules of probability theory. (Ok, sometimes you’ll encounter a nasty integral – but that’s a technical difficulty, not a conceptual one). May 14, 2013 15 A simple example of Bayesian inference The question: • What is the probability that the components from manufacturer B have a longer lifetime than those from manufacturer A? • More specifically: given how much more expensive they are, how much longer do I require the components from B to live. • Example of a decision rule: if the components from B live 3 hours longer than those from A with a probability of at least 80%, I will choose those from B. May 14, 2013 16 A simple example of Bayesian inference The model: • likelihood (Gaussian): đ đ đĨđ đ, đ = đ=1 đ 2đ 1 2 đ exp − đĨđ − đ 2 2 • prior (Gaussian-gamma): đ đ, đ đ0 , đ0 , đ0 = đŠ đ đ0 , đ 0 đ May 14, 2013 −1 Gam đ đ0 , đ0 17 A simple example of Bayesian inference The posterior (Gaussian-gamma): đ đ, đ đĨđ = đŠ đ đđ , đ đ đ −1 Gam đ đđ , đđ Parameter updates: đđ = đ0 + đ đĨ − đ0 , đ 0 + đ đđ = đ0 + đĨâ 1 đ đ đĨđ , đ=1 May 14, 2013 đ 2 â 1 đ đ đ = đ 0 + đ, đ 2 đ 0 đ + đĨ − đ0 2 đ 0 + đ đđ = đ0 + đ 2 2 with đ đĨđ − đĨ 2 đ=1 18 A simple example of Bayesian inference The limit for which the prior becomes uninformative: • For đ 0 = 0, đ0 = 0, đ0 = 0, the updates reduce to: đđ = đĨ đ đ = đ đđ = đđ = đ 2 đ 2 đ 2 • This means that only the data influence the posterior and all influence from the parameters of the prior has been eliminated. • This limit should only ever be taken after the calculation of the posterior using a proper prior. May 14, 2013 19 A simple example of Bayesian inference Integrating out the nuisance parameter đ gives rise to a tdistribution: May 14, 2013 20 A simple example of Bayesian inference The joint posterior đ đđ´ , đđĩ đĨđ đ´ , đĨđ of our đ đđĩ đĨđ two đĩ independent đĩ is simply the product posteriors đ đđ´ đĨđ đ´ and . It will now give us the answer to our question: ∞ đ đđĩ − đđ´ > 3 = ∞ dđđ´ đ đđ´ đĨđ −∞ dđđĩ đ đđĩ đĨđ đ´ đĩ = 0.9501 đđ´ +3 Note that the t-test told us that there was «no significant difference» even though there is a >95% probability that the parts from B will last 3 hours longer than those from A. May 14, 2013 21 Bayesian inference The procedure in brief: • Determine your question of interest («What is the probability that...?») • Specify your model (likelihood and prior) • Calculate the full posterior using Bayes’ theorem • [Pass to the uninformative limit in the parameters of your prior] • Integrate out any nuisance parameters • Ask your question of interest of the posterior All you need is the rules of probability theory. May 14, 2013 22 Frequentist (or: orthodox, classical) versus Bayesian inference: parameter estimation Classical Bayesian • define the null, e.g.: đģ0 : đ = 0 • invert model (obtain posterior pdf) đ đĄ đģ0 đ đđĻ P ī¨ H0 y īŠ īŗ īĄ đ đĄ> đĄ∗ đĄ∗ đ đģ0 đĻ đģ0 īą đĄ≡đĄ đ • estimate parameters (obtain test stat.) • define the null, e.g.: • apply decision rule, i.e.: • apply decision rule, i.e.: if đ đĄ > đĄ ∗ đģ0 ≤ đŧ then reject H0 May 14, 2013 đģ0 : đ > 0 if đ đģ0 đĻ ≥ đŧ then accept H0 23 Model comparison • Principle of parsimony: «plurality should not be assumed without necessity» • Automatically enforced by Bayesian model comparison y = f(x) Model evidence: đ đĻđ = đ đĻ đ, đ đ đ đ dđ ≈ đđđđĸđđđđĻ − đđđđđđđĨđđĄđĻ “Occam’s razor” : y=f(x) model evidence p(y|m) x May 14, 2013 space of all data sets 24 Frequentist (or: orthodox, classical) versus Bayesian inference: model comparison • Define the null and the alternative hypothesis in terms of priors, e.g.: p ī¨Y H 0 īŠ īŦ1 if īą īŊ 0 H 0 : p ī¨īą H 0 īŠ īŊ ī īŽ0 otherwise p ī¨Y H1 īŠ H1 : p ī¨īą H1 īŠ īŊ N ī¨ 0, ī īŠ y • Apply decision rule, i.e.: May 14, 2013 if P ī¨ H0 yīŠ P ī¨ H1 y īŠ Y space of all datasets īŖ 1 then reject H0 25 Applications of Bayesian inference May 14, 2013 26 segmentation and normalisation realignment posterior probability maps (PPMs) smoothing dynamic causal modelling general linear model statistical inference normalisation multivariate decoding Gaussian field theory p <0.05 template May 14, 2013 27 Segmentation (mixture of Gaussians-model) class variances īŗ2 īŗ1 ī1 … īŗk ī2 yi … ith voxel label ith voxel value ci īŦ class frequencies īk class means May 14, 2013 grey matter white matter CSF 28 fMRI time series analysis PPM: regions best explained by short-term memory model short-term memory design matrix (X) prior variance of GLM coeff prior variance of data noise AR coeff (correlated noise) PPM: regions best explained by long-term memory model long-term memory design matrix (X) GLM coeff fMRI time series May 14, 2013 29 Dynamic causal modeling (DCM) m1 m2 m3 m4 attention attention attention attention PPC stim V1 V5 PPC stim V1 V5 PPC stim V1 V5 PPC stim V1 V5 attention models marginal likelihood ln p ī¨y mīŠ 15 PPC 10 1.25 stim 5 estimated effective synaptic strengths for best model (m4) 0.10 0.26 V1 0.39 0.26 0.13 V5 0.46 0 m1 m2 m3 m4 May 14, 2013 30 Model comparison for group studies differences in log- model evidences ln p ī¨ y m1 īŠ ī ln p ī¨ y m2 īŠ m1 m2 subjects Fixed effect Assume all subjects correspond to the same model Random effect Assume different subjects might correspond to different models May 14, 2013 31 Thanks May 14, 2013 32