05_MEEG_Bayesian_inference

advertisement
Bayesian Inference
Chris Mathys
Wellcome Trust Centre for Neuroimaging
UCL
SPM Course
London, May 12, 2014
Thanks to Jean Daunizeau and Jérémie Mattout for
previous versions of this talk
A spectacular piece of information
May 12, 2014
2
A spectacular piece of information
Messerli, F. H. (2012). Chocolate Consumption, Cognitive Function, and Nobel Laureates.
New England Journal of Medicine, 367(16), 1562–1564.
May 12, 2014
3
So will I win the Nobel prize if I eat lots of chocolate?
This is a question referring to uncertain quantities. Like almost all scientific
questions, it cannot be answered by deductive logic. Nonetheless, quantitative
answers can be given – but they can only be given in terms of probabilities.
Our question here can be rephrased in terms of a conditional probability:
๐‘ ๐‘๐‘œ๐‘๐‘’๐‘™ ๐‘™๐‘œ๐‘ก๐‘  ๐‘œ๐‘“ ๐‘โ„Ž๐‘œ๐‘๐‘œ๐‘™๐‘Ž๐‘ก๐‘’ = ?
To answer it, we have to learn to calculate such quantities. The tool for this is
Bayesian inference.
May 12, 2014
4
«Bayesian» = logical
and
logical = probabilistic
«The actual science of logic is conversant at present only with things either
certain, impossible, or entirely doubtful, none of which (fortunately) we have to
reason on. Therefore the true logic for this world is the calculus of probabilities,
which takes account of the magnitude of the probability which is, or ought to
be, in a reasonable man's mind.»
— James Clerk Maxwell, 1850
May 12, 2014
5
«Bayesian» = logical
and
logical = probabilistic
But in what sense is probabilistic reasoning (i.e., reasoning about uncertain
quantities according to the rules of probability theory) «logical»?
R. T. Cox showed in 1946 that the rules of probability theory can be derived
from three basic desiderata:
1. Representation of degrees of plausibility by real numbers
2. Qualitative correspondence with common sense (in a well-defined sense)
3. Consistency
May 12, 2014
6
The rules of probability
By mathematical proof (i.e., by deductive reasoning) the three desiderata as set out by
Cox imply the rules of probability (i.e., the rules of inductive reasoning).
This means that anyone who accepts the desiderata must accept the following rules:
1.
๐‘Ž๐‘
2.
๐‘ ๐‘ =
3.
๐‘ ๐‘Ž, ๐‘ = ๐‘ ๐‘Ž ๐‘ ๐‘ ๐‘ = ๐‘ ๐‘ ๐‘Ž ๐‘ ๐‘Ž
(Normalization)
๐‘Ž =1
๐‘Ž๐‘
๐‘Ž, ๐‘
(Marginalization – also called the sum rule)
(Conditioning – also called the product rule)
«Probability theory is nothing but common sense reduced to calculation.»
— Pierre-Simon Laplace, 1819
May 12, 2014
7
Conditional probabilities
The probability of ๐’‚ given ๐’ƒ is denoted by
๐‘ ๐‘Ž๐‘ .
In general, this is different from the probability of ๐‘Ž alone (the marginal probability of
๐‘Ž), as we can see by applying the sum and product rules:
๐‘ ๐‘Ž =
๐‘ ๐‘Ž, ๐‘ =
๐‘
๐‘ ๐‘Ž๐‘ ๐‘ ๐‘
๐‘
Because of the product rule, we also have the following rule (Bayes’ theorem) for
going from ๐‘ ๐‘Ž ๐‘ to ๐‘ ๐‘ ๐‘Ž :
๐‘ ๐‘๐‘Ž =
May 12, 2014
๐‘ ๐‘Ž๐‘ ๐‘ ๐‘
=
๐‘ ๐‘Ž
๐‘ ๐‘Ž๐‘ ๐‘ ๐‘
๐‘′ ๐‘ ๐‘Ž ๐‘′ ๐‘ ๐‘′
8
The chocolate example
In our example, it is immediately clear that ๐‘ƒ ๐‘๐‘œ๐‘๐‘’๐‘™ ๐‘โ„Ž๐‘œ๐‘๐‘œ๐‘™๐‘Ž๐‘ก๐‘’ is very different from
๐‘ƒ ๐‘โ„Ž๐‘œ๐‘๐‘œ๐‘™๐‘Ž๐‘ก๐‘’ ๐‘๐‘œ๐‘๐‘’๐‘™ . While the first is hopeless to determine directly, the second is
much easier to find out: ask Nobel laureates how much chocolate they eat. Once we
know that, we can use Bayes’ theorem:
posterior
likelihood
model
๐‘ ๐‘โ„Ž๐‘œ๐‘๐‘œ๐‘™๐‘Ž๐‘ก๐‘’ ๐‘๐‘œ๐‘๐‘’๐‘™ ๐‘ƒ ๐‘๐‘œ๐‘๐‘’๐‘™
๐‘ ๐‘๐‘œ๐‘๐‘’๐‘™ ๐‘โ„Ž๐‘œ๐‘๐‘œ๐‘™๐‘Ž๐‘ก๐‘’ =
๐‘ ๐‘โ„Ž๐‘œ๐‘๐‘œ๐‘™๐‘Ž๐‘ก๐‘’
prior
evidence
Inference on the quantities of interest in fMRI/DCM studies has exactly the same
general structure.
May 12, 2014
9
Inference in SPM
forward problem
๐‘ ๐‘ฆ ๐œ—, ๐‘š
likelihood
posterior distribution
๐‘ ๐œ— ๐‘ฆ, ๐‘š
inverse problem
May 12, 2014
10
Inference in SPM
Likelihood:
๐‘ ๐‘ฆ ๐œ—, ๐‘š
Prior:
๐‘ ๐œ—๐‘š
Bayes’ theorem:
๐‘ ๐‘ฆ ๐œ—, ๐‘š ๐‘ ๐œ— ๐‘š
๐‘ ๐œ— ๐‘ฆ, ๐‘š =
๐‘ ๐‘ฆ๐‘š
๏ฑ
generative model ๐‘š
May 12, 2014
11
A simple example of Bayesian inference
(adapted from Jaynes (1976))
Two manufacturers, A and B, deliver the same kind of components that turn out to
have the following lifetimes (in hours):
A:
B:
Assuming prices are comparable, from which manufacturer would you buy?
May 12, 2014
12
A simple example of Bayesian inference
How do we compare such samples?
May 12, 2014
13
A simple example of Bayesian inference
What next?
May 12, 2014
14
A simple example of Bayesian inference
The procedure in brief:
• Determine your question of interest («What is the probability that...?»)
• Specify your model (likelihood and prior)
• Calculate the full posterior using Bayes’ theorem
• [Pass to the uninformative limit in the parameters of your prior]
• Integrate out any nuisance parameters
• Ask your question of interest of the posterior
All you need is the rules of probability theory.
(Ok, sometimes you’ll encounter a nasty integral – but that’s a technical difficulty,
not a conceptual one).
May 12, 2014
15
A simple example of Bayesian inference
The question:
• What is the probability that the components from manufacturer B
have a longer lifetime than those from manufacturer A?
• More specifically: given how much more expensive they are, how
much longer do I require the components from B to live.
• Example of a decision rule: if the components from B live 3 hours
longer than those from A with a probability of at least 80%, I will
choose those from B.
May 12, 2014
16
A simple example of Bayesian inference
The model (bear with me, this will turn out to be simple):
• likelihood (Gaussian):
๐‘›
๐‘ ๐‘ฅ๐‘– ๐œ‡, ๐œ† =
๐‘–=1
๐œ†
2๐œ‹
1
2
๐œ†
exp − ๐‘ฅ๐‘– − ๐œ‡
2
2
• prior (Gaussian-gamma):
๐‘ ๐œ‡, ๐œ† ๐œ‡0 , ๐œ…0 ๐‘Ž0 , ๐‘0 = ๐’ฉ ๐œ‡ ๐œ‡0 , ๐œ…0 ๐œ†
May 12, 2014
−1
Gam ๐œ† ๐‘Ž0 , ๐‘0
17
A simple example of Bayesian inference
The posterior (Gaussian-gamma):
๐‘ ๐œ‡, ๐œ† ๐‘ฅ๐‘–
= ๐’ฉ ๐œ‡ ๐œ‡๐‘› , ๐œ…๐‘› ๐œ†
−1
Gam ๐œ† ๐‘Ž๐‘› , ๐‘๐‘›
Parameter updates:
๐œ‡๐‘› = ๐œ‡0 +
๐‘›
๐‘ฅ − ๐œ‡0 ,
๐œ…0 + ๐‘›
๐‘๐‘› = ๐‘0 +
๐‘ฅโ‰”
1
๐‘›
๐‘›
๐‘ฅ๐‘– ,
๐‘–=1
May 12, 2014
๐‘ 2 โ‰”
1
๐‘›
๐œ…๐‘› = ๐œ…0 + ๐‘›,
๐‘› 2
๐œ…0
๐‘  +
๐‘ฅ − ๐œ‡0
2
๐œ…0 + ๐‘›
๐‘Ž๐‘› = ๐‘Ž0 +
๐‘›
2
2
with
๐‘›
๐‘ฅ๐‘– − ๐‘ฅ
2
๐‘–=1
18
A simple example of Bayesian inference
The limit for which the prior becomes uninformative:
• For ๐œ…0 = 0, ๐‘Ž0 = 0, ๐‘0 = 0, the updates reduce to:
๐œ‡๐‘› = ๐‘ฅ
๐œ…๐‘› = ๐‘›
๐‘›
๐‘Ž๐‘› =
2
๐‘› 2
๐‘๐‘› = ๐‘ 
2
• As promised, this is really simple: all you need is ๐’, the number
of datapoints; ๐’™, their mean; and ๐’”๐Ÿ , their variance.
•
This means that only the data influence the posterior and all influence from the
parameters of the prior has been eliminated.
•
The uninformative limit should only ever be taken after the calculation of the
posterior using a proper prior.
May 12, 2014
19
A simple example of Bayesian inference
Integrating out the nuisance parameter ๐œ† gives rise to a tdistribution:
May 12, 2014
20
A simple example of Bayesian inference
The joint posterior ๐‘ ๐œ‡๐ด , ๐œ‡๐ต ๐‘ฅ๐‘– ๐ด , ๐‘ฅ๐‘˜
of
our
๐‘ ๐œ‡๐ต ๐‘ฅ๐‘˜
two
๐ต
independent
๐ต
is simply the product
posteriors
๐‘ ๐œ‡๐ด ๐‘ฅ๐‘–
๐ด
and
. It will now give us the answer to our question:
∞
๐‘ ๐œ‡๐ต − ๐œ‡๐ด > 3 =
∞
d๐œ‡๐ด ๐‘ ๐œ‡๐ด ๐‘ฅ๐‘–
−∞
d๐œ‡๐ต ๐‘ ๐œ‡๐ต ๐‘ฅ๐‘˜
๐ด
๐ต
= 0.9501
๐œ‡๐ด +3
Note that the t-test told us that there was «no significant
difference» even though there is a >95% probability that the
parts from B will last at least 3 hours longer than those from A.
May 12, 2014
21
Bayesian inference
The procedure in brief:
• Determine your question of interest («What is the probability that...?»)
• Specify your model (likelihood and prior)
• Calculate the full posterior using Bayes’ theorem
• [Pass to the uninformative limit in the parameters of your prior]
• Integrate out any nuisance parameters
• Ask your question of interest of the posterior
All you need is the rules of probability theory.
May 12, 2014
22
Frequentist (or: orthodox, classical) versus
Bayesian inference: hypothesis testing
Classical
Bayesian
• define the null, e.g.: ๐ป0 : ๐œ— = 0
• invert model (obtain posterior pdf)
๐‘ ๐‘ก ๐ป0
๐‘ ๐œ—๐‘ฆ
๐‘ ๐‘ก>
๐‘ก∗
๐‘ก∗
๐‘ ๐ป0 ๐‘ฆ
๐ป0
๐‘ก≡๐‘ก ๐‘Œ
๐œ—
๐œ—0
• estimate parameters (obtain test stat.)
• define the null, e.g.:
• apply decision rule, i.e.:
• apply decision rule, i.e.:
if ๐‘ ๐‘ก > ๐‘ก ∗ ๐ป0 ≤ ๐›ผ then reject H0
May 12, 2014
๐ป0 : ๐œ— > ๐œ—0
if ๐‘ ๐ป0 ๐‘ฆ ≥ ๐›ผ then accept H0
23
Model comparison: general principles
• Principle of parsimony: «plurality should not be assumed without necessity»
• Automatically enforced by Bayesian model comparison
y = f(x)
Model evidence:
๐‘ ๐‘ฆ๐‘š =
๐‘ ๐‘ฆ ๐œ—, ๐‘š ๐‘ ๐œ— ๐‘š d๐œ—
≈ exp ๐‘Ž๐‘๐‘๐‘ข๐‘Ÿ๐‘Ž๐‘๐‘ฆ − ๐‘๐‘œ๐‘š๐‘๐‘™๐‘’๐‘ฅ๐‘–๐‘ก๐‘ฆ
“Occam’s razor” :
y=f(x)
model evidence p(y|m)
x
space of all data sets
May 12, 2014
24
Model comparison: negative variational free energy F
๐ฅ๐จ๐  − ๐ฆ๐จ๐๐ž๐ฅ ๐ž๐ฏ๐ข๐๐ž๐ง๐œ๐ž โ‰” log ๐‘ ๐‘ฆ ๐‘š
= log
๐‘ ๐‘ฆ, ๐œ— ๐‘š d๐œ—
= log
๐‘ž ๐œ—
sum rule
multiply by 1 =
๐‘ž ๐œ—
๐‘ž ๐œ—
Jensen’s inequality
≥
๐‘ž ๐œ— log
๐‘ ๐‘ฆ, ๐œ— ๐‘š
d๐œ—
๐‘ž ๐œ—
a lower bound on the
log-model evidence
๐‘ ๐‘ฆ, ๐œ— ๐‘š
d๐œ—
๐‘ž ๐œ—
=: ๐‘ญ = ๐ง๐ž๐ ๐š๐ญ๐ข๐ฏ๐ž ๐ฏ๐š๐ซ๐ข๐š๐ญ๐ข๐จ๐ง๐š๐ฅ ๐Ÿ๐ซ๐ž๐ž ๐ž๐ง๐ž๐ซ๐ ๐ฒ
๐นโ‰”
=
๐‘ ๐‘ฆ, ๐œ— ๐‘š
๐‘ž ๐œ— log
d๐œ—
๐‘ž ๐œ—
๐‘ž ๐œ— log
Kullback-Leibler divergence
๐‘ ๐‘ฆ ๐œ—, ๐‘š ๐‘ ๐œ— ๐‘š
d๐œ—
๐‘ž ๐œ—
product rule
=
๐‘ž ๐œ— log ๐‘ ๐‘ฆ ๐œ—, ๐‘š d๐œ—
− ๐พ๐ฟ ๐‘ž ๐œ— , ๐‘ ๐œ— ๐‘š
๐‚๐จ๐ฆ๐ฉ๐ฅ๐ž๐ฑ๐ข๐ญ๐ฒ
๐€๐œ๐œ๐ฎ๐ซ๐š๐œ๐ฒ (๐ž๐ฑ๐ฉ๐ž๐œ๐ญ๐ž๐ ๐ฅ๐จ๐ −๐ฅ๐ข๐ค๐ž๐ฅ๐ข๐ก๐จ๐จ๐)
May 12, 2014
25
Model comparison: F in relation to Bayes factors, AIC, BIC
๐‘ ๐‘ฆ ๐‘š1
๐‘ ๐‘ฆ ๐‘š1
๐๐š๐ฒ๐ž๐ฌ ๐Ÿ๐š๐œ๐ญ๐จ๐ซ โ‰”
= exp log
๐‘ ๐‘ฆ ๐‘š0
๐‘ ๐‘ฆ ๐‘š0
Bayes factor
Prior odds
Posterior odds
≈ exp ๐น1 − ๐น0
[Meaning of the Bayes factor:
๐‘ญ=
= exp log ๐‘ ๐‘ฆ ๐‘š1 − log ๐‘ ๐‘ฆ ๐‘š0
๐‘
๐‘
๐‘š1 ๐‘ฆ
๐‘ ๐‘ฆ ๐‘š1
=
๐‘š0 ๐‘ฆ
๐‘ ๐‘ฆ ๐‘š0
๐‘ ๐‘š1
๐‘ ๐‘š0
]
๐‘ž ๐œ— log ๐‘ ๐‘ฆ ๐œ—, ๐‘š d๐œ— − ๐พ๐ฟ ๐‘ž ๐œ— , ๐‘ ๐œ— ๐‘š
= Accuracy − Complexity
๐€๐ˆ๐‚ โ‰” Accuracy − ๐‘
๐‘
๐๐ˆ๐‚ โ‰” Accuracy − log ๐‘
2
May 12, 2014
Number of parameters
Number of data points
26
A note on informative priors
• Any model consists of two parts: likelihood and prior.
• The choice of likelihood requires as much justification as the choice of prior because
it is just as «subjective» as that of the prior.
• The data never speak for themselves. They only acquire meaning when seen through
the lens of a model. However, this does not mean that all is subjective because
models differ in their validity.
• In this light, the widespread concern that informative priors might bias results
(while the form of the likelihood is taken as a matter of course requiring no
justification) is misplaced.
• Informative priors are an important tool and their use can be justified by
establishing the validity (face, construct, and predictive) of the resulting model as
well as by model comparison.
May 12, 2014
27
Applications of Bayesian inference
May 12, 2014
28
segmentation
and normalisation
realignment
posterior probability
maps (PPMs)
smoothing
dynamic causal
modelling
general linear model
statistical
inference
normalisation
multivariate
decoding
Gaussian
field theory
p <0.05
template
May 12, 2014
29
Segmentation (mixture of Gaussians-model)
class variances
๏ณ2
๏ณ1
๏ญ1
…
๏ณk
๏ญ2
yi
…
ith voxel
label
ith voxel
value
ci
๏ฌ
class
frequencies
๏ญk
class
means
May 12, 2014
grey matter
white matter
CSF 30
fMRI time series analysis
PPM: regions best explained
by short-term memory model
short-term memory
design matrix (X)
prior variance
of GLM coeff
prior variance
of data noise
AR coeff
(correlated noise)
GLM coeff
PPM: regions best explained
by long-term memory model
fMRI time series
May 12, 2014
long-term memory
design matrix (X)
31
Dynamic causal modeling (DCM)
m1
m2
m3
m4
attention
attention
attention
attention
PPC
stim
V1
V5
PPC
stim
V1
V5
PPC
stim
V1
V5
PPC
stim
V1
V5
attention
models marginal likelihood
ln p ๏€จy m๏€ฉ
15
PPC
10
1.25
stim
5
estimated
effective synaptic strengths
for best model (m4)
0.10
0.26
V1
0.39
0.26
0.13
V5
0.46
0
m1 m2 m3 m4
May 12, 2014
32
Model comparison for group studies
differences in log- model evidences
ln p ๏€จ y m1 ๏€ฉ ๏€ญ ln p ๏€จ y m2 ๏€ฉ
m1
m2
subjects
Fixed effect
Assume all subjects correspond to the same model
Random effect
Assume different subjects might correspond to different models
May 12, 2014
33
Thanks
May 12, 2014
34
Download