JSM verdana

advertisement
Sparse inference and large-scale multiple comparisons
Maxima of discretely
sampled random fields, with
an application to ‘bubbles’
Keith Worsley,
McGill
Nicholas Chamandy,
McGill and Google
Jonathan Taylor,
Stanford and Université de Montréal
Frédéric Gosselin,
Université de Montréal
Philippe Schyns, Fraser Smith,
Glasgow
What is ‘bubbles’?
Nature (2005)
Subject is shown one of 40
faces chosen at random …
Happy
Sad
Fearful
Neutral
… but face is only revealed
through random ‘bubbles’

First trial: “Sad” expression
Sad
75 random
Smoothed by a
bubble centres Gaussian ‘bubble’
What the
subject sees
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0


Subject is asked the expression:
Response:
“Neutral”
Incorrect
Your turn …

Trial 2
Subject response:
“Fearful”
CORRECT
Your turn …

Trial 3
Subject response:
“Happy”
INCORRECT
(Fearful)
Your turn …

Trial 4
Subject response:
“Happy”
CORRECT
Your turn …

Trial 5
Subject response:
“Fearful”
CORRECT
Your turn …

Trial 6
Subject response:
“Sad”
CORRECT
Your turn …

Trial 7
Subject response:
“Happy”
CORRECT
Your turn …

Trial 8
Subject response:
“Neutral”
CORRECT
Your turn …

Trial 9
Subject response:
“Happy”
CORRECT
Your turn …

Trial 3000
Subject response:
“Happy”
INCORRECT
(Fearful)
Bubbles analysis

1
E.g. Fearful (3000/4=750 trials):
+
2
+
3
+
Trial
4 + 5
+
6
+
7 + … + 750
1
= Sum
300
0.5
200
0
100
250
200
150
100
50
Correct
trials
Proportion of correct bubbles
=(sum correct bubbles)
/(sum all bubbles)
0.75
Thresholded at
proportion of
0.7
correct trials=0.68,
0.65
scaled to [0,1]
1
Use this
as a
0.5
bubble
mask
0
Results

Mask average face
Happy

Sad
Fearful
Neutral
But are these features real or just noise?

Need statistics …
Statistical analysis
Correlate bubbles with response (correct = 1,
incorrect = 0), separately for each expression
Equivalent to 2-sample Z-statistic for correct vs.
incorrect bubbles, e.g. Fearful:


Trial 1
2
3
4
5
6
7 …
750
1
0.5
0
1
1
Response
0
1
4
2
0
-2
0
1
1 …
1
Very similar to the proportion of correct
bubbles:

Z~N(0,1)
statistic
0.75
0.7
0.65
Results

Thresholded at Z=1.64 (P=0.05)
Happy
Average face
Sad
Fearful
Neutral
Z~N(0,1)
statistic
4.58
4.09
3.6
3.11
2.62
2.13
1.64

Sparse inference and large-scale multiple
comparisons - correction?
Three methods so far
The set-up:
S is a subset of a D-dimensional lattice (e.g. pixels);
Z(s) ~ N(0,1) at most points s in S;
Z(s) ~ N(μ(s),1), μ(s)>0 at a sparse set of points;
Z(s1), Z(s2) are spatially correlated.
To control the false positive rate to ≤α we want a good
approximation to α = P(maxS Z(s) ≥ t):

Bonferroni (1936)

Random field theory (1970’s)

Discrete local maxima (2005, 2007)
P(maxS Z(s) ≥ t) = 0.05
0.1
0.09
0.08
Random field theory
Bonferroni
0.07
P value
0.06
0.05
Discrete local maxima
0.04
2
0.03
0
0.02
-2
0.01
0
Z(s)
0
1
2
3
4
5
6
7
8
9
10
FWHM (Full Width at Half Maximum) of smoothing filter
Random field theory:
Euler Characteristic (EC) = #blobs - #holes
Excursion set {s: Z(s) ≥ t } for neutral face
EC = 0
Euler Characteristic, EC
30
20
0
-7
-11
13
14
9
1
0
Heuristic:
At high thresholds t,
the holes disappear,
EC ~ 1 or 0,
E(EC) ~ P(max Z ≥ t).
Observed
Expected
10
0
-10
-20
-4
-3
-2
-1
0
1
Threshold, t
2
• Exact expression for
E(EC) for all thresholds,
• E(EC) ~ P(max Z ≥ t)
3
4accurate.
is extremely
Random field theory:
The details
If ¡Z (s)¢ » N(0; 1) is an isot ropic Gaussian random ¯eld, s 2 < 2 , wit h ¸ I 2£ 2 =
V @Z ,
µ
¶
@s
P max Z (s) ¸ t ¼ E(E C(S \ f s : Z (s) ¸ tg))
s2 S
Z
1
1
= E C(S)
e¡ z 2 =2 dz
(2¼) 1=2
t
Intrinsic volumes or
Minkowski functionals
¸ 1=2
+ 1 Perimet er(S)
e¡ t 2 =2
2
2¼
¸
+ Area(S)
te¡ t 2 =2 :
(2¼) 3=2
If Z (s) is whit e noise convolved wit h an isot ropic Gaussian ¯lt er of Full Widt h
at Half Maximum FWHM t hen ¸ = 4 log 2 .
FW HM
2
Random field theory:
The brain mapping version
If Z (s) is whit e noise smoot hed wit h an isot ropic Gaussian ¯lt er of Full Widt h
at Half Maximum FWHM
µ
¶
P max Z (s) ¸ t ¼ E(E C(S \ f s : Z (s) ¸ tg))
s2 S
Z
1
1
= E C(S)
e¡ z 2 =2 dz
EC0(S)
(2¼) 1=2
Resels0(S)
t
Resels1(S)
Resels2(S)
Perimet er(S) (4 log 2) 1=2
e¡
FWHM
2¼
Area(S) 4 log 2
+
te¡ t 2 =2 :
FWHM 2 (2¼) 3=2
+
Resels (Resolution elements)
1
2
t 2 =2
EC1(S)
EC2(S)
EC densities
Discrete local maxima

Bonferroni applied to events:
{Z(s) ≥ t and Z(s) is a discrete local maximum} i.e.
{Z(s) ≥ t and neighbour Z’s ≤ Z(s)}

Conservative

If Z(s) is stationary, with
Cor(Z(s1),Z(s2)) = ρ(s1-s2),
Z(s2)
≤
Z(s-1)≤ Z(s) ≥Z(s1)
all we need is
P{Z(s) ≥ t and neighbour Z’s ≤ Z(s)}
a (2D+1)-variate integral
≥
Z(s-2)
Discrete local maxima:
“Markovian” trick

If ρ is “separable”: s=(x,y),
ρ((x,y)) = ρ((x,0)) × ρ((0,y))
 e.g. Gaussian spatial correlation function:
ρ((x,y)) = exp(-½(x2+y2)/w2)
Then Z(s) has a “Markovian” property:
conditional on central Z(s), Z’s on
different axes are independent:
Z(s±1) ┴ Z(s±2) | Z(s)

Z(s2)
≤
Z(s-1)≤ Z(s) ≥Z(s1)
≥
Z(s-2)
So condition on Z(s)=z, find
P{neighbour Z’s ≤ z | Z(s)=z} = dP{Z(s±d) ≤ z | Z(s)=z}
then take expectations over Z(s)=z
 Cuts the (2D+1)-variate integral down to a bivariate
integral

T he result only involves t he correlat ion ½d between adjacent voxels along
each lat t ice axis d, d = 1; : : : ; D . First let t he Gaussian density and uncorrect ed
P values be
Z
p
1
2
Á(z) = exp(¡ z =2)= 2¼; ©(z) =
Á(u)du;
z
respect ively. T hen de¯ne
1
Q(½; z) = 1 ¡ 2©(hz) +
¼
where
® = sin¡
³p
1
Z
®
exp(¡
1 h2 z2 =sin2
2
0
r
´
(1 ¡ ½2 )=2 ;
h=
µ)dµ;
1¡ ½
:
1+ ½
T hen t he P-value of t he maximum is bounded by
µ
P
¶
max Z (s) ¸ t
s2 S
Z
· jSj
t
1
YD
Q(½d ; z) Á(z)dz;
d= 1
where jSj is t he number of voxels s in t he search region S. For a voxel on
t he boundary of t he search region wit h just one neighbour in axis direct ion d,
replace Q(½; z) by 1 ¡ ©(hz), and by 1 if it has no neighbours.
Results, corrected for search

Random field theory threshold: Z=3.92 (P=0.05)
Happy
Average face
Sad
Fearful
Neutral
Z~N(0,1)
statistic
4.58
4.47
4.36
4.25
4.14
4.03
3.92


DLM threshold:
Z=3.92 (P=0.05) – same
Bonferroni threshold: Z=4.87 (P=0.05) – nothing
Results, corrected for search

FDR threshold: Z=
Happy
Average face
Sad
Fearful
Neutral
Z~N(0,1)
statistic
4.58
4.37
4.15
3.94
3.73
3.52
3.31
4.87
(Q=0.05)
3.46
3.31
4.87
Comparison



Bonferroni (1936)
 Conservative
 Accurate if spatial correlation is low
 Simple
Discrete local maxima (2005, 2007)
 Conservative
 Accurate for all ranges of spatial correlation
 A bit messy
 Only easy for stationary separable Gaussian data on
rectilinear lattices
 Even if not separable, always seems to be conservative
Random field theory (1970’s)
 Approximation based on assuming S is continuous
 Accurate if spatial correlation is high
 Elegant
 Easily extended to non-Gaussian, non-isotropic random
fields
Random field theory:
Non-Gaussian non-iostropic
If T (s) = f (Z 1 (s); : : : ; Z n (s)) is a funct ion³ of i.i.d.
Gaussian random ¯elds
´
Z i (s) » Z (s) » N(0; 1), s 2 < D , wit h V @Z ( s) = ¤ D £ D (s), t hen replace
@s
resels by Lipschitz-K illing curvature L d (S; ¤ ):
µ
¶
P max T (s) ¸ t
s2 S
XD
¼ E(E C(S \ f s : T (s) ¸ tg)) =
L d (S; ¤ )½d (t);
d= 0
where ½d (t) is t he same EC density for t he isot ropic case wit h ¤ (s) = I D £ D
(Taylor & Adler, 2003). Bot h Lipschit z-K illing curvat ure L d (S; ¤ ) and EC
density ½d (t) are de¯ned implicit ly as coe± cient s of a power series expansion of
t he volume of a t ube as a funct ion of it s radius. In t he case of Lipschit z-K illing
curvat ure, t he t ube is about t he search region S in t he Riemannian met ric
induced by ¤ (s); in t he case of t he EC densit ies, t he t ube is about t he reject ion
region f ¡ 1 ([t; 1 )) and volume is replaced by Gaussian probability. Fort unat ely
t here are simple ways of est imat ing Lipschit z-K illing curvat ure from sample dat a
(Taylor & Worsley, 2007), and simple ways of calculat ing EC densit ies.
Referee report


Why bother?
Why not just do simulations?
fMRI data: 120 scans, 3 scans each of hot, rest, warm, rest, hot, rest, …
First scan of fMRI data
Highly significant effect, T=6.59
1000
hot
rest
warm
890
880
870
500
0
100
200
300
No significant effect, T=-0.74
820
hot
rest
warm
0
800
T statistic for hot - warm effect
5
0
-5
T = (hot – warm effect) / S.d.
~ t110 if no effect
0
100
0
100
200
Drift
300
810
800
790
200
Time, seconds
300
Bubbles task in fMRI scanner

Correlate bubbles with BOLD at every voxel:
Trial
1
2
3
4
5
6
7 …
3000
1
0.5
0
fMRI
10000
0

Calculate Z for each pair (bubble pixel, fMRI
voxel) – a 5D “image” of Z statistics …
Discussion: thresholding


Thresholding in advance is vital, since we cannot store
all the ~1 billion 5D Z values
 Resels5=(image Resels2=146.2) × (fMRI
Resels3=1057.2)
 for P=0.05, threshold is t = 6.22 (approx)
 The threshold based on Gaussian RFT can be improved
using new non-Gaussian RFT based on saddle-point
approximations (Chamandy et al., 2006)
 Model the bubbles as a smoothed Poisson point
process
 The improved thresholds are slightly lower, so more
activation is detected
Only keep 5D local maxima
 Z(pixel, voxel) > Z(pixel, 6 neighbours of voxel)
> Z(4 neighbours of pixel, voxel)
Discussion: modeling

The random response is Y=1 (correct) or 0 (incorrect), or Y=fMRI

The regressors are Xj=bubble mask at pixel j, j=1 … 240x380=91200 (!)





Logistic regression or ordinary regression:
 logit(E(Y)) or E(Y) = b0+X1b1+…+X91200b91200
But there are only n=3000 observations (trials) …
Instead, since regressors are independent, fit them one at a time:
 logit(E(Y)) or E(Y) = b0+Xjbj
However the regressors (bubbles) are random with a simple known
distribution, so turn the problem around and condition on Y:
 E(Xj) = c0+Ycj
 Equivalent to conditional logistic regression (Cox, 1962) which gives
exact inference for b1 conditional on sufficient statistics for b0
 Cox also suggested using saddle-point approximations to improve
accuracy of inference …
Interactions? logit(E(Y)) or E(Y)=b0+X1b1+…+X91200b91200+X1X2b1,2+ …
P(maxS Z(s) ≥ t) = 0.05
0.1
0.09
0.08
Random field theory
Bonferroni
0.07
P value
0.06
0.05
Discrete local maxima
0.04
2
0.03
0
0.02
-2
0.01
0
Z(s)
0
1
2
3
4
5
6
7
8
9
10
FWHM (Full Width at Half Maximum) of smoothing filter
Bayesian Model Selection (thanks to Ed George)
• Z-statistic at voxel i is Zi ~ N(mi,1), i = 1, … , n
• Most of the mi’s are zero (unactivated voxels) and a few are nonzero (activated voxels), but we do not know which voxels are
activated, and by how much (mi)
• This is a model selection problem, where we add an extra model
parameter (mi) for the mean of each activated voxel
• Simple Bayesian set-up:
- each voxel is independently active with probability p
- the activation is itself drawn independently from a Gaussian
distribution: mi ~ N(0,c)
• The hyperparameter p controls the expected proportion of
activated voxels, and c controls their expected activation
• Surprise! This prior setup is related to the canonical
penalized sum-of-squares criterion
AF = Σactivated voxels Zi2 – F q
where - q is the number of activated voxels and
- F is a fixed penalty for adding an activated voxel
• Popular model selection criteria simply entail
- maximizing AF for a particular choice of F
- which is equivalent to thresholding the image at √F
•Some choices of F:
- F = 0 : all voxels activated
- F = 2 : Mallow’s Cp and AIC
- F = log n : BIC
- F = 2 log n : RIC
- P(Z > √F) = 0.05/n : Bonferroni (almost same as RIC!)
• The Bayesian relationship with AF is obtained by reexpressing the posterior of the activated voxels, given the
data:
P(activated voxels | Z’s) α exp ( c/2(1+c) AF )
where
F = (1+c)/c {2 log[(1-p)/p] + log(1+c)}
• Since p and c control the expected number and size of
the activation, the dependence of F on p and c provides an
implicit connection between the penalty F and the sorts of
models for which its value may be appropriate
• The awful truth: p and c are unknown
• Empirical Bayes idea: use p and c that maximize the
marginal likelihood, which simplifies to
L(p,c | Z’s) α Пi [ (1-p)exp(-Zi2/2) + p(1+c)-1/2exp(-Zi2/2(1+c) ) ]
• This is identical to fitting a classic mixture model with
- a probability of (1-p) that Zi ~ N(0,1)
- a probability of p that Zi ~ N(0,c)
- √F is the value of Z where the two components are equal
• Using these estimated values of p and c gives us an adaptive
penalty F, or equivalently a threshold √F, that is implicitly based
on the SPM
• All we have to do is fit the mixture model … but does it work?
• Same data as before: hot – warm stimulus, four runs:
- proportion of activated voxels p = 0.57
- variance of activated voxels c = 5.8 (sd = 2.4)
- penalty F = 1.59 (a bit like AIC)
- threshold √F = 1.26 (?) seems a bit low …
1400
1200
AIC:
√F = 2
Null model N(0,1)
√F threshold where
components FDR (0.05):
are equal √F = 2.67
BIC:
Mixture
√F = 3.21
43% unRIC:
activated
√F = 4.55
voxels,
N(0,1)
Bon (0.05):
1000
Histogram
of SPM
(n=30786):
800
600
400
57% activated
voxels,
N(0,5.8)
200
0
-8
√F = 4.66
-6
-4
-2
0
Z
2
4
6
8
• Same data as before: hot – warm stimulus, one run:
- proportion of activated voxels p = 0.80
- variance of activated voxels c = 1.55
- penalty F = -3.02 (?)
- all voxels activated !!!!!! What is going on?
1400
1200
AIC:
√F = 2
Null model N(0,1)
components are
never equal!
1000
Histogram
of SPM
(n=30768):
800
600
80% activated
voxels,
N(0,1.55)
Mixture
20% unactivated
voxels,
N(0,1)
400
200
0
-8
-6
-4
-2
0
Z
2
4
6
FDR (0.05):
√F = 2.67
BIC:
√F = 3.21
RIC:
√F = 4.55
Bon (0.05):
√F = 4.66
8
MS lesions and cortical
thickness

Idea: MS lesions interrupt neuronal signals, causing
thinning in down-stream cortex
Data: n = 425 mild MS patients
5.5
Average cortical thickness (mm)

5
4.5
4
3.5
3
2.5
Correlation = -0.568,
T = -14.20 (423 df)
2
1.5
0
10
20
30
40
50
Total lesion volume (cc)
60
70
80
Charil et al,
NeuroImage (2007)
MS lesions and cortical
thickness at all pairs of points


Dominated by total lesions and average
cortical thickness, so remove these effects
Cortical thickness CT, smoothed 20mm



Lesion density LD, smoothed 10mm
Find partial correlation(lesion density, cortical
thickness) removing total lesion volume


linear model: CT-av(CT) ~ 1 + TLV + LD, test for
LD
Repeat of all voxels in 3D, nodes in 2D


Subtract average cortical thickness
~1 billion correlations, so thresholding essential!
Look for high negative correlations …
Thresholding? Cross
correlation random field

Correlation between 2 fields at 2 different
locations, searched over all pairs of locations


one in R (D dimensions), one in S (E dimensions)
sample size n
Cao & Worsley, Annals of Applied Probability (1999)

MS lesion data: P=0.05, c=0.300, T=6.46
Cluster extent rather than
peak height (Friston, 1994)



fit a quadratic
Y
to
the
peak:
L D (clust er) » c
®
k


Peak
height

Choose a lower level, e.g. t=3.11 (P=0.001)
Find clusters i.e. connected components of
excursion set
Z
D=1
Measure cluster
LD
extent
extent by
t
Distribution:
Dbn. of maximum cluster extent:

Bonferroni on N = #clusters ~ E(EC).
s
Cao, Advances
in Applied
Probability (1999)
How do you choose the threshold
t for defining clusters?





If signal is focal i.e. ~FWHM of noise
 Choose a high threshold
 i.e. peak height is better
If signal is broad i.e. >>FWHM of noise
 Choose a low threshold
 i.e. cluster extent is better
Conclusion: cluster extent is better for detecting broad
signals
Alternative: smooth data with filter that matches signal
(Matched Filter Theorem)… try range of filter widths …
scale space search … correct using random field theory …
a lot of work …
Cluster extent is easier!
Thresholding? Cross
correlation random field

Correlation between 2 fields at 2 different locations,
searched over all pairs of locations
 one in R (D dimensions), one in S (E dimensions)
µ
¶
XD XE
P max Correlat ion > c ¼
L d (R) L e (S) ½d;e (c)
R ;S
½d;e (c) =
d= 0 e= 0
(d ¡ 1)!e!2n ¡
¼d +2 e + 1
d¡ e¡ 2
X
X
(¡ 1) k cd+ e¡
1¡ 2k (1 ¡
k
c2 ) n ¡
d ¡ e¡ 1
2
+k
i ;j
¡ ( n ¡ d + i )¡ ( n ¡ e + j )
2
2
i !j !( k ¡ i ¡ j ) !( n ¡ 1 ¡ d ¡ e + i + j + k ) !( d ¡ 1 ¡ k ¡ i + j ) !( e ¡ k ¡ j + i ) !
Cao & Worsley, Annals of Applied Probability (1999)

MS lesion data: P=0.05, c=0.300, T=6.46
0.1
correlation
0
5
x 10
2.5
2
-0.1
1.5
-0.2
1
-0.3
threshold
-0.4
-0.5
0
50
100
150
Different hemisphere
0.1
5
x 10
2.5
0
correlation
Histogram Same hemisphere
-0.1
2
-0.2
1.5
-0.3
1
0.5
-0.4
0
-0.5
0
threshold
50
100
150
0.5
0
‘Conditional’ histogram: scaled to same max at each distance
0.1
1
-0.1
0.6
-0.2
0.4
-0.3
-0.4
-0.5
0
threshold
50
100 150
distance (mm)
1
0
0.8
correlation
correlation
0
0.1
0.8
-0.1
0.6
-0.2
0.4
-0.3
0.2
-0.4
0
-0.5
0
threshold
50
100 150
distance (mm)
0.2
0
The details …
2
Tube(S,r)
r
S
3
A
B
6
Λ is big
TubeΛ(S,r)
S
Λ is small
r
ν
2
U(s1)
S
S
s1
Tube
s2
s3
Tube
U(s2)
U(s3)
Z2
R
r
Tube(R,r)
Z1
N2(0,I)
Tube(R,r)
R
t-r
z
t
z1
RR
Tube(R,r)
r
z2
z3
Summary
Comparison

Both depend on average correct
bubbles, rest is ~ constant
Z=(Average correct bubbles
-average incorrect bubbles)
/ pooled sd
4
2
0
-2
Proportion correct bubbles
= Average correct bubbles
/ (average all bubbles * 4)
0.75
0.7
0.65
Random field theory results

For searching in D (=2) dimensions, P-value of max Z is
P(maxs Z(s) ≥ t) ~ E( Euler characteristic of {s: Z(s) ≥ t})
= ReselsD(S) × ECD(t) (+ boundary terms)


ReselsD(S) = Image area / (bubble FWHM)2
= 146.2 (unitless)
ECD(t) = (4 log(2))D/2 tD-1 exp(-t2/2) / (2π)(D+1)/2
Download