Sparse inference and large-scale multiple comparisons Maxima of discretely sampled random fields, with an application to ‘bubbles’ Keith Worsley, McGill Nicholas Chamandy, McGill and Google Jonathan Taylor, Stanford and Université de Montréal Frédéric Gosselin, Université de Montréal Philippe Schyns, Fraser Smith, Glasgow What is ‘bubbles’? Nature (2005) Subject is shown one of 40 faces chosen at random … Happy Sad Fearful Neutral … but face is only revealed through random ‘bubbles’ First trial: “Sad” expression Sad 75 random Smoothed by a bubble centres Gaussian ‘bubble’ What the subject sees 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Subject is asked the expression: Response: “Neutral” Incorrect Your turn … Trial 2 Subject response: “Fearful” CORRECT Your turn … Trial 3 Subject response: “Happy” INCORRECT (Fearful) Your turn … Trial 4 Subject response: “Happy” CORRECT Your turn … Trial 5 Subject response: “Fearful” CORRECT Your turn … Trial 6 Subject response: “Sad” CORRECT Your turn … Trial 7 Subject response: “Happy” CORRECT Your turn … Trial 8 Subject response: “Neutral” CORRECT Your turn … Trial 9 Subject response: “Happy” CORRECT Your turn … Trial 3000 Subject response: “Happy” INCORRECT (Fearful) Bubbles analysis 1 E.g. Fearful (3000/4=750 trials): + 2 + 3 + Trial 4 + 5 + 6 + 7 + … + 750 1 = Sum 300 0.5 200 0 100 250 200 150 100 50 Correct trials Proportion of correct bubbles =(sum correct bubbles) /(sum all bubbles) 0.75 Thresholded at proportion of 0.7 correct trials=0.68, 0.65 scaled to [0,1] 1 Use this as a 0.5 bubble mask 0 Results Mask average face Happy Sad Fearful Neutral But are these features real or just noise? Need statistics … Statistical analysis Correlate bubbles with response (correct = 1, incorrect = 0), separately for each expression Equivalent to 2-sample Z-statistic for correct vs. incorrect bubbles, e.g. Fearful: Trial 1 2 3 4 5 6 7 … 750 1 0.5 0 1 1 Response 0 1 4 2 0 -2 0 1 1 … 1 Very similar to the proportion of correct bubbles: Z~N(0,1) statistic 0.75 0.7 0.65 Results Thresholded at Z=1.64 (P=0.05) Happy Average face Sad Fearful Neutral Z~N(0,1) statistic 4.58 4.09 3.6 3.11 2.62 2.13 1.64 Sparse inference and large-scale multiple comparisons - correction? Three methods so far The set-up: S is a subset of a D-dimensional lattice (e.g. pixels); Z(s) ~ N(0,1) at most points s in S; Z(s) ~ N(μ(s),1), μ(s)>0 at a sparse set of points; Z(s1), Z(s2) are spatially correlated. To control the false positive rate to ≤α we want a good approximation to α = P(maxS Z(s) ≥ t): Bonferroni (1936) Random field theory (1970’s) Discrete local maxima (2005, 2007) P(maxS Z(s) ≥ t) = 0.05 0.1 0.09 0.08 Random field theory Bonferroni 0.07 P value 0.06 0.05 Discrete local maxima 0.04 2 0.03 0 0.02 -2 0.01 0 Z(s) 0 1 2 3 4 5 6 7 8 9 10 FWHM (Full Width at Half Maximum) of smoothing filter Random field theory: Euler Characteristic (EC) = #blobs - #holes Excursion set {s: Z(s) ≥ t } for neutral face EC = 0 Euler Characteristic, EC 30 20 0 -7 -11 13 14 9 1 0 Heuristic: At high thresholds t, the holes disappear, EC ~ 1 or 0, E(EC) ~ P(max Z ≥ t). Observed Expected 10 0 -10 -20 -4 -3 -2 -1 0 1 Threshold, t 2 • Exact expression for E(EC) for all thresholds, • E(EC) ~ P(max Z ≥ t) 3 4accurate. is extremely Random field theory: The details If ¡Z (s)¢ » N(0; 1) is an isot ropic Gaussian random ¯eld, s 2 < 2 , wit h ¸ I 2£ 2 = V @Z , µ ¶ @s P max Z (s) ¸ t ¼ E(E C(S \ f s : Z (s) ¸ tg)) s2 S Z 1 1 = E C(S) e¡ z 2 =2 dz (2¼) 1=2 t Intrinsic volumes or Minkowski functionals ¸ 1=2 + 1 Perimet er(S) e¡ t 2 =2 2 2¼ ¸ + Area(S) te¡ t 2 =2 : (2¼) 3=2 If Z (s) is whit e noise convolved wit h an isot ropic Gaussian ¯lt er of Full Widt h at Half Maximum FWHM t hen ¸ = 4 log 2 . FW HM 2 Random field theory: The brain mapping version If Z (s) is whit e noise smoot hed wit h an isot ropic Gaussian ¯lt er of Full Widt h at Half Maximum FWHM µ ¶ P max Z (s) ¸ t ¼ E(E C(S \ f s : Z (s) ¸ tg)) s2 S Z 1 1 = E C(S) e¡ z 2 =2 dz EC0(S) (2¼) 1=2 Resels0(S) t Resels1(S) Resels2(S) Perimet er(S) (4 log 2) 1=2 e¡ FWHM 2¼ Area(S) 4 log 2 + te¡ t 2 =2 : FWHM 2 (2¼) 3=2 + Resels (Resolution elements) 1 2 t 2 =2 EC1(S) EC2(S) EC densities Discrete local maxima Bonferroni applied to events: {Z(s) ≥ t and Z(s) is a discrete local maximum} i.e. {Z(s) ≥ t and neighbour Z’s ≤ Z(s)} Conservative If Z(s) is stationary, with Cor(Z(s1),Z(s2)) = ρ(s1-s2), Z(s2) ≤ Z(s-1)≤ Z(s) ≥Z(s1) all we need is P{Z(s) ≥ t and neighbour Z’s ≤ Z(s)} a (2D+1)-variate integral ≥ Z(s-2) Discrete local maxima: “Markovian” trick If ρ is “separable”: s=(x,y), ρ((x,y)) = ρ((x,0)) × ρ((0,y)) e.g. Gaussian spatial correlation function: ρ((x,y)) = exp(-½(x2+y2)/w2) Then Z(s) has a “Markovian” property: conditional on central Z(s), Z’s on different axes are independent: Z(s±1) ┴ Z(s±2) | Z(s) Z(s2) ≤ Z(s-1)≤ Z(s) ≥Z(s1) ≥ Z(s-2) So condition on Z(s)=z, find P{neighbour Z’s ≤ z | Z(s)=z} = dP{Z(s±d) ≤ z | Z(s)=z} then take expectations over Z(s)=z Cuts the (2D+1)-variate integral down to a bivariate integral T he result only involves t he correlat ion ½d between adjacent voxels along each lat t ice axis d, d = 1; : : : ; D . First let t he Gaussian density and uncorrect ed P values be Z p 1 2 Á(z) = exp(¡ z =2)= 2¼; ©(z) = Á(u)du; z respect ively. T hen de¯ne 1 Q(½; z) = 1 ¡ 2©(hz) + ¼ where ® = sin¡ ³p 1 Z ® exp(¡ 1 h2 z2 =sin2 2 0 r ´ (1 ¡ ½2 )=2 ; h= µ)dµ; 1¡ ½ : 1+ ½ T hen t he P-value of t he maximum is bounded by µ P ¶ max Z (s) ¸ t s2 S Z · jSj t 1 YD Q(½d ; z) Á(z)dz; d= 1 where jSj is t he number of voxels s in t he search region S. For a voxel on t he boundary of t he search region wit h just one neighbour in axis direct ion d, replace Q(½; z) by 1 ¡ ©(hz), and by 1 if it has no neighbours. Results, corrected for search Random field theory threshold: Z=3.92 (P=0.05) Happy Average face Sad Fearful Neutral Z~N(0,1) statistic 4.58 4.47 4.36 4.25 4.14 4.03 3.92 DLM threshold: Z=3.92 (P=0.05) – same Bonferroni threshold: Z=4.87 (P=0.05) – nothing Results, corrected for search FDR threshold: Z= Happy Average face Sad Fearful Neutral Z~N(0,1) statistic 4.58 4.37 4.15 3.94 3.73 3.52 3.31 4.87 (Q=0.05) 3.46 3.31 4.87 Comparison Bonferroni (1936) Conservative Accurate if spatial correlation is low Simple Discrete local maxima (2005, 2007) Conservative Accurate for all ranges of spatial correlation A bit messy Only easy for stationary separable Gaussian data on rectilinear lattices Even if not separable, always seems to be conservative Random field theory (1970’s) Approximation based on assuming S is continuous Accurate if spatial correlation is high Elegant Easily extended to non-Gaussian, non-isotropic random fields Random field theory: Non-Gaussian non-iostropic If T (s) = f (Z 1 (s); : : : ; Z n (s)) is a funct ion³ of i.i.d. Gaussian random ¯elds ´ Z i (s) » Z (s) » N(0; 1), s 2 < D , wit h V @Z ( s) = ¤ D £ D (s), t hen replace @s resels by Lipschitz-K illing curvature L d (S; ¤ ): µ ¶ P max T (s) ¸ t s2 S XD ¼ E(E C(S \ f s : T (s) ¸ tg)) = L d (S; ¤ )½d (t); d= 0 where ½d (t) is t he same EC density for t he isot ropic case wit h ¤ (s) = I D £ D (Taylor & Adler, 2003). Bot h Lipschit z-K illing curvat ure L d (S; ¤ ) and EC density ½d (t) are de¯ned implicit ly as coe± cient s of a power series expansion of t he volume of a t ube as a funct ion of it s radius. In t he case of Lipschit z-K illing curvat ure, t he t ube is about t he search region S in t he Riemannian met ric induced by ¤ (s); in t he case of t he EC densit ies, t he t ube is about t he reject ion region f ¡ 1 ([t; 1 )) and volume is replaced by Gaussian probability. Fort unat ely t here are simple ways of est imat ing Lipschit z-K illing curvat ure from sample dat a (Taylor & Worsley, 2007), and simple ways of calculat ing EC densit ies. Referee report Why bother? Why not just do simulations? fMRI data: 120 scans, 3 scans each of hot, rest, warm, rest, hot, rest, … First scan of fMRI data Highly significant effect, T=6.59 1000 hot rest warm 890 880 870 500 0 100 200 300 No significant effect, T=-0.74 820 hot rest warm 0 800 T statistic for hot - warm effect 5 0 -5 T = (hot – warm effect) / S.d. ~ t110 if no effect 0 100 0 100 200 Drift 300 810 800 790 200 Time, seconds 300 Bubbles task in fMRI scanner Correlate bubbles with BOLD at every voxel: Trial 1 2 3 4 5 6 7 … 3000 1 0.5 0 fMRI 10000 0 Calculate Z for each pair (bubble pixel, fMRI voxel) – a 5D “image” of Z statistics … Discussion: thresholding Thresholding in advance is vital, since we cannot store all the ~1 billion 5D Z values Resels5=(image Resels2=146.2) × (fMRI Resels3=1057.2) for P=0.05, threshold is t = 6.22 (approx) The threshold based on Gaussian RFT can be improved using new non-Gaussian RFT based on saddle-point approximations (Chamandy et al., 2006) Model the bubbles as a smoothed Poisson point process The improved thresholds are slightly lower, so more activation is detected Only keep 5D local maxima Z(pixel, voxel) > Z(pixel, 6 neighbours of voxel) > Z(4 neighbours of pixel, voxel) Discussion: modeling The random response is Y=1 (correct) or 0 (incorrect), or Y=fMRI The regressors are Xj=bubble mask at pixel j, j=1 … 240x380=91200 (!) Logistic regression or ordinary regression: logit(E(Y)) or E(Y) = b0+X1b1+…+X91200b91200 But there are only n=3000 observations (trials) … Instead, since regressors are independent, fit them one at a time: logit(E(Y)) or E(Y) = b0+Xjbj However the regressors (bubbles) are random with a simple known distribution, so turn the problem around and condition on Y: E(Xj) = c0+Ycj Equivalent to conditional logistic regression (Cox, 1962) which gives exact inference for b1 conditional on sufficient statistics for b0 Cox also suggested using saddle-point approximations to improve accuracy of inference … Interactions? logit(E(Y)) or E(Y)=b0+X1b1+…+X91200b91200+X1X2b1,2+ … P(maxS Z(s) ≥ t) = 0.05 0.1 0.09 0.08 Random field theory Bonferroni 0.07 P value 0.06 0.05 Discrete local maxima 0.04 2 0.03 0 0.02 -2 0.01 0 Z(s) 0 1 2 3 4 5 6 7 8 9 10 FWHM (Full Width at Half Maximum) of smoothing filter Bayesian Model Selection (thanks to Ed George) • Z-statistic at voxel i is Zi ~ N(mi,1), i = 1, … , n • Most of the mi’s are zero (unactivated voxels) and a few are nonzero (activated voxels), but we do not know which voxels are activated, and by how much (mi) • This is a model selection problem, where we add an extra model parameter (mi) for the mean of each activated voxel • Simple Bayesian set-up: - each voxel is independently active with probability p - the activation is itself drawn independently from a Gaussian distribution: mi ~ N(0,c) • The hyperparameter p controls the expected proportion of activated voxels, and c controls their expected activation • Surprise! This prior setup is related to the canonical penalized sum-of-squares criterion AF = Σactivated voxels Zi2 – F q where - q is the number of activated voxels and - F is a fixed penalty for adding an activated voxel • Popular model selection criteria simply entail - maximizing AF for a particular choice of F - which is equivalent to thresholding the image at √F •Some choices of F: - F = 0 : all voxels activated - F = 2 : Mallow’s Cp and AIC - F = log n : BIC - F = 2 log n : RIC - P(Z > √F) = 0.05/n : Bonferroni (almost same as RIC!) • The Bayesian relationship with AF is obtained by reexpressing the posterior of the activated voxels, given the data: P(activated voxels | Z’s) α exp ( c/2(1+c) AF ) where F = (1+c)/c {2 log[(1-p)/p] + log(1+c)} • Since p and c control the expected number and size of the activation, the dependence of F on p and c provides an implicit connection between the penalty F and the sorts of models for which its value may be appropriate • The awful truth: p and c are unknown • Empirical Bayes idea: use p and c that maximize the marginal likelihood, which simplifies to L(p,c | Z’s) α Пi [ (1-p)exp(-Zi2/2) + p(1+c)-1/2exp(-Zi2/2(1+c) ) ] • This is identical to fitting a classic mixture model with - a probability of (1-p) that Zi ~ N(0,1) - a probability of p that Zi ~ N(0,c) - √F is the value of Z where the two components are equal • Using these estimated values of p and c gives us an adaptive penalty F, or equivalently a threshold √F, that is implicitly based on the SPM • All we have to do is fit the mixture model … but does it work? • Same data as before: hot – warm stimulus, four runs: - proportion of activated voxels p = 0.57 - variance of activated voxels c = 5.8 (sd = 2.4) - penalty F = 1.59 (a bit like AIC) - threshold √F = 1.26 (?) seems a bit low … 1400 1200 AIC: √F = 2 Null model N(0,1) √F threshold where components FDR (0.05): are equal √F = 2.67 BIC: Mixture √F = 3.21 43% unRIC: activated √F = 4.55 voxels, N(0,1) Bon (0.05): 1000 Histogram of SPM (n=30786): 800 600 400 57% activated voxels, N(0,5.8) 200 0 -8 √F = 4.66 -6 -4 -2 0 Z 2 4 6 8 • Same data as before: hot – warm stimulus, one run: - proportion of activated voxels p = 0.80 - variance of activated voxels c = 1.55 - penalty F = -3.02 (?) - all voxels activated !!!!!! What is going on? 1400 1200 AIC: √F = 2 Null model N(0,1) components are never equal! 1000 Histogram of SPM (n=30768): 800 600 80% activated voxels, N(0,1.55) Mixture 20% unactivated voxels, N(0,1) 400 200 0 -8 -6 -4 -2 0 Z 2 4 6 FDR (0.05): √F = 2.67 BIC: √F = 3.21 RIC: √F = 4.55 Bon (0.05): √F = 4.66 8 MS lesions and cortical thickness Idea: MS lesions interrupt neuronal signals, causing thinning in down-stream cortex Data: n = 425 mild MS patients 5.5 Average cortical thickness (mm) 5 4.5 4 3.5 3 2.5 Correlation = -0.568, T = -14.20 (423 df) 2 1.5 0 10 20 30 40 50 Total lesion volume (cc) 60 70 80 Charil et al, NeuroImage (2007) MS lesions and cortical thickness at all pairs of points Dominated by total lesions and average cortical thickness, so remove these effects Cortical thickness CT, smoothed 20mm Lesion density LD, smoothed 10mm Find partial correlation(lesion density, cortical thickness) removing total lesion volume linear model: CT-av(CT) ~ 1 + TLV + LD, test for LD Repeat of all voxels in 3D, nodes in 2D Subtract average cortical thickness ~1 billion correlations, so thresholding essential! Look for high negative correlations … Thresholding? Cross correlation random field Correlation between 2 fields at 2 different locations, searched over all pairs of locations one in R (D dimensions), one in S (E dimensions) sample size n Cao & Worsley, Annals of Applied Probability (1999) MS lesion data: P=0.05, c=0.300, T=6.46 Cluster extent rather than peak height (Friston, 1994) fit a quadratic Y to the peak: L D (clust er) » c ® k Peak height Choose a lower level, e.g. t=3.11 (P=0.001) Find clusters i.e. connected components of excursion set Z D=1 Measure cluster LD extent extent by t Distribution: Dbn. of maximum cluster extent: Bonferroni on N = #clusters ~ E(EC). s Cao, Advances in Applied Probability (1999) How do you choose the threshold t for defining clusters? If signal is focal i.e. ~FWHM of noise Choose a high threshold i.e. peak height is better If signal is broad i.e. >>FWHM of noise Choose a low threshold i.e. cluster extent is better Conclusion: cluster extent is better for detecting broad signals Alternative: smooth data with filter that matches signal (Matched Filter Theorem)… try range of filter widths … scale space search … correct using random field theory … a lot of work … Cluster extent is easier! Thresholding? Cross correlation random field Correlation between 2 fields at 2 different locations, searched over all pairs of locations one in R (D dimensions), one in S (E dimensions) µ ¶ XD XE P max Correlat ion > c ¼ L d (R) L e (S) ½d;e (c) R ;S ½d;e (c) = d= 0 e= 0 (d ¡ 1)!e!2n ¡ ¼d +2 e + 1 d¡ e¡ 2 X X (¡ 1) k cd+ e¡ 1¡ 2k (1 ¡ k c2 ) n ¡ d ¡ e¡ 1 2 +k i ;j ¡ ( n ¡ d + i )¡ ( n ¡ e + j ) 2 2 i !j !( k ¡ i ¡ j ) !( n ¡ 1 ¡ d ¡ e + i + j + k ) !( d ¡ 1 ¡ k ¡ i + j ) !( e ¡ k ¡ j + i ) ! Cao & Worsley, Annals of Applied Probability (1999) MS lesion data: P=0.05, c=0.300, T=6.46 0.1 correlation 0 5 x 10 2.5 2 -0.1 1.5 -0.2 1 -0.3 threshold -0.4 -0.5 0 50 100 150 Different hemisphere 0.1 5 x 10 2.5 0 correlation Histogram Same hemisphere -0.1 2 -0.2 1.5 -0.3 1 0.5 -0.4 0 -0.5 0 threshold 50 100 150 0.5 0 ‘Conditional’ histogram: scaled to same max at each distance 0.1 1 -0.1 0.6 -0.2 0.4 -0.3 -0.4 -0.5 0 threshold 50 100 150 distance (mm) 1 0 0.8 correlation correlation 0 0.1 0.8 -0.1 0.6 -0.2 0.4 -0.3 0.2 -0.4 0 -0.5 0 threshold 50 100 150 distance (mm) 0.2 0 The details … 2 Tube(S,r) r S 3 A B 6 Λ is big TubeΛ(S,r) S Λ is small r ν 2 U(s1) S S s1 Tube s2 s3 Tube U(s2) U(s3) Z2 R r Tube(R,r) Z1 N2(0,I) Tube(R,r) R t-r z t z1 RR Tube(R,r) r z2 z3 Summary Comparison Both depend on average correct bubbles, rest is ~ constant Z=(Average correct bubbles -average incorrect bubbles) / pooled sd 4 2 0 -2 Proportion correct bubbles = Average correct bubbles / (average all bubbles * 4) 0.75 0.7 0.65 Random field theory results For searching in D (=2) dimensions, P-value of max Z is P(maxs Z(s) ≥ t) ~ E( Euler characteristic of {s: Z(s) ≥ t}) = ReselsD(S) × ECD(t) (+ boundary terms) ReselsD(S) = Image area / (bubble FWHM)2 = 146.2 (unitless) ECD(t) = (4 log(2))D/2 tD-1 exp(-t2/2) / (2π)(D+1)/2