Statistical analysis of fMRI data, ‘bubbles’ data, and the connectivity between the two Keith Worsley, McGill (and Chicago) Nicholas Chamandy, McGill and Google Jonathan Taylor, Université de Montréal and Stanford Robert Adler, Technion Philippe Schyns, Fraser Smith, Glasgow Frédéric Gosselin, Université de Montréal Arnaud Charil, Alan Evans, Montreal Neurological Institute Before you start: PCA of time space Component Temporal components (sd, % variance explained) 1 0.68, 46.9% 2 0.29, 8.6% 3 0.17, 2.9% 4 0.15, 2.4% 0 20 40 60 80 100 120 140 Frame Spatial components 1 Component 1 0.5 2 0 3 -0.5 1: exclude first frames 2: drift 3: long-range correlation or anatomical effect: remove by converting to % of brain 4 0 2 4 6 8 Slice (0 based) 10 12 -1 4: signal? Bad design: 2 mins rest 2 mins Mozart 2 mins Eminem 2 mins James Brown Rest Mozart Eminem J. Brown Temporal components Component Period: 5.2 16.1 (sd, % variance explained) 15.6 11.6 seconds 1 0.41, 17% 2 0.31, 9.5% 3 0.24, 5.6% 0 50 100 Frame Spatial components 150 200 1 Component 1 0.5 2 0 -0.5 3 0 2 4 6 8 10 12 Slice (0 based) 14 16 18 -1 Effect of stimulus on brain response Alternating hot and warm stimuli separated by rest (9 seconds each). 2 1 0 -1 0 50 100 150 200 250 300 350 Hemodynamic response function: difference of two gamma densities Stimulus is delayed and dispersed by ~6s 0.4 Modeled by convolving the stimulus with the “hemodynamic response function” 0.2 0 -0.2 0 50 Responses = stimuli * HRF, sampled every 3 seconds 2 1 0 -1 0 50 100 150 200 Time, seconds 250 300 350 fMRI data, pain experiment, one slice First scan of fMRI data Highly significant effect, T=6.59 1000 hot rest warm 890 880 870 500 0 100 200 300 No significant effect, T=-0.74 820 hot rest warm 0 800 T statistic for hot - warm effect 5 0 -5 T = (hot – warm effect) / S.d. ~ t110 if no effect 0 100 0 100 200 Drift 300 810 800 790 200 Time, seconds 300 How fMRI differs from other repeated measures data Many reps (~200 time points) Few subjects (~15) Df within subjects is high, so not worth pooling sd across subjects Df between subjects low, so use spatial smoothing to boost df Data sets are huge ~4GB, not easy to use statistics packages such as R FMRISTAT (Matlab) / BRAINSTAT (Python) statistical analysis strategy Analyse each voxel separately Break up analysis into stages Borrow strength from neighbours when needed 1st level: analyse each time series separately 2nd level: combine 1st level results over runs 3rd level: combine 2nd level results over subjects Cut corners: do a reasonable analysis in a reasonable time (or else no one will use it!) 1st level: Linear model with AR(p) errors Data Model Yt = fMRI data at time t xt = (responses,1, t, t2, t3, … )’ to allow for drift Yt = xt’β + εt εt = a1εt-1 + … + apεt-p + σFηt, ηt ~ N(0,1) i.i.d. Fit in 2 stages: 1st pass: fit by least squares, find residuals, estimate AR parameters a1 … ap 2nd pass: whiten data, re-fit by least squares Higher levels: Mixed effects model Data Model Ei = effect (contrast in β) from previous level Si = sd of effect from previous level zi = (1, treatment, group, gender, …)’ Ei = zi’γ + SiεiF + σRεiR (Si high df, so assumed fixed) εiF ~ N(0,1) i.i.d. fixed effects error εiR ~ N(0,1) i.i.d. random effects error Fit by ReML Use EM for stability, 10 iterations Where we use spatial information 1st level: smooth AR parameters to lower their variability and increase “df” “df” defined by Satterthwaite approximation surrogate for variance of the variance parameters Higher levels: smooth Random / Fixed effects sd ratio to lower variability and increase “df” Final level: use random field theory to correct for multiple comparisons 1st level: Autocorrelation AR(1) model: εt = a1 εt-1 + σFηt Fit the linear model using least squares εt = Y t – Y t â1 = Correlation (εt , εt-1) Estimating εt changes their correlation structure slightly, so â1 is slightly biased: Raw autocorrelation Smoothed 12.4mm ~ -0.05 Bias corrected â1 ~0 0.3 0.2 0.1 0 -0.1 How much smoothing? • Variability in â lowers df • Df depends on contrast • Smoothing â brings df back up: ( FWHMâ2 +1 2 FWHMdata dfâ = dfresidual 2 1 dfeff Hot stimulus = 1 + 2 acor(contrast of data)2 dfresidual dfâ FWHMdata = 8.79 Residual df = 110 100 Target = 100 df Contrast of data, acor = 0.61 50 dfeff 0 0 10 20 30 FWHM = 10.3mm FWHMâ ) 3/2 Hot-warm stimulus Residual df = 110 100 Target = 100 df Contrast of data, acor = 0.79 50 dfeff 0 0 10 20 30 FWHM = 12.4mm FWHMâ 2nd level: 4 runs, 3 df for random effects sd Run 1 Run 2 Run 3 Run 4 2nd level Effect, Ei 1 0 … very noisy sd: -1 0.2 Sd, Si 0.1 … and T>15.96 for P<0.05 (corrected): 0 5 T stat, E i / Si 0 … so no response is detected … -5 Solution: Spatial smoothing of the sd ratio • Basic idea: increase “df” by spatial smoothing (local pooling) of the sd. • Can’t smooth the random effects sd directly, - too much anatomical structure. • Instead, sd = smooth random effects sd fixed effects sd fixed effects sd ) which removes the anatomical structure before smoothing. ^ Average Si Random effects sd, 3 df Fixed effects sd, 440 df Mixed effects sd, ~100 df 0.2 0.15 0.1 0.05 divide Random sd / fixed sd 0 multiply Smoothed sd ratio 1.5 1 0.5 random effect, sd ratio ~1.3 How much smoothing? ( dfratio = dfrandom FWHMratio2 2 +1 2 FWHMdata ) 1 1 1 = + dfeff dfratio dffixed 3/2 dfrandom = 3, dffixed = 4 110 = 440, FWHMdata = 8mm: fixed effects analysis, dfeff = 440 400 300 dfeff Target = 100 df random effects analysis, dfeff = 3 200 FWHM = 19mm 100 0 0 20 40 FWHMratio Infinity Final result: 19mm smoothing, 100 df Run 1 Run 2 Run 3 Run 4 2nd level Effect, Ei 1 0 … less noisy sd: -1 0.2 Sd, Si 0.1 … and T>4.93 for P<0.05 (corrected): 0 5 T stat, E i / Si 0 … and now we can detect a response! -5 Final level: Multiple comparisons correction 0.1 Threshold chosen so that P(maxS Z(s) ≥ t) = 0.05 0.09 0.08 Bonferroni Random field theory 0.07 P value 0.06 0.05 Discrete local maxima 0.04 2 0.03 0 0.02 -2 0.01 0 Z(s) 0 1 2 3 4 5 6 7 8 FWHM (Full Width at Half Maximum) of smoothing filter 9 10 FWHM Random field theory Z(s) white noise = filter * FWHM If Z (s) is whit e noise smoot hed wit h an isot ropic Gaussian ¯lt er of Full Widt h at Half Maximum FWHM µ ¶ Z 1 1 P max Z (s) ¸ t ¼ E C(S) e¡ z 2 =2 dz (2¼) 1=2 s2 S EC (S) t Resels0(S) Resels1(S) Resels2(S) Resels3(S) Diamet er(S) e¡ t 2 =2 FWHM 2¼ Area(S) 4 log 2 1 + te¡ t 2 =2 2 FWHM 2 (2¼) 3=2 Volume(S) (4 log 2) 3=2 + (t 2 ¡ 1)e¡ 3 (2¼) 2 FWHM + 2 Resels (Resolution elements) 0 (4 log 2) 1=2 EC1(S) EC2(S) t 2 =2 : EC3(S) EC densities Discrete local maxima Bonferroni applied to N events: {Z(s) ≥ t and Z(s) is a discrete local maximum} i.e. {Z(s) ≥ t and neighbour Z’s ≤ Z(s)} Conservative If Z(s) is stationary, with Cor(Z(s1),Z(s2)) = ρ(s1-s2), Then the DLM P-value is Z(s2) ≤ Z(s-1)≤ Z(s) ≥Z(s1) ≥ Z(s-2) P{maxS Z(s) ≥ t} ≤ N × P{Z(s) ≥ t and neighbour Z’s ≤ Z(s)} We only need to evaluate a (2D+1)-variate integral … Discrete local maxima: “Markovian” trick If ρ is “separable”: s=(x,y), ρ((x,y)) = ρ((x,0)) × ρ((0,y)) e.g. Gaussian spatial correlation function: ρ((x,y)) = exp(-½(x2+y2)/w2) Then Z(s) has a “Markovian” property: conditional on central Z(s), Z’s on different axes are independent: Z(s±1) ┴ Z(s±2) | Z(s) Z(s2) ≤ Z(s-1)≤ Z(s) ≥Z(s1) ≥ Z(s-2) So condition on Z(s)=z, find P{neighbour Z’s ≤ z | Z(s)=z} = ∏d P{Z(s±d) ≤ z | Z(s)=z} then take expectations over Z(s)=z Cuts the (2D+1)-variate integral down to a bivariate integral T he result only involves t he correlat ion ½d between adjacent voxels along each lat t ice axis d, d = 1; : : : ; D . First let t he Gaussian density and uncorrect ed P values be Z p 1 Á(z) = exp(¡ z2 =2)= 2¼; ©(z) = Á(u)du; z respect ively. T hen de¯ne 1 Q(½; z) = 1 ¡ 2©(hz) + ¼ where ® = sin¡ ³p 1 Z ® exp(¡ 1 h2 z2 =sin2 2 0 r ´ (1 ¡ ½2 )=2 ; h= µ)dµ; 1¡ ½ : 1+ ½ T hen t he P-value of t he maximum is bounded by µ P ¶ max Z (s) ¸ t s2 S Z · jSj t 1 YD Q(½d ; z) Á(z)dz; d= 1 where jSj is t he number of voxels s in t he search region S. For a voxel on t he boundary of t he search region wit h just one neighbour in axis direct ion d, replace Q(½; z) by 1 ¡ ©(hz), and by 1 if it has no neighbours. Example: single run, hot-warm Detected by BON and DLM but not by RFT Detected by DLM, but not by BON or RFT Estimating the delay of the response • Delay or latency to the peak of the HRF is approximated by a linear combination of two optimally chosen basis functions: delay 0.6 0.4 basis1 0.2 HRF basis2 0 -0.2 -0.4 -5 0 shift 5 10 t (seconds) 15 20 25 HRF(t + shift) ~ basis1(t) w1(shift) + basis2(t) w2(shift) • Convolve bases with the stimulus, then add to the linear model • Fit linear model, estimate w1 and w2 3 w2 / w1 2 1 • Equate w2 / w1 to estimates, then solve for shift (Hensen et al., 2002) w1 • To reduce bias when the magnitude is small, use 0 w2 shift / (1 + 1/T2) -1 where T = w1 / Sd(w1) is the T statistic for the magnitude -2 -3 -5 0 shift (seconds) 5 • Shrinks shift to 0 where there is little evidence for a response. Shift of the hot stimulus T stat for magnitude T stat for shift 6 6 4 4 2 2 0 0 -2 -2 -4 -4 -6 -6 Shift (secs) Sd of shift (secs) 4 2 2 1.5 0 1 -2 0.5 -4 0 Shift of the hot stimulus T stat for magnitude T>4 T stat for shift 6 6 4 4 2 2 0 0 -2 -2 -4 -4 -6 -6 Shift (secs) ~1 sec T~2 Sd of shift (secs) 4 2 2 1.5 0 +/- 0.5 sec 1 -2 0.5 -4 0 Combining shifts of the hot stimulus (Contours are T stat for magnitude > 4) Run 1 Effect, Ei Run 2 Run 3 Run 4 MULTISTAT ~1 sec 4 2 0 -2 -4 2 Sd, Si +/- 0.25 sec 1 0 5 T stat, E i / Si T~4 0 -5 Shift of the hot stimulus Shift (secs) T stat for magnitude > 4.93 Functional Imaging Analysis Contest HBM2005 15 subjects / 4 runs per subject (2 with events, 2 with blocks) 4 conditions per run Same sentence, same speaker Same sentence, different speaker Different sentence, same speaker Different sentence, different speaker 3T, 191 frames, TR=2.5s Greater %BOLD response for different – same sentences (1.08±0.16%) different – same speaker (0.47±0.08%) Greater latency for different – same sentences (0.148±0.035 secs) Contrasts in the data used for effects 2 Hot, Sd = 0.16 Warm, Sd = 0.16 9 sec 1 blocks, 9 sec gaps 0 -1 0 50 100 150 200 Hot-warm, Sd = 0.19 250 300 350 Time (secs) 2 Hot, Sd = 0.28 90 sec blocks, 1 90 sec gaps 0 Warm, Sd = 0.43 Only using data near block transitions Ignoring data in the middle of blocks -1 0 50 100 150 200 Hot-warm, Sd = 0.55 250 300 350 Time (secs) Optimum block design Sd of hot stimulus 0.5 20 0.4 15 Magnitude Best design 10 15 20 0.8 15 10 5 0 (secs) 1 20 Delay 5 0 5 X 10 15 0.1 20 20 0.8 15 0.6 Best design X 0.4 0.2 15 20 0 0 (secs) 1 10 (Not enough signal) 10 0.2 Best design 0.6 Best design X 5 0.3 10 0 10 0.4 15 0.2 0.1 5 0.5 20 0.3 X 5 Gap (secs) Sd of hot-warm 5 0 0.4 0.2 (Not enough signal) Block (secs) 5 10 15 20 0 Optimum event design 0.5 (Not enough signal) ____ magnitudes ……. delays uniform . . . . . . . . . random .. . ... .. . concentrated : 0.4 Sd of effect (secs for delays) 0.3 0.2 12 secs best for magnitudes 0.1 0 5 15 7 secs best for 10 delays Average time between events (secs) 20 How many subjects? Largest portion of variance comes from the last stage i.e. combining over subjects: sdrun2 sdsess2 sdsubj2 nrun nsess nsubj + nsess nsubj + nsubj If you want to optimize total scanner time, take more subjects. What you do at early stages doesn’t matter very much! Features special to FMRISTAT / BRAINSTAT Bias correction for AR coefficients Df boosting due to smoothing: P-value adjustment for: AR coefficients random/fixed effects variance peaks due to small FWHM (DLM) clusters due to spatially varying FWHM Delays analysed the same way as magnitudes Sd of effects before collecting data What is ‘bubbles’? Nature (2005) Subject is shown one of 40 faces chosen at random … Happy Sad Fearful Neutral … but face is only revealed through random ‘bubbles’ First trial: “Sad” expression Sad 75 random Smoothed by a bubble centres Gaussian ‘bubble’ What the subject sees 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Subject is asked the expression: Response: “Neutral” Incorrect Your turn … Trial 2 Subject response: “Fearful” CORRECT Your turn … Trial 3 Subject response: “Happy” INCORRECT (Fearful) Your turn … Trial 4 Subject response: “Happy” CORRECT Your turn … Trial 5 Subject response: “Fearful” CORRECT Your turn … Trial 6 Subject response: “Sad” CORRECT Your turn … Trial 7 Subject response: “Happy” CORRECT Your turn … Trial 8 Subject response: “Neutral” CORRECT Your turn … Trial 9 Subject response: “Happy” CORRECT Your turn … Trial 3000 Subject response: “Happy” INCORRECT (Fearful) Bubbles analysis 1 E.g. Fearful (3000/4=750 trials): + 2 + 3 + Trial 4 + 5 + 6 + 7 + … + 750 1 = Sum 300 0.5 200 0 100 250 200 150 100 50 Correct trials Proportion of correct bubbles =(sum correct bubbles) /(sum all bubbles) 0.75 Thresholded at proportion of 0.7 correct trials=0.68, 0.65 scaled to [0,1] 1 Use this as a 0.5 bubble mask 0 Results Mask average face Happy Sad Fearful But are these features real or just noise? Need statistics … Neutral Statistical analysis Correlate bubbles with response (correct = 1, incorrect = 0), separately for each expression Equivalent to 2-sample Z-statistic for correct vs. incorrect bubbles, e.g. Fearful: Trial 1 2 3 4 5 6 7 … 750 1 0.5 0 1 1 Response 0 1 Z~N(0,1) statistic 4 2 0 -2 0 1 1 … 1 0.75 Very similar to the proportion of correct bubbles: 0.7 0.65 Results Thresholded at Z=1.64 (P=0.05) Happy Average face Sad Fearful Neutral Z~N(0,1) statistic 4.58 4.09 3.6 3.11 2.62 2.13 1.64 Multiple comparisons correction? Need random field theory … Results, corrected for search Random field theory threshold: Z=3.92 (P=0.05) Happy Average face Sad Fearful Neutral Z~N(0,1) statistic 4.58 4.47 4.36 4.25 4.14 4.03 3.92 3.82 3.80 3.81 3.80 Saddle-point approx (Chamandy, 2007): Z=↑ (P=0.05) Bonferroni: Z=4.87 (P=0.05) – nothing Scale Separate analysis of the bubbles at each scale Scale space: smooth Z(s) with range of filter widths w = continuous wavelet transform adds an extra dimension to the random field: Z(s,w) Scale space, no signal w = FWHM (mm, on log scale) 34 8 6 4 2 0 -2 22.7 15.2 10.2 6.8 -60 -40 34 -20 0 20 One 15mm signal 40 60 8 6 4 2 0 -2 22.7 15.2 10.2 6.8 -60 -40 -20 0 s (mm) 20 40 60 15mm signal is best detected with a 15mm smoothing filter Z(s,w) Matched Filter Theorem (= Gauss-Markov Theorem): “to best detect signal + white noise, filter should match signal” 10mm and 23mm signals w = FWHM (mm, on log scale) 34 8 6 4 2 0 -2 22.7 15.2 10.2 6.8 -60 -40 34 -20 0 20 Two 10mm signals 20mm apart 40 60 8 6 4 2 0 -2 22.7 15.2 10.2 6.8 -60 -40 -20 0 20 40 60 s (mm) But if the signals are too close together they are detected as a single signal half way between them Z(s,w) Scale space can even separate two signals at the same location! 8mm and 150mm signals at the same location 10 5 w = FWHM (mm, on log scale) 0 -60 170 -40 -20 0 20 40 60 20 76 15 34 10 15.2 6.8 5 -60 -40 -20 0 s (mm) 20 40 60 Z(s,w) Bubbles task in fMRI scanner Correlate bubbles with BOLD at every voxel: Trial 1 2 3 4 5 6 7 … 3000 1 0.5 0 fMRI 10000 0 Calculate Z for each pair (bubble pixel, fMRI voxel) a 5D “image” of Z statistics … Thresholding? Thresholding in advance is vital, since we cannot store all the ~1 billion 5D Z values Resels = (image resels = 146.2) × (fMRI resels = 1057.2) for P=0.05, threshold is Z = 6.22 (approx) Only keep 5D local maxima Z(pixel, voxel) > Z(pixel, 6 neighbours of voxel) > Z(4 neighbours of pixel, voxel) Generalised linear models? The random response is Y=1 (correct) or 0 (incorrect), or Y=fMRI The regressors are Xj=bubble mask at pixel j, j=1 … 240x380=91200 (!) Logistic regression or ordinary regression: logit(E(Y)) or E(Y) = b0+X1b1+…+X91200b91200 But there are only n=3000 observations (trials) … Instead, since regressors are independent, fit them one at a time: logit(E(Y)) or E(Y) = b0+Xjbj However the regressors (bubbles) are random with a simple known distribution, so turn the problem around and condition on Y: E(Xj) = c0+Ycj Equivalent to conditional logistic regression (Cox, 1962) which gives exact inference for b1 conditional on sufficient statistics for b0 Cox also suggested using saddle-point approximations to improve accuracy of inference … Interactions? logit(E(Y)) or E(Y)=b0+X1b1+…+X91200b91200+X1X2b1,2+ …