Reproducibility Across Processing Pipelines (including analysis methods) Stephen C. Strother, Ph.D. Rotman Research Institute, Baycrest Centre http://www.rotman-baycrest.on.ca/rotmansite/home.php & Medical Biophysics, University of Toronto © S. C. Strother, 2008 Overview • Why: Pipelines as Meta-Models • How: Optimizing fMRI pipelines – Metrics? − ROC with simulations & data-driven − Reproducibility (r) − Prediction (p) − the NPAIRS (p, r) resampling framework • What: Four Examples − ROCs vs NPAIRS r − NPAIRS (p, r) plots for pipeline optimization − Sensitivity of r to changes in SPM(z) − Measuring neural spatial scale with r © S. C. Strother, 2008 fMRI Processing Pipelines Reconstructed MRI/fMRI Data Physiological Noise Correction Intra-Subject Motion Correction Between Subject Alignment Smoothing Intensity Normalisation Data Modeling/ Analysis Experimental Design Matrix Detrending Rendering of Results on Anatomy Statistical Analysis Engine Statistical Maps © S. C. Strother, 2008 Why Test fMRI Pipelines? “All models are wrong, but some are useful!” • • “All models are wrong.” G.E. Box (1976) Marks Nester, “An applied statistician’s creed,” Applied Statistics, 45(4):401-410, 1996. • Goal is to quantify and optimize utility = Reproducibility? A wide range of software is available for building fMRI processing & analysis pipelines, but • fMRI researchers implicitly hope, but typically do not test, that their results are robust to the exact techniques, algorithms and software packages used. Little is known about what constitutes an optimal fMRI pipeline: • cognitive and clinically relevant tasks. • children, middle-aged and older subjects who represent the age-matched controls relevant for many clinical populations. © S. C. Strother, 2008 Optimization Metric Frameworks Simulations 1. ROC curves 1. 2. 3. 4. Skudlarski P., et al., Neuroimage. 9(3):311-329, 1999. Della-Maggiore V., et al., Neuroimage 17:19–28, 2002. Lukic AS., et al., IEEE Symp. Biomedical Imaging, 2004. Beckmann CF & Smith SM. IEEE Trans. Med. Img. 23:137-152, 2004. Data-Driven: 2. Minimize p-values a. b. 3. Hopfinger JB, et al., Neuroimage, 11:326-333, 2000. Tanabe J, et al. Neuroimage, 15:902-907, 2002. Model Selection: Maximum Likelihood, Akaike’s information criterion (AIC), Minimum DescriptionLength, Bayesian Information Criterion (BIC) & Bayes Evidence, Cross Validation 4. Replication/Reproducibility 1. 2. 3. 4. Intra-Class Correlation Coefficient Empirical ROCs – mixed multinomial model a. b. c. Genovese CR., et al., Magnetic Resonance in Medicine, 38:497–507, 1997. Maitra, R., et al., Magnetic Resonance in Medicine, 48, 62 –70, 2002. Liou M., et al., Neuroimage, 29:383-95, 2006 a. Nandy RR & Cordes D. Magnetic Resonance in Medicine 49:1152–1162, 2003. Empirical ROCs – lower bound on ROC Split-Half Resampling a. 5. Prediction/Generalization Error or Accuracy a. b. c. 6. Strother et al., Hum Brain Mapp, 5:312-316, 1997. Hansen et al., Neuroimage, 9:534-544, 1999. Kustra R & Strother SC. IEEE Trans Med Img 20:376-387, 2001. Carlson, T.A., et al., J Cog Neuroscience, 15:704–717, 2003. NPAIRS: Prediction + Reproducibility a. b. c. d. e. Strother SC, et. al., Neuroimage 15:747-771, 2002. Kjems U, et al., et al., Neuroimage 15:772-786, 2002. Shaw ME, et. al. Neuroimage 19:988-1001, 2003. LaConte S, et. al. Neuroimage 18:10-23, 2003. Strother SC, et. al., Neuroimage 23S1:S196-S207, 2004. © S. C. Strother, 2008 Simulation Block design: • N independent baseline & activation image pairs; • Total = 2N images, 30 “Cortex”/”White Matter” = 4:1. Mean Gaussian amplitude: ∆M = 3% of background. FROM: Lukic AS, Wernick MN, Strother SC. “An evaluation of methods for detecting brain activations from PET or fMRI images.” Artificial Intelligence in Medicine, 25:69-88, 2002. Pixel noise standard deviation: 5% background. Gaussian amplitude correlations: 0.0, 0.5, 0.99. Gaussian amplitude variation, V: 0.1 – 2.0 V © S. C. Strother, 2008 Receiver Operating Characteristic (ROC) Curves PA = P(True positive) = P(Truly active voxel is classified as active) = Sensitivity pAUC PI = P(False positive) = P(Inactive voxel is classified as active) = False alarm rate Skudlarski P, Neuroimage. 9(3):311-329, 1999. Della-Maggiore V, Neuroimage 17:19–28, 2002. Lukic AS, IEEE Symp. Biomedical Imaging, 2004. Beckmann CF, Smith SM. IEEE Trans. Med. Img. 23:137-152, 2004. © S. C. Strother, 2008 ROC Generation © S. C. Strother, 2008 Detection Performance V © S. C. Strother, 2008 Empirical ROCs Data-Driven, Empirical ROCs: • Nandy RR, Cordes D. Novel ROC-Type Method for Testing the Efficiency of Multivariate Statistical Methods in fMRI. Magnetic Resonance in Medicine 49:1152–1162, 2003. P(Y) = P(voxel identified as active) P(Y/F) = P(inactive voxel identified as active) P(Y) vs. P(Y/F) is a lower bound for true ROC • Two runs: − standard experimental AND − resting-state for P(Y/F). • Assumes common noise structure for accurate P(Y/F). © S. C. Strother, 2008 Reproducibility Quantifying replication/reproducibility because: • replication is a fundamental criterion for a result to be considered scientific; • smaller p values do not imply a stronger likelihood of repeating the result; • for “good scientific practice” it is necessary, but not sufficient, to build a measure of replication into the experimental design and data analysis; • results are data-driven and avoid simulations. © S. C. Strother, 2008 Reproducibity and the Binomial Mixture Model Data-Driven, Empirical ROCs: • Genovese CR, Noll DC, Eddy WF. Estimating test-retest reliability in functional MR imaging. I. Statistical methodology. Magnetic Resonance in Medicine, 38:497–507, 1997. • Maitra, R., Roys, S. R., & Gullapalli, R. P. Test–retest reliability estimation of functional MRI data. Magnetic Resonance in Medicine, 48, 62 –70, 2002. • Liou M, Su H-R, Lee J-D, Cheng PE, Huang C-C, Tsai C-H. Bridging Functional MR Images and Scientific Inference: Reproducibility Maps. J. Cog. Neuroscience, 15:935-945, 2003. M ( M R ) ( M R ) V V R R V V λ P 1 P 1 λ P P + A A I 1 I R V M – number of replications RV - # voxels > threshold PA – P(true activation) PA – P(false activation) Λ – mixing fraction of true and false activations © S. C. Strother, 2008 ROC framework • However, comparisons between ROC curves, • in this case results from three different preprocessing pipelines, need to use common mixing proportion estimate (assume same underlying activation) Uncertainty estimates on curves are also necessary to determine significance of the difference between AUCs ROC framework IV • • • Common or joint lambda binomial mixture model is required for empirical ROC curve comparisons Model still uses independent PA and PI across thresholds and methods but a single value of λ must be used for the empirical ROC analysis to be meaningful How do we determine the joint lambda to be used and can the same optimization techniques be used with the joint model? Threshold dependency problems in joint BMM I • Most common method of determining joint λ is to first estimate the independent model (where λ varies) and then use the average value as the joint value Threshold dependency problems in BMM III • • • • low Z-statistic thresholds result in most voxels being above threshold more than once high uncertainty associated with low threshold estimates of λ estimated λ will tend to average to 0.5 when no threshold constraint is used Previous literature after Genovese et al (1997) used a variety of different threshold constraints when estimating the joint lambda across thresholds/methods Activation shape alters BMM estimates • • • Second and potentially more serious problem with this model concerns the spatial form of the activation Simplest test of BMM: Cube vs. Gaussian blob simulated activity with 10% of brain active Simulation uses fixed activation placement, height and noise across subjects Activation shape alters estimates • From Z-statistic -2 to 4.5, the estimated λ is correctly 0.1 • inverse variance weighting maintains that across all thresholds The values from the Gaussian blob vary are threshold dependent - this dependence becomes more variable as the simulations become more realistic Can BMM be used with multi-subject data?correspondence • Lack of point-to-point • Differences in activation magnitude/size • Differences in noise mean/variance • Potentially different strategies/networks • The binomial mixture model framework cannot reliably estimate the mixing proportion when the activation shape is not of uniform strength across activated voxels. Prediction and Reproducibility (Split-Half Cross-Validation Resampling) Standard SPM Estimation Prediction Metric © S. C. Strother, 2008 Activation-Pattern Reproducibility Metrics (NPAIRS: Split-half reSampling) 1 1 1 1 1 r 1 + r0 2 2 2 2 r1 10 r 1 1 1 1 2 2 2 2 Signal eigenvalue = (1 + r) Noise eigenvalue = (1 – r) G l o b a lS N R = S D D S i g n a l S N o i s e ( 1 + r ) ( 1 r )( 1 r ) 2 r / ( 1 r ) uncorrelated signal (rSPM) and noise (nSPM) from any data analysis model. a reproducible SPM (rSPM) on a common statistical Z-score scale. © S. C. Strother, 2008 Simulations: Comparing Metrics GLOBAL LOCAL © S. C. Strother, 2008 NPAIRS: ROC-Like Prediction vs. Reproducibility Optimizing Performance. Like an ROC plot there is a single point, (1, 1), on this prediction vs. reproducibility plot with the best performance; at this location the model has perfectly predicted the design matrix while extracting an infinite SNR. A Bias-Variance Tradeoff. As model complexity increases (i.e., #PCs 10 →100), prediction of design matrix’s class labels improves and reproducibility (i.e., activation SNR) decreases. LaConte S, et. al. Evaluating preprocessing choices in single-subject BOLD-fMRI studies using data-driven performance metrics. Neuroimage 18:10-23, 2003 © S. C. Strother, 2008 Measuring Improved (p, r) Plots © S. C. Strother, 2008 Testing Pipelines with (p, r) Plots 1Sliding window running means. 2Multi-Taper power spectrum 3Wilcoxin matched–pair rank sum test (N = 16) Zhang et al., Neuromage 41:4:1242-1252, 2008 and Mag Res Med (in press) © S. C. Strother, 2008 A Multi-Task Dataset as a f(age) • Previously collected data (Grady et al., J. Cog. NSci, 2006) o o • Language-picture encoding/recognition memory experiment 1.5T GE MRI at Sunnybrook (TR 2.5 s) Multiple tasks o 6 different tasks – Picture/word; Animacy/Size(case) 4 x Encoding [4 epochs, 89 volumes] 2 x Recognition [8 epochs, 166 volumes] TASK • A A A A Different age groups o o 20 subjects, young (10) and old (10) 60 runs/age group (6 tasks/run x 10 subjects) © S. C. Strother, 2008 fMRI Processing Pipelines Examined MC Reconstructed MRI/fMRI Data Between Subject Alignment Within-Subject Motion Correction MC + Smoothing Reconstructed MRI/fMRI Data Between Subject Alignment Within-Subject Motion Correction Spatial Smoothing General Linear Model or MC + MPER Reconstructed MRI/fMRI Data Within-Subject Motion Correction Voxel Detrending (Motion Params.) Between Subject Alignment Multivariate CVA MC + MPER + Smoothing Reconstructed MRI/fMRI Data Within-Subject Motion Correction Voxel Detrending (Motion Params.) Between Subject Alignment Spatial Smoothing © S. C. Strother, 2008 ∆% SPM(z) vs. ∆Reproducibility © S. C. Strother, 2008 Spatial-Scale of Cat Orientation Columns B A C SSPL SSPL SPL WM SPL D cerebellum D L 2 mm A L 2 mm A •MION 2 mm enhanced, CBV weighted fMRI •1-mm thick slice tangential to the surface of the cortex containing visual area 18. •Gradient-Echo, 9.4 Tesla •In-plane resolution 0.15 x 0.15mm2, TR=2s, TE=10ms Zhao F, et al., NeuroImage 27:416-24, 2005 Methods & Framework Resampling BS1 2D Gaussian smoothing max Original Data … Reproducibility BS2 Correlation rSPM FWHM(mm) BS100 Bootstrap Samples Results Dataset 1 Dataset 2 (0 vs 90, 0.15 x 0.15mm2) Dataset 3 Dataset 4 (45 vs 135, 0.15 x 0.15mm2) Distribution of the optimal FWHM of Gaussian filters over 100 bootstrap samples. Data 1 (0 vs 90, 0.225 x 0.225 mm2) Overview • Why: Pipelines as Meta-Models • How: Optimizing fMRI pipelines – Metrics? − ROC with simulations & data-driven − Reproducibility (r) − Prediction (p) − the NPAIRS (p, r) resampling framework • What: Four Examples − ROCs vs NPAIRS r − NPAIRS (p, r) plots for pipeline optimization − Sensitivity of r to changes in SPM(z) − Measuring neural spatial scale with r © S. C. Strother, 2008 Acknowledgements Rotman Research Institute Xu Chen, Ph.D. Cheryl Grady, Ph.D. Grigori Yourganov, M.Sc Simon Graham, Ph.D. Wayne Lee, M.Sc. Randy McIntosh, Ph.D. Mani Fazeli, M.Sc. Anita Oder, B.Sc. Illinois Institute of Technology & Predictek, Inc., Chicago Ana Lukic, Ph.D. Miles Wernick, PhD. University of Pittsburgh Seong-Gi Kim, Ph.D. Fuqiang Zhao, Ph.D FMRIB, Oxford University Morgan Hough, B.A. Steve Smith, Ph.D. Principal Funding Sources: NIH Human Brain Project, P20-EB02013-10 & P20MH072580-01, James S. McDonnell Foundation © S. C. Strother, 2008