Reproducibility Across Processing Pipelines (including analysis methods) Stephen C. Strother, Ph.D.

advertisement
Reproducibility Across
Processing Pipelines
(including analysis methods)
Stephen C. Strother, Ph.D.
Rotman Research Institute, Baycrest Centre
http://www.rotman-baycrest.on.ca/rotmansite/home.php
& Medical Biophysics, University of Toronto
© S. C. Strother, 2008
Overview
• Why: Pipelines as Meta-Models
• How: Optimizing fMRI pipelines – Metrics?
− ROC with simulations & data-driven
− Reproducibility (r)
− Prediction (p)
− the NPAIRS (p, r) resampling framework
• What: Four Examples
− ROCs vs NPAIRS r
− NPAIRS (p, r) plots for pipeline optimization
− Sensitivity of r to changes in SPM(z)
− Measuring neural spatial scale with r
© S. C. Strother, 2008
fMRI Processing Pipelines
Reconstructed
MRI/fMRI Data
Physiological
Noise
Correction
Intra-Subject
Motion
Correction
Between
Subject
Alignment
Smoothing
Intensity
Normalisation
Data Modeling/
Analysis
Experimental
Design
Matrix
Detrending
Rendering of
Results on
Anatomy
Statistical
Analysis
Engine
Statistical
Maps
© S. C. Strother, 2008
Why Test fMRI Pipelines?
 “All models are wrong, but some are useful!”
•
•
“All models are wrong.” G.E. Box (1976)
Marks Nester, “An applied statistician’s creed,” Applied Statistics, 45(4):401-410, 1996.
• Goal is to quantify and optimize utility = Reproducibility?
 A wide range of software is available for building fMRI
processing & analysis pipelines, but
• fMRI researchers implicitly hope, but typically do not test, that
their results are robust to the exact techniques, algorithms and
software packages used.
 Little is known about what constitutes an optimal fMRI
pipeline:
• cognitive and clinically relevant tasks.
• children, middle-aged and older subjects who represent the
age-matched controls relevant for many clinical populations.
© S. C. Strother, 2008
Optimization
Metric
Frameworks
Simulations
1.
ROC curves
1.
2.
3.
4.
Skudlarski P., et al., Neuroimage. 9(3):311-329, 1999.
Della-Maggiore V., et al., Neuroimage 17:19–28, 2002.
Lukic AS., et al., IEEE Symp. Biomedical Imaging, 2004.
Beckmann CF & Smith SM. IEEE Trans. Med. Img. 23:137-152, 2004.
Data-Driven:
2.
Minimize p-values
a.
b.
3.
Hopfinger JB, et al., Neuroimage, 11:326-333, 2000.
Tanabe J, et al. Neuroimage, 15:902-907, 2002.
Model Selection: Maximum Likelihood, Akaike’s information criterion
(AIC), Minimum DescriptionLength, Bayesian Information Criterion (BIC) &
Bayes Evidence, Cross Validation
4.
Replication/Reproducibility
1.
2.
3.
4.
Intra-Class Correlation Coefficient
Empirical ROCs – mixed multinomial model
a.
b.
c.
Genovese CR., et al., Magnetic Resonance in Medicine, 38:497–507, 1997.
Maitra, R., et al., Magnetic Resonance in Medicine, 48, 62 –70, 2002.
Liou M., et al., Neuroimage, 29:383-95, 2006
a.
Nandy RR & Cordes D. Magnetic Resonance in Medicine 49:1152–1162, 2003.
Empirical ROCs – lower bound on ROC
Split-Half Resampling
a.
5.
Prediction/Generalization Error or Accuracy
a.
b.
c.
6.
Strother et al., Hum Brain Mapp, 5:312-316, 1997.
Hansen et al., Neuroimage, 9:534-544, 1999.
Kustra R & Strother SC. IEEE Trans Med Img 20:376-387, 2001.
Carlson, T.A., et al., J Cog Neuroscience, 15:704–717, 2003.
NPAIRS: Prediction + Reproducibility
a.
b.
c.
d.
e.
Strother SC, et. al., Neuroimage 15:747-771, 2002.
Kjems U, et al., et al., Neuroimage 15:772-786, 2002.
Shaw ME, et. al. Neuroimage 19:988-1001, 2003.
LaConte S, et. al. Neuroimage 18:10-23, 2003.
Strother SC, et. al., Neuroimage 23S1:S196-S207, 2004.
© S. C. Strother, 2008
Simulation
 Block design:
• N independent baseline &
activation image pairs;
• Total = 2N images, 30
 “Cortex”/”White Matter” = 4:1.
 Mean Gaussian amplitude:
∆M = 3% of background.
FROM: Lukic AS, Wernick MN, Strother
SC. “An evaluation of methods for
detecting brain activations from PET or
fMRI images.” Artificial Intelligence in
Medicine, 25:69-88, 2002.
 Pixel noise standard deviation:
5% background.
 Gaussian amplitude correlations:
0.0, 0.5, 0.99.
 Gaussian amplitude variation,
V: 0.1 – 2.0
V
© S. C. Strother, 2008
Receiver Operating Characteristic (ROC) Curves
PA = P(True positive)
= P(Truly active voxel
is classified as active)
= Sensitivity
pAUC
PI = P(False positive)
= P(Inactive voxel
is classified as active)
= False alarm rate
Skudlarski P, Neuroimage. 9(3):311-329, 1999.
Della-Maggiore V, Neuroimage 17:19–28, 2002.
Lukic AS, IEEE Symp. Biomedical Imaging, 2004.
Beckmann CF, Smith SM. IEEE Trans. Med. Img. 23:137-152, 2004.
© S. C. Strother, 2008
ROC Generation
© S. C. Strother, 2008
Detection Performance
V
© S. C. Strother, 2008
Empirical ROCs
Data-Driven, Empirical ROCs:
• Nandy RR, Cordes D. Novel ROC-Type Method for Testing the Efficiency
of Multivariate Statistical Methods in fMRI. Magnetic Resonance in
Medicine 49:1152–1162, 2003.
 P(Y) = P(voxel identified as active)
 P(Y/F) = P(inactive voxel identified as active)
 P(Y) vs. P(Y/F) is a lower bound for true ROC
• Two runs:
− standard experimental AND
− resting-state for P(Y/F).
• Assumes common noise structure for accurate P(Y/F).
© S. C. Strother, 2008
Reproducibility
Quantifying replication/reproducibility
because:
• replication is a fundamental criterion for a result to be
considered scientific;
• smaller p values do not imply a stronger likelihood of
repeating the result;
• for “good scientific practice” it is necessary, but not
sufficient, to build a measure of replication into the
experimental design and data analysis;
• results are data-driven and avoid simulations.
© S. C. Strother, 2008
Reproducibity and the Binomial
Mixture Model
 Data-Driven, Empirical ROCs:
•
Genovese CR, Noll DC, Eddy WF. Estimating test-retest reliability in functional MR imaging. I.
Statistical methodology. Magnetic Resonance in Medicine, 38:497–507, 1997.
•
Maitra, R., Roys, S. R., & Gullapalli, R. P. Test–retest reliability estimation of functional MRI
data. Magnetic Resonance in Medicine, 48, 62 –70, 2002.
•
Liou M, Su H-R, Lee J-D, Cheng PE, Huang C-C, Tsai C-H. Bridging Functional MR Images
and Scientific Inference: Reproducibility Maps. J. Cog. Neuroscience, 15:935-945, 2003.
M


(
M
R
)
(
M
R
)
V
V
R
R
V
V
λ
P
1
P
1
λ
P
P

 +




A
A
I 1
I
R

V

M – number of replications RV - # voxels > threshold
PA – P(true activation)
PA – P(false activation)
Λ – mixing fraction of true and false activations
© S. C. Strother, 2008
ROC
framework
• However, comparisons between ROC curves,
•
in this case results from three different
preprocessing pipelines, need to use common
mixing proportion estimate (assume same
underlying activation)
Uncertainty estimates on curves are also
necessary to determine significance of the
difference between AUCs
ROC framework IV
•
•
•
Common or joint lambda binomial mixture model is
required for empirical ROC curve comparisons
Model still uses independent PA and PI across
thresholds and methods but a single value of λ must
be used for the empirical ROC analysis to be
meaningful
How do we determine the joint lambda to be used
and can the same optimization techniques be used
with the joint model?
Threshold dependency
problems in joint BMM I
• Most common method of determining
joint λ is to first estimate the
independent model (where λ varies) and
then use the average value as the joint
value
Threshold
dependency problems
in BMM III
•
•
•
•
low Z-statistic thresholds result in most voxels
being above threshold more than once
high uncertainty associated with low threshold
estimates of λ
estimated λ will tend to average to 0.5 when no
threshold constraint is used
Previous literature after Genovese et al (1997)
used a variety of different threshold constraints
when estimating the joint lambda across
thresholds/methods
Activation shape
alters BMM estimates
•
•
•
Second and potentially more serious problem
with this model concerns the spatial form of the
activation
Simplest test of BMM: Cube vs. Gaussian blob
simulated activity with 10% of brain active
Simulation uses fixed activation placement,
height and noise across subjects
Activation shape alters
estimates
• From Z-statistic -2 to 4.5, the estimated λ is correctly 0.1 •
inverse variance weighting maintains that across all
thresholds
The values from the Gaussian blob vary are threshold
dependent - this dependence becomes more variable as
the simulations become more realistic
Can BMM be used
with multi-subject
data?correspondence
• Lack of point-to-point
• Differences in activation
magnitude/size
• Differences in noise mean/variance
• Potentially different strategies/networks
• The binomial mixture model framework
cannot reliably estimate the mixing
proportion when the activation shape is
not of uniform strength across
activated voxels.
Prediction and Reproducibility
(Split-Half Cross-Validation Resampling)
Standard SPM Estimation
Prediction Metric
© S. C. Strother, 2008
Activation-Pattern Reproducibility Metrics
(NPAIRS: Split-half reSampling)
1 1

1 1





1
r
1
+
r0



 2 2
2
2



  


r1
10
r
1
1 

1 



 1




2 2

2 2

Signal eigenvalue = (1 + r)
Noise eigenvalue = (1 – r)
G
l
o
b
a
lS
N
R
=
S
D
D
S
i
g
n
a
l S
N
o
i
s
e
(
1
+
r
)
(
1
r
)(
1
r
)
2
r
/
(
1
r
)
uncorrelated signal (rSPM) and noise
(nSPM) from any data analysis model.
 a reproducible SPM (rSPM) on a
common statistical Z-score scale.
© S. C. Strother, 2008
Simulations: Comparing Metrics
GLOBAL
LOCAL
© S. C. Strother, 2008
NPAIRS: ROC-Like Prediction vs.
Reproducibility
 Optimizing Performance.
Like an ROC plot there is a single point,
(1, 1), on this prediction vs. reproducibility
plot with the best performance; at this
location the model has perfectly predicted
the design matrix while extracting an
infinite SNR.
 A Bias-Variance Tradeoff.
As model complexity increases (i.e., #PCs
10 →100), prediction of design matrix’s
class labels improves and reproducibility
(i.e., activation SNR) decreases.
LaConte S, et. al. Evaluating preprocessing choices in
single-subject BOLD-fMRI studies using data-driven
performance metrics. Neuroimage 18:10-23, 2003
© S. C. Strother, 2008
Measuring Improved (p, r) Plots
© S. C. Strother, 2008
Testing Pipelines with (p, r) Plots
1Sliding
window running means. 2Multi-Taper power spectrum
3Wilcoxin matched–pair rank sum test (N = 16)
Zhang et al., Neuromage 41:4:1242-1252, 2008 and Mag Res Med (in press)
© S. C. Strother, 2008
A Multi-Task Dataset as a f(age)
•
Previously collected data (Grady et al., J. Cog. NSci, 2006)
o
o
•
Language-picture encoding/recognition memory experiment
1.5T GE MRI at Sunnybrook (TR 2.5 s)
Multiple tasks
o
6 different tasks – Picture/word; Animacy/Size(case)


4 x Encoding [4 epochs, 89 volumes]
2 x Recognition [8 epochs, 166 volumes]
TASK
•
A
A
A
A
Different age groups
o
o
20 subjects, young (10) and old (10)
60 runs/age group (6 tasks/run x 10 subjects)
© S. C. Strother, 2008
fMRI Processing Pipelines Examined
MC
Reconstructed
MRI/fMRI Data
Between
Subject
Alignment
Within-Subject
Motion
Correction
MC + Smoothing
Reconstructed
MRI/fMRI Data
Between
Subject
Alignment
Within-Subject
Motion
Correction
Spatial
Smoothing
General Linear
Model
or
MC + MPER
Reconstructed
MRI/fMRI Data
Within-Subject
Motion
Correction
Voxel
Detrending
(Motion Params.)
Between
Subject
Alignment
Multivariate
CVA
MC + MPER + Smoothing
Reconstructed
MRI/fMRI Data
Within-Subject
Motion
Correction
Voxel
Detrending
(Motion Params.)
Between
Subject
Alignment
Spatial
Smoothing
© S. C. Strother, 2008
∆% SPM(z) vs. ∆Reproducibility
© S. C. Strother, 2008
Spatial-Scale of Cat Orientation Columns
B
A
C
SSPL
SSPL
SPL
WM
SPL
D
cerebellum
D
L
2 mm
A
L
2 mm
A
•MION
2 mm enhanced, CBV weighted fMRI
•1-mm thick slice tangential to the surface of the cortex containing visual area 18.
•Gradient-Echo, 9.4 Tesla
•In-plane resolution 0.15 x 0.15mm2, TR=2s, TE=10ms
Zhao F, et al., NeuroImage 27:416-24, 2005
Methods & Framework
Resampling
BS1
2D Gaussian
smoothing
max
Original Data
…
Reproducibility
BS2
Correlation rSPM
FWHM(mm)
BS100
Bootstrap Samples
Results
Dataset 1
Dataset 2
(0 vs 90, 0.15 x 0.15mm2)
Dataset 3
Dataset 4
(45 vs 135, 0.15 x 0.15mm2)
Distribution of the optimal FWHM of Gaussian filters over
100 bootstrap samples.
Data 1 (0 vs 90, 0.225 x 0.225 mm2)
Overview
• Why: Pipelines as Meta-Models
• How: Optimizing fMRI pipelines – Metrics?
− ROC with simulations & data-driven
− Reproducibility (r)
− Prediction (p)
− the NPAIRS (p, r) resampling framework
• What: Four Examples
− ROCs vs NPAIRS r
− NPAIRS (p, r) plots for pipeline optimization
− Sensitivity of r to changes in SPM(z)
− Measuring neural spatial scale with r
© S. C. Strother, 2008
Acknowledgements
Rotman Research Institute
Xu Chen, Ph.D.
Cheryl Grady, Ph.D.
Grigori Yourganov, M.Sc
Simon Graham, Ph.D.
Wayne Lee, M.Sc.
Randy McIntosh, Ph.D.
Mani Fazeli, M.Sc.
Anita Oder, B.Sc.
Illinois Institute of Technology & Predictek, Inc., Chicago
Ana Lukic, Ph.D.
Miles Wernick, PhD.
University of Pittsburgh
Seong-Gi Kim, Ph.D.
Fuqiang Zhao, Ph.D
FMRIB, Oxford University
Morgan Hough, B.A.
Steve Smith, Ph.D.
Principal Funding Sources: NIH Human Brain Project, P20-EB02013-10 & P20MH072580-01, James S. McDonnell Foundation
© S. C. Strother, 2008
Download