My first 100 Terabytes of Data

advertisement
My first 100 Tb of data
STATISTICAL METHODS FOR NEW
TECHNOLOGY WORKING GROUP
Ciprian M. Crainiceanu
Johns Hopkins University
http://www.biostat.jhsph.edu/smnt
Members of the group
• Key personnel
• C.M. Crainiceanu, B.S. Caffo, A.-M. Staicu, S. Greven,
D. Ruppert, C.-Z. Di
• Senior Students
• V. Zipunnikov, J.-A. Goldsmith
• Other statisticians (>20)
• Scientific collaborators
• Direct collaboration
• Solving important scientific problems
• Diverse scientific applications
Scientific Collaborators
• Susan Bassett – fMRI,
Alzheimer’s
• Danny Reich – DTI, DCE-MRI, MS
• Brian Schwartz – lead exposure,
VBM, DTI, white matter imaging
• Stewart Mostofsky – fMRI,
rsfcMRI, Autism, ADHD, Turrets
• Naresh Punjabi – EEG, sleep,
sleep diseases
• Dzung Pham / Pilou Bazin –
Cortical shape, thickness, lesion
detection, MS
• Dean Wong – PET, fMRI
substance abuse
• Susan Resnick – BLSA
• Jerry Prince – BLSA, ADNI
• Jim Pekar, Peter Van Zijl – 7T
MRI, fMRI, rsfcMRI
preprocessing, scanner physics
• Christos Davatzikos- RAVENS
• Susumu Mori – DTI, tractography
• Dana Boatman – ECOG, EEG,
epilepsy
• Graham Redgrave – fMRI, DTI,
Huntington’s, anorexia/bulimia
• Tudor Badea, Bruno Jednyak –
Neuron classification,
morphometry, 3D structure and
shape
• Tom Glass – Gizmos
• Merck – EEG, neuroimaging
• Pfizer – imaging biomarkers?
Observational Studies 2.0
Longitudinal Functional Principal Component
Analysis (LFPCA)
• I=1000, J=4, D=100: 15’
• I=1000, J=8, D=200: 70’
Greven, Crainiceanu, Caffo, Reich, 2010. LFPCA, EJS, to appear
A simple regression formula
• Data compression via longitudinal PCA
• MoM estimators of covariance matrices, smoothing
• Need: all covariance operators
• Solution: regress Yij(d)Yik(d’) on 1, Tik, Tij, TikTij, djk
Variance explained (FA, 3 yrs of long. data)
Longitudinal Penalized Functional Regression
LPFR: recipe and ingredients
PASAT/MD (Corp. Call.), PD (Cortic. spinal)
Functional regression
•
•
•
•
No paper on longitudinal functional regression
No paper published with this data structure
Longitudinal extensions are not “simple”
Technical details are hard without the correct
“recipe” for known and published “ingredients”
• No available method that scales up
Goldsmith, Feder, Crainiceanu, Caffo, Reich, 2010. PFR, JCGS, to
appear
Goldsmith, Crainiceanu, Caffo, Reich, 2010. LPFR, to appear?
Population Value Decomposition
(PVD)
PVD
Yi = P ViD + Ei
•
•
•
•
P is T*A
D is B*F
Vi is A*B
A << T, B << F
Singular Value Decomposition (SVD) summarizes variance
Time
One subject
Frequency
Frequency.
Subject-specific Data
Eigenvariates
Diagonal
Matrix
Eigenfrequencies
Default PVD
(Start here)
Eigenvariates
SVD
Subject-specific Data
Eigenfrequencies
Low rank approximation
Stacked across subjects
SVD
Population decomposition
Projecting original data onto population bases
...
…
Subject-specific Data
Caffo BS, Crainiceanu CM, Verduzco G, Joel SE, Mostofsky SH, Bassett SS, Pekar JJ. Two-Stage decompositions for the analysis of functional
connectivity for fMRI with application to Alzheimer’s disease risk. NeuroImage (In Press).
Population eigenimages
Currently:
•Deploying PVD to the 1000 Functional
Connectomes Project
http://www.nitrc.org/projects/fcon_1000/
•Comparing rsfcMRI in stroke versus
normal subjects
HD-MFPCA/RAVENS Images
Multilevel Functional Principal Component
Analysis (MFPCA)
MFPCA
HD-MFPCA
HD-MFPCA, Step 1
HD-MFPCA, Step 2
Main message, backed by 100Tb of data
• Eventually, good tech makes into observational
and clinical trials
• Longitudinal/Multilevel FDA is the natural next
step in FDA
• Data is changing the way we do business:
availability, size, complexity
• Likely: funding will be based much more on
relevance than on technical ability
Download