Multivariate Approaches: Joint Modeling of Imaging and Genetic Data Bertrand Thirion,

advertisement
Multivariate Approaches: Joint
Modeling of Imaging and Genetic Data
Bertrand Thirion,
INRIA Saclay-Île-de-France, Parietal team
CEA, DSV, I2BM, Neurospin
http://parietal.saclay.inria.fr
bertrand.thirion@inria.fr
June 25th, 2011
HBM Imaging genetics educational course
Outline
●
Introduction: Multivariate methods
●
Neuroimaging genetics
●
●
Limitations of mass univariate models
●
Penalized multiple regression
●
Multivariate multiple regression
Examples on simulated and real data
June 25th, 2011
HBM Imaging genetics educational course
Introduction: Multivariate methods
●
Two main families: supervised and unsupervised
●
Unsupervised: clustering, PCA, ICA
●
●
●
Generic form: X = AD + E, where only X is known
Try to model/understand some data – its distribution
p(X)– without trying to fit any target
Supervised: regularized regression, kernel
machines, discriminant analysis, PLS
●
Generic form: Y = XB + E, where Y and X are known
●
Try to fit the target data Y given X
June 25th, 2011
HBM Imaging genetics educational course
Introduction: Multivariate methods
●
●
Supervised methods
●
Unsupervised methods
Using unsupervised techniques to solve supervised
problems is simply inefficient
June 25th, 2011
HBM Imaging genetics educational course
Neuroimaging Genetics: problem
statement
Subject s=1..S
Brain image data
Genetic data
Q response variables:
Y = {y1,..,yq}
June 25th, 2011
P predictor variables:
X = {x1,.., xp}
HBM Imaging genetics educational course
Genetic-Neuroimaging studies
●
Small q, small p
●
●
Small q, large p
●
●
[Potkin et al. 2009] q = 1 mean BOLD signal; p = 317, 503
SNPs
Large q, small p
●
●
[Joyner et al. 2009] q = 4 brain size measures, p = 11 SNPs
[Filippini et al. 2009] q = 29, 812 voxels; p = 1 SNP
Large q, large p
●
[Stein et al. 2010] q = 31, 622 voxels; p = 448, 293 SNPs
June 25th, 2011
HBM Imaging genetics educational course
Mass Univariate Linear Models (MULM)
●
A commonly used approach is to model one genotype and
one phenotype at a time:
1. Fit all univariate linear regression models
yj = βjk xk + ε, j = 1, . . . , q, k = 1, . . . , p
2 possibilities:
● Allelic dosage
● categorical model
2. Search for a subset of p significant genotypes with indices
{k1 , k2 , . . . , kp' }  {1,..,p} with p'≪ p
by testing all (p × q) null hypotheses of no association H0 : β jk = 0
3. Correct for multiple testing – control experiment-wise FWER or
FDR
●
Possible dependence patterns among genotypes and
phenotypes are ignored at the modeling stage
June 25th, 2011
HBM Imaging genetics educational course
Problems with MULM
●
●
If we test for each (voxel, SNP) pair (up
to 1012 pairs)
Power issue: (peak statistic)
to detect an effect with a power of 80%,
based on n=1000 subjects, the
standardized effect needs to be greater
than .26 (.19 with n=2000 subjects).
●
Reproducibility issue: is directly related
to the small power
→ false negatives hamper the
reproducibility of the analysis
this is known to be one of the major
issues in GWAS.
June 25th, 2011
HBM Imaging genetics educational course
Multivariate predictive modelling
●
●
●
Why modeling multiple genotypes and multiple phenotypes?
✔
A weak effect may be more apparent when other causal effects are already
accounted for
✔
A false signal may be weakened by inclusion in the model of a stronger signal
from a true causal association
✔
A weak effect may be more apparent if multiple phenotypes are affected
Basic strategy: build a linear regression model that includes all
genotypes (predictors) and all phenotypes (responses) and then perform
variable selection
The models covered here are:
✔
Penalized multiple linear regression (any p and q = 1)
✔
Penalized sparse canonical correlation analysis (any p and q)
✔
Penalized reduced-rank regression (any p and q)
June 25th, 2011
HBM Imaging genetics educational course
Multiple genotypes (p > 1) and one
phenotype (q = 1): multiple regression
●
●
The multiple linear regression model with univariate
response
Fit the multiple linear regression model
by solving
●
Equivalently, minimize the error function (loss)
●
When n > p, the OLS solution is given by
June 25th, 2011
HBM Imaging genetics educational course
Penalized multivariate regression for
genotype selection
✔
One step-approach: fit the multiple linear regression model while
finding a subset of p important predictors with indices
{k1 , k2 , . . . , kp' }  {1, 2, . . . , p} with p'≪p
all having non-zero regression coefficients.
✔
This is achieved by fitting a penalized regression model:
✔
The penalization ψ(β) imposes a constraint on β. We use convex
functions to have unique optimum
✔
The coupling parameter λ controls the trade-off between the OLS
(unpenalized) solution and the penalized solution.
June 25th, 2011
HBM Imaging genetics educational course
Ridge regression
●
The problem can be rewritten as
●
or more compactly
●
λ controls the amount of shrinkage
●
Some properties are:
✔
Closed-form solution
✔
Useful when the data matrix X is singular and X T X non invertible
✔
Bias-variance trade-off – better predictions
✔
Grouping effect – correlated variables get similar coefficients
✔
No variable selection
June 25th, 2011
HBM Imaging genetics educational course
LASSO regression
●
Lasso regression finds β subject to
●
Performs both continuous shrinkage and variable selection
●
λ controls the amount of sparsity
●
For instance, with p = 2:
June 25th, 2011
HBM Imaging genetics educational course
Example of Lasso regularization path
●
●
The higher the regularization, the sparser the solution
Run it yourself
http://scikit-learn.sourceforge.net/modules/linear_model.html
June 25th, 2011
HBM Imaging genetics educational course
Elastic net regression: convex
combination of L1 and L2 penalties
●
Elastic net regression solves
●
It retains the benefits of both individual penalties
●
Setting
●
the penalty simplifies to
June 25th, 2011
HBM Imaging genetics educational course
Solving the penalized regression
problem
●
●
●
Convex but non-smooth problem: a unique optimal solution
can always be found, though not through simple techniques
(gradient descent etc.)
3 families of methods:
✔
Homotopy methods: Lars algorithm [Efron et al. 2004]
✔
Proximal methods [Nesterov 2004]
✔
Coordinate descent [Friedmann et al. 2009]
Implementations available in R/python
●
Glmnet R package, http://cran.r-project.org/web/packages/glmnet/
●
Scikit learn, http://scikit-learn.sourceforge.net/
June 25th, 2011
HBM Imaging genetics educational course
Variable selection in practice: λ
●
A common procedure is nested cross validation (CV)
●
For each value of λ within a given range:
1. Leave m samples out for testing
2. Use the remaining n − m samples for training– fit the model
3. Compute the prediction error using the test sample
4. Repeat for all n/m folds and take the average prediction error
→ The optimal λ minimizes the cross-validated prediction error
●
●
●
Various search strategies can be used to explore the space Λ
No optimal solution guaranteed: the Global problem (that
includes optimizing λ) is NOT convex
Cost = n/m * |Λ|
June 25th, 2011
HBM Imaging genetics educational course
Variable selection in practice: caveat
●
●
Learning the parameter is different from evaluating the performance
of the predictive model !
You need two have two cross-validation loops
Test set
(Xt, yt)
Data set
(X, y)
Learning
set
(Xl, yl)
Set Λ
June 25th, 2011
Prediction
ŷt =Xt β
Internal
Test set
(Xa, ya)
Internal
learning set
(Xb, yb)
For all λ in Λ
Fit on (Xb, yb)
Predict on (Xa, ya)
Choose
λ* in Λ
HBM Imaging genetics educational course
Compute β
prediction
accuracy
∥ŷt-yt∥2
Variable selection in practice
Stability selection is an alternative approach which avoids searching for an optimal
regularization parameter [Meinshausen and Buhlmann, 2009]
●
The procedure works as follow
1. Extract B subsamples (e.g. of size n/2) from the training data set
2. For each subsample, fit the sparse regression model
3. Estimate the probability of each predictor being selected
4. Select all those predictors whose selection probability is above a pre-determined threshold
●
●
Under some assumptions, this procedure controls the expected number of false
positives
Unlike CV, it does not heavily depend upon the regularization parameter λ
See [Brunea et al. 2011]
●
But assumes a very sparse solution
June 25th, 2011
HBM Imaging genetics educational course
Example: Lasso regression, stability path
Lasso path
vitamin geneexpression dataset.
The paths of the 6
nonpermuted genes are
plotted as solid, red
lines, while the paths
of the 4082 permuted
genes are shown
as broken, black lines.
[Meinshausen and
Buhlmann, 2009]
June 25th, 2011
HBM Imaging genetics educational course
stability path of
Lasso
Summary on multiple regression
●
●
Latent variable models
for one phenotype
Simultaneous
dimensionality
reduction and variable
selection
June 25th, 2011
HBM Imaging genetics educational course
Modeling multiple genotypes and
phenotypes
Multivariate multiple linear regression: Y = XC + E
✔
If n were greater than p, C could be estimated by least squares or with
adequate penalization columnwise
✔
C(R) would be full rank, R = min (p, q)
✔
Same solutions as with q regression models
June 25th, 2011
HBM Imaging genetics educational course
Reduced Rank Regression / PLS
✔
Alternative approach: impose a rank condition on the
regression coefficient matrix so that rank(C) ≤ min(p, q)
✔
If C has rank r , it can be written as a product of a (p × r )
matrix B and (r × q) matrix A, both of full rank
✔
The RRR model is written Y = XBA + E,
✔
For a fixed rank r , the matrices A and B are obtained by
minimizing the weighted least squares criterion
M = Tr {(Y − XBA) Γ (Y − XBA)'}
for a given (q × q) positive definite matrix Γ
June 25th, 2011
HBM Imaging genetics educational course
Reduced Rank Regression/ PLS
illustration
●
Canonical Correlation Analysis (CCA) can be used to remove
the variance confound, but requires regularization
June 25th, 2011
HBM Imaging genetics educational course
RRR/PLS Solution
●
●
The optimal A and B are obtained as
H is the (q × r ) matrix whose columns are the first r
normalized eigenvectors associated with the r largest
eigenvalues of the (q × q) matrix
June 25th, 2011
HBM Imaging genetics educational course
Sparse reduced rank regression (sRRR)
Vounou et al. (2010), le Floch et al. (2011)
●
Add penalties to induce sparsity on A and/or B
June 25th, 2011
HBM Imaging genetics educational course
Interpretation of PLS/RRR/CCA
●
Latent variable
models for
multiple
phenotype:
Find latent
variable pairs
(tr , sr )
satisfying some
optimal
properties
[Hoggart et al. (2008) Wu et al. (2009) Vounou et al. (2010) Le Floch et al. (2011)]
June 25th, 2011
HBM Imaging genetics educational course
Statistical power comparison
Monte Carlo simulation framework [Vounou et al., 2010]
●
●
Generate an entire population P of 10k individuals
✔
Use a forwards-in-time simulation approach (FREGENE)
✔
Reproduce features observed in real human populations
✔
Genotypes coded as minor allele SNP-dosage
Generate B Monte Carlo data sets of sample size n each:
1. Randomly sample n genotypes x from the population P
2. Simulate the n phenotypes y from a multivariate normal distribution
calibrated on real data (ADNI data base)
3. Induce an association according to an additive genetic model
–
–
–
June 25th, 2011
p between 1000 to 40, 000
10 predictive SNPs with small marginal effects
q = 111 with 6 true responses
HBM Imaging genetics educational course
Simulations results
SNP sensitivity with n = 500
June 25th, 2011
SNP sensitivity with n = 1000
HBM Imaging genetics educational course
Simulations results (ctd)
●
●
Large p:
Ratio of SNP
sensitivities
(sRRR/MULM)
as a function of
the total number
of SNPs
June 25th, 2011
HBM Imaging genetics educational course
More experiments: MULM vs elastic net
Comparison of
the performance
of MULM and
elastic net to
predict one
phenotype
generated as
Varying the SNR
Top: using hybrid simulation
(real snps)
Bottom: using i.i.d. Data
(simulated snps)
Also: better support
recovery with elastic net
June 25th, 2011
HBM Imaging genetics educational course
Varying the sparsity
Real data
Q=19
Fnctional
asymmetries in
Q=19 ROIs
P=1083 snps in
12 genetic
regions
N=94
Localizer dataset
[Pinel et al. Subm.]
One significant
association
detected using
enet
[Thirion et al. in
prep]
June 25th, 2011
HBM Imaging genetics educational course
Many thanks to
●
●
●
Vincent Frouin, Jean-Baptiste Poline, Edouard
Duchesnay, Edith le Floch and the genim group at
Neurospin
Gaël Varoquaux, Fabian Pedregosa, Alexandre
Gramfort, Vincent Michel and scikit learn
contributors
Giovanni Montana for providing most of this
material
June 25th, 2011
HBM Imaging genetics educational course
Bibliography
●
●
●
●
●
●
Filippini, N., Rao, A., Wetten, S., et al. (2009). Anatomically-distinct genetic associations
of APOE epsilon4 allele load with regional cortical atrophy in Alzheimer's disease.
NeuroImage, 44(3):724--8.
Hoggart, C., Whittaker, J., De Iorio, M., and Balding, D. (2008). Simultaneous Analysis
of All SNPs in Genome-Wide and Re-Sequencing Association Studies. PLoS Genet, 4(7).
Joyner, A. H., Roddey, J. C., Bloss, C. S., et al. (2009). A common MECP2 haplotype
associates with reduced cortical surface area in humans in two independent populations.
PNAS, 106(36):15475--80.
Meinshausen, N. and Buhlmann, P. (2009). Stability selection. Annnals of Statistics.
Potkin, S. G., Turner, J. a., Guanti, G., et al. (2009). A genome-wide association study of
schizophrenia using brain activation as a quantitative phenotype. Schizophrenia bulletin,
35(1):96--108.
Shen, L., Kim, S., Risacher, S. L., Nho, K., et al. (2010). Whole genome association
study of brain-wide imaging phenotypes for identifying quantitative trait loci in MCI and
AD: A study of the ADNI cohort. NeuroImage, pages 1--13.
June 25th, 2011
HBM Imaging genetics educational course
Bibliography
●
●
●
●
Stein, J. L., Hua, X., Lee, S., et al. (2010). Voxelwise genome-wide association
study (vGWAS). NeuroImage.
Vounou, M., Nichols, T., and Montana, G. (2010). Discovering genetic associations
with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression
approach. NeuroImage 2010.
Witten, D., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition,
with applications to sparse principal components and canonical correlation analysis.
Biostatistics, 10(3):515.
Wu, T., Chen, Y., Hastie, T., Sobel, E., and Lange, K. (2009). Genomewideassociation analysis by lasso penalized logistic regression. Bioinformatics,
25(6):714.
June 25th, 2011
HBM Imaging genetics educational course
Download