Multivariate Approaches: Joint Modeling of Imaging and Genetic Data Bertrand Thirion, INRIA Saclay-Île-de-France, Parietal team CEA, DSV, I2BM, Neurospin http://parietal.saclay.inria.fr bertrand.thirion@inria.fr June 25th, 2011 HBM Imaging genetics educational course Outline ● Introduction: Multivariate methods ● Neuroimaging genetics ● ● Limitations of mass univariate models ● Penalized multiple regression ● Multivariate multiple regression Examples on simulated and real data June 25th, 2011 HBM Imaging genetics educational course Introduction: Multivariate methods ● Two main families: supervised and unsupervised ● Unsupervised: clustering, PCA, ICA ● ● ● Generic form: X = AD + E, where only X is known Try to model/understand some data – its distribution p(X)– without trying to fit any target Supervised: regularized regression, kernel machines, discriminant analysis, PLS ● Generic form: Y = XB + E, where Y and X are known ● Try to fit the target data Y given X June 25th, 2011 HBM Imaging genetics educational course Introduction: Multivariate methods ● ● Supervised methods ● Unsupervised methods Using unsupervised techniques to solve supervised problems is simply inefficient June 25th, 2011 HBM Imaging genetics educational course Neuroimaging Genetics: problem statement Subject s=1..S Brain image data Genetic data Q response variables: Y = {y1,..,yq} June 25th, 2011 P predictor variables: X = {x1,.., xp} HBM Imaging genetics educational course Genetic-Neuroimaging studies ● Small q, small p ● ● Small q, large p ● ● [Potkin et al. 2009] q = 1 mean BOLD signal; p = 317, 503 SNPs Large q, small p ● ● [Joyner et al. 2009] q = 4 brain size measures, p = 11 SNPs [Filippini et al. 2009] q = 29, 812 voxels; p = 1 SNP Large q, large p ● [Stein et al. 2010] q = 31, 622 voxels; p = 448, 293 SNPs June 25th, 2011 HBM Imaging genetics educational course Mass Univariate Linear Models (MULM) ● A commonly used approach is to model one genotype and one phenotype at a time: 1. Fit all univariate linear regression models yj = βjk xk + ε, j = 1, . . . , q, k = 1, . . . , p 2 possibilities: ● Allelic dosage ● categorical model 2. Search for a subset of p significant genotypes with indices {k1 , k2 , . . . , kp' } {1,..,p} with p'≪ p by testing all (p × q) null hypotheses of no association H0 : β jk = 0 3. Correct for multiple testing – control experiment-wise FWER or FDR ● Possible dependence patterns among genotypes and phenotypes are ignored at the modeling stage June 25th, 2011 HBM Imaging genetics educational course Problems with MULM ● ● If we test for each (voxel, SNP) pair (up to 1012 pairs) Power issue: (peak statistic) to detect an effect with a power of 80%, based on n=1000 subjects, the standardized effect needs to be greater than .26 (.19 with n=2000 subjects). ● Reproducibility issue: is directly related to the small power → false negatives hamper the reproducibility of the analysis this is known to be one of the major issues in GWAS. June 25th, 2011 HBM Imaging genetics educational course Multivariate predictive modelling ● ● ● Why modeling multiple genotypes and multiple phenotypes? ✔ A weak effect may be more apparent when other causal effects are already accounted for ✔ A false signal may be weakened by inclusion in the model of a stronger signal from a true causal association ✔ A weak effect may be more apparent if multiple phenotypes are affected Basic strategy: build a linear regression model that includes all genotypes (predictors) and all phenotypes (responses) and then perform variable selection The models covered here are: ✔ Penalized multiple linear regression (any p and q = 1) ✔ Penalized sparse canonical correlation analysis (any p and q) ✔ Penalized reduced-rank regression (any p and q) June 25th, 2011 HBM Imaging genetics educational course Multiple genotypes (p > 1) and one phenotype (q = 1): multiple regression ● ● The multiple linear regression model with univariate response Fit the multiple linear regression model by solving ● Equivalently, minimize the error function (loss) ● When n > p, the OLS solution is given by June 25th, 2011 HBM Imaging genetics educational course Penalized multivariate regression for genotype selection ✔ One step-approach: fit the multiple linear regression model while finding a subset of p important predictors with indices {k1 , k2 , . . . , kp' } {1, 2, . . . , p} with p'≪p all having non-zero regression coefficients. ✔ This is achieved by fitting a penalized regression model: ✔ The penalization ψ(β) imposes a constraint on β. We use convex functions to have unique optimum ✔ The coupling parameter λ controls the trade-off between the OLS (unpenalized) solution and the penalized solution. June 25th, 2011 HBM Imaging genetics educational course Ridge regression ● The problem can be rewritten as ● or more compactly ● λ controls the amount of shrinkage ● Some properties are: ✔ Closed-form solution ✔ Useful when the data matrix X is singular and X T X non invertible ✔ Bias-variance trade-off – better predictions ✔ Grouping effect – correlated variables get similar coefficients ✔ No variable selection June 25th, 2011 HBM Imaging genetics educational course LASSO regression ● Lasso regression finds β subject to ● Performs both continuous shrinkage and variable selection ● λ controls the amount of sparsity ● For instance, with p = 2: June 25th, 2011 HBM Imaging genetics educational course Example of Lasso regularization path ● ● The higher the regularization, the sparser the solution Run it yourself http://scikit-learn.sourceforge.net/modules/linear_model.html June 25th, 2011 HBM Imaging genetics educational course Elastic net regression: convex combination of L1 and L2 penalties ● Elastic net regression solves ● It retains the benefits of both individual penalties ● Setting ● the penalty simplifies to June 25th, 2011 HBM Imaging genetics educational course Solving the penalized regression problem ● ● ● Convex but non-smooth problem: a unique optimal solution can always be found, though not through simple techniques (gradient descent etc.) 3 families of methods: ✔ Homotopy methods: Lars algorithm [Efron et al. 2004] ✔ Proximal methods [Nesterov 2004] ✔ Coordinate descent [Friedmann et al. 2009] Implementations available in R/python ● Glmnet R package, http://cran.r-project.org/web/packages/glmnet/ ● Scikit learn, http://scikit-learn.sourceforge.net/ June 25th, 2011 HBM Imaging genetics educational course Variable selection in practice: λ ● A common procedure is nested cross validation (CV) ● For each value of λ within a given range: 1. Leave m samples out for testing 2. Use the remaining n − m samples for training– fit the model 3. Compute the prediction error using the test sample 4. Repeat for all n/m folds and take the average prediction error → The optimal λ minimizes the cross-validated prediction error ● ● ● Various search strategies can be used to explore the space Λ No optimal solution guaranteed: the Global problem (that includes optimizing λ) is NOT convex Cost = n/m * |Λ| June 25th, 2011 HBM Imaging genetics educational course Variable selection in practice: caveat ● ● Learning the parameter is different from evaluating the performance of the predictive model ! You need two have two cross-validation loops Test set (Xt, yt) Data set (X, y) Learning set (Xl, yl) Set Λ June 25th, 2011 Prediction ŷt =Xt β Internal Test set (Xa, ya) Internal learning set (Xb, yb) For all λ in Λ Fit on (Xb, yb) Predict on (Xa, ya) Choose λ* in Λ HBM Imaging genetics educational course Compute β prediction accuracy ∥ŷt-yt∥2 Variable selection in practice Stability selection is an alternative approach which avoids searching for an optimal regularization parameter [Meinshausen and Buhlmann, 2009] ● The procedure works as follow 1. Extract B subsamples (e.g. of size n/2) from the training data set 2. For each subsample, fit the sparse regression model 3. Estimate the probability of each predictor being selected 4. Select all those predictors whose selection probability is above a pre-determined threshold ● ● Under some assumptions, this procedure controls the expected number of false positives Unlike CV, it does not heavily depend upon the regularization parameter λ See [Brunea et al. 2011] ● But assumes a very sparse solution June 25th, 2011 HBM Imaging genetics educational course Example: Lasso regression, stability path Lasso path vitamin geneexpression dataset. The paths of the 6 nonpermuted genes are plotted as solid, red lines, while the paths of the 4082 permuted genes are shown as broken, black lines. [Meinshausen and Buhlmann, 2009] June 25th, 2011 HBM Imaging genetics educational course stability path of Lasso Summary on multiple regression ● ● Latent variable models for one phenotype Simultaneous dimensionality reduction and variable selection June 25th, 2011 HBM Imaging genetics educational course Modeling multiple genotypes and phenotypes Multivariate multiple linear regression: Y = XC + E ✔ If n were greater than p, C could be estimated by least squares or with adequate penalization columnwise ✔ C(R) would be full rank, R = min (p, q) ✔ Same solutions as with q regression models June 25th, 2011 HBM Imaging genetics educational course Reduced Rank Regression / PLS ✔ Alternative approach: impose a rank condition on the regression coefficient matrix so that rank(C) ≤ min(p, q) ✔ If C has rank r , it can be written as a product of a (p × r ) matrix B and (r × q) matrix A, both of full rank ✔ The RRR model is written Y = XBA + E, ✔ For a fixed rank r , the matrices A and B are obtained by minimizing the weighted least squares criterion M = Tr {(Y − XBA) Γ (Y − XBA)'} for a given (q × q) positive definite matrix Γ June 25th, 2011 HBM Imaging genetics educational course Reduced Rank Regression/ PLS illustration ● Canonical Correlation Analysis (CCA) can be used to remove the variance confound, but requires regularization June 25th, 2011 HBM Imaging genetics educational course RRR/PLS Solution ● ● The optimal A and B are obtained as H is the (q × r ) matrix whose columns are the first r normalized eigenvectors associated with the r largest eigenvalues of the (q × q) matrix June 25th, 2011 HBM Imaging genetics educational course Sparse reduced rank regression (sRRR) Vounou et al. (2010), le Floch et al. (2011) ● Add penalties to induce sparsity on A and/or B June 25th, 2011 HBM Imaging genetics educational course Interpretation of PLS/RRR/CCA ● Latent variable models for multiple phenotype: Find latent variable pairs (tr , sr ) satisfying some optimal properties [Hoggart et al. (2008) Wu et al. (2009) Vounou et al. (2010) Le Floch et al. (2011)] June 25th, 2011 HBM Imaging genetics educational course Statistical power comparison Monte Carlo simulation framework [Vounou et al., 2010] ● ● Generate an entire population P of 10k individuals ✔ Use a forwards-in-time simulation approach (FREGENE) ✔ Reproduce features observed in real human populations ✔ Genotypes coded as minor allele SNP-dosage Generate B Monte Carlo data sets of sample size n each: 1. Randomly sample n genotypes x from the population P 2. Simulate the n phenotypes y from a multivariate normal distribution calibrated on real data (ADNI data base) 3. Induce an association according to an additive genetic model – – – June 25th, 2011 p between 1000 to 40, 000 10 predictive SNPs with small marginal effects q = 111 with 6 true responses HBM Imaging genetics educational course Simulations results SNP sensitivity with n = 500 June 25th, 2011 SNP sensitivity with n = 1000 HBM Imaging genetics educational course Simulations results (ctd) ● ● Large p: Ratio of SNP sensitivities (sRRR/MULM) as a function of the total number of SNPs June 25th, 2011 HBM Imaging genetics educational course More experiments: MULM vs elastic net Comparison of the performance of MULM and elastic net to predict one phenotype generated as Varying the SNR Top: using hybrid simulation (real snps) Bottom: using i.i.d. Data (simulated snps) Also: better support recovery with elastic net June 25th, 2011 HBM Imaging genetics educational course Varying the sparsity Real data Q=19 Fnctional asymmetries in Q=19 ROIs P=1083 snps in 12 genetic regions N=94 Localizer dataset [Pinel et al. Subm.] One significant association detected using enet [Thirion et al. in prep] June 25th, 2011 HBM Imaging genetics educational course Many thanks to ● ● ● Vincent Frouin, Jean-Baptiste Poline, Edouard Duchesnay, Edith le Floch and the genim group at Neurospin Gaël Varoquaux, Fabian Pedregosa, Alexandre Gramfort, Vincent Michel and scikit learn contributors Giovanni Montana for providing most of this material June 25th, 2011 HBM Imaging genetics educational course Bibliography ● ● ● ● ● ● Filippini, N., Rao, A., Wetten, S., et al. (2009). Anatomically-distinct genetic associations of APOE epsilon4 allele load with regional cortical atrophy in Alzheimer's disease. NeuroImage, 44(3):724--8. Hoggart, C., Whittaker, J., De Iorio, M., and Balding, D. (2008). Simultaneous Analysis of All SNPs in Genome-Wide and Re-Sequencing Association Studies. PLoS Genet, 4(7). Joyner, A. H., Roddey, J. C., Bloss, C. S., et al. (2009). A common MECP2 haplotype associates with reduced cortical surface area in humans in two independent populations. PNAS, 106(36):15475--80. Meinshausen, N. and Buhlmann, P. (2009). Stability selection. Annnals of Statistics. Potkin, S. G., Turner, J. a., Guanti, G., et al. (2009). A genome-wide association study of schizophrenia using brain activation as a quantitative phenotype. Schizophrenia bulletin, 35(1):96--108. Shen, L., Kim, S., Risacher, S. L., Nho, K., et al. (2010). Whole genome association study of brain-wide imaging phenotypes for identifying quantitative trait loci in MCI and AD: A study of the ADNI cohort. NeuroImage, pages 1--13. June 25th, 2011 HBM Imaging genetics educational course Bibliography ● ● ● ● Stein, J. L., Hua, X., Lee, S., et al. (2010). Voxelwise genome-wide association study (vGWAS). NeuroImage. Vounou, M., Nichols, T., and Montana, G. (2010). Discovering genetic associations with high-dimensional neuroimaging phenotypes: a sparse reduced-rank regression approach. NeuroImage 2010. Witten, D., Tibshirani, R., and Hastie, T. (2009). A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics, 10(3):515. Wu, T., Chen, Y., Hastie, T., Sobel, E., and Lange, K. (2009). Genomewideassociation analysis by lasso penalized logistic regression. Bioinformatics, 25(6):714. June 25th, 2011 HBM Imaging genetics educational course