Structural Equation Modeling analysis for causal inference from multiple -omics datasets So-Youn Shin, Ann-Kristin Petersen Christian Gieger, Nicole Soranzo [J. L. Griffin and J. P. Shockcor (2004) Nature Reviews Cancer] [K. Suhre, S.-Y. Shin, et al. (2011) Nature] Integrative Analysis for multiple –omics 1. Motivations – To dissect biological and genetic determinants of normal phenotypic variation and disease states – To validate results of individual –omics levels by reducing false positives caused by technical and methodological biases 2. Several analytical challenges – High dimensional, highly correlated datasets • Normalization and Missing value estimation/imputation • Biologically relevant dimension reduction – Methodologies • Correlation → Causation • Linearity → Nonlinearity (e.g. interaction) • Multiple testing correction, Validation and Replication Causal inference ? DNA variation Metabolomics Lipids Study aim: • To dissect mediation at serum lipid loci using metabolomics [A.-K. Petersen, S.-Y. Shin, et al. (2011) Under Review] Study design Model selection(KORA, N=~1,800) Linear models on 95 LipidSNPs, 151 Metabolites (and ~10,000 ratios), 4 Lipids P ≤ 3.4x10-6 Metabolite Lipid P ≤ 8.7x10-5 LipidSNP P ≤ 0.05 Metabolite LipidSNP Lipid 50 Principal Components (97% Variance) PC Metabolite Model testing (Structural Equation Modeling) Model testing (Structural Equation Modeling) Metabolite LipidSNP PC Lipid Replication (TwinsUK, N=~800) LipidSNP Lipid Replication (TwinsUK, N=~800) Interpretation of Principal Components [A.-K. Petersen, S.-Y. Shin, et al. (2011) Under Review] Structural Equation Modeling MA A MP MB B M A 0 0 M 0 0 B M P 0 0 A A 0 B 0 B 0 M A 0 0 M B 0 0 0 0 M P 0 P 0 0 A A 0 BA 0 B B 1 E VV I E UU I 0 0 0 0 S p ( p 1) F ( ) ln tr S ln S 2 p ( p 1) X 2 ( N 1) F ( ) ~ 2 df q ( ) 2 R package “sem” 1 Structural Equation Modeling Model 1 Model 2 MET SNP Model 3 MET SNP LIP Model 4 SNP LIP Model 5 MET SNP MET LIP LIP MET LIP Model 9 MET SNP LIP MET SNP Model 8 SNP LIP Model 6 SNP Model 7 MET Model 10 MET SNP LIP MET SNP LIP LIP All possible models -> Best fit (p-value, BIC) Structural Equation Modeling • Assumptions – Statistical assumptions (like any regression models) – Causal assumptions (based on biological knowledge) • Pros – Flexible hypotheses : Both direct and indirect effects are allowed. (vs. Mendelian Randomization) – A variable can be both predictor and response simultaneously. (vs. Bayesian network analysis) • Cons – Nonlinearity cannot be detected. – Hidden confounders or measurement errors can mislead causal inference. (same with biological experiments) Causal inference DNA variation • • • • Metabolomics Lipids We tested 95 loci associated with serum lipid levels. We applied SEM to test causal inference, on METs or PCs. 260 association sets met our criteria for significant edges in SNP -> MET -> Lipid at 3 loci (FADS1, GCKR, APOA1). METs and PCs showed similar results. We suggest that SEM is an appropriate statistical instrument to dissect the contribution of intermediate phenotypes to complex biological pathways. [A.-K. Petersen, S.-Y. Shin, et al. (2011) Under Review] Our ongoing project TwinsUK ~600k SNPs ~48k Probes ~32k Metabolic traits Overlapping N = ~600 Missing Values • Issues in multivariate analyses • Ignore vs. Impute • How to impute – Impute with mean (row mean) – K-nearest-neighbors (kNN) – Transform based methods (SVD, Bayesian PCA) • BPCA and GMC (Gaussian mixtures) seemed to perform better than SVD, row mean and kNN [R. Jornsten et al. (2005) Bioinformatics] • BPCA and LSA (least squares adaptive) appeared to be the best [S. Oh et al. (2011) Bioinformatics] Test of Bayesian PCA Cumulative R^2 Probabilistic PCA vs. Bayesian PCA 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 PPCA BPCA 1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 R-package “pcaMethods” Dimension Reduction Methods • Principal Component Analysis – kernel PCA • Factor Analysis, Multidimensional Scaling • Canonical Correlation Analysis – regularized CCA – kernel CCA • Partial Least Squares (canonical mode) – Sparse PLS Test of rCCA • • • No significant cross correlation in two datasets rCCA extract features (canonical covariates) while maximizing the correlation between two datasets Significant cross correlation after rCCA : Is this biological meaningful or not? R package “CCA” or “mixOmics” Integrative Analysis for multiple –omics Open questions remain how best to integrate the multiple omics datasets to understand underlying biological mechanisms and infer causal pathways. Acknowledgements Wellcome Trust Sanger Institute • Nicole Soranzo, YasinMemari, AparnaRadhakrishnan • PanosDeloukas, ElinGrunberg KORA • Ann-Kristin Petersen, Christian Gieger, KarstenShure TwinsUK • Tim Spector, Massimo Mangino, GuangjuZhai, Kerrin Small