Metabolomics: GWAS and beyond

advertisement
Structural Equation Modeling analysis
for causal inference
from multiple -omics datasets
So-Youn Shin, Ann-Kristin Petersen
Christian Gieger, Nicole Soranzo
[J. L. Griffin and J. P. Shockcor (2004) Nature Reviews Cancer]
[K. Suhre, S.-Y. Shin, et al. (2011) Nature]
Integrative Analysis for multiple –omics
1. Motivations
– To dissect biological and genetic determinants of normal
phenotypic variation and disease states
– To validate results of individual –omics levels by reducing false
positives caused by technical and methodological biases
2. Several analytical challenges
– High dimensional, highly correlated datasets
• Normalization and Missing value estimation/imputation
• Biologically relevant dimension reduction
– Methodologies
• Correlation → Causation
• Linearity → Nonlinearity (e.g. interaction)
• Multiple testing correction, Validation and Replication
Causal inference
?
DNA variation
Metabolomics
Lipids
Study aim:
• To dissect mediation at serum lipid loci using
metabolomics
[A.-K. Petersen, S.-Y. Shin, et al. (2011) Under Review]
Study design
Model selection(KORA, N=~1,800)
Linear models on 95 LipidSNPs, 151 Metabolites (and ~10,000 ratios), 4 Lipids
P ≤ 3.4x10-6
Metabolite
Lipid
P ≤ 8.7x10-5
LipidSNP
P ≤ 0.05
Metabolite
LipidSNP
Lipid
50 Principal Components (97% Variance)
PC
Metabolite
Model testing
(Structural Equation Modeling)
Model testing
(Structural Equation Modeling)
Metabolite
LipidSNP
PC
Lipid
Replication
(TwinsUK, N=~800)
LipidSNP
Lipid
Replication
(TwinsUK, N=~800)
Interpretation of Principal Components
[A.-K. Petersen, S.-Y. Shin, et al. (2011) Under Review]
Structural Equation Modeling
MA
A
MP
MB
B
M A   0 0
M   0 0
 B 
M P    0 0

 
A

  A 0
 B   0  B
0  M A   0 
0  M B   0 
0
0 0  M P    0 

  
 P 0 0  A   A 
0  BA 0  B   B 
1
     E VV   I    E UU I   
0
0
0
0
S
p ( p  1)
F ( )  ln    tr S      ln S 
2
p ( p  1)


X 2  ( N  1) F ( ) ~  2  df 
 q ( ) 
2



R package “sem”
1

Structural Equation Modeling
Model 1
Model 2
MET
SNP
Model 3
MET
SNP
LIP
Model 4
SNP
LIP
Model 5
MET
SNP
MET
LIP
LIP
MET
LIP
Model 9
MET
SNP
LIP
MET
SNP
Model 8
SNP
LIP
Model 6
SNP
Model 7
MET
Model 10
MET
SNP
LIP
MET
SNP
LIP
LIP
All possible models -> Best fit (p-value, BIC)
Structural Equation Modeling
• Assumptions
– Statistical assumptions (like any regression models)
– Causal assumptions (based on biological knowledge)
• Pros
– Flexible hypotheses : Both direct and indirect effects are allowed.
(vs. Mendelian Randomization)
– A variable can be both predictor and response simultaneously. (vs.
Bayesian network analysis)
• Cons
– Nonlinearity cannot be detected.
– Hidden confounders or measurement errors can mislead causal
inference. (same with biological experiments)
Causal inference
DNA variation
•
•
•
•
Metabolomics
Lipids
We tested 95 loci associated with serum lipid levels.
We applied SEM to test causal inference, on METs or PCs.
260 association sets met our criteria for significant edges in SNP ->
MET -> Lipid at 3 loci (FADS1, GCKR, APOA1).
METs and PCs showed similar results.
We suggest that SEM is an appropriate statistical instrument to dissect
the contribution of intermediate phenotypes to complex biological
pathways.
[A.-K. Petersen, S.-Y. Shin, et al. (2011) Under Review]
Our ongoing project
TwinsUK
~600k SNPs
~48k Probes
~32k Metabolic traits
Overlapping N = ~600
Missing Values
• Issues in multivariate analyses
• Ignore vs. Impute
• How to impute
– Impute with mean (row mean)
– K-nearest-neighbors (kNN)
– Transform based methods (SVD, Bayesian PCA)
• BPCA and GMC (Gaussian mixtures) seemed to perform better than
SVD, row mean and kNN [R. Jornsten et al. (2005) Bioinformatics]
• BPCA and LSA (least squares adaptive) appeared to be the best [S. Oh
et al. (2011) Bioinformatics]
Test of Bayesian PCA
Cumulative R^2
Probabilistic PCA vs. Bayesian PCA
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
PPCA
BPCA
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77
R-package “pcaMethods”
Dimension Reduction Methods
• Principal Component Analysis
– kernel PCA
• Factor Analysis, Multidimensional Scaling
• Canonical Correlation Analysis
– regularized CCA
– kernel CCA
• Partial Least Squares (canonical mode)
– Sparse PLS
Test of rCCA
•
•
•
No significant cross correlation in two datasets
rCCA extract features (canonical covariates) while maximizing the correlation
between two datasets
Significant cross correlation after rCCA : Is this biological meaningful or not?
R package “CCA” or “mixOmics”
Integrative Analysis for multiple –omics
Open questions remain how best to integrate the
multiple omics datasets to understand underlying
biological mechanisms and infer causal pathways.
Acknowledgements
Wellcome Trust Sanger Institute
• Nicole Soranzo, YasinMemari, AparnaRadhakrishnan
• PanosDeloukas, ElinGrunberg
KORA
• Ann-Kristin Petersen, Christian Gieger, KarstenShure
TwinsUK
• Tim Spector, Massimo Mangino, GuangjuZhai, Kerrin Small
Download