Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis using 30-10-2014 Outline • Introduction to GWAS • Study design o GWAS design o Issues and considerations in GWAS • Testing for association o Univariate methods o Multivariate methods • Penalized regression methods • Factorial methods 2 Genomics & GWAS 3 The genomics revolution • Sequencing technology o 1977 – Sanger o 1995 – 1st bacterial genomes • < 10,000 bases per day per machine o 2003 – 1st human genome • > 10,000,000,000,000 bases per day per machine • GWAS publications o 2005 – 1st GWAS o Age-related macular degeneration o 2014 – 1,991 publications o 14,342 associations Genomics & GWAS 4 A few GWAS discoveries… Genomics & GWAS 5 So what is GWAS? • Genome Wide Association Study o Looking for SNPs… associated with a phenotype. • Purpose: o Explain • Understanding • Mechanisms • Therapeutics o Predict • Intervention • Prevention • Understanding not required Genomics & GWAS 6 Association p • Definition o Any relationship between two measured quantities that renders them statistically dependent. • Heritability SNPs Controls Cases n individuals o The proportion of variance explained by genetics o P = G + E + G*E • Heritability > 0 Genomics & GWAS 7 Genomics & GWAS 8 Why? • • • • Environment, Gene-Environment interactions Complex traits, small effects, rare variants Gene expression levels GWAS methodology? Genomics & GWAS 9 Study Design 10 GWAS design • Case-Control o Well-defined “case” o Known heritability • Variations o Quantitative phenotypic data • Eg. Height, biomarker concentrations o Explicit models • Eg. Dominant or recessive Study Design 11 Issues & Considerations • Data quality o 1% rule • Controlling for confounding o Sex, age, health profile o Correlation with other variables • Population stratification** • Linkage disequilibrium** Study Design 12 Population stratification • Definition o “Population stratification” = population structure o Systematic difference in allele frequencies btw. subpopulations… • … possibly due to different ancestry • Problem o Violates assumed population homogeneity, independent observations • ο Confounding, spurious associations o Case population more likely to be related than Control population • ο Over-estimation of significance of associations Study Design 13 Population stratification II • Solutions o Visualise • Phylogenetics • PCA o Correct • Genomic Control • Regression on Principal Components of PCA Study Design 14 Linkage disequilibrium (LD) • Definition o Alleles at separate loci are NOT independent of each other • Problem? o Too much LD is a problem • ο noise >> signal o Some (predictable) LD can be beneficial • ο enables use of “marker” SNPs Study Design 15 Testing for Association 16 Methods for association testing • Standard GWAS o Univariate methods • Incorporating interactions o Multivariate methods • Penalized regression methods (LASSO) • Factorial methods (DAPC-based FS) Testing for Association 17 Univariate methods p SNPs • Approach o Individual test statistics o Correction for multiple testing • Variations o Testing Controls Cases n individuals • Fisher’s exact test, Cochran-Armitage trend test, Chisquared test, ANOVA • Gold Standard—Fischer’s exact test o Correcting • Bonferroni • Gold Standard—FDR Testing for Association 18 Univariate – Strengths & weaknesses Strengths • Straightforward • Computationally fast • Conservative • Easy to interpret Testing for Association Weaknesses • Multivariate system, univariate framework • Effect size of individual SNPs may be too small • Marginal effects of individual SNPs ≠ combined effects 19 What about interactions? Testing for Association 20 Interactions • Epistasis o “Deviation from linearity under a general linear model” ππ = πΏ + π½1 π₯π + π½2 π¦π vs. ππ = πΏ + π½1 π₯π + π½2 π¦π + πΈππ ππ • With p predictors, there are: • π π = ππ π! k-way interactions • p = 10,000,000 ο 5 x 1011 o That’s 500 BILLION possible pair-wise interactions! • Need some way to limit the number of pairwise interactions considered… Testing for Association 21 Multivariate methods Neural Networks Penalized Regression LASSO penalized regression The elastic net Ridge regression Bayesian Approaches Bayesian partitioning Bayesian Logistic Bayesian Epistasis Regression with Association Mapping Stochastic Search Variable Selection Factorial Methods Multi-factor Sparse-PCA dimensionality reduction Supervised-PCA method DAPC-based FS (snpzip) Odds-ratiobased MDR Testing for Association Genetic programming Parametric optimized neural decreasing method networks Logic Trees Logic feature selection Monte Carlo Logic regression Logic Regression Modified Logic Regression-Gene Expression Programming Genetic Programming for Set associationAssociation Studies approach Non-parametric Methods Random forests Restricted partitioning method Combinatorial partitioning method 22 Multivariate methods • Penalized regression methods o LASSO penalized regression • Factorial methods o DAPC-based feature selection Testing for Association 23 Penalized regression methods • Approach o Regression models multivariate association o Shrinkage estimation ο feature selection • Variations o LASSO, Ridge, Elastic net, Logic regression • Gold Standard—LASSO penalized regression Testing for Association 24 LASSO penalized regression • Regression o Generalized linear model (“glm”) • Penalization o L1 norm o Coefficients ο 0 o Feature selection! Testing for Association 25 LASSO – Strengths & weaknesses Strengths Weaknesses • Stability • Multicollinearity • Interpretability • Not designed for high-p • Likely to accurately select the most influential predictors • Sparsity • Computationally intensive • Calibration of penalty parameters o User-defined ο variability • Sparsity Testing for Association 26 Factorial methods • Approach o Place all variables (SNPs) in a multivariate space o Identify discriminant axis ο best separation o Select variables with the highest contributions to that axis • Variations o Supervised-PCA, Sparse-PCA, DA, DAPC-based FS o Our focus—DAPC with feature selection (snpzip) Testing for Association 27 DAPC-based feature selection a b e Alleles Individuals Diseased (“cases”) Healthy (“controls”) Discriminant Axis c Discriminant Axis 0.5 0.4 Contribution to Discriminant Axis 0.3 0.2 0.1 0 a Testing for Association b c d e individuals of individuals Density of Density d Discriminant axis 28 DAPC-based feature selection • Where should we draw the line? o ο Hierarchical clustering 0.4 Density of individuals 0.35 Contribution to Discriminant Axis 0.3 0.25 0.2 ? 0.15 0.1 0.05 0 a b Discriminant c axis d e Hierarchical clustering (FS) 0.5 0.4 0.3 Contribution to Discriminant Axis 0.2 0.1 0 a b c d e Hooray! Testing for Association 30 DAPC – Strengths & weaknesses Strengths • More likely to catch all relevant SNPs (signal) • Computationally quick • Good exploratory tool • Redundancy > sparsity Testing for Association Weaknesses • Sensitive to n.pca • N.snps.selected varies • No “p-value” • Redundancy > sparsity 31 Conclusions • Study design o GWAS design o Issues and considerations in GWAS • Testing for association o Univariate methods o Multivariate methods • Penalized regression methods • Factorial methods 32 Thanks for listening! 33 Questions? 34