Associating Genomic Variations with Phenotypes Model comparison, rare variants, and analysis pipeline Qunyuan Zhang Division of Statistical Genomics & Genome Institute Washington University School of Medicine 1 Data & Question i 1 Y y1 X x11 x12 ... x1m 2 y2 x21 ... ... ... ... x22 ... ... x2 m ... ... n xn 2 .. yn Phenotypes (quantitative, categorical) xn1 xnm Genotypes: SNP Insertion Deletion Duplication Inversion Translocation … Relationship between X and Y ? 2 Linkage & Association i 1 x11 X q12 x12 ... 2 y2 x21 ... ... ... ... q22 ... x22 ... ... ... n Y y1 yn xn1 qn 2 Genotypes xn 2 ... Association: (Y,X) Linkage: (Y,Q) Phenotype Q is unobservable r1 Q r2 Putative QTL 3 A Fixed-effect Mixture Model For Linkage Commonly used in plant genetics P1 X P2 SNP A SNP B F1 r1 Q r2 F2 3 f ( yi ) P(Q j | X i , r ) j 1 1 j 1 yi j 2 exp ( ) 2 2 j n L(Y ) f ( yi ) i 1 4 A Variance-component Model For Linkage Commonly used in human genetics SNP A SNP B ΔQ r 1 Q r2 1 T 1 L(Y ) exp (Y ) V (Y ) n/2 1/ 2 (2 ) | V | 2 1 V Cov(Y ) ΔQ Δg I 2 Q QTL IBD matrix 2 g Background IBD matrix 2 e Diagonal unit matrix 5 Variance-component Model = Random-effect Linear Model Y μ ZQ γ Q Z g γ g e MVN (0, Δg ) MVN (0, ΔQ ) 2 Q 2 g V ΔQ Δg I 2 Q 2 g Random effects N (0, e2 ) 2 e 1 T 1 L(Y ) exp (Y ) V (Y ) n/2 1/ 2 (2 ) | V | 2 1 6 From Linkage to Association QTL effect(s) Y μ ZQ γ Q Z g γ g e Y μ Xβ Zg γ g e marker effect(s) Linkage model Family-based association model V Δg I 2 g 2 e fixed effect(s) 1 T 1 L(Y ) exp (Y X ) V (Y X ) n/2 1/ 2 (2 ) | V | 2 1 7 A Simple Association Model For Unrelated Subjects Y μ Xβ e L(Y ) 1 (2 ) n / 2 | V |1/ 2 n i 1 1 e V I 2 e 1 T 1 exp (Y X ) V (Y X ) 2 1 yi X 2 exp ( ) e 2 2 8 Covariate(s): Adjusting For Confounder(s) Y μ Xβ XC βC e Observed confounders: age, sex etc. Hidden confounders: population structure Population structure can be estimated by: -PCA -Clustering -Admixture/ancestry 9 Modeling Hidden Genetic Correlation Between Subjects marker fixed effect(s) covariate fixed effect(s) Genetic background random effects Y μ Xβ XCβC Z g γ g e V Δg I 2 g 2 e Family data, pedigree => IBD matrix Population data, hidden, marker data => IBS matrix 10 Modeling Rare Variants Y μ Xβ XCβC Z g γ g e Common variants, tested individually, H0: β1=0. One p-value per variant Y μ 1 X1 ... Rare variants, tested as an entire group (burden test), usually by gene H0: β1= β2=…=βk=0 . One p-value per group of variants Y μ 1 X1 2 X 2 ... k X k ... Incorporated with variable selection, with loose criteria β can be treated as random effects, variance components test, can be weighted by prior information 11 Collapsing Model Y μ 1 X1 2 X 2 ... k X k ... Y μ X ... Collapsing multiple variables into one subject X 1 X2 X3 X 1 2 0 0 0 1 0 0 1 1 3 1 0 0 1 12 Weighted Sum Model Y μ 1 X1 2 X 2 ... k X k ... k Y μ ( w j X j ) ... j 1 Y μ S ... subject X 1 1 2 3 X3 X2 S w1 0.2 w1 0.5 w1 0.3 0.0 0 0 0 0.8 0 1 1 1 0 0 0.2 13 Weighting Variants Base on allele frequency, continuous or binary(0,1) weight, variable threshold; Based on function annotation/prediction; Based on sequencing quality (coverage, mapping quality, genotyping quality, validated or not etc.); Data-driven, using both genotype and phenotype data, learning weights (including effect directions) from data, requiring permutation test; Any combination … Grouping Variants By gene By transcript By gene set / pathway …… By exon By protein domain 14 Modeling More Data Types Generalized Linear (Mixed) Model g (Y ) μ Xβ ... e Link function For binary Y, logistic model P(Y 1) g (Y ) logit (Y ) log 1 P(Y 0) exp(μ Xβ ... e) P(Y 1) exp(μ Xβ ... e) 1 15 Longitudinal Data (quantitative) Time Fixed effect, time as covariate Repeated measures, random effect, correlation within subjects 16 Longitudinal Data (binary) Time Linear model, time as covariate Survival analysis, CoxPH model etc. 17 Tools SAS Procedures REG, LOGISTIC, GENMOD, MIXED, HPMIXED, GLIMMIX, PHREG/LIFETEST R Functions/Packages lm (), glm() gee, nlme, kinship2/coxme, lme4, survival Other Programs SOLAR, MMAP, EMMA, EMMAX, SKAT 18 Pipeline Input (data + options) Job generating/submitting module Job number controlling module LSF bsub job1 job2 ….. Job N Options.jobi => self-programmed modules (SAS, R,…) Options.jobi => external program modules (MMAP, SKAT,..) Result 1 ….. Result 2 Result N Job status monitoring module (all done ?) Yes no Result summarizing module Wait … 19 gwas.sh options.gwa #!/bin/sh OPFILE=$1 ... … Pheno Bmi YES Obes YES HD Age Sex … type qt covar age,sex program analysis SASGLM mixed ql NA SASGLM gee ql … … age SASGLM gee Program SASGLM GSTAT MMAP language SAS R C location Maintainer /dsg1/code/sas/glm.sas Q.Zhang /dsg1/code/R/gstat.R Q.Zhang /dsg1/code/sas/mmap.sh J. Czajkowski … run NO [DATA] database=SAS genotype_dir=/dsg1/gwas/fhsgeno genotype_file= phenotype_file=fhs100 markerinfo_file=mapall marker_selection=MAF>0.01 pedigree_file=pediall subjectID=subject pedgreeID=famid markername=snp … [ANALYSIS] phenolist_file= pheno_list=bmi/qt covariates= program=SASGLM analysis=mixed [OUTPUT] output_dir=/dsguser/qunyuan/fhs/bmi output_file= output_replace=no [RUN] clusterjobname=bmimixed memsize=1000M maxjobn=300 … 20 Thanks ! 21