Association Analysis of Rare Genetic Variants Qunyuan Zhang Division of Statistical Genomics Course M21-621 Computational Statistical Genetics 1 Rare Variants Low allele frequency: usually less than 1% Low power: for most analyses, due to less variation of observations High false positive rate: for some model-based analyses, due to sparse distribution of data, unstable/biased parameter estimation and inflated pvalue. 2 An Example of Low Power 3 Jonathan C. Cohen, et al. Science 305, 869 (2004) An Example of High False Positive Rate (Q-Q plots from GWAS data, unpublished) N=~2500 N=~2500 MAF>0.03 MAF<0.03 N=~2500 N=50000 MAF<0.03 MAF<0.03 Permuted Bootstrapped Three Levels of Rare Variant Data Level 1: Individual-level Level 2: Summarized over subjects Level 3: Summarized over both subjects and variants 5 Level 1: Individual-level Subject V1 V2 V3 V4 Trait-1 Trait-2 1 1 0 0 0 90.1 1 2 0 1 0 . 99.2 1 3 0 0 0 0 105.9 0 4 0 0 0 0 89.5 0 5 0 . 0 0 97.6 0 6 0 0 0 0 110.5 0 7 0 0 1 0 88.8 0 8 0 0 0 1 95.4 1 6 Level 2: Summarized over subjects (by group) 7 Jonathan C. Cohen, et al. Science 305, 869 (2004) Level 3: Summarized over subjects (by group) and variants (usually by gene) Variant allele number Reference allele number Total Low-HDL group 20 236 256 High-HDL group 2 254 256 Total 22 490 512 Methods For Level 3 Data 9 Single-variant Test vs Total Freq.Test (TFT) 10 Jonathan C. Cohen, et al. Science 305, 869 (2004) What we have learned … Single-variant test of rare variants has very low power for detecting association, due to extremely low frequency (usually < 0.01) Testing collective effect of a set of rare variants may increase the power (sum test, collective test, group test, collapsing test, burden test…) 11 Methods For Level 2 Data Allowing different samples sizes for different variants Different variants can be weighted differently 12 CAST: A cohort allelic sums test Morgenthaler and Thilly, Mutation Research 615 (2007) 28–56 Under H0: S(cases)/2N(cases)−S(controls)/2N(controls) =0 S: variant number; N: sample size T= S(cases) − S(controls)N(cases)/N(controls) = S(cases) − S∗(controls) (S can be calculated variant by variant and can be weighted differently, the final T=sum(WiSi) ) Z=T/SQRT(Var(T)) ~ N (0,1) Var(T)= Var (S(cases) − S* (controls) ) =Var(S(cases)) + Var(S* (controls)) =Var(S(cases)) + Var(S(controls)) X [N(cases)/N(controls)]^2 13 C-alpha PLOS Genetics, 2011 | Volume 7 | Issue 3 | e1001322 Effect direction problem 14 C-alpha 15 QQ Plots of Existing Methods (under the null) EFT CAST TFT C-alpha •EFT and C-alpha inflated with false positives •TFT and CAST no inflation, but assuming single effect-direction •Objective More general, powerful methods … More Generalized Methods For Level 2 Data 17 Structure of Level 2 data variant 1 variant 2 variant 3 Strategy Instead of testing total freq./number, we test the randomness of all tables. … variant i … variant k Exact Probability Test (EPT) 1.Calculating the probability of each table based on hypergeometric distribution Pi C n , a C n , a i 1 i 1 2. Calculating the logarized joint probability (L) for all k tables L i 2 i 2 CN , n i i A 3. Enumerating all possible tables and L scores k log(P ) i 1 i 4. Calculating p-value P= Prob.( ) ASHG Meeting 1212, Zhang Likelihood Ratio Test (LRT) Binomial distribution k LR 2 log Pr(a , b , a , b i 1 i 1 k i 1 i 2 i 2 H0 : ) i 1 i 2 i i i i i i Pr( a , b , a , b H : 1 1 2 2 A 1 2) ~ 2 df k i 1 ASHG Meeting 1212, Zhang Q-Q Plots of EPT and LRT (under the null) EPT N=500 LRT N=500 EPT N=3000 LRT N=3000 Power Comparison significance level=0.00001 Variant proportion Neutral 20% Power Positive causal 80% Negative Causal 0% Sample size Power Comparison significance level=0.00001 Variant proportion Neutral 20% Power Positive causal 60% Negative Causal 20% Sample size Power Comparison significance level=0.00001 Variant proportion Neutral 20% Power Positive causal 40% Negative Causal 40% Sample size Methods For Level 1 Data •Including covariates •Extended to quantitative trait •Better control for population structure •More sophisticate model 25 Collapsing (C) test Li and Leal,The American Journal of Human Genetics 2008(83): 311–321 Step 1 Step 2 logit(y)=a + b* X + e (logistic regression) 26 Variant Collapsing (+) (+) (.) (.) Subject V1 V2 V3 V4 Collapsed Trait 1 1 0 0 0 1 1 2 0 1 0 0 1 1 3 0 0 0 0 0 0 4 0 0 0 0 0 0 5 0 0 0 0 0 0 6 0 0 0 0 0 0 7 0 0 1 0 1 0 8 0 0 0 1 1 1 27 WSS 28 WSS 29 WSS 30 Weighted Sum Test m s wi gi i 1 Collapsing test (Li & Leal, 2008), wi =1 and s=1 if s>1 Weighted-sum test (Madsen & Browning ,2009), wi calculated based-on allele freq. in control group aSum: Adaptive sum test (Han & Pan ,2010), wi = -1 if b<0 and p<0.1, otherwise wj=1 KBAC (Liu and Leal, 2010), wi = left tail p value RBT (Ionita-Laza et al, 2011), wi = log scaled probability PWST p-value weighted sum test (Zhang et al., 2011) :, wi = rescaled left tail p value, incorporating both significance and directions EREC( Lin et al, 2011), wi = estimated effect size 31 (+) When there are only causal(+) variants … 3.2 Subjec t 1 2 3 4 5 6 (+) V1 1 0 0 V2 0 1 0 0 0 0 0 0 0 Collapse d 1 1 0 0 0 0 Trait 3.00 3.10 1.95 2.00 2.05 2.10 3.0 2.8 Collapsing (Li & Leal,2008) works well, power increased Trait 2.6 2.4 2.2 2.0 1.8 0 Collapsed Genotype 1 32 (+) When there are causal(+) and non-causal(.) variants … (+) Subject V1 1 1 2 0 3 0 4 0 5 0 6 0 7 0 8 0 V2 0 1 0 0 0 0 0 0 (.) (.) V3 0 0 0 0 0 0 1 0 Collapse V4 d Trait 0 1 3.00 0 1 3.10 0 0 1.95 0 0 2.00 0 0 2.05 0 0 2.10 0 1 2.00 1 1 2.10 3.2 3.0 2.8 Collapsing still works, power reduced Trait 2.6 2.4 2.2 2.0 1.8 0 Collapsed Genotype 1 33 (+) When there areSubject 1 causal(+) 2 3 non-causal(.) 4 and causal (-) 5 6 variants … 7 8 9 10 3.6 V1 1 0 0 0 0 0 0 0 0 0 (+) V2 0 1 0 0 0 0 0 0 0 0 (.) V3 0 0 0 0 0 0 1 0 0 0 (.) V4 0 0 0 0 0 0 0 1 0 0 (-) (-) V5 0 0 0 0 0 0 0 0 1 0 Collaps V6 ed Trait 0 1 3.00 0 1 3.10 0 0 1.95 0 0 2.00 0 0 2.05 0 0 2.10 0 1 2.00 0 1 2.10 0 1 0.95 1 1 1.00 3.2 2.8 Power of collapsing test significantly down Trait 2.4 2.0 1.6 1.2 0.8 0 Collapsed Genotype 1 34 P-value Weighted Sum Test (PWST) Subject 1 2 3 4 5 6 7 8 9 10 t p(x≤t) 2*(p-0.5) (+) V1 1 0 0 0 0 0 0 0 0 0 1.61 0.93 0.86 (+) (.) (.) (-) (-) V2 V3 V4 V5 V6 Collapsed pSum Trait 0 0 0 0 0 1 0.86 3.00 1 0 0 0 0 1 0.90 3.10 0 0 0 0 0 0 0.00 1.95 0 0 0 0 0 0 0.00 2.00 0 0 0 0 0 0 0.00 2.05 0 0 0 0 0 0 0.00 2.10 0 1 0 0 0 1 -0.02 2.00 0 0 1 0 0 1 0.08 2.10 0 0 0 1 0 1 -0.90 0.95 0 0 0 0 1 1 -0.88 1.00 1.84 -0.04 0.11 -1.84 -1.72 0.95 0.49 0.54 0.05 0.06 0.90 -0.02 0.08 -0.90 -0.88 Rescaled left-tail p-value [-1,1] is used as weight 35 P-value Weighted Sum Test (PWST) 3.2 2.8 Trait 2.4 2.0 1.6 1.2 -1.000 -0.500 0.8 0.000 pSum 0.500 1.000 Power of collapsing test is retained even there are bidirectional effects 36 PWST:Q-Q Plots Under the Null Direct test Inflation of type I error Corrected by permutation test (permutation of phenotype) 37 Generalized Linear Mixed Model (GLMM) & Weighted Sum Test (WST) 38 GLMM & WST m Y wi g i X Z i 1 Y : quantitative trait or logit(binary trait) α : intercept β : regression coefficient of weighted sum m : number of RVs to be collapsed wi : weight of variant i gi : genotype (recoded) of variant i Σwigi : weighted sum (WS) X: covariate(s), such as population structure variable(s) τ : fixed effect(s) of X Z: design matrix corresponding to γ γ : random polygene effects for individual subjects, ~N(0, G), G=2σ2K, K is the kinship matrix and σ2 the additive ploygene genetic variance ε : residual 39 m Weight w g i 1 i i Base on allele frequency, binary(0,1) or continuous, fixed or variable threshold; Based on function annotation/prediction; SIFT, PolyPhen etc. Based on sequencing quality (coverage, mapping quality, genotyping quality etc.); Data-driven, using both genotype and phenotype data, learning weight from data or adaptive selection, permutation test; Any combination … 40 Application 1: Family Data Adjusting relatedness in family data for non-datadriven test of rare variants. Unadjusted: m Y wi g i i 1 Adjusted: m Y wi g i Z i 1 γ ~N(0,2σ2K) 41 Q-Q Plots of –log10(P) under the Null Li & Leal’s collapsing test, ignoring family structure, inflation of type-1 error Li & Leal’s collapsing test, modeling family structure via GLMM, inflation is corrected (From Zhang et al, 2011, BMC Proc.) 42 Application 2: Permuting Family Data MMPT: Mixed Model-based Permutation Test Adjusting relatedness in family data for data-driven permutation test of rare variants. Permuted m Y wi g i Z γ ~N(0,2σ2K) i 1 Non-permuted, subject IDs fixed 43 Q-Q Plots under the Null WSS Permutation test, ignoring family structure, inflation of type-1 error aSum PWST SPWST 44 (From Zhang et al, 2011, IGES Meeting) Q-Q Plots under the Null WSS Mixed model-based permutation test (MMPT), modeling family structure, inflation corrected aSum PWST (From Zhang et al, 2011, IGES Meeting) SPWST Burden Test vs. Non-burden Test k Burden test Y ( wi xi ) i 1 H0 : 0 k Non-burden test Y i xi ... i 1 H 0 : i 0( 1 2 ... k 0) T-test, Likelihood Ratio Test, F-test, score test, … SKAT: sequence kernel association test 46 SKAT: sequence kernel association test k Y i xi i 1 H 0 : i 0( 1 2 ... k 0) Extension of SKAT to Family Data kinship matrix Polygenic heritability of the trait Han Chen et al., 2012, Genetic Epidemiology Residual Other problems Missing genotypes & imputation Genotyping errors & QC (family consistency, sequence review) Population Stratification Inherited variants and de novo mutation Family data & linkage infomation Variant validation and association validation Public databases And more … 49