1 A flexible likelihood framework for detecting associations with secondary 2 phenotypes in genetic studies using selected samples: application to sequence 3 data 4 Supplemental Material 5 6 7 Dajiang J. Liu1,2 & Suzanne M. Leal1,2 * *: To whom the correspondence should be addressed 1 1 2 1. Details for applying MTA model to case control, extreme trait and multiple 3 trait study designs: 4 1.) Case-control study 5 In the example of a case control study, Y1i represents the case control status Ai , 6 and Y2i represents the continuous trait Ti . It is assumed that N ACC cases and 7 N UCC controls are sequenced. 8 9 Conditional on the primary phenotype, the sampling mechanism is independent of the genotype and secondary phenotypes. Therefore, PrZi 1Y1i , Y2i , X i , Wki k PrZi 1Y1i 10 11 According to formulas (4) and (A1), the probability PrY1i , Y2i , X i , Zi 1 equals to Pr Y1i , Y2i X i , Z i 1 12 Pr Y1i , Y2i , Z i 1 X i Pr Z i 1 X i Pr Z i 1Y1i Pr Y1i , Y2i X i Pr Z i 1 y1i 1 Pr y1i 1 X i Pr Z i 1 y1i 0 Pr y1i 0 X i 13 14 (A2) Since cases and controls are random samples from the pools of affected and unaffected individuals respectively, the sampling probabilities must satisfy Pr Z i 1Y1i 1 N ACC Pr Y1i 0 (A3) Pr Z i 1Y1i 0 NUCC Pr Y1i 1 15 16 Combining equations (A2) and (A3), the likelihood for individual i is reduced 17 to 18 Pr Y1i , Y2i X i CC CC Pr Y1i 1 X i Pr Y1i 0 X i NU N A Pr Y1i , Y2i X i , Z i 1 Pr Y1i , Y2i X i Pr Y1i 0 X i Pr Y1i 1 X i N ACC NUCC 19 (A1) if Y1i 1 (A4) if Y1i 0 2.) Extreme-trait study 2 1 In an extreme-trait study, Y1i represents the primary trait Bi , and Y2i 2 represents the secondary trait Ti . Two cutoffs are set, i.e. y1ub , y1lb . A number of N ET 3 individuals with trait B values exceeding these cutoffs are selected and sequenced. 4 Therefore N ET Pr Y1i y1ub Pr Y1i y1lb 5 Pr Z i 1Y1i y or Y1i y , Y2i 6 The following likelihood can be obtained for the extreme-trait study design: ub 1 Pr Y1i , Y2i X i , Z i 1 7 (A5) Pr Y1i , Y2i X i Pr Z i 1Y1i , Y2i , X i y1ub Pr Z i 1 y1i Pr y1i , y 2i dy1i dy 2i Pr Z i 1 y1i Pr y1i , y 2i dy1i dy 2i y2lb Pr Y1i , Y2i X i ub Pr Y1i y1 X i Pr Y1i y1lb X i 0 8 lb 1 (A6) if Y1i y1ub or Y1i y1lb if y1lb Y1i y1ub 3.) Multiple-trait study 9 The example considered in this manuscript is motivated by the study of 10 diabetes in obese people. In this study, Y1i represents the binary primary trait C i , and 11 Y2i represents the continuous secondary trait Ti . The affection status is determined by 12 the binary trait C i . N AMT affected individuals with trait T greater than t C and N UMT 13 unaffected individuals are sequenced. Similar to the case-control study, the sampling 14 mechanism satisfies 15 Pr Z i 1Y1i 1, Y2i y2C N MT Pr Y1i 0 MT A Pr Z i 1Y1i 0 NU Pr Y1i 1, Y2i y2C (A7) 16 Following the same approach as in case-control and extreme-trait studies, for 17 the selection mechanism that involves both the primary and secondary phenotypes, 18 the likelihood is given by 3 Pr Y1i , Y2i X i , Z i 1 1 Pr Y1i , Y2i X i Pr Z i 1Y1i , Y2i , X i y 2C Pr Z i 1 y1i 1, y 2i Pr y1i 1, y 2i X i dy 2i C Pr Z i 1 y1i 0, y 2i Pr y1i 0, y 2i X i dy 2i C Pr y1i 1, y 2i y 2 Pr y1i 0 X i Pr y2 Pr Y1i , Y2i X i X i Pr y1i 0 X i N UMT N AMT Pr Y1i , Y2i X i y 1i 1, y 2i y 2C X i N AMT N UMT 0 if Y1i 1, Y2i y (A8) C 2 if Y1i 0 otherwise 2 4 1 2 3 2. Derivation of MTA model when probit link function is used When a liability threshold model and a probit link function are used to model binary phenotypes, the multivariate generalized linear model can be simplified, i.e. 4 Y1i 01 11 X i k k1Wki 1i Y2i 02 12 X i Y1 Y1i k k 2Wki 2i 5 The distribution of residual errors is assumed to be bivariate normal, i.e. 6 7 8 0 12 0 1i , 2i ~ , 2 0 0 2 (B1) (B2) If the primary trait is continuous, the likelihood model is given by L , ; X , i 1 PrY1i ,Y2i X i , Zi 1, Wki k N (B3) 9 where Z i is an indicator of individual i being sampled, and N is the number of 10 individuals in the sample. Conditional on locus genotypes and other covariates, the 11 joint distribution for Y1i , Y2i is multivariate normal. It satisfies 12 PrY1i ,Y2i X i , Zi 1, Wki k PrY2i X i ,Y1i , Zi 1, Wki k PrY1i X i , Zi 1, Wki k 13 If the primary trait is binary with Y1i 1 being affected, and Y1i 0 being unaffected, 14 a multivariate liability threshold model is used to model multiple phenotypes. 15 16 17 Y1*i 01 11 X i k k1Wki 1i Y2i 02 12 X i Y1i Y1i k k 2Wki 2i (B5), where Y1i* is the liability trait for Y1i . According to the liability threshold model, the binary disorder Y1i is related to 18 its underlying liability trait Y1i* according to 19 1 Y1*i y1C Y1i * C 0 Y1i y1 20 (B4) (B6) The joint distribution is given by 5 L , ; X , i1 PrY1i , Y2i X i , Zi 1, Wki k (B7) N 1 2 3 Each factor in (B7) satisfies * C Pr Y1i y1 , Y2i X i , Wki k , Z i 1 Pr Y1i , Y2i X i , Z i 1, Wki k * C Pr Y1i y1 , Y2i X i , Wki k , Z i 1 if Y1i 1 if Y1i 0 (B8). 4 In order to make the liability threshold model parameters identifiable, the intercept 5 01 has to be set to 0, and the variance parameter 12 also needs to be standardized, 6 i.e. 12 1 . 7 The parameters relevant to sampling mechanisms such as disease prevalence 8 are estimated independently from other data sources (e.g. prospective or cross 9 sectional cohorts). The remaining genetic parameters are estimated through 10 maximizing the likelihood function. Nelder-Mead algorithm can be applied where 11 calculations of analytic derivatives are not needed. 12 13 6 1 2 3. Population Genetics Simulation: According to Boyko et al. 21 , a rigorous population genetic model 3 incorporating demographic change and purifying selections was used to simulate the 4 African variant data. A two-epoch model with two degrees of freedom was used, 5 where the population was constant with effective population size N anc 7,778 6 followed by an instant population expansion 6,809 generations ago to reach its current 7 effective population size N curr 25,636 . It has been shown that this simple 8 demographic model provides a good fit to neutral variant frequency spectrums. 9 Selection was modeled using Gamma distribution, which has been shown to be 10 parsimonious and provide good fit to data21. The selective disadvantages of new 11 heterozygous and homozygous mutations are assumed to be s and 2s . The 12 distributions for fitness effects were estimated for the scaled selective disadvantage 13 2 N curr s . For Africans, the scaled selective disadvantage follows 14 x, x ~ 1 x exp x (C1), 15 where the parameters satisfy 0.184, 8,200 . The locus length is specified to be 16 1500 base pairs, which is the average length of a human gene coding region. A 17 mutation rate of S 1.8 10 8 per nucleotide site per generation was assumed. 18 One hundred sets of haplotype pools were generated. A haplotype pool is 19 randomly chosen for each replicate. The multi-site genotype for an individual was 20 obtained by pairing two randomly chosen haplotypes from the pool. Following 21 Kryukov et al.7, only non-synonymous variants are used in the analysis in order to 22 reduce the impact of non-causal variants and to increase signal-to-noise ratio. 23 7 1 4. Simulation of Phenotypes: 2 3 Similarly to the case-control study, for the extreme-trait study, the phenotypes Bi ,Ti follow a bivariate normal distribution MVΝ iET , ET , with B2 ~ ~ iET B sCV xis , T sCV xis , and ET B ,T B T 4 B T B ,T B T T2 (D1) 5 The distribution for the augmented traits Ci* , Ti in the multiple-trait study is also 6 assumed to be bivariate normal MVΝ iMT , MT MT C2 * ~ MT s ~ s C * sCV xi , T sCV xi , and * * T T C* C ,T C C ,T C T T2 7 i 8 The causative variant sites CVB , and CVC * are similarly defined as CV A* . * * (D2) 9 8 1 2 3 4 5. Evaluation of Type I Error When Ascertainment Is Not Properly Adjusted There can be substantial biases if the ascertainment mechanisms are not properly modeled. In order to illustrate this, we analyzed the simulated data under L , ; X , i 1 PrY1i , Y2i X i , Wki k (E1) N 5 Model (E1) does not take into account the non-random ascertainment mechanisms. 6 The primary (liability) trait Y1i (or Y1i* ) and the quantitative secondary trait Y2i in 7 selected samples are assumed to follow a bivariate normal distribution. The model is 8 clearly incorrect when samples are ascertained on the primary trait (the case-control 9 study and the extreme-trait study) or on both the primary and secondary traits (the 10 multiple-trait study). Association testing under model (E1) was carried-out using 11 score statistics. P-values under the null hypothesis are not uniformly distributed 12 anymore. There can be serious biases in all three study designs. The results are shown 13 in (Supplemental Figure 1). 14 9 1 Supplemental Figure 1: Quantile-Quantile plot of p-values under the null hypothesis 2 in case-control (panel a), extreme-trait (panel b) and multiple-trait (panel c) studies. It 3 is assumed that the disease prevalence (10%) is correctly specified. The simulated 4 data was analyzed using model (E1) which does not take into account the non-random 5 ascertainment mechanisms. Scenarios with different combinations of primary trait 6 ~ ~ ~ genetic effects A* ( B and C * ) and residual correlations A* ,T ( B,T , C * ,T ) were 7 investigated. Results are shown where neither the primary nor secondary traits are 8 associated with the gene region (dashed red and blue lines) and where only the 9 primary but not the secondary trait is associated with the gene region (solid green and 10 brown line). The results were obtained using 10,000 replicates. 11 10 1 11 1 Supplemental Table 1: Power to detect associations with trait T when the trait is 2 analyzed as a primary trait using randomly ascertained population samples. 3 Sample Size Powera 4 5 6 7 1,000 0.516 2,000 0.666 3,000 0.736 aPower was empirically estimated using 5,000 replicates under a significance level of 0.05 . 12 1 2 3 Supplemental Table 2: Power to detect associations with quantitative trait T when extreme sampling is performed using a cohort of 5,000 individuals. Number of Number of Upper Lower Powera Individuals Individuals Threshold Threshold from the from the Percentile Percentile Upper Lower Extreme Extreme 100 300 500 4 5 6 7 100 300 500 98th 94th 90th 2th 6th 10th 0.566 0.706 0.754 aPower was empirically estimated using 5,000 replicates under a significance level of 0.05 . 13 1 2 3 4 5 6 Supplemental Table 3: Summary statistics and results for the analyses of eight phenotypes as primary traits using the Dallas Heart Study dataset. Analysis of Carrier Analysis of Carrier Individuals frequency Entire frequency in Trait with Extreme in the Sample the Lower a Trait Upper (p-value) Extreme (p-value) Extreme ANGPTL3 0.924 0.731 0.012 0.012 BMI 0.898 0.985 0.015 0.014 DiasBP 1 0.998 0.014 0.014 SysBP 0.253 0.397 0.011 0.014 TCL 0.974 0.961 0.013 0.013 LDL 0.733 0.631 0.014 0.013 HDL 0.076 0.061 0.01 0.015 TG 0.64 0.566 0.014 0.015 Gluc ANGPTL4 0.504 0.296 0.016 0.017 BMI 0.608 0.493 0.015 0.017 DiasBP 0.679 0.754 0.019 0.018 SysBP 0.311 0.467 0.019 0.017 TCL 0.179 0.347 0.017 0.017 LDL 0.068 0.018* 0.021 0.016 HDL 0.005* 0.002* 0.013 0.021 TG 0.541 0.66 0.021 0.019 Gluc ANGPTL5 0.003* 0.028* 0.02 0.012 BMI 0.564 0.940 0.02 0.018 DiasBP 0.842 0.899 0.022 0.02 SysBP 0.355 0.113 0.019 0.017 TCL 0.600 0.438 0.021 0.019 LDL 0.024* 0.019* 0.021 0.016 HDL 0.894 0.873 0.018 0.018 TG 0.665 0.575 0.021 0.02 Gluc ANGPTL6 0.022* 0.014* 0.01 0.008 BMI 0.11 0.158 0.014 0.009 DiasBP 0.487 0.589 0.008 0.012 SysBP 0.479 0.385 0.011 0.009 TCL 0.628 0.498 0.011 0.01 LDL 0.431 0.462 0.012 0.009 HDL 0.978 0.982 0.01 0.009 TG 0.205 0.197 0.012 0.008 Gluc a The primary trait was analyzed using individuals with extreme trait values in the upper and lower quartiles. 14 1 2 3 Supplemental Table 4: Phenotypic correlations between the eight phenotypes using all subjects in the Dallas Heart Study. BMI DiasBP SysBP TCL LDL HDL TG Gluc BMI 1.000 0.255 0.017 0.066 0.109 -0.273 0.227 0.232 DiasBP 0.255 1.000 0.181 0.140 0.111 -0.081 0.201 0.210 SysBP 0.017 0.181 1.000 0.014 -0.003 -0.057 0.102 0.049 TCL 0.066 0.140 0.014 1.000 0.890 0.102 0.373 0.065 LDL 0.109 0.111 -0.003 0.890 1.000 -0.137 0.197 0.058 HDL -0.273 -0.081 -0.057 0.102 -0.137 1.000 -0.374 -0.129 TG 0.227 0.201 0.102 0.373 0.197 -0.374 1.000 0.191 Gluc 0.232 0.210 0.049 0.065 0.058 -0.129 0.191 1.000 4 15