Supplementary Material (doc 362K)

advertisement
1
A flexible likelihood framework for detecting associations with secondary
2
phenotypes in genetic studies using selected samples: application to sequence
3
data
4
Supplemental Material
5
6
7
Dajiang J. Liu1,2 & Suzanne M. Leal1,2 *
*: To whom the correspondence should be addressed
1
1
2
1. Details for applying MTA model to case control, extreme trait and multiple
3
trait study designs:
4
1.) Case-control study
5
In the example of a case control study, Y1i represents the case control status Ai ,
6
and Y2i represents the continuous trait Ti . It is assumed that N ACC cases and
7
N UCC controls are sequenced.
8
9
Conditional on the primary phenotype, the sampling mechanism is
independent of the genotype and secondary phenotypes. Therefore,
PrZi  1Y1i , Y2i , X i , Wki k   PrZi  1Y1i 
10
11
According to formulas (4) and (A1), the probability PrY1i , Y2i , X i , Zi  1 equals to
Pr Y1i , Y2i X i , Z i  1 
12
Pr Y1i , Y2i , Z i  1 X i 
Pr Z i  1 X i 
Pr Z i  1Y1i  Pr Y1i , Y2i X i 

Pr Z i  1 y1i  1 Pr  y1i  1 X i   Pr Z i  1 y1i  0  Pr  y1i  0 X i 
13
14
(A2)
Since cases and controls are random samples from the pools of affected and
unaffected individuals respectively, the sampling probabilities must satisfy
Pr Z i  1Y1i  1 N ACC Pr Y1i  0

(A3)
Pr Z i  1Y1i  0  NUCC Pr Y1i  1
15
16
Combining equations (A2) and (A3), the likelihood for individual i is reduced
17
to
18

Pr Y1i , Y2i X i 

CC
CC
 Pr Y1i  1 X i   Pr Y1i  0 X i  NU N A
Pr Y1i , Y2i X i , Z i  1  
Pr Y1i , Y2i X i 

 Pr Y1i  0 X i   Pr Y1i  1 X i  N ACC NUCC
19
(A1)
if Y1i  1
(A4)
if Y1i  0
2.) Extreme-trait study
2
1
In an extreme-trait study, Y1i represents the primary trait Bi , and Y2i
2
represents the secondary trait Ti . Two cutoffs are set, i.e. y1ub , y1lb . A number of N ET
3
individuals with trait B values exceeding these cutoffs are selected and sequenced.
4
Therefore


N ET

Pr Y1i  y1ub  Pr Y1i  y1lb
5
Pr Z i  1Y1i  y or Y1i  y , Y2i
6
The following likelihood can be obtained for the extreme-trait study design:
ub
1
Pr Y1i , Y2i X i , Z i  1 
7




(A5)
Pr Y1i , Y2i X i Pr Z i  1Y1i , Y2i , X i 


y1ub
Pr Z i  1 y1i Pr  y1i , y 2i dy1i dy 2i   Pr Z i  1 y1i Pr  y1i , y 2i dy1i dy 2i
y2lb

Pr Y1i , Y2i X i 

ub
  Pr Y1i  y1 X i  Pr Y1i  y1lb X i

0


8
lb
1
 


(A6)
if Y1i  y1ub or Y1i  y1lb
if y1lb  Y1i  y1ub
3.) Multiple-trait study
9
The example considered in this manuscript is motivated by the study of
10
diabetes in obese people. In this study, Y1i represents the binary primary trait C i , and
11
Y2i represents the continuous secondary trait Ti . The affection status is determined by
12
the binary trait C i . N AMT affected individuals with trait T greater than t C and N UMT
13
unaffected individuals are sequenced. Similar to the case-control study, the sampling
14
mechanism satisfies
15


Pr Z i  1Y1i  1, Y2i  y2C
N MT Pr Y1i  0
 MT A
Pr Z i  1Y1i  0
NU Pr Y1i  1, Y2i  y2C


(A7)
16
Following the same approach as in case-control and extreme-trait studies, for
17
the selection mechanism that involves both the primary and secondary phenotypes,
18
the likelihood is given by
3
Pr Y1i , Y2i X i , Z i  1 
1
Pr Y1i , Y2i X i  Pr Z i  1Y1i , Y2i , X i 


y 2C
Pr Z i  1 y1i  1, y 2i  Pr  y1i  1, y 2i X i dy 2i   C Pr Z i  1 y1i  0, y 2i  Pr  y1i  0, y 2i X i dy 2i



C
 Pr y1i  1, y 2i  y 2


 Pr  y1i  0 X i   Pr




y2
Pr Y1i , Y2i X i 

X i  Pr  y1i  0 X i   N UMT N AMT
Pr Y1i , Y2i X i 
y
1i

 1, y 2i  y 2C X i  N AMT N UMT
0
if Y1i  1, Y2i  y
(A8)
C
2
if Y1i  0
otherwise
2
4
1
2
3
2. Derivation of MTA model when probit link function is used
When a liability threshold model and a probit link function are used to model
binary phenotypes, the multivariate generalized linear model can be simplified, i.e.
4

Y1i   01  11 X i  k  k1Wki   1i

Y2i   02  12 X i   Y1 Y1i  k  k 2Wki   2i
5
The distribution of residual errors is assumed to be bivariate normal, i.e.
6
7
8
  0    12 0  

 1i ,  2i  ~   , 
2 
0
0

2 
  
(B1)
(B2)
If the primary trait is continuous, the likelihood model is given by
L , ; X ,    i 1 PrY1i ,Y2i X i , Zi  1, Wki k 
N
(B3)
9
where Z i is an indicator of individual i being sampled, and N is the number of
10
individuals in the sample. Conditional on locus genotypes and other covariates, the
11
joint distribution for Y1i , Y2i is multivariate normal. It satisfies
12
PrY1i ,Y2i X i , Zi  1, Wki k   PrY2i X i ,Y1i , Zi  1, Wki k  PrY1i X i , Zi  1, Wki k 
13
If the primary trait is binary with Y1i  1 being affected, and Y1i  0 being unaffected,
14
a multivariate liability threshold model is used to model multiple phenotypes.

15
16
17

Y1*i   01  11 X i  k  k1Wki   1i

Y2i   02  12 X i   Y1i Y1i  k  k 2Wki   2i
(B5),
where Y1i* is the liability trait for Y1i .
According to the liability threshold model, the binary disorder Y1i is related to
18
its underlying liability trait Y1i* according to
19
1 Y1*i  y1C
Y1i  
*
C
0 Y1i  y1
20
(B4)
(B6)
The joint distribution is given by
5
L , ; X ,    i1 PrY1i , Y2i X i , Zi  1, Wki k  (B7)
N
1
2
3
Each factor in (B7) satisfies




*
C

Pr Y1i  y1 , Y2i X i , Wki k , Z i  1
Pr Y1i , Y2i X i , Z i  1, Wki k   
*
C

Pr Y1i  y1 , Y2i X i , Wki k , Z i  1
if Y1i  1
if Y1i  0
(B8).
4
In order to make the liability threshold model parameters identifiable, the intercept
5
 01 has to be set to 0, and the variance parameter  12 also needs to be standardized,
6
i.e.  12  1 .
7
The parameters relevant to sampling mechanisms such as disease prevalence
8
are estimated independently from other data sources (e.g. prospective or cross
9
sectional cohorts). The remaining genetic parameters are estimated through
10
maximizing the likelihood function. Nelder-Mead algorithm can be applied where
11
calculations of analytic derivatives are not needed.
12
13
6
1
2
3. Population Genetics Simulation:
According to Boyko et al.
21
, a rigorous population genetic model
3
incorporating demographic change and purifying selections was used to simulate the
4
African variant data. A two-epoch model with two degrees of freedom was used,
5
where the population was constant with effective population size N anc  7,778
6
followed by an instant population expansion 6,809 generations ago to reach its current
7
effective population size N curr  25,636 . It has been shown that this simple
8
demographic model provides a good fit to neutral variant frequency spectrums.
9
Selection was modeled using Gamma distribution, which has been shown to be
10
parsimonious and provide good fit to data21. The selective disadvantages of new
11
heterozygous and homozygous mutations are assumed to be s and 2s . The
12
distributions for fitness effects were estimated for the scaled selective disadvantage
13
  2 N curr s . For Africans, the scaled selective disadvantage follows
14
   x, x ~
   1
x exp  x 
 
(C1),
15
where the parameters satisfy   0.184,   8,200 . The locus length is specified to be
16
1500 base pairs, which is the average length of a human gene coding region. A
17
mutation rate of  S  1.8  10 8 per nucleotide site per generation was assumed.
18
One hundred sets of haplotype pools were generated. A haplotype pool is
19
randomly chosen for each replicate. The multi-site genotype for an individual was
20
obtained by pairing two randomly chosen haplotypes from the pool. Following
21
Kryukov et al.7, only non-synonymous variants are used in the analysis in order to
22
reduce the impact of non-causal variants and to increase signal-to-noise ratio.
23
7
1
4. Simulation of Phenotypes:
2
3
Similarly to the case-control study, for the extreme-trait study, the phenotypes
Bi ,Ti  follow a bivariate normal distribution




MVΝ  iET ,  ET , with

  B2

~
~
iET   B sCV xis , T sCV xis , and  ET  
  B ,T  B T
4
B
T
 B ,T  B T 

 T2 
(D1)
5
The distribution for the augmented traits Ci* , Ti  in the multiple-trait study is also
6

assumed to be bivariate normal MVΝ  iMT ,  MT

 MT


 C2 *
~
MT
s ~
s


   C * sCV xi ,  T sCV xi  , and  
  *  * T
T
C*


 C ,T C
 C ,T  C  T 


 T2

7
i
8
The causative variant sites CVB , and CVC * are similarly defined as CV A* .
*
*
(D2)
9
8
1
2
3
4
5. Evaluation of Type I Error When Ascertainment Is Not Properly Adjusted
There can be substantial biases if the ascertainment mechanisms are not
properly modeled. In order to illustrate this, we analyzed the simulated data under
L , ; X ,    i 1 PrY1i , Y2i X i , Wki k  (E1)
N
5
Model (E1) does not take into account the non-random ascertainment mechanisms.
6
The primary (liability) trait Y1i (or Y1i* ) and the quantitative secondary trait Y2i in
7
selected samples are assumed to follow a bivariate normal distribution. The model is
8
clearly incorrect when samples are ascertained on the primary trait (the case-control
9
study and the extreme-trait study) or on both the primary and secondary traits (the
10
multiple-trait study). Association testing under model (E1) was carried-out using
11
score statistics. P-values under the null hypothesis are not uniformly distributed
12
anymore. There can be serious biases in all three study designs. The results are shown
13
in (Supplemental Figure 1).
14
9
1
Supplemental Figure 1: Quantile-Quantile plot of p-values under the null hypothesis
2
in case-control (panel a), extreme-trait (panel b) and multiple-trait (panel c) studies. It
3
is assumed that the disease prevalence (10%) is correctly specified. The simulated
4
data was analyzed using model (E1) which does not take into account the non-random
5
ascertainment mechanisms. Scenarios with different combinations of primary trait
6
~
~
~
genetic effects  A* (  B and  C * ) and residual correlations  A* ,T (  B,T ,  C * ,T ) were
7
investigated. Results are shown where neither the primary nor secondary traits are
8
associated with the gene region (dashed red and blue lines) and where only the
9
primary but not the secondary trait is associated with the gene region (solid green and
10
brown line). The results were obtained using 10,000 replicates.
11
10
1
11
1
Supplemental Table 1: Power to detect associations with trait T when the trait is
2
analyzed as a primary trait using randomly ascertained population samples.
3
Sample Size
Powera
4
5
6
7
1,000
0.516
2,000
0.666
3,000
0.736
aPower
was empirically estimated using 5,000 replicates under a significance level of
  0.05 .
12
1
2
3
Supplemental Table 2: Power to detect associations with quantitative trait T when
extreme sampling is performed using a cohort of 5,000 individuals.
Number of Number of
Upper
Lower
Powera
Individuals Individuals Threshold Threshold
from the
from the
Percentile Percentile
Upper
Lower
Extreme
Extreme
100
300
500
4
5
6
7
100
300
500
98th
94th
90th
2th
6th
10th
0.566
0.706
0.754
aPower
was empirically estimated using 5,000 replicates under a significance level of
  0.05 .
13
1
2
3
4
5
6
Supplemental Table 3: Summary statistics and results for the analyses of eight
phenotypes as primary traits using the Dallas Heart Study dataset.
Analysis of
Carrier
Analysis of
Carrier
Individuals
frequency
Entire
frequency in
Trait
with Extreme
in the
Sample
the Lower
a
Trait
Upper
(p-value)
Extreme
(p-value)
Extreme
ANGPTL3
0.924
0.731
0.012
0.012
BMI
0.898
0.985
0.015
0.014
DiasBP
1
0.998
0.014
0.014
SysBP
0.253
0.397
0.011
0.014
TCL
0.974
0.961
0.013
0.013
LDL
0.733
0.631
0.014
0.013
HDL
0.076
0.061
0.01
0.015
TG
0.64
0.566
0.014
0.015
Gluc
ANGPTL4
0.504
0.296
0.016
0.017
BMI
0.608
0.493
0.015
0.017
DiasBP
0.679
0.754
0.019
0.018
SysBP
0.311
0.467
0.019
0.017
TCL
0.179
0.347
0.017
0.017
LDL
0.068
0.018*
0.021
0.016
HDL
0.005*
0.002*
0.013
0.021
TG
0.541
0.66
0.021
0.019
Gluc
ANGPTL5
0.003*
0.028*
0.02
0.012
BMI
0.564
0.940
0.02
0.018
DiasBP
0.842
0.899
0.022
0.02
SysBP
0.355
0.113
0.019
0.017
TCL
0.600
0.438
0.021
0.019
LDL
0.024*
0.019*
0.021
0.016
HDL
0.894
0.873
0.018
0.018
TG
0.665
0.575
0.021
0.02
Gluc
ANGPTL6
0.022*
0.014*
0.01
0.008
BMI
0.11
0.158
0.014
0.009
DiasBP
0.487
0.589
0.008
0.012
SysBP
0.479
0.385
0.011
0.009
TCL
0.628
0.498
0.011
0.01
LDL
0.431
0.462
0.012
0.009
HDL
0.978
0.982
0.01
0.009
TG
0.205
0.197
0.012
0.008
Gluc
a
The primary trait was analyzed using individuals with extreme trait values in the
upper and lower quartiles.
14
1
2
3
Supplemental Table 4: Phenotypic correlations between the eight phenotypes using
all subjects in the Dallas Heart Study.
BMI
DiasBP
SysBP
TCL
LDL
HDL
TG
Gluc
BMI
1.000
0.255
0.017
0.066
0.109
-0.273
0.227
0.232
DiasBP
0.255
1.000
0.181
0.140
0.111
-0.081
0.201
0.210
SysBP
0.017
0.181
1.000
0.014
-0.003
-0.057
0.102
0.049
TCL
0.066
0.140
0.014
1.000
0.890
0.102
0.373
0.065
LDL
0.109
0.111
-0.003
0.890
1.000
-0.137
0.197
0.058
HDL
-0.273
-0.081
-0.057
0.102
-0.137
1.000
-0.374
-0.129
TG
0.227
0.201
0.102
0.373
0.197
-0.374
1.000
0.191
Gluc
0.232
0.210
0.049
0.065
0.058
-0.129
0.191
1.000
4
15
Download