Text S1 - Figshare

Text S1 Mathematical derivation of GenoCMI and GameteCMI metrics Conditional mutual information based on genotype (GenoCMI) For convenience, consider two unlinked diallelic loci, G and H, with locus G having alleles a and A and locus H having alleles b and B. Let alleles A and B denote the risk allele of loci G and H, respectively. Let i and j (i, j =0, 1 or 2) represent the genotypes of loci G and H, respectively, where 0, 1 and 2 denote the number of risk alleles in a genotype (that is, wild-type homozygote, heterozygote and mutant homozygote, respectively). Let D denote the disease status of each individual, where D=1 (D=0) indicates affected (unaffected). The conditional mutual information based on genotype, denoted as GenoCMI, could be defined and formulated as following:   P(G  i, H  j D  d )  GenoCMI   P(G  i, H  j , D  d ) log    P ( G  i D  d ) P ( H  j D  d ) d i j     PijA   PijN A N      PA Pij log A A  PN Pij log  N N P P  P P i j   i  j   i  j      (1) where PijA ( PijN ), PiA ( PiN ), and P jA ( P jN ) indicate the joint genotype frequency of loci G and H, the genotype frequency of locus G, and the genotype frequency of locus H in affected (unaffected) population, respectively (i.e. P11A , P1A and P1A indicate the frequencies of AaBb, Aa and Bb in affected population, respectively). PA ( PN ) is the frequencies of affected (unaffected) people in the combined population (in fact, PN  1  PA ). In a case-control study, it is the proportion of cases (controls), and in a cohort study, it is the estimated prevalence (1-prevalence) in the general population. Let f ij , f i and f  j be measures of penetrance of joint genotype ij of loci G and H, genotype i of locus G, and genotype j of locus H, respectively. The penetrance is defined as a conditional probability of being affected on the genotype: fij  P( D  1G  i, H  j ) , fi   P( D  1 G  i )   fij , j f j  P( D  1 H  j )   fij i Let Pij , Pi and P j denote P(G=i, H=j), P(G=i) and P(H=j) in the general population, respectively, and K denotes the population prevalence. We can show the genotype frequency distribution of loci G and H in cases and controls, respectively, in following tables: Cases locus H locus G 0 (bb) 1 (Bb) 2 (BB) margin 0 (aa) P00 f 00 K P01 f 01 K P02 f 02 K P0 f 0 K 1 (Aa) P10 f10 K P11 f11 K P12 f12 K P1 f1 K 2 (AA) P20 f 20 K P21 f 21 K P22 f 22 K P2 f 2 K margin P0 f 0 K P1 f1 K P2 f2 K Controls 1 locus H locus G 0 (bb) 1 (Bb) 2 (BB) margin 0 (aa) P00 (1  f 00 ) (1  K ) P01 (1  f 01 ) (1  K ) P02 (1  f 02 ) (1  K ) P0 (1  f 0 ) (1  K ) 1 (Aa) P10 (1  f10 ) (1  K ) P11 (1  f11 ) (1  K ) P12 (1  f12 ) (1  K ) P1 (1  f1 ) (1  K ) 2 (AA) P20 (1  f 20 ) (1  K ) P21 (1  f 21 ) (1  K ) P22 (1  f 22 ) (1  K ) P2 (1  f 2 ) (1  K ) margin P0 (1  f 0 ) (1  K ) P1 (1  f1 ) (1  K ) P2 (1  f2 ) (1  K ) 1 Hence, parameters in equation (1) could be expressed as PijA  Pij and f ij K , PiA  Pi f i , K P jA  P j f j K PijN  Pij 1  f ij 1 K , PiN  Pi 1  f i , 1 K P jN  P j 1  f j 1 K Hence,  PijA   P log  A A   ijA  log  ij P P  PP  i  j   i  j      PN log  Nij N P P  i  j        P   ijN  log  ij  PP   i  j    ( f / K )( f / K ) j  i  ijA  log   ( f ij / K )  .  [( 1  f ) /( 1  K )]  [( 1  f ) /( 1  K )] i j   ijN  log  (1  f ij ) /(1  K ) Then, equation (1) can be expressed as  f A (1  f ij )ijN  f ij (1  f ij )   Pij    log  GenoCMI   Pij  PA ij ij  (1  PA )   PA  (1  PA ) K 1 K 1  K   Pi P j  i j  K   (1  PA )( f ij  1)  A  ( f  K )( PA  K )  A  (ij  ijN )   ij   Pij   1ij  (1  K ) K (1  K ) i j       ( f ij  K )( PA  K )   Pij   Pij   1 log   K ( 1  K ) i j   Pi P j      (2) In fact, if we define the relative risk (RR) of a genotype or genotype combination as the ratio between the penetrance of the given genotype and the population prevalence (also called the reference or baseline penetrance), ijA could be referred as an information measure for the interaction between the two loci 1, which measures the log-scale of deviation between the risk of joint genotype against the additive risk of marginal genotypes (usually called main effects). Departure of ijA from zero indicates the presence of interaction between loci G and H. Furthermore, we could obtain that, f ij /(1  f ij )     K /(1  K ) A N   ij  ij  log  f i /(1  f i ) f j /(1  f j )      K /(1  K ) K /(1  K )   ORij  log   OR  OR j  i  ORij where log   OR  OR j  i       can be shown as a measure of interaction between two loci   G and H 2, 3. Hence, equation (2) could be reduced to  (1  PA )( f ij  1)   ORij  log  GenoCMI   Pij   ( 1  K ) i j    ORi  OR j   ( f ij  K )( PA  K )   RRij   1 log     K ( 1  K )   RRi  RR j    ( f ij  K )( PA  K )   Pij   Pij   1 log   K ( 1  K ) i j   Pi P j    RRij  PA  Pij  RRij log   RR  RR i j  j  i          ( f ij  K )( PA  K )   Pij   Pij   1 log   K ( 1  K ) i j    Pi P j     (3) (by using the approximation that OR≈RR if disease prevalence K is low, i.e., for a rare disease )  Pij The second part of equation (3) is a function of fij and log  PP  i  j   . The latter is a   measure of dependence between loci G and H in general population, and hence could be viewed as a quantity for the population linkage disequilibrium (LD). In the case of two unlinked loci, this part is approximated to be zero, and   RRij GenoCMI  PA  Pij  RRij log   RR  RR i j j  i      (4) In a cohort study where PA=K, it can be easily seen that equation (3) becomes   RRij GenoCMI  K  Pij  RRij log   RR  RR i j  j  i   P    Pij log  ij  PP  i j  i  j     (5)     Conditional mutual information based on gametic disequilibrium (GameteCMI) Similarly, we consider two unlinked diallelic loci, G and H. Let alleles A and B be the risk alleles of loci G and H, respectively. Let hkl be a gamete of loci G and H, where k and l (k, l =0 and 1) indicates the carrier states of the risk alleles A and B in the gamete, respectively. The four possible gametes for the diallelic loci, a-b, a-B, A-b and A-B, are denoted by h00, h01, h10 and h11, respectively. Similar to equation (1), GameteCMI metric can be defined as:   P(hkl d )  GameteCMI   P(hkl , d ) log   P ( G  k d ) P ( H  l d ) d k l   (6) where P(G=k|d) and P(H=l|d) are the frequencies of alleles k and l under disease status d, respectively. Similar to derivation of equations (2) and (3), the equation (6), approximately, can also be decomposed into two components,   RRklh   GameteCMI  PA  qkl  RRklh log  h h  k l  RRk   RRl    ( RR klh  K )( PA  K )   qkl   qkl   1 log  ( 1  K ) k l   qk   ql     (7) where qkl , qk  and ql are the frequencies of gamete hkl, and alleles k and l in general population, respectively. RR klh , RR kh and RRhl are the relative risks of gamete hkl, and alleles k and l, respectively. Different from GenoCMI, the logarithm function in the second term of equation (7) explicitly describes the LD status between loci G and H in general population. For unlinked loci or loci without linkage disequilibrium, the second term equals to zero, because D  qkl  qk  q.l  0 in this case. Consequently, equation (7) can be reduced to  h  RRklh   GameteCMI  PA  qkl  RRkl log  h h  RR  RR k l l    k  (8) In a cohort study where PA=K, equation (7) becomes   RRklh   q    qkl log  kl GameteCMI  K  qkl  RRklh log  h h  k l  RRk   RRl  k l  qk   ql     (9) Reference 1. Wu X, Jin L, Xiong M. Mutual information for testing gene-environment interaction. PLoS One. 2009;4(2):e4578. 2. Wu X, Dong H, Luo L, et al. A novel statistic for genome-wide interaction analysis. PLoS Genet. Sep 2010;6(9). 3. Ueki M, Cordell HJ. Improved statistics for genome-wide interaction analysis. PLoS Genet. Apr 2012;8(4):e1002625.

Text S1 - Figshare

Related documents

Products

Support

Text S1 - Figshare

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib