Data supplement 11

advertisement
Supplementary Material
Simple Linear Regression for a Quantitative Trait
Consider a quantitative trait locus with two alleles A and a having frequencies PA and Pa ,
respectively. Denote the genotypic values of the genotypes AA, Aa and aa by G11 , G12 and G22 ,
respectively. A genetic additive effect is given by (Appendix A)
  PAG11  ( Pa  PA )G12  PaG22 .
Let Yi be a phenotype value of the i-th individual. A simple linear regression model for a
quantitative trait is given by (Appendix A)
Yi    X i    i
where
 2 Pa , AA

X i   Pa - PA , Aa .
  2 P , aa
A

(1)
 i are independent and identically distributed normal variables with zero mean and variance  e2 .
We add a constant  Pa  PA into X i . Then, the indicator variable will be transformed to
1 , AA

X i  0 , Aa
 1 , aa

(2)
which is a widely used indicator variable in quantitative genetic analysis.
Consider a marker locus with two alleles M and m having frequencies PM and Pm ,
respectively. Let D be the disequilibrium coefficient of the linkage disequilibrium (LD) between
the marker and the quantitative trait loci. The relationship between the phenotypic value Yi and
the genotype at the marker locus is given by
Yi   m  X im m  i ,
(3)
where indicator variable X i is given by
X
m
i
 2 Pm , MM

  Pm - PM , Mm.
  2 P , mm
M

(4)
Assuming the genetic model for the quantitative trait in equation (1), it can be shown that
a. s.
ˆ m 
D
,
PM Pm
which implies that the estimators ̂ m at the trait locus is ˆ m   and hence is a consistent
estimator.
Assume that the marker is located at the genomic position t and the trait locus is at the
genomic position s. Let D(t , s ) be the disequilibrium coefficient between the marker and trait
loci. Let ˆ (t ) be the estimated genetic additive effect at the marker locus and (s ) be the true
additive effect at the trait locus. Let PM (t ) and Pm (t ) be the frequencies of the marker alleles M
and m at the genomic position t, respectively. Then, ˆ (t ) will be almost surely convergent to
a.s.
ˆ (t ) 
D ( s, t )
 ( s) .
PM (t ) Pm (t )
(5)
Suppose that there are K trait loci which are located at the genomic positions s1,, sK . The j-th
trait locus has the additive  ( s j ) . Then, we have
a . s.
ˆ (t ) 
1
PM (t ) Pm (t )
K
 D(t , s
j )( s j )
,
(6)
j 1
where D(t , s j ) is the disequilibrium coefficient between the marker at the genomic position t and
the trait locus at the genomic position s j .
Multiple Linear Regression for a Quantitative Trait
Consider the L marker loci which are located at the genomic positions t1 ,, t L . The
multiple linear regression model for a quantitative trait is given by
Yi   m 
L
X
il  l
 i
(7)
i 1
where indicator variables X il are similarly defined as that in equation (4). Let
PM l and Pml be the frequencies of the alleles M l and ml of the marker located at the genomic
positions tl , respectively, D (tl , t j ) be the disequilibrium coefficient of the LD between the
marker at the genomic position tl and the marker at the genomic position t j . Assume that there
are K trait loci s as defined before. Let D (t j , sk ) be the disequilibrium coefficient of the LD
between the marker at the genomic position t j and the trait at the genomic position sk . By a
similar argument as that in Appendix B, we have
a.s.
ˆ m  PA1 ,
(8)
where ˆ m  [ˆ1 ,, ˆ L ]T ,
 2 PM 1 Pm1 D(t1 , t2 )
 D (t , t ) 2 P P
2 1
M 2 m2
PA  

 

 D(t L , t1 ) D(t L , t2 )
K
K
j 1
j 1
 D(t1 , t L ) 
 D(t1 , t L ) 


 

 2 PM L PmL 
  [2 D(t1 , s j )( s j ),,2 D(t L , s j )( s j )]T .
If we assume that all markers are in linkage equilibrium, then equation (8) is reduced to
ˆ (t ) 
1
PM t Pmt
k
 D (t , s
j 1
j
)( s j ) ,
which is exactly the same as equation (6). In other words, under the assumption of linkage
equilibrium among markers, multiple linear regression can be decomposed into a number of
simple regressions.
Substituting the above equation into equation (7) and taking limit will lead to the function
linear model (10) that will be discussed in the next section.
B-Splines
B-splines are Let the domain [0,1] be subdivided into knot spans by a set of nondecreasing numbers 0  u 0  u1  u 2    u m  1 . The ui s are called knots. The i-th B-spline
basis function of degree p, written as Bi , p (t ) is defined recursively as follows (Ramsay and
Silverman 2005):
1 if u i  t  u i 1
Bi ,0 (t )  
0 otherwise
Bi , p (t ) 
u i  p 1  t
t  ui
Bi , p 1 (t ) 
Bi 1, p 1 (t ) .
ui p  ui
u i  p 1  u i 1
B-Spline Basis functions have two important features:
(1) Basis function Bi , p (t ) is non-zero only on p  1 knot spans
[u i , u i 1 ), [u i 1 , u i  2 ), , [u i  p , u i  p 1 ).
(2) Given any knot span [ui , ui 1 ) , there are at most p  1 degree p basis functions that are nonzero, namely:
Bi  p , p (t ), Bi  p 1, p (t ), Bi  p  2, p (t ), , Bi 1, p (t ) and Bi , p (t ).
Methods for basis function selection
Basis function selection is carried out by LASSO regression. Consider a sequence of SNPs:
t1 ,..., tk with t0  t1  ...  tk  tk 1 and t 3  t 2  t1  t0 , tk 1  tk  2  tk  3  tk  4 . Let Bk (u, t ) be a
cubic B-spline. Assume that we have the observed data
{( ui , d i )}in1 . A regression model is given by
di  f (ui )  i , i  1,2,..., n .
(9)
We approximate f (t ) by a B-spline:
S (u ) 
k m
  B ( u, t ) .
l 1
l
(10)
l
The regression spline estimate fˆ (u ) 
k m
 ˆ B (u, t ) is obtained by minimizing
l
l
l 1
F ( ) 
k m
k m
1 n
[d i   l Bl (ui , t )]2    | l | ,

2 i 1
l 1
l 1
(11)
where m  4 . Let d  [d1 ,..., d n ]T ,   [1 ,..., k  m ]T , and
 B1 (u1 , t )  Bk  m (u1 , t ) 
.
M  


 B (u , t )  B (u , t )
k m
n
 1 n

Then, equation (11) can be reduced to
F ( ) 
k m
1
( d  M)T ( d  M)    | l | .
2
l 1
A point * which is a minimum of function F () is to satisfy1
(12)
0  F (* ).
The optimization problem (12) can be rewritten as
k m
k m
k m
1
F ()  (d   M l l  M j  j )T (d   M l l  M j  j )   | l | .
2
l 1,  j
l 1,  j
l 1
(13)
Differentiating equation (13) and setting it to be zero, we have
 M Tj rj  M Tj M j  j  s j  0,
where rj  d 
k m
M 
l 1,  j
l
l
, M j is the j-th column vector of the matrix M,
sing (  j ) if  j  0
sj  
 j  0.
 [1,1]
Thus,
M Tj M j  j  M Tj rj  s j , or
M Tj rj

j  T
 T
sj .
MjMj MjMj
Therefore,
M Tj rj
M Tj rj

ˆ j  sign ( T

)(| T
| T
)
MjMj MjMj
MjMj
if x  0
x
( x)  
0
otherwise

(14)
Algorithms:
Step 1: initialization
( 0)  ( M T M ) 1 M T d .
Step 2: for v  1,2,...,
Step 3: for j  1,2,..., k  m
rj  d 
ˆ

( v 1)
j
k m
M 
l 1,  j
l
(v)
l
,
M Tj rj
M Tj rj

 sign ( T
)(| T
| T
)
MjMj MjMj
MjMj
End
If || (jv 1) ( v )j ||2   then
(jv ) 
 (jv 1) , j  1,2,..., k  m,
Go to step 2
Else
Stop (convergence)
Appendix A: A statistical model for genetic effects
The statistical model for three genotypic values can be expressed as
G11    21  e1
G12    1   2  e2
G22    2 2  e3
(A1)
where ei (i  1,2,3) are the respective deviation of the genotypic values from their expectations on
the basis of a perfect fit of the model. We then obtain estimators of  , 1 and  2 by minimizing
Q  p 2 (G11    2 1 ) 2  2 pq(G12    1  2 ) 2  q 2 (G11    2 2 ) 2 .
Setting
Q
Q
Q
 0,
 0 and
 0 , and solving these equations,
u
1
 2
we obtain
ˆ  PA2G11  2 PA PaG12  Pa2G22
1  Pa [ PAG11  ( Pa  PA )G12  PaG22 ]
2   PA [ PAG11  ( Pa  PA )G12  Pa G22 ] .
The substitution effect  is defined as
  1   2  PAG11  ( Pa  PA )G12  PaG22 ,
which is often termed as a genetic additive effect.
Appendix B: The convergence of least square estimator of the regression coefficients
Let Y  [Y1 ,, Yn ]T , X m  [ X 1m ,, X nm ]T , ,   [ m ,  m ]T ,   [1 ,  2 , n ]T , W  [1, X m ]T ,
1  [1,1,,1]T . Then equation (3) can be written in a matrix form:
Y  W  
(B1)
The least square estimator of the regression coefficients  is given by
ˆ  (W TW ) 1W T Y
(B2)
It follows from equation (4) that
E[ X 1m ]  2 Pm PM2  2( Pm  PM )( Pm PM )  2 PM Pm2
 2 Pm PM ( PM  Pm  PM  Pm )  0,
E[( X 1m ) 2 ]  (2 Pm ) 2 PM2  2( Pm  PM ) 2 ( Pm PM )  ( 2 PM ) 2 Pm2
 2 Pm PM .
Since

 1
1 T
W W 
n
 1  X im
 n i
1

X im 

n i
,
1
m 2
(
X
)
 i 
n i
by large number theory, we have
0 
1
1 T
W W 

.
a. s.
n
0 2 Pm PM 
It follows from equations (3) and (4) that
E[Y1 ]  
(B3)
2
E[ X 1m X 1 ]  2 Pm [( 2 Pa ) PMA
 2( Pa  PA ) PMA PMa  2 PA ( PMa ) 2 ]
 ( Pm  PM )[( 2 Pa )2 PMA PmA  2( Pa  PA )( PMA Pma  PMa PmA )  2(2 PA ) PMa Pma ]
2
2
 (2 PM )[( 2 Pa ) PmA
 2( Pa  PA ) PmA Pma  (2 PA ) Pma
]
 4 Pm PM D  2( Pm  PM ) 2 D  4 Pm PM D
 2D
By the large number theory,
 1

Yi 


 
1 T
n i
a.s
W Y 


 D .
n
 
 1  X imYi 
 n i

(B4)
Combining equations (B3) and (B4), we obtain
a. s.
m 
D
.
PM Pm
(B5)
When the marker is at the trait locus, we have
a .s .
 m  .
References
1. Friedman J, Hastie T, Höfling, H and Tibshirani R. Pathwise coordinate optimization.
Ann. Appl. Stat. 2007; 1: 302-332.
Download