File S2 - Genetics

advertisement

File S2: Derivation of various formulas

Restricted maximum likelihood estimation of variance component via eigen-decomposition:

Under the polygenic model, the restricted log likelihood function is,

L

  n

 r

2 ln(

2

)

1

2 ln | H |

2

1

2

( y

X

)

T

H

1

( y

X where

 

, ,

2

)

1

2 ln |

is the parameter vector,

is a vector of fixed effects, /

T

1

X H X

2

the

| (1) variance ratio,

2

is the polygenic variance,

 2

is the residual variance, n is the sample size, r is the rank of matrix X , H

K

 

I is the covariance structure and K is a marker inferred kinship matrix. Given

, the maximum likelihood estimates of

and

 2 are

T

1

X H X )

1 T

1

X H y

 ˆ 2 

1 n

 r

( y

X

 ˆ

)

T

H

1

( y

X

 ˆ

)

These two estimated parameters are expressed as functions of

. Substituting

and

 2 in equation (1) by

 ˆ

and

 ˆ 2 in equation (2) yields a profiled likelihood function that is only a function of

, as shown below,

L

 

1

2 ln | H |

1

2 ln |

T

1

X H X |

 n

 r

2 ln(

T y Py )

(2)

(3) where

P

H

1  

1

(

T

1

)

1

X H

1

(4)

A numeric solution of

can be found iteratively using the Newton algorithm,

( t

1)  

 2

L

(

2

)

 

 

L (

)

The likelihood function requires inverse and determinant of matrix H , an n n matrix, and the

(5) computation can be demanding for large sample sizes. We used the eigen-decomposition approach to deal with the K

matrix (K

ANG

et al.

2008; Z

HOU

and S

TEPHENS

2012). Further

investigation of equation (3) shows that the profiled restricted log likelihood function only requires the log determinant of matrix H and various quadratic forms involving perform eigen-decomposition for K so that K

UDU

T

, where D

 diag

 

1

,...,

 n

H

1 . Let us

is a diagonal matrix for the eigenvalues and U is the eigenvectors, an n n matrix. The eigenvectors have the property of U

T 

U

1

so that UU

T 

I . Now, let us rewrite matrix H by

H

K

I UDU

T

  

(

 

)

T

(6)

The determinant of H is

| H

(

 

)

T 

D

 

||

T

| | D

 

I | where D

 

I is a diagonal matrix. Therefore, the log determinant of matrix H is n ln | H |

 ln( j

1

1)

(7)

(8)

The restricted log likelihood function also involves various quadratic terms in the form of

1 , for example, T

1

X H X , T

1

X H y and T

1 y H y . Using eigenvalue decomposition, we can rewrite the quadratic form by a H b

 T

(

 

I )

1

U b

 a

* T

( D

 

I )

1 b

*  j n 

1 a b j

T * j

(

  

1)

1

(9) where a

* 

U a and b

* 

U b . Note that a * j

is the j th element (row) of vector (matrix) a

* and b * j is the j th element (row) of vector (matrix) b

* . Using eigenvalue decomposition, matrix inversion and determinant calculation have been simplified into simple summations, and thus, the computational speed can be substantially improved.

Best linear unbiased prediction (BLUP) of a marker effect under the polygenic model:

Under the polygenic model, all marker effects share the same variance, i.e., a k

~ N (0, I

2

/ m ) ) for k

1,..., m , where

2  

2

is estimated from the data under the polygenic model. The

BLUP estimate of a k

is

E( a k

| )

Z

T k

(

 ˆ

2

/ m K

 ˆ

2 

I

 ˆ 2

) ( y

X

 ˆ

) (10) k

We have a total of m markers and thus m effects to estimate under the polygenic model (prior to the marker scanning step). The polygenic effect associated with marker k

 k

Z a k

. Here, eigen-decomposition is also required to avoid direct calculation of ( K

 ˆ

2 

I

 ˆ 2

1

) .

Estimating variance components via Woodbury matrix identity and eigen-decomposition :

The genomic scanning model for the k th locus is y

X

 

Z k

 k

  where

is the polygene and the general error term

 

has E (

 

)

0 and

(11) var(

 

) ( K

 ˆ 

I )

 2

. We assume

 k

~ N (0, I

8

2 k

) and perform a significance test for H

0

:

2 

0 . k

Under the null hypothesis, the k th locus is not linked to QTL. The expectation of y remains

E ( )

X

, but the variance-covariance matrix is var( )

Z Z

T

 k k k

2 

K

2 

I

2 

( Z Z

T

 k k k

K

 ˆ 

I )

2 where k k

/

2 is the variance ratio. Let *  T y U y , X

* 

U X and *  T

Z U Z k k

(12)

be transformed variables so that y

* 

X

*

 

Z

* k

 k

U

T

(

 

) (13)

The variance-covariance matrix of y

* is var( y

*

)

 * * T

Z Z

 k k k

2 

( D

 ˆ 

I )

2

R )

 

( Z Z * k k

T

 k

 2 where R

D

 ˆ 

I

 * * T

H Z Z k k k

 

, k

,

is a known diagonal matrix for the general covariance structure. Let

 k

2

R

by

and define the restricted log likelihood function for parameter vector

L

  n

 r

2

2 

1 ln( ) ln | |

2

H k

1

2

2

( y

* 

X

*

)

T

H k

1

( y

* 

X

*

)

1

2 ln | X

* T

1 *

H X

(14)

| (15)

Given

 k

, the maximum likelihood estimates of

and

 2 are

* T

1 *

1 * T

1 *

X H X ) X H y

 ˆ 2 

1 n

 r

( y

* 

X

*

 ˆ

)

T

H k

1

( y

* 

X

The above estimated parameters are expressed as functions of

 k

*

 ˆ

)

. Substituting

and

 2 equation (15) by

 ˆ

and

 ˆ 2

in

(16)

in equation (16) yields a profiled likelihood function that is only a function of

 k

, as shown below,

L (

 k

)

 

1

2 ln | H k

|

1

2 ln | X

* T

H X k

1 *

|

2 ln(

* T y P y k

*

) where

P k

H k

1  k

1 *

H X ( X

* T

1 *

1 * T

H X k

) X H k

1

The Newton algorithm for the numeric solution of

 k

is

 k

( t

1)   k

 2 L (

 k

)

  k

2

 

 

L (

 k

)

  k

Once the iteration process has converged, the solution is the MLE of

 k

, denoted by

 ˆ k

.

(17)

(18)

(19)

Efficient matrix inversion and determinant calculation are required to evaluate the log likelihood function shown in equation (17). We used the Woodbury matrix identities to improve the

computational speed (G

OLUB

and V

AN

L

OAN

1996). The Woodbury matrix identities are

H k

1

( Z Z

* k

T

 k

R )

1

R

1   k

R Z

* k

(

 k

Z R Z

* k

I

8

)

1 * k

T

1

(20) and

| H k

Z Z * k

T

 k

R |

| R ||

 k

Z R Z

* k

I

8

|

(21)

Because R

D

 ˆ 

I is a diagonal matrix, the Woodbury identities convert the above calculations into inversion and determinant of matrices with dimension 8 8 function also involves various quadratic terms in the form of

1 , which can be expressed as k

T

1 a H b k

 T

1 a R b

  k

T

1 * a R Z k

(

 * T

Z R Z k k

1 * k

I

8

)

1 * T k

1

Z R b (22)

Note that the quadratic term involving H k

1

. The restricted likelihood

has been expressed as a function of various simplified

1 terms. The simplified quadratic term is calculated using where a j

and b j

T  j n 

1 j j

(

  j

ˆ 

1)

1

are the j th rows of matrices a and b , respectively, for j

1,..., n .

(23)

LITERATURE CITED

Golub, G. H., and C. F. Van Loan, 1996 Matrix computations. 1996. Johns Hopkins University,

Press, Baltimore, MD, USA : 374-426.

Kang, H. M., N. A. Zaitlen, C. M. Wade, A. Kirby, D. Heckerman et al.

, 2008 Efficient control of population structure in model organism association mapping. Genetics 178 : 1709-1723.

Zhou, X., and M. Stephens, 2012 Genome-wide efficient mixed-model analysis for association studies. Nature genetics 44 : 821-824.

Download