MODELS FOR GENETIC ANALYSIS OF LONGITUDINAL DATA ANS 590A

advertisement
MODELS FOR GENETIC ANALYSIS OF LONGITUDINAL DATA
ANS 590A
Jack Dekkers
March, 2002
Based on Notes for Short Course on ‘Random Regression in Animal Breeding’
Julius van der Werf
Larry Schaeffer
University of Guelph, 1997
1
INTRODUCTION (J. van der Werf)
In univariate analysis the basic assumption is that a single measurement arises
from a single unit (experimental unit). In multivariate analysis, not one measurement but
a number of different characteristics are measured from each experimental design, e.g.
milk yield, body weight and feed intake of a cow. These measurements are assumed to
have a correlation structure among them. When the same physical quantity is measured
sequentially over time on each experimental unit, we call them repeated measurements,
which can be seen as a special form of a multivariate case. Repeated measurements
deserve a special statistical treatment in the sense that their covariance pattern, which has
to be taken into account, is often very structured. Repeated measurements on the same
animal are generally more correlated than two measurements on different animals, and
the correlation between repeated measurements may decrease as the time between them
increases. Modeling the covariance structure of repeated measurements correctly is of
importance for drawing correct inference from such data.
Measurements that are taken along a trajectory can often be modeled as a function
of the parameters that define that trajectory. The most common example of a trajectory is
time, and repeated measurements are taken on a trajectory of time. The term ‘repeated
measurement’ can be taken literally in the sense that the measurements can be thought of
as being repeatedly influenced by identical effects, and it is only random noise that
causes variation between them. However, repeatedly measuring a certain characteristic
1
may give information about the change over time of that characteristic. The function that
describes such a change over time my be of interest since it may help us to understand or
explain, or to manipulate how the characteristic changes over time. Common examples in
animal production are growth curves and lactation curves.
Generally, we have therefore two main arguments to take special care when
dealing with repeated measurements. The first is to achieve statistically correct models
that allow correct inferences from the data. The second argument is to obtain information
on a trait that changes gradually over time.
Experiments are often set up with repeated measurements to exploit these two
features. The prime advantage of longitudinal studies (i.e. with repeated measurements
over time) is its effectiveness for studying change. Notice that the interpretation of
change may be very different if it is obtained from data across individuals (cross sectional
study) or on repeated measures on the same individuals. An example is given by Diggle
et al. (1994) where the relationship between reading ability and age is plotted. A first
glance at the data suggests a negative relationship, because older people in the data set
tended to have had less education. However, repeated observations on individuals
showed a clear improvement of reading ability over time.
The other advantage of longitudinal studies is that it often increases statistical
power. The influence of variability across experimental units is canceled if experimental
units can serve as their own control.
Both arguments are very important in animal production as well. A good example
is the estimation of a growth curve. When weight would be regressed on time on data
across animals, not only would the resulting growth curve be more inaccurate, but also
the resulting parameters might be very biased if differences between animals and
animals’ environments were not taken into account.
Models that deal with repeated measurements have been often used in animal
production. In dairy cattle, the analysis of multiple lactation records is often considered
using a ‘repeatability model’. The typical feature of such a model from the genetic point
of view is that repeated records are thought of expressions of the same trait, that is, the
genetic correlation between repeated lactation is considered to be equal to unity. Models
that include data on individual test days have often used the same assumption. Typically,
2
genetic evaluation models that use measures of growth do often consider repeated
measurements as genetically different (but correlated) traits. Weaning weight and
yearling weight in beef cattle are usually analyzed in a multiple trait model.
Repeatability models are often used because of simplicity. With several
measurements per animal, they require much less computational effort and less
parameters than a multiple trait model. A multiple trait model would often seem more
correct, since they allow genetic correlations to differ between different measurements.
However, covariance matrix for measurements at very many different ages would be
highly overparameterised. Also, an unstructured covariance matrix may not be the most
desirable for repeated measurements that are recorded along a trajectory. As the mean of
measurements is a function of time, so also may their covariance structure be. A model
that allows the covariance between measurements to change gradually over time, and
with the change dependent upon differences between times, can make use of a covariance
function.
As was stated earlier, repeated measurements can often be used to generate
knowledge about the change of a trait over time. Whole families of models have been
especially designed to describe such changes as regression on time, e.g. lactation curves
and growth curves. The analysis may reveal causes of variation that influence this
change. Parameters that describe typical features of such change, e.g. the slope of a
growth curve, are regressions that may be influenced by feeding levels, environment, or
breeds. There may also be additive genetic variation within breeds for such parameters.
One option is then to estimate curve parameters for each animal and determine
components of variation for such parameters. Another option is use a model for analysis
that allows regression coefficients to vary from animal to animal. Such regression
coefficients are then not fixed, but are allowed to vary according to a distribution that can
be assigned to them, therefore indicated as random regression coefficients.
This course will present models that use random regression in animal breeding.
Typical applications are for traits that are measured repeatedly along a trajectory, e.g.
time. Different random regression models will be presented and compared. The features
of random regression models, and estimation of their parameters will be discussed.
Alternative approaches to deal with repeatedly measured traits along a trajectory are the
3
use of covariance functions, and use of multiple trait models. These approaches have
much in common, since they all deal with changing covariances along a trajectory.
Differences that seem to exist are most often due to the differences in the model, and
generally not necessarily due to the approach followed. This course will present and
discuss the different methods, and show where they can be equivalent. Different models
that allow the study of genetic aspects of changes of traits along a trajectory will be
presented and discussed.
Most of the examples will refer to test day production records in dairy cattle,
since test day models have been used mostly to develop and compare random regression
models. However, the procedures and models presented have a much wider scope for use,
since many characters have multiple expressions, and often there is an interest in how the
expression changes over time. A good example is the analysis of traits related to growth.
Another generalization is that the methodology developed not necessarily refers to traits
that are modeled as a function of time (i.e. regressed on a time variable).
2.
STANDARD GENETIC MODELS FOR LONGITUDINAL DATA
Animal model for a single phenotypic record of animal i:
yi  xib  ui  ei
Across animals:
y  Xb  Zu  e
u  (u1 .... un ) ~ N (0, a2 A)
e  (e1 .... rn ) ~ N (0, e2I)
Repeatability animal model: T common measuring times/ages for the n animals.
Observation at age/time t on animal i:
yit  x it b  ui  pi  eit
xitb must include some effects to account for the effects of age/time of measurement.
4

Discrete class variable

Polynomial function; e.g. xitb =
q
b x
k 0
k
k
it
Across animals:
y  Xb  Zu  Wp  e
2
p  ( p1 .... pn ) ~ N (0,  pe
I)
Assumes the same genetic trait is expressed at each age/time.
Multiple trait animal model
Assume a different genetic trait at each time/age; for measurements at age/time t on
animal i:
yit  x it b t  uit  eit
Across animals for measurements at age/time t:
y t  Xt bt  Zt ut  et
Across ages; data sorted by age/time:
 y 1   X1
  
 y2   0
 .  .
  
y   0
 T 
0
X2
.
0
0  b1   Z1
  
. 0  b 2   0

. .  .   .
  
. X T  b T   0
.
0
Z2
.
0
0  u1   e1 
   
. 0  u 2   e 2 

. .  .   . 
   
. Z T  u T   e T 
.
y = Xb  Zu  e
u  (u1' u '2 ..... u T' ) ~ N( 0, G  A )
e  (e1' e '2 ..... eT' ) ~ N( 0, E  I )
MME:
 X' X X' Z
 bˆ   X' y 

   

1
1  
 Z' X Z' Z  (G  A )  uˆ   Z' y 
5
Analysis of individual animal curve parameters
1)
Fit separate curve to each animal’s data:
-
linear polynomial function:
yi = xiti =
q

k 0
2)
non-linear function:
ik
xitk
yi = f(xit,i)
Fit animal model to estimates of curve parameters
 ik  x k b k  u k  eik
Single or multiple trait model
3.
RANDOM REGRESSION MODELS
Suppose that the observation for animal i at time t is modeled by a quadratic polynomial
of time, with regression coefficients specific to that animal:
y it     0i   1i xit   2i xit2  eit
Now, consider that animal-specific regression coefficients ki are determined by, apart
from some average (bk) that applies to the whole population, by genetics (aki) as well as
environmental factors (pki) that are specific to animal i:
 ki  bk  aki  pki
Then, rearranging terms such that population average, genetic, and environmental effects
are grouped:
yit    b0  b1 xit  b2 xit2  a0i  a1i xit  a 2i xit2  p0i  p1i xit  p 2i xit2  eit
fixed
genetic
6
perm. envir.
b0 
x  b1   1 xit
b2 

yit    1 xit
2
it


a 0i 
x  a1i   1 xit
a 2i 
2
it


 p 0i 
x  p1i   eit
 p 2i 
2
it


yit    x it b  m it a i  m it p i  eit
m it  1 xit
xit2
 m it' Gm it  m it' Em it   e2it
Variance at time t:
Var( yit )
Covariance between time t1 and t2:
Cov( y it1 , y it1 )  m it' Gm it2  m it' Em it
 a 0 i    a0

G = Var  a1i  =  a0 a1
 
a 2i   a0 a2
2
a a
 a2
aa
0 1
1
1 2
1


1 2 

2 
1
2
 p 0i    pe0

E = Var  p1i  =  pe0 pe1
 
 p 2i   pe0 pe2
a a
aa
 a2
0 2

2
 pe pe
2
 pe
 pe pe
0
1
1
1
2


1
2 

2

 pe pe
 pe pe
2
 pe
0
2
Across records for animal i:
1 xi1
 yi1 

y 
1 xi 2
 i2 
y i   .     . .

 
. .
 . 
1 x
 yiT 
iT

1 xi1
xi21 


xi22  b0  1 xi 2
 
.   b1    . .


.  b2   . .
1 x
xiT2 
iT

1 xi1
xi21 


xi22  a0i  1 xi 2
 
.   a1i    . .


.  a 2i   . .
1 x
xiT2 
iT

xi21 
 ei1 

 
xi22   p0i  ei 2 
 
.   p1i    . 

 
.   p 2i   . 
eiT 
xiT2 
y i  Wi c i  X i b  Μ i a i  M i p i  e i
Across animals with observations sorted by animal and trait within animal:
y  Wc  Xb  Za  Wp  e
b0 
b =  b1 
b2 
 a1 
a 
a =  2  ~ N( 0, A  G )
 . 
 
a n 
 p1 
p 
p =  2  ~ N( 0, I  E)
 . 
 
p n 
7
 e1 
e 
e =  2  ~ N( 0, I e2 )
.
 
e n 
Example data on stature of four cows (After L.R. Schaeffer, 2001)
All cows are in the same herd and measured at four different visits (potentially by
different evaluators)
Visit 1
Cow
Visit 2
Visit 3
Visit 4
Sire Dam Age (mo) Stature Age (mo) Stature Age (mo) Stature Age (mo) Stature
1
7
5
22
24
34
36
47
39
2
7
6
30
44
42
47
55
41
66
44
3
8
5
28
24
40
42
4
8
1
20
20
33
34
44
28
yit  b0  b1 xit  b2 xit2  visit  a0i  a1i xit  a2i xit2  p0i  p1i xit  p2i xit2  eit
y  Xb  Wc  Ma  Mp  e
 e2  9
 c2  4
G=
94
-3.34
0.03098
-3.34
0.15
-0.00144
0.03098
E=
-0.00144 0.000014
y
31.6981
-1.1263
-1.1263
0.05058 -0.00048559
0.010447 -0.00048559
X
W
24
1
22
484
1
0
0
0
36
1
34 1156
0
1
0
0
39
1
47 2209
0
0
1
0
44
1
30
900
1
0
0
0
47
1
42 1764
0
1
0
0
41
1
55 3025
0
0
1
0
44
1
66 4356
0
0
0
1
24
1
28
784
1
0
0
0
42
1
40 1600
0
1
0
0
20
1
20
400
0
1
0
0
34
1
33 1089
0
0
1
1
28
1
44 1936
0
0
0
1
8
0.010447
0.00000472
M=
1
22
484
0
0
0
0
0
0
0
0
0
1
34 1156
0
0
0
0
0
0
0
0
0
1
47 2209
0
0
0
0
0
0
0
0
0
0
0
0
1
30
900
0
0
0
0
0
0
0
0
0
1
42
1764
0
0
0
0
0
0
0
0
0
1
55
3025
0
0
0
0
0
0
0
0
0
1
66
4356
0
0
0
0
0
0
0
0
0
0
0
0
1
28
784
0
0
0
0
0
0
0
0
0
1
40 1600
0
0
0
0
0
0
0
0
0
0
0
0
1
20
400
0
0
0
0
0
0
0
0
0
1
33
1089
0
0
0
0
0
0
0
0
0
1
44
1936
Mixed Model Equations:
 X' X

 W' X
 M' X

0
 M' X

A
1
X' W
W' W  I 94
X' M
W' M
0
0
M' W
0
M' M  A nn  G 1
A bn  G 1
A nb  G 1
A bb  G 1
M' W
M' M
0
 A nn
  bn
A
  bˆ   X' y 
  

  cˆ   W' y 
  aˆ    M' y 
M' M
  n 

0
  aˆ b   0 
 
M' M  I  E 1   pˆ   M' Y 
X' M
W' M
A nb 
 , with n / b corresponding to animals with / without observations
A bb 
MME solutions:
bˆ'  (0.5131 1.7794  0.01298)
cˆ'  (14.5400 13.7436 6.3133  1.0837)
9
Animal
a0
a1
a2
p0
p1
p2
1
-1.720812 0.039069 -0.000334
-0.76127
2
9.908001 -0.286581 0.002556
3.536084
3
-8.723928 0.245155 -0.002166
-2.737505
0.073743 -0.000643
4
-2.972874 0.108809 -0.001022
-0.037309
0.015559
5
-5.215457 0.139228 -0.001218
6
5.243107 -0.150764 0.001344
7
4.086682 -0.120872
8
0.012384 -0.000093
-0.101685
0.000906
-0.00017
0.00108
-4.114334 0.132408 -0.001206
The EBV for animal i at age x months can be computed as: EBVix  aˆ 0i  aˆ1i x  aˆ 2i x 2
An overall EBV can be computed based on economic values. For example, if the
economic values of stature at ages 24, 36, and 48 months are 2, 1, and 0.5, the total EBV
for animal i can be computed as:
TEBVi = 2*EBVi,24 + 1*EBVi,36 + 0.5*EBVi,48
EBV at
Total
Animal 24 mo 36 mo 48 mo
EBV
1
-0.98
-0.75
-0.62
-3.01
2
4.50
2.90
2.04
12.93
3
-4.09
-2.71
-1.95 -11.85
4
-0.95
-0.38
-0.10
-2.33
5
-2.58
-1.78
-1.34
-7.60
6
2.40
1.56
1.10
6.91
7
1.81
1.13
0.77
5.14
8
-1.63
-0.91
-0.54
-4.44
Genetic growth curve
EBV
75
6
70
65
60
2
Stature
EBV for stature
4
0
55
50
45
-2
40
-4
35
30
-6
20
30
40
50
60
25
70
20
Age (months)
30
40
50
Age (months)
10
60
70
4.
COVARIANCE FUNCTIONS (After van der Werf and Schaeffer, 1997)
A covariance function (CF) describes in mathematical terms, the covariance between
variables on the same individual at different times. For example, the for the covariance
between breeding values ul and um on an animal for traits measured at ages xl and xm can
be described in terms of a polynomial of order k as:
k
cov(ul,um)= f(xl,xm)=
k
 c
i 1
j 1
ij
xli 1 xmj 1
where cij are the coefficients of the CF.
xm=  1  2 a m  a min
Ages x are often standardized (-1  x  1) as
a max  a min
where amin and amax = first and the latest time point on the trajectory considered
This CF can be written in matrix form by defining vectors mt with elements xti-1 for
1<i<k:
mt = [1
xt xt2 xt3 xt4
.....
xtk-1]
and a matrix C with CF coefficients cij. The CF for breeding values ul and um on an
animal for traits measured at ages xl and xm can then be expressed as:
cov(ul,um)= f(xl,xm) = mlCmm’
The CF of order k for the variance-covariance matrix of breeding values at T ages can
then be expressed as:
 u1 
 
u  ~
Cov  1  = G = MCM'
.
 
u 
 T
 m1 


m2 
with M = 
. 


m 
 T
~
~
For a known, or previously estimated matrix G , equation G = MCM' can be used to
estimate the coefficients of the CF, i.e. the elements of matrix C as:
~ t
ˆ  M 1G
C
M
where the -t superscript refers to the inverse of the transpose.
11
For example, assume that a trait was measured at standardized ages of x = -1, 0, and 1
and that multiple-trait analysis of the data resulted in the following estimated additive
436 522.3 424.2
~
G  522.3 808 664.7
genetic variance-covariance matrix:
424.2 664.7
558
These estimates are for the additive genetic covariances for log(body weight) of male
mice at 2, 3, and 4 weeks of age (Riska et al. 1984), as discussed by Kirkpatrick et al.
(1990, Genetics 124:979).
This matrix can be represented by a CF of order 3 as follows:
Using mt’ = [1 xt xt2] the matrix M for ages xt = -1, 0, and 1 is:
1
M 1
1
808.0
~ t
ˆ M G
Solving for C
M gives:
71.2
ˆ  71.20
C
36.4
 214.5  40.7
1
1 1
0 0
1
1
 214.5
 40.7
81.6
and the covariance function between breeding values at times xl and xm is:
1
ˆ
cov(ul,um)= f(xl,xm) = ml Ĉ mm’ = [1 xl xl ] C  x m 
 x m2 
2
= 808+ 71.2(x1+xm) +36.4xl xm -40.7(xl 2xm + xlxm2) - 214.5(xl2+xm2) +81.6(xl2xm2)
Using this function we can compute the covariance between the age combinations
~
represented in G . Because the CF has 6 coefficients, which are used to estimate the 6
~
~
unique elements of the symmetric matrix G , this CF gives an exact fit to matrix G . This
represents a ‘full fit’ CF. The unique aspect of the CF, however, is that it can be used to
estimate covariances between ages that were never measured by interpolation. For example,
the covariance between weight at 3 (xl = 0) and 3.5 (xm = 0.5) weeks of age is
f(0, 0.5) = 808+ 71.2*(0.5) - 214.5*(0.52) = 789.9.
12
Reduced fit Covariance Functions.
Here, the intend is to estimate a CF that has fewer parameters than the original matrix: k<T.
This is important in particular if T is large because it will allow the variance-covariance
structure to be described by fewer parameters.
Using M with polynomials of order k<T, find matrix C of dimensions k<T such that:
~
ˆ  MCM' provides the best fit to G
G
To find this, set
Pre- and post-multiply:
Solve for C:
~
G  MCM'
~
M'GM  M'MCM'M
~
ˆ  (M' M ) 1 M'G
C
M (M' M ) 1
1  1
M  1 0 
1 1 
~
For a 2-order fit of the example matrix G we get:
1  1
M' M  1 0 
1 1 
~
ˆ  (M' M ) 1 M'G
M (M' M ) 1 this results in:
Using C
558.2667
44.06667
Ĉ  44.06667
36.4
~
and in an estimate of matrix G that can be derived as:
506.5333
ˆ  MC
ˆ M' 
G
514.2 521.866667
514.2 558.26667 602.333333
521.8667 602.33333
682.8
~
This procedure amounts to a least-squares regression of G on M and results in a matrix
Ĝ whose elements have the lowest residual sums of squares (RSS) from the original
~
matrix G . In fact, it gives the same results as the procedure suggested by Kirkpatrick et
~
al. (1990), which was to stack elements of G into vector ~g as:
~
~
~
~
~
~
~
~
g '  [G (1,1), G (2,1)...., G (T ,1), G (1,2),...., G (T ,2),....., G (1, T ),...., G (T , T )]
and fit the following LS model:
13
~
g =Xc + e
with X = M  M
cˆ  (X' X) 1 X' ~
g
Solve as:
Matrix Ĉ is then derived by unstacking the elements of vector ĉ .
For our example we get:
X=
1
1
1
1
1
1
1
1
1
-1
0
1
-1
0
1
-1
0
1
-1
-1
-1
0
0
0
1
1
1
1
0
-1
0
0
0
-1
0
1
X'X =
9
0
0
0
0
6
0
0
0
0
6
0
0
0
0
4
(X'X)-1 = 0.111111
C=
g=
436
522.3
424.2
522.3
808
664.7
424.2
664.7
558
X'g = 5024.4
264.4
264.4
145.6
0
0
0
c=(X'X)-1X'g= 558.2667
0 0.166667
0
0
44.06667
0
0 0.166667
0
44.06667
0
0
0.25
36.4
0
558.2667 44.06667
44.06667
36.4
The RSS for the fit of order 2 can be computed by taking the sum of squares of the
elements of the following residual matrix:
~
G - Ĝ =
70.53
-8.10
97.67
-8.10
-249.73
-62.37
97.67
-62.37
124.80
This results in RSS2 = 109904.7.
14
Significance of an increase in goodness of fit for a model of order k+p over a model of
( RSS k  RSS k  p ) /( df k  df k  p )
RSS k  p / df k  p
order k can be tested by an F-test as: F =
Where dfk is the residual degrees of freedom for the fit of order k, which is equal to the
number of unique elements in the original matrix minus that in a matrix of order k:
dfk = T(T+1)/2 - k(k+1)/2
This statistic can be tested against an F-value with dfk+p-dfk and dfk+p degrees of freedom.
The order needed to fit the matrix adequately can also be determined by evaluating the
~
eigenvalues of the original matrix G .
~
Note that matrix Ĝ was derived by regressing matrix M on all elements of G , although
~
G is symmetric and contains only T(T+1)/2 unique elements. Since off-diagonals appear
~
twice in G , they received twice as much emphasis in the regression analysis than the
diagonals. To prevent this, regression can be performed only on the unique elements of
~
G:
Redefine ~g and c to vectors of length T(T+1)/2 and k(k+1)/2, containing only the lower
~
half of the matrix elements of G and C, respectively. The rows in X corresponding to
~
G (i,j), for i<j need to be deleted. Furthermore, the columns corresponding to C(i,j), for
i<j need to be added to the columns corresponding to C(j,i), and the columns for C(i,j)
needs to be deleted. Following these steps, the matrix X has dimensions T(T+1)/2 and
k(k+1)/2:
X=
1
1
1
1
1
1
-2
-1
0
0
1
2
g=
1
0
-1
0
0
1
Resulting in:
c=(X'X)-1X'g=
568.8118
38.64
0.329412
15
436
522.3
424.2
808
664.7
558
Unstacking gives:
C=
568.8118
38.64
38.64
0.329412
~
and in an estimate of matrix G that can be derived as:
ˆ  MC
ˆ M' 
G
491.85
530.16
568.47
530.16
568.80
607.44
568.47
607.44
646.41
The RSS of this matrix is equal to 11641.3 when considering all elements, but 92306.5
when considering only half-stored elements. The latter RSS has a Chi-square distribution
with [T(T+1)/2 - k(k+1)/2} The RSS of half-stored elements of Ĝ derived using the fullstored matrix is equal to 96410.7.
~
An additional factor to consider is that matrix G will itself consist of estimates, whose
sampling variances won’t be equal. These complications can be taken into account by
using Generalized Least Squares Regression procedures, as described in Kirkpatrick et al.
(1990) as:
cˆ  ( X' V 1 X) 1 X' V 1~
g
where V is the variance-covariance matrix of estimates in ~g .
16
Download