MODELS FOR GENETIC ANALYSIS OF LONGITUDINAL DATA ANS 590A Jack Dekkers March, 2002 Based on Notes for Short Course on ‘Random Regression in Animal Breeding’ Julius van der Werf Larry Schaeffer University of Guelph, 1997 1 INTRODUCTION (J. van der Werf) In univariate analysis the basic assumption is that a single measurement arises from a single unit (experimental unit). In multivariate analysis, not one measurement but a number of different characteristics are measured from each experimental design, e.g. milk yield, body weight and feed intake of a cow. These measurements are assumed to have a correlation structure among them. When the same physical quantity is measured sequentially over time on each experimental unit, we call them repeated measurements, which can be seen as a special form of a multivariate case. Repeated measurements deserve a special statistical treatment in the sense that their covariance pattern, which has to be taken into account, is often very structured. Repeated measurements on the same animal are generally more correlated than two measurements on different animals, and the correlation between repeated measurements may decrease as the time between them increases. Modeling the covariance structure of repeated measurements correctly is of importance for drawing correct inference from such data. Measurements that are taken along a trajectory can often be modeled as a function of the parameters that define that trajectory. The most common example of a trajectory is time, and repeated measurements are taken on a trajectory of time. The term ‘repeated measurement’ can be taken literally in the sense that the measurements can be thought of as being repeatedly influenced by identical effects, and it is only random noise that causes variation between them. However, repeatedly measuring a certain characteristic 1 may give information about the change over time of that characteristic. The function that describes such a change over time my be of interest since it may help us to understand or explain, or to manipulate how the characteristic changes over time. Common examples in animal production are growth curves and lactation curves. Generally, we have therefore two main arguments to take special care when dealing with repeated measurements. The first is to achieve statistically correct models that allow correct inferences from the data. The second argument is to obtain information on a trait that changes gradually over time. Experiments are often set up with repeated measurements to exploit these two features. The prime advantage of longitudinal studies (i.e. with repeated measurements over time) is its effectiveness for studying change. Notice that the interpretation of change may be very different if it is obtained from data across individuals (cross sectional study) or on repeated measures on the same individuals. An example is given by Diggle et al. (1994) where the relationship between reading ability and age is plotted. A first glance at the data suggests a negative relationship, because older people in the data set tended to have had less education. However, repeated observations on individuals showed a clear improvement of reading ability over time. The other advantage of longitudinal studies is that it often increases statistical power. The influence of variability across experimental units is canceled if experimental units can serve as their own control. Both arguments are very important in animal production as well. A good example is the estimation of a growth curve. When weight would be regressed on time on data across animals, not only would the resulting growth curve be more inaccurate, but also the resulting parameters might be very biased if differences between animals and animals’ environments were not taken into account. Models that deal with repeated measurements have been often used in animal production. In dairy cattle, the analysis of multiple lactation records is often considered using a ‘repeatability model’. The typical feature of such a model from the genetic point of view is that repeated records are thought of expressions of the same trait, that is, the genetic correlation between repeated lactation is considered to be equal to unity. Models that include data on individual test days have often used the same assumption. Typically, 2 genetic evaluation models that use measures of growth do often consider repeated measurements as genetically different (but correlated) traits. Weaning weight and yearling weight in beef cattle are usually analyzed in a multiple trait model. Repeatability models are often used because of simplicity. With several measurements per animal, they require much less computational effort and less parameters than a multiple trait model. A multiple trait model would often seem more correct, since they allow genetic correlations to differ between different measurements. However, covariance matrix for measurements at very many different ages would be highly overparameterised. Also, an unstructured covariance matrix may not be the most desirable for repeated measurements that are recorded along a trajectory. As the mean of measurements is a function of time, so also may their covariance structure be. A model that allows the covariance between measurements to change gradually over time, and with the change dependent upon differences between times, can make use of a covariance function. As was stated earlier, repeated measurements can often be used to generate knowledge about the change of a trait over time. Whole families of models have been especially designed to describe such changes as regression on time, e.g. lactation curves and growth curves. The analysis may reveal causes of variation that influence this change. Parameters that describe typical features of such change, e.g. the slope of a growth curve, are regressions that may be influenced by feeding levels, environment, or breeds. There may also be additive genetic variation within breeds for such parameters. One option is then to estimate curve parameters for each animal and determine components of variation for such parameters. Another option is use a model for analysis that allows regression coefficients to vary from animal to animal. Such regression coefficients are then not fixed, but are allowed to vary according to a distribution that can be assigned to them, therefore indicated as random regression coefficients. This course will present models that use random regression in animal breeding. Typical applications are for traits that are measured repeatedly along a trajectory, e.g. time. Different random regression models will be presented and compared. The features of random regression models, and estimation of their parameters will be discussed. Alternative approaches to deal with repeatedly measured traits along a trajectory are the 3 use of covariance functions, and use of multiple trait models. These approaches have much in common, since they all deal with changing covariances along a trajectory. Differences that seem to exist are most often due to the differences in the model, and generally not necessarily due to the approach followed. This course will present and discuss the different methods, and show where they can be equivalent. Different models that allow the study of genetic aspects of changes of traits along a trajectory will be presented and discussed. Most of the examples will refer to test day production records in dairy cattle, since test day models have been used mostly to develop and compare random regression models. However, the procedures and models presented have a much wider scope for use, since many characters have multiple expressions, and often there is an interest in how the expression changes over time. A good example is the analysis of traits related to growth. Another generalization is that the methodology developed not necessarily refers to traits that are modeled as a function of time (i.e. regressed on a time variable). 2. STANDARD GENETIC MODELS FOR LONGITUDINAL DATA Animal model for a single phenotypic record of animal i: yi xib ui ei Across animals: y Xb Zu e u (u1 .... un ) ~ N (0, a2 A) e (e1 .... rn ) ~ N (0, e2I) Repeatability animal model: T common measuring times/ages for the n animals. Observation at age/time t on animal i: yit x it b ui pi eit xitb must include some effects to account for the effects of age/time of measurement. 4 Discrete class variable Polynomial function; e.g. xitb = q b x k 0 k k it Across animals: y Xb Zu Wp e 2 p ( p1 .... pn ) ~ N (0, pe I) Assumes the same genetic trait is expressed at each age/time. Multiple trait animal model Assume a different genetic trait at each time/age; for measurements at age/time t on animal i: yit x it b t uit eit Across animals for measurements at age/time t: y t Xt bt Zt ut et Across ages; data sorted by age/time: y 1 X1 y2 0 . . y 0 T 0 X2 . 0 0 b1 Z1 . 0 b 2 0 . . . . . X T b T 0 . 0 Z2 . 0 0 u1 e1 . 0 u 2 e 2 . . . . . Z T u T e T . y = Xb Zu e u (u1' u '2 ..... u T' ) ~ N( 0, G A ) e (e1' e '2 ..... eT' ) ~ N( 0, E I ) MME: X' X X' Z bˆ X' y 1 1 Z' X Z' Z (G A ) uˆ Z' y 5 Analysis of individual animal curve parameters 1) Fit separate curve to each animal’s data: - linear polynomial function: yi = xiti = q k 0 2) non-linear function: ik xitk yi = f(xit,i) Fit animal model to estimates of curve parameters ik x k b k u k eik Single or multiple trait model 3. RANDOM REGRESSION MODELS Suppose that the observation for animal i at time t is modeled by a quadratic polynomial of time, with regression coefficients specific to that animal: y it 0i 1i xit 2i xit2 eit Now, consider that animal-specific regression coefficients ki are determined by, apart from some average (bk) that applies to the whole population, by genetics (aki) as well as environmental factors (pki) that are specific to animal i: ki bk aki pki Then, rearranging terms such that population average, genetic, and environmental effects are grouped: yit b0 b1 xit b2 xit2 a0i a1i xit a 2i xit2 p0i p1i xit p 2i xit2 eit fixed genetic 6 perm. envir. b0 x b1 1 xit b2 yit 1 xit 2 it a 0i x a1i 1 xit a 2i 2 it p 0i x p1i eit p 2i 2 it yit x it b m it a i m it p i eit m it 1 xit xit2 m it' Gm it m it' Em it e2it Variance at time t: Var( yit ) Covariance between time t1 and t2: Cov( y it1 , y it1 ) m it' Gm it2 m it' Em it a 0 i a0 G = Var a1i = a0 a1 a 2i a0 a2 2 a a a2 aa 0 1 1 1 2 1 1 2 2 1 2 p 0i pe0 E = Var p1i = pe0 pe1 p 2i pe0 pe2 a a aa a2 0 2 2 pe pe 2 pe pe pe 0 1 1 1 2 1 2 2 pe pe pe pe 2 pe 0 2 Across records for animal i: 1 xi1 yi1 y 1 xi 2 i2 y i . . . . . . 1 x yiT iT 1 xi1 xi21 xi22 b0 1 xi 2 . b1 . . . b2 . . 1 x xiT2 iT 1 xi1 xi21 xi22 a0i 1 xi 2 . a1i . . . a 2i . . 1 x xiT2 iT xi21 ei1 xi22 p0i ei 2 . p1i . . p 2i . eiT xiT2 y i Wi c i X i b Μ i a i M i p i e i Across animals with observations sorted by animal and trait within animal: y Wc Xb Za Wp e b0 b = b1 b2 a1 a a = 2 ~ N( 0, A G ) . a n p1 p p = 2 ~ N( 0, I E) . p n 7 e1 e e = 2 ~ N( 0, I e2 ) . e n Example data on stature of four cows (After L.R. Schaeffer, 2001) All cows are in the same herd and measured at four different visits (potentially by different evaluators) Visit 1 Cow Visit 2 Visit 3 Visit 4 Sire Dam Age (mo) Stature Age (mo) Stature Age (mo) Stature Age (mo) Stature 1 7 5 22 24 34 36 47 39 2 7 6 30 44 42 47 55 41 66 44 3 8 5 28 24 40 42 4 8 1 20 20 33 34 44 28 yit b0 b1 xit b2 xit2 visit a0i a1i xit a2i xit2 p0i p1i xit p2i xit2 eit y Xb Wc Ma Mp e e2 9 c2 4 G= 94 -3.34 0.03098 -3.34 0.15 -0.00144 0.03098 E= -0.00144 0.000014 y 31.6981 -1.1263 -1.1263 0.05058 -0.00048559 0.010447 -0.00048559 X W 24 1 22 484 1 0 0 0 36 1 34 1156 0 1 0 0 39 1 47 2209 0 0 1 0 44 1 30 900 1 0 0 0 47 1 42 1764 0 1 0 0 41 1 55 3025 0 0 1 0 44 1 66 4356 0 0 0 1 24 1 28 784 1 0 0 0 42 1 40 1600 0 1 0 0 20 1 20 400 0 1 0 0 34 1 33 1089 0 0 1 1 28 1 44 1936 0 0 0 1 8 0.010447 0.00000472 M= 1 22 484 0 0 0 0 0 0 0 0 0 1 34 1156 0 0 0 0 0 0 0 0 0 1 47 2209 0 0 0 0 0 0 0 0 0 0 0 0 1 30 900 0 0 0 0 0 0 0 0 0 1 42 1764 0 0 0 0 0 0 0 0 0 1 55 3025 0 0 0 0 0 0 0 0 0 1 66 4356 0 0 0 0 0 0 0 0 0 0 0 0 1 28 784 0 0 0 0 0 0 0 0 0 1 40 1600 0 0 0 0 0 0 0 0 0 0 0 0 1 20 400 0 0 0 0 0 0 0 0 0 1 33 1089 0 0 0 0 0 0 0 0 0 1 44 1936 Mixed Model Equations: X' X W' X M' X 0 M' X A 1 X' W W' W I 94 X' M W' M 0 0 M' W 0 M' M A nn G 1 A bn G 1 A nb G 1 A bb G 1 M' W M' M 0 A nn bn A bˆ X' y cˆ W' y aˆ M' y M' M n 0 aˆ b 0 M' M I E 1 pˆ M' Y X' M W' M A nb , with n / b corresponding to animals with / without observations A bb MME solutions: bˆ' (0.5131 1.7794 0.01298) cˆ' (14.5400 13.7436 6.3133 1.0837) 9 Animal a0 a1 a2 p0 p1 p2 1 -1.720812 0.039069 -0.000334 -0.76127 2 9.908001 -0.286581 0.002556 3.536084 3 -8.723928 0.245155 -0.002166 -2.737505 0.073743 -0.000643 4 -2.972874 0.108809 -0.001022 -0.037309 0.015559 5 -5.215457 0.139228 -0.001218 6 5.243107 -0.150764 0.001344 7 4.086682 -0.120872 8 0.012384 -0.000093 -0.101685 0.000906 -0.00017 0.00108 -4.114334 0.132408 -0.001206 The EBV for animal i at age x months can be computed as: EBVix aˆ 0i aˆ1i x aˆ 2i x 2 An overall EBV can be computed based on economic values. For example, if the economic values of stature at ages 24, 36, and 48 months are 2, 1, and 0.5, the total EBV for animal i can be computed as: TEBVi = 2*EBVi,24 + 1*EBVi,36 + 0.5*EBVi,48 EBV at Total Animal 24 mo 36 mo 48 mo EBV 1 -0.98 -0.75 -0.62 -3.01 2 4.50 2.90 2.04 12.93 3 -4.09 -2.71 -1.95 -11.85 4 -0.95 -0.38 -0.10 -2.33 5 -2.58 -1.78 -1.34 -7.60 6 2.40 1.56 1.10 6.91 7 1.81 1.13 0.77 5.14 8 -1.63 -0.91 -0.54 -4.44 Genetic growth curve EBV 75 6 70 65 60 2 Stature EBV for stature 4 0 55 50 45 -2 40 -4 35 30 -6 20 30 40 50 60 25 70 20 Age (months) 30 40 50 Age (months) 10 60 70 4. COVARIANCE FUNCTIONS (After van der Werf and Schaeffer, 1997) A covariance function (CF) describes in mathematical terms, the covariance between variables on the same individual at different times. For example, the for the covariance between breeding values ul and um on an animal for traits measured at ages xl and xm can be described in terms of a polynomial of order k as: k cov(ul,um)= f(xl,xm)= k c i 1 j 1 ij xli 1 xmj 1 where cij are the coefficients of the CF. xm= 1 2 a m a min Ages x are often standardized (-1 x 1) as a max a min where amin and amax = first and the latest time point on the trajectory considered This CF can be written in matrix form by defining vectors mt with elements xti-1 for 1<i<k: mt = [1 xt xt2 xt3 xt4 ..... xtk-1] and a matrix C with CF coefficients cij. The CF for breeding values ul and um on an animal for traits measured at ages xl and xm can then be expressed as: cov(ul,um)= f(xl,xm) = mlCmm’ The CF of order k for the variance-covariance matrix of breeding values at T ages can then be expressed as: u1 u ~ Cov 1 = G = MCM' . u T m1 m2 with M = . m T ~ ~ For a known, or previously estimated matrix G , equation G = MCM' can be used to estimate the coefficients of the CF, i.e. the elements of matrix C as: ~ t ˆ M 1G C M where the -t superscript refers to the inverse of the transpose. 11 For example, assume that a trait was measured at standardized ages of x = -1, 0, and 1 and that multiple-trait analysis of the data resulted in the following estimated additive 436 522.3 424.2 ~ G 522.3 808 664.7 genetic variance-covariance matrix: 424.2 664.7 558 These estimates are for the additive genetic covariances for log(body weight) of male mice at 2, 3, and 4 weeks of age (Riska et al. 1984), as discussed by Kirkpatrick et al. (1990, Genetics 124:979). This matrix can be represented by a CF of order 3 as follows: Using mt’ = [1 xt xt2] the matrix M for ages xt = -1, 0, and 1 is: 1 M 1 1 808.0 ~ t ˆ M G Solving for C M gives: 71.2 ˆ 71.20 C 36.4 214.5 40.7 1 1 1 0 0 1 1 214.5 40.7 81.6 and the covariance function between breeding values at times xl and xm is: 1 ˆ cov(ul,um)= f(xl,xm) = ml Ĉ mm’ = [1 xl xl ] C x m x m2 2 = 808+ 71.2(x1+xm) +36.4xl xm -40.7(xl 2xm + xlxm2) - 214.5(xl2+xm2) +81.6(xl2xm2) Using this function we can compute the covariance between the age combinations ~ represented in G . Because the CF has 6 coefficients, which are used to estimate the 6 ~ ~ unique elements of the symmetric matrix G , this CF gives an exact fit to matrix G . This represents a ‘full fit’ CF. The unique aspect of the CF, however, is that it can be used to estimate covariances between ages that were never measured by interpolation. For example, the covariance between weight at 3 (xl = 0) and 3.5 (xm = 0.5) weeks of age is f(0, 0.5) = 808+ 71.2*(0.5) - 214.5*(0.52) = 789.9. 12 Reduced fit Covariance Functions. Here, the intend is to estimate a CF that has fewer parameters than the original matrix: k<T. This is important in particular if T is large because it will allow the variance-covariance structure to be described by fewer parameters. Using M with polynomials of order k<T, find matrix C of dimensions k<T such that: ~ ˆ MCM' provides the best fit to G G To find this, set Pre- and post-multiply: Solve for C: ~ G MCM' ~ M'GM M'MCM'M ~ ˆ (M' M ) 1 M'G C M (M' M ) 1 1 1 M 1 0 1 1 ~ For a 2-order fit of the example matrix G we get: 1 1 M' M 1 0 1 1 ~ ˆ (M' M ) 1 M'G M (M' M ) 1 this results in: Using C 558.2667 44.06667 Ĉ 44.06667 36.4 ~ and in an estimate of matrix G that can be derived as: 506.5333 ˆ MC ˆ M' G 514.2 521.866667 514.2 558.26667 602.333333 521.8667 602.33333 682.8 ~ This procedure amounts to a least-squares regression of G on M and results in a matrix Ĝ whose elements have the lowest residual sums of squares (RSS) from the original ~ matrix G . In fact, it gives the same results as the procedure suggested by Kirkpatrick et ~ al. (1990), which was to stack elements of G into vector ~g as: ~ ~ ~ ~ ~ ~ ~ ~ g ' [G (1,1), G (2,1)...., G (T ,1), G (1,2),...., G (T ,2),....., G (1, T ),...., G (T , T )] and fit the following LS model: 13 ~ g =Xc + e with X = M M cˆ (X' X) 1 X' ~ g Solve as: Matrix Ĉ is then derived by unstacking the elements of vector ĉ . For our example we get: X= 1 1 1 1 1 1 1 1 1 -1 0 1 -1 0 1 -1 0 1 -1 -1 -1 0 0 0 1 1 1 1 0 -1 0 0 0 -1 0 1 X'X = 9 0 0 0 0 6 0 0 0 0 6 0 0 0 0 4 (X'X)-1 = 0.111111 C= g= 436 522.3 424.2 522.3 808 664.7 424.2 664.7 558 X'g = 5024.4 264.4 264.4 145.6 0 0 0 c=(X'X)-1X'g= 558.2667 0 0.166667 0 0 44.06667 0 0 0.166667 0 44.06667 0 0 0.25 36.4 0 558.2667 44.06667 44.06667 36.4 The RSS for the fit of order 2 can be computed by taking the sum of squares of the elements of the following residual matrix: ~ G - Ĝ = 70.53 -8.10 97.67 -8.10 -249.73 -62.37 97.67 -62.37 124.80 This results in RSS2 = 109904.7. 14 Significance of an increase in goodness of fit for a model of order k+p over a model of ( RSS k RSS k p ) /( df k df k p ) RSS k p / df k p order k can be tested by an F-test as: F = Where dfk is the residual degrees of freedom for the fit of order k, which is equal to the number of unique elements in the original matrix minus that in a matrix of order k: dfk = T(T+1)/2 - k(k+1)/2 This statistic can be tested against an F-value with dfk+p-dfk and dfk+p degrees of freedom. The order needed to fit the matrix adequately can also be determined by evaluating the ~ eigenvalues of the original matrix G . ~ Note that matrix Ĝ was derived by regressing matrix M on all elements of G , although ~ G is symmetric and contains only T(T+1)/2 unique elements. Since off-diagonals appear ~ twice in G , they received twice as much emphasis in the regression analysis than the diagonals. To prevent this, regression can be performed only on the unique elements of ~ G: Redefine ~g and c to vectors of length T(T+1)/2 and k(k+1)/2, containing only the lower ~ half of the matrix elements of G and C, respectively. The rows in X corresponding to ~ G (i,j), for i<j need to be deleted. Furthermore, the columns corresponding to C(i,j), for i<j need to be added to the columns corresponding to C(j,i), and the columns for C(i,j) needs to be deleted. Following these steps, the matrix X has dimensions T(T+1)/2 and k(k+1)/2: X= 1 1 1 1 1 1 -2 -1 0 0 1 2 g= 1 0 -1 0 0 1 Resulting in: c=(X'X)-1X'g= 568.8118 38.64 0.329412 15 436 522.3 424.2 808 664.7 558 Unstacking gives: C= 568.8118 38.64 38.64 0.329412 ~ and in an estimate of matrix G that can be derived as: ˆ MC ˆ M' G 491.85 530.16 568.47 530.16 568.80 607.44 568.47 607.44 646.41 The RSS of this matrix is equal to 11641.3 when considering all elements, but 92306.5 when considering only half-stored elements. The latter RSS has a Chi-square distribution with [T(T+1)/2 - k(k+1)/2} The RSS of half-stored elements of Ĝ derived using the fullstored matrix is equal to 96410.7. ~ An additional factor to consider is that matrix G will itself consist of estimates, whose sampling variances won’t be equal. These complications can be taken into account by using Generalized Least Squares Regression procedures, as described in Kirkpatrick et al. (1990) as: cˆ ( X' V 1 X) 1 X' V 1~ g where V is the variance-covariance matrix of estimates in ~g . 16