Correlation in repeated measures designs The story Data by Ossama Dimassi (FG Valle-Zarate) 14 goats Each goats measured at 7 dates (months) two traits (percentage of cheese yield, percentage of protein) Objective: Assess correlation of percentage cheese yield and percentage protein yield Dahlem Cashmere (DC) is a multipurpose goat breed developed at the end of the 1980s at the Technical University of Berlin, based on crosses between Angora and dairy goats, in particular the German White dairy goat with some influence of the German Fawn and Anglo Nubian. Along with the valuable cashmere wool, DC is used for meat and milk production. Empirical results indicate that milk of DC goats has superior processing properties compared to milk of other dairy goats conventionally kept in Germany. In order to assess the technological potential of milk of Dahlem Cashmere (DC) goats individual milk samples from two groups of 5 DC goats (at 2nd and 3rd lactation) and one group of 5 German Fawn dairy goats (GF) at 2nd lactation, were taken fortnightly over lactation periods of 32 and 28 weeks, respectively. Along with the main components (protein, casein, fat) cheese yield was measured. Significant differences have been detected between the different breeds thus the next step is to try and at least partly explain this variation by milk components level, one of which is protein, a quantitative variable. The simple correlation, however, will not account for the complex design structure and the repeated measures nature of the data. Exploring the correlation structure We have different options for looking at the correlation, e.g., use all data, ignoring the structure (goats, months) (Fig. 1) compute means across months and correlate goat means (Fig. 2) compute means across goats and correlate month means (Fig. 3) look at each month separately; within-month = between-subject correlation (Fig. 4) look at each goat separately; within-goat = within-subject correlation (Fig. 5) The resulting correlation coefficients are not the same, and some are quite dramatically different, so the question arises which correlation is the correct one. Obviously, there is structure in the data (months, goats), which needs to be accounted for. In order to partition the correlation according to the factors month and goat, a factorial model is needed. 1 c 6 5 4 3 2 2.6 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 p Fig. 1: Plot of cheese yield [%] vs. protein [%] for repeated measurements ignoring goats and months (r = 0.85). c 2.4 2.3 2.2 2.1 2.0 1.9 1.8 1.7 1.6 1.5 1.4 1.5 1.6 1.7 1.8 1.9 p Fig. 2: Plot of cheese yield [%] vs. protein [%] for goat means across months (r = 0.96). 2 2.0 c 2.08 2.06 2.04 2.02 2.00 1.98 1.96 1.94 1.92 1.90 1.88 1.86 1.84 1.82 1.65 1.66 1.67 1.68 1.69 1.70 1.71 1.72 1.73 p Fig. 3: Plot of cheese cheese yield [%] vs. protein [%] for month means across goats (r = 0.65). c 4.8 4.7 4.6 4.5 4.4 4.3 4.2 4.1 4.0 3.9 3.8 3.7 3.6 3.5 3.4 3.3 3.2 3.1 3.0 2.9 2.8 2.7 2.7 2.8 2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 p Fig. 4: Plot of cheese yield [%] vs. protein [%] for goats at month=7 (r = 0.93); within-month analysis. 3 c 4.5 4.4 4.3 4.2 4.1 4.0 3.9 3.8 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 p Fig. 5: Plot of cheese yield [%] vs. protein [%] across months for goat = 124 (r = -0.05, n.s.); within-goat analysis. Modelling a single trait To develop a model, it is useful to look at a single trait first and then extend to two (or more) traits. yit = + mt + gi + eit (1) where yit = trait value for i-th goat at t-th month = general mean mt = effect of t-th month gi = effect of i-th goat eit = residual If goats are a random sample, then gi is a random effect. Residuals eit from the same goat are potentially correlated due to repeated measurements on the same animal. Here, we use an AR(1) model, which states that errors on the same goat are correlated by Corr(eit, eit’) = |t-t’| (2) Thus, the correlation decays with distance in time. Also, errors of different goats are uncorrelated. 4 The model may be modified to account for a possible linear time trend using yit = + gi + xt + ixt + eit (3) intercept slope of i-th goat where xt is the time in months. If the trend is nonlinear, a simple extension is to add quadratic and cubic terms, if necessary, The lack-of-fit of the mean regression () can be tested by adding the month effect mt. It is important to fit mt after and to use a Type I analysis. The lack-of-fit of the goat-specific regression (i) cannot be tested, even with replicate data per goat and month. The reason for this is that the random term, eit, which models random fluctuation around a smooth trend, would be counfounded with a goat-specific lack-of-fit effect. For performing the regression, it is necessary to define the quantitative variable xt within a datastep. Here, this variable is simply coded as "t". In addition, a "month" effect ((i) is fitted to test the lack-of-fit. The following analysis is done taking goats in (3) as fixed and using an AR(1) correlation-model for the residual eit. Protein yield (percentage): Type 1 Tests of Fixed Effects Effect Num DF Den DF F Value Pr > F Goat t month t*Goat 15 1 5 15 15.3 23.1 54.7 23.2 41.18 0.10 1.48 1.83 <.0001 0.7558 0.2126 0.0924 Cheese yield (percentage): Type 1 Tests of Fixed Effects Effect Goat t t*t month t*Goat t*t*Goat Num DF Den DF F Value Pr > F 15 1 1 4 15 14 4.96 7.64 13.2 38.5 7.09 12.5 83.92 36.03 21.26 1.25 1.55 0.67 <.0001 0.0004 0.0005 0.3077 0.2861 0.7684 For both traits there is no indication of heterogeneity among goats in the time trend. For protein there is no overall trend at all, while for cheese yield the trend is quadratic. Thus, we would select the following models: Protein yield (percentage): yit = + gi + eit (4) Cheese yield (percentage): 5 yit = + gi + 1xt + 2x2t + eit (5) SAS code: proc mixed data=d method=reml; where trait='cheesey'; class month goat; model y=goat t t*t month goat*t goat*t*t/ddfm=kr solution htype=1; repeated month/sub=goat type=ar(1); run; Joint analysis for both traits In order to come up with a joint analysis for both traits, it will convenient to amalgamate models (4) and (5). To do so, it will be convenient to introduce a dummy variable identifying the trait with the more general model (5). Indexing the trait by j, the model is yijt = j + gij + 1jwjxt + 2jwjx2t + eijt (6) where wj = 0 for protein (j = 1) wj = 1 for cheese (j = 2) The model has two random effects: the between-goat effect gij and the within-goat effect eijt. Both of these effects will be correlated among traits. For both effects, the correlation of effects pertaining to different animals is zero. The between-goat correlation is corr(gi1, gi2) = g For the pair of random effects (gi1, gi2), we impose an unstructured variance-covariance matrix: var(gi1, gi2) = g The within-goat correlation, defined for a fixed point in time and for different traits, can be defined as corr(ei1t, ei2t) = e For the pair of random effects (ei1t, ei2t), we impose an unstructured variance-covariance matrix: var(ei1t, ei2t) = e To complete the within-goats error model, we need to account for serial correlation among observations on the same trait at different points in time. Assuming an AR(1) model, the serial correlation for the same trait at different points in time takes the form Corr(eijt, eijt’) = s|t-t’| 6 where s is the serial correlation. The variance-covariance matrix for a vector of errors on the same goat and the same trait can be denoted as var(eij1, eij2, …, eij7) = s Now what about two observations on different traits and different points in time? A parsimonious way of modelling this is given by the direct product of within-goat correlation and serial correlation: Corr(eijt, eij’t’) = es|t-t’| For the while error vector, sorted by trait and time, the variance-covariance matrix is = e s where is the direct product operator (Kronecker product). This model can be fitted in MIXED using the TYPE=UN@AR(1) option in the REPEATED statement. SAS code: proc mixed data=d; class trait month Goat; model y=trait w*trait*t w*trait*t*t; random trait/subject=Goat type=un; repeated trait month/sub=goat type=un@ar(1); run; Three components of between-trait correlation (1) Between-goat correlation (g) (2) Within-goat correlation (e) (3) Time-trend related correlation The first two components have already been defined as per the joint model (xx). To illustrate the meaning of (3), assume that gij and eijt have zero variance (are absent) and that there is a time trend in both traits. For example, if both cheese yield and protein yield percentages increase in time, this will induce a positive correlation when plotting cheese versus protein yield percentages from different points in time. In the present case, there is no sigificant trend in protein, so the third component of correlation can be ignored here. But it is stressed that (3) requires attention when there is a time trend in both traits. It is useful to define a between-within-goats correlation as bw = corr(gi1 + ei1t, gi2 + ei2t) This correlation refers to the correlation of both traits on a randomly selected goat i at a common point in time t. 7 Results The fit of model (xx) is as follows: Covariance Parameter Estimates Cov Parm Subject Estimate UN(1,1) UN(2,1) UN(2,2) trait UN(1,1) UN(2,1) UN(2,2) month AR(1) Goat Goat Goat Goat Goat Goat Goat 0.2935 0.1837 0.1191 0.03376 0.004966 0.02596 0.05760 g e s Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) -73.5 -59.5 -58.9 -54.1 The between-goat correlation is 0.1837/(0.2935*0.1191) = 0.9825. The within-goat-correlation is 0.004966/(0.03376*0.02596) = 0.1677. [The serial correlation equals 0.0576] Thus, the between-goat correlation is much more important than the within-goat correlation. The between-within correlation is (0.1837+0.004966)/[(0.2935+0.03376)*(0.1191+0.02596)] = 0.8659 Obviously, the between-goat correlation dominates the between-within-correlation, which is a result of the larger between-goats variances. To test the significance of both correlations, we fit models with g = 0 and with e = 0 are record the value of twice the residual log-likelihood. Under the null hypothesis, the difference in twice the log-likelihood between (i) a model with the correlation set to zero and (ii) a model with the correlation allowed to vary, has a chi-squared distribution with one d.f. Thus, the difference must exceed a value of 3.84 to be significant. g = 0: Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) -35.6 -23.6 -23.2 -19.0 The difference to the full model is –35.6 + 73.5 = 37.9 >> 3.84, which is highly significant. Thus, the between-goat correlation is highly significant. e = 0: Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) -71.1 -59.1 -58.6 8 BIC (smaller is better) -54.4 The difference to the full model is –71.1 + 73.5 = 2.4 < 3.84, which is not significant. Thus, the within-goat correlation is not significant. SAS-code for g = 0: proc mixed data=d; class trait month Goat; model y=trait w*t*trait w*t*t*trait/ddfm=kr solution noint; random trait/subject=Goat type=un; repeated trait month/sub=goat type=un@ar(1); parms (1)(0)(1)(1)(0)(1)(0.5)/hold=2; run; SAS-code for e = 0: proc mixed data=d; class trait month Goat; model y=trait w*t*trait w*t*t*trait/ddfm=kr solution noint; random trait/subject=Goat type=un; repeated trait month/sub=goat type=un@ar(1); parms (1)(0)(1)(1)(0)(1)(0.5)/hold=5; run; Alternatively, oue may fit the foll model and add the option COVTEST to the PROC MIXED line. This will invoke a Wald-test for the covariance parameters. The test is appropriate only for the two covariances [UN(2,1) in the output] and the serial correlation [AR(1)]. It is NOT appropriate for the variance components [UN(1,1) and UN(2,2)]. Generally, likelihood-ratio tests tend to be more reliable than Wald-tests, so I recommend the former. The results below show that the between-goats covariance (and thus correlation) is significant, while the withingoats covariance (correlation is not significant. The Mixed Procedure Covariance Parameter Estimates Cov Parm Subject Estimate Standard Error Z Value Pr Z UN(1,1) UN(2,1) UN(2,2) trait UN(1,1) UN(2,1) UN(2,2) month AR(1) Goat Goat Goat Goat Goat Goat Goat 0.2935 0.1837 0.1191 0.03376 0.004966 0.02596 0.05760 0.1094 0.06895 0.04528 0.005392 0.003244 0.003928 0.09233 2.68 2.66 2.63 6.26 1.53 6.61 0.62 0.0037 0.0077 0.0043 <.0001 0.1258 <.0001 0.5327 SAS code for Wald-Tests of covariances: proc mixed data=d covtest; class trait month Goat; model y=trait w*trait*t w*trait*t*t; random trait/subject=Goat type=un; repeated trait month/sub=goat type=un@ar(1); run; Finally, if an overall test of the between-within correlation is desired, we can compare the full model to the one with both correlations set to 0. The LR-statistic is then compared against chisquared with 2 d.f., which has a critical value of 5,99 at the 5% level of significance. 9 SAS-code for e = 0 and g = 0: proc mixed data=d; class trait month Goat; model y=trait w*t*trait w*t*t*trait/ddfm=kr solution noint; random trait/subject=Goat type=un; repeated trait month/sub=goat type=un@ar(1); parms (1)(0)(1)(1)(0)(1)(0.5)/hold= 2,5; run; Output: Fit Statistics -2 Res Log Likelihood AIC (smaller is better) AICC (smaller is better) BIC (smaller is better) -11.5 -1.5 -1.2 2.3 The difference to the full model is –11.5 + 73.5 = 62.0 >> 5.99, which is highly significant. A p-value for the LR-statistic can be computed as follows: data; chi=62; df=2; p_value=1-probchi(chi,df); proc print; run; Output: Obs chi df p_value 1 62 2 3.4417E-14 Thus, the p-value is p = 3.4417*10-14 < 0.001 which is highly significant. Convergence problems: Occasionaly, MIXED does not converge for these types of model. There are many ways to tackle convergence problems. Outlying observations (typos) may be one reason for lack of convergence. Also, it is helpful to scale the data so that both traits have about the same mean. This may be effected by multiplication of one trait by a constant factor. Note that rescaling will not affect the correlation estimate. Finally, if none of these tricks work, the AR(1) model may not be adequate. An alternative model is the compound symmetry model, which states that the serial correlation is constand for any time lag: Corr(eit, eit’) = for any t t’ (7) The model can be fitted using TYPE=UN@CS in place of TYPE=UN@AR(1) in the REPEATED statement. SAS-code for reading the data: Data d; Input Goat month trait$ y; w=0; if trait='cheesey' then w=1; 10 t=month; Cards; 124 1 124 1 124 2 124 2 124 3 124 3 124 4 124 4 124 5 124 5 124 6 124 6 124 7 124 7 126 1 126 1 126 2 126 2 126 3 126 3 126 4 126 4 126 5 126 5 126 6 126 6 126 7 126 7 128 1 128 1 128 2 128 2 128 3 128 3 128 4 128 4 128 5 128 5 128 6 128 6 128 7 128 7 130 1 130 1 130 2 130 2 131 1 131 1 131 2 131 2 131 3 131 3 131 4 131 4 131 5 131 5 131 6 131 6 131 7 131 7 134 1 cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey 4.30 3.57 4.25 3.64 4.35 3.28 4.10 3.60 3.86 3.77 4.33 3.82 4.42 3.96 4.48 3.95 4.36 3.72 4.53 3.79 4.23 3.70 4.08 4.05 4.63 3.81 4.34 3.97 4.95 3.92 4.46 3.75 4.67 3.18 4.30 3.69 4.20 3.69 4.27 3.71 4.14 3.83 4.92 3.88 4.32 3.95 4.44 3.62 5.14 3.81 4.62 3.30 4.67 3.71 4.59 3.83 4.51 3.87 4.72 3.77 3.89 11 134 134 134 134 134 134 134 134 134 134 134 134 134 140 140 140 140 140 140 140 140 140 140 140 140 140 140 143 143 143 143 143 143 143 143 143 143 143 143 145 145 145 145 145 145 145 145 145 145 145 145 145 145 147 147 147 147 147 147 147 147 147 147 1 2 2 3 3 4 4 5 5 6 6 7 7 1 1 2 2 3 3 4 4 5 5 6 6 7 7 1 1 2 2 3 3 4 4 5 5 6 6 1 1 2 2 3 3 4 4 5 5 6 6 7 7 1 1 2 2 3 3 4 4 5 5 protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein 3.24 3.52 3.30 3.62 3.08 3.62 3.14 3.45 3.09 3.63 3.24 3.45 3.06 3.87 3.36 4.16 3.52 3.81 3.23 3.94 3.35 3.63 3.23 3.72 3.34 3.89 3.37 4.40 3.59 4.45 3.76 4.25 3.72 3.64 3.71 4.20 3.61 4.37 3.92 4.64 3.88 4.47 3.65 4.45 3.87 3.88 3.75 4.20 3.76 4.27 3.86 4.34 3.76 4.11 3.38 4.45 3.32 4.02 3.46 4.01 3.54 3.63 3.27 12 147 147 147 147 157 157 157 157 157 157 157 157 157 157 157 157 157 157 183 183 183 183 183 183 183 183 183 183 183 183 183 183 184 184 184 184 184 184 184 184 184 184 184 184 184 184 185 185 185 185 185 185 185 185 185 185 185 185 185 185 186 186 186 6 6 7 7 1 1 2 2 3 3 4 4 5 5 6 6 7 7 1 1 2 2 3 3 4 4 5 5 6 6 7 7 1 1 2 2 3 3 4 4 5 5 6 6 7 7 1 1 2 2 3 3 4 4 5 5 6 6 7 7 1 1 2 cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey 3.90 3.41 3.94 3.28 4.24 3.36 4.04 3.34 3.96 2.82 3.72 3.27 3.64 3.35 4.00 3.57 3.78 3.50 3.19 2.84 3.02 2.71 3.01 3.15 2.97 2.69 2.98 2.79 2.98 2.72 3.00 2.79 3.60 3.19 3.18 2.87 3.10 3.25 2.99 2.76 2.89 2.73 2.84 2.76 2.75 2.70 3.63 3.11 3.51 3.35 3.37 3.17 3.12 3.11 3.00 2.86 3.16 2.94 3.54 3.10 3.33 2.99 3.10 13 186 186 186 186 186 186 186 186 186 186 186 187 187 187 187 187 187 187 187 187 187 187 187 187 187 ; 2 3 3 4 4 5 5 6 6 7 7 1 1 2 2 3 3 4 4 5 5 6 6 7 7 protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein cheesey protein 2.90 3.11 3.13 2.97 2.89 2.92 2.96 2.96 2.97 2.94 3.03 4.02 3.42 3.68 3.11 3.49 3.30 4.08 3.29 3.77 3.25 3.56 3.33 3.97 3.36 14