Correlation in repeated measures designs

advertisement
Correlation in repeated measures designs
The story
Data by Ossama Dimassi (FG Valle-Zarate)



14 goats
Each goats measured at 7 dates (months)
two traits (percentage of cheese yield, percentage of protein)
Objective: Assess correlation of percentage cheese yield and percentage protein yield
Dahlem Cashmere (DC) is a multipurpose goat breed developed at the end of the 1980s at the
Technical University of Berlin, based on crosses between Angora and dairy goats, in
particular the German White dairy goat with some influence of the German Fawn and Anglo
Nubian. Along with the valuable cashmere wool, DC is used for meat and milk production.
Empirical results indicate that milk of DC goats has superior processing properties compared
to milk of other dairy goats conventionally kept in Germany. In order to assess the
technological potential of milk of Dahlem Cashmere (DC) goats individual milk samples from
two groups of 5 DC goats (at 2nd and 3rd lactation) and one group of 5 German Fawn dairy
goats (GF) at 2nd lactation, were taken fortnightly over lactation periods of 32 and 28 weeks,
respectively. Along with the main components (protein, casein, fat) cheese yield was
measured. Significant differences have been detected between the different breeds thus the
next step is to try and at least partly explain this variation by milk components level, one of
which is protein, a quantitative variable. The simple correlation, however, will not account for
the complex design structure and the repeated measures nature of the data.
Exploring the correlation structure
We have different options for looking at the correlation, e.g.,
 use all data, ignoring the structure (goats, months) (Fig. 1)
 compute means across months and correlate goat means (Fig. 2)
 compute means across goats and correlate month means (Fig. 3)
 look at each month separately; within-month = between-subject correlation (Fig. 4)
 look at each goat separately; within-goat = within-subject correlation (Fig. 5)
The resulting correlation coefficients are not the same, and some are quite dramatically
different, so the question arises which correlation is the correct one.
Obviously, there is structure in the data (months, goats), which needs to be accounted for. In
order to partition the correlation according to the factors month and goat, a factorial model is
needed.
1
c
6
5
4
3
2
2.6
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.0
4.1
p
Fig. 1: Plot of cheese yield [%] vs. protein [%] for repeated measurements ignoring goats and
months (r = 0.85).
c
2.4
2.3
2.2
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
1.5
1.6
1.7
1.8
1.9
p
Fig. 2: Plot of cheese yield [%] vs. protein [%] for goat means across months (r = 0.96).
2
2.0
c
2.08
2.06
2.04
2.02
2.00
1.98
1.96
1.94
1.92
1.90
1.88
1.86
1.84
1.82
1.65
1.66
1.67
1.68
1.69
1.70
1.71
1.72
1.73
p
Fig. 3: Plot of cheese cheese yield [%] vs. protein [%] for month means across goats (r =
0.65).
c
4.8
4.7
4.6
4.5
4.4
4.3
4.2
4.1
4.0
3.9
3.8
3.7
3.6
3.5
3.4
3.3
3.2
3.1
3.0
2.9
2.8
2.7
2.7
2.8
2.9
3.0
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.0
p
Fig. 4: Plot of cheese yield [%] vs. protein [%] for goats at month=7 (r = 0.93); within-month
analysis.
3
c
4.5
4.4
4.3
4.2
4.1
4.0
3.9
3.8
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.0
p
Fig. 5: Plot of cheese yield [%] vs. protein [%] across months for goat = 124 (r = -0.05, n.s.);
within-goat analysis.
Modelling a single trait
To develop a model, it is useful to look at a single trait first and then extend to two (or more)
traits.
yit =  + mt + gi + eit
(1)
where
yit = trait value for i-th goat at t-th month
 = general mean
mt = effect of t-th month
gi = effect of i-th goat
eit = residual
If goats are a random sample, then gi is a random effect.
Residuals eit from the same goat are potentially correlated due to repeated measurements on
the same animal. Here, we use an AR(1) model, which states that errors on the same goat are
correlated by
Corr(eit, eit’) = |t-t’|
(2)
Thus, the correlation decays with distance in time. Also, errors of different goats are
uncorrelated.
4
The model may be modified to account for a possible linear time trend using
yit =  + gi + xt + ixt + eit
(3)
intercept slope
of i-th goat
where xt is the time in months. If the trend is nonlinear, a simple extension is to add quadratic
and cubic terms, if necessary, The lack-of-fit of the mean regression () can be tested by
adding the month effect mt. It is important to fit mt after  and to use a Type I analysis. The
lack-of-fit of the goat-specific regression (i) cannot be tested, even with replicate data per
goat and month. The reason for this is that the random term, eit, which models random
fluctuation around a smooth trend, would be counfounded with a goat-specific lack-of-fit
effect.
For performing the regression, it is necessary to define the quantitative variable xt within a
datastep. Here, this variable is simply coded as "t". In addition, a "month" effect ((i) is fitted
to test the lack-of-fit. The following analysis is done taking goats in (3) as fixed and using an
AR(1) correlation-model for the residual eit.
Protein yield (percentage):
Type 1 Tests of Fixed Effects
Effect
Num
DF
Den
DF
F Value
Pr > F
Goat
t
month
t*Goat
15
1
5
15
15.3
23.1
54.7
23.2
41.18
0.10
1.48
1.83
<.0001
0.7558
0.2126
0.0924
Cheese yield (percentage):
Type 1 Tests of Fixed Effects
Effect
Goat
t
t*t
month
t*Goat
t*t*Goat
Num
DF
Den
DF
F Value
Pr > F
15
1
1
4
15
14
4.96
7.64
13.2
38.5
7.09
12.5
83.92
36.03
21.26
1.25
1.55
0.67
<.0001
0.0004
0.0005
0.3077
0.2861
0.7684
For both traits there is no indication of heterogeneity among goats in the time trend. For
protein there is no overall trend at all, while for cheese yield the trend is quadratic. Thus, we
would select the following models:
Protein yield (percentage):
yit =  + gi + eit
(4)
Cheese yield (percentage):
5
yit =  + gi + 1xt + 2x2t + eit
(5)
SAS code:
proc mixed data=d method=reml;
where trait='cheesey';
class month goat;
model y=goat t t*t month goat*t goat*t*t/ddfm=kr solution htype=1;
repeated month/sub=goat type=ar(1);
run;
Joint analysis for both traits
In order to come up with a joint analysis for both traits, it will convenient to amalgamate
models (4) and (5). To do so, it will be convenient to introduce a dummy variable identifying
the trait with the more general model (5). Indexing the trait by j, the model is
yijt = j + gij + 1jwjxt + 2jwjx2t + eijt
(6)
where
wj = 0 for protein (j = 1)
wj = 1 for cheese (j = 2)
The model has two random effects: the between-goat effect gij and the within-goat effect eijt.
Both of these effects will be correlated among traits. For both effects, the correlation of
effects pertaining to different animals is zero. The between-goat correlation is
corr(gi1, gi2) = g
For the pair of random effects (gi1, gi2), we impose an unstructured variance-covariance
matrix:
var(gi1, gi2) = g
The within-goat correlation, defined for a fixed point in time and for different traits, can
be defined as
corr(ei1t, ei2t) = e
For the pair of random effects (ei1t, ei2t), we impose an unstructured variance-covariance
matrix:
var(ei1t, ei2t) = e
To complete the within-goats error model, we need to account for serial correlation among
observations on the same trait at different points in time. Assuming an AR(1) model, the serial
correlation for the same trait at different points in time takes the form
Corr(eijt, eijt’) = s|t-t’|
6
where s is the serial correlation. The variance-covariance matrix for a vector of errors on
the same goat and the same trait can be denoted as
var(eij1, eij2, …, eij7) = s
Now what about two observations on different traits and different points in time? A
parsimonious way of modelling this is given by the direct product of within-goat correlation
and serial correlation:
Corr(eijt, eij’t’) = es|t-t’|
For the while error vector, sorted by trait and time, the variance-covariance matrix is
 = e   s
where  is the direct product operator (Kronecker product). This model can be fitted in
MIXED using the TYPE=UN@AR(1) option in the REPEATED statement.
SAS code:
proc mixed data=d;
class trait month Goat;
model y=trait w*trait*t w*trait*t*t;
random trait/subject=Goat type=un;
repeated trait month/sub=goat type=un@ar(1);
run;
Three components of between-trait correlation
(1) Between-goat correlation (g)
(2) Within-goat correlation (e)
(3) Time-trend related correlation
The first two components have already been defined as per the joint model (xx). To illustrate
the meaning of (3), assume that gij and eijt have zero variance (are absent) and that there is a
time trend in both traits. For example, if both cheese yield and protein yield percentages
increase in time, this will induce a positive correlation when plotting cheese versus protein
yield percentages from different points in time.
In the present case, there is no sigificant trend in protein, so the third component of
correlation can be ignored here. But it is stressed that (3) requires attention when there is a
time trend in both traits.
It is useful to define a between-within-goats correlation as
bw = corr(gi1 + ei1t, gi2 + ei2t)
This correlation refers to the correlation of both traits on a randomly selected goat i at a
common point in time t.
7
Results
The fit of model (xx) is as follows:
Covariance Parameter Estimates
Cov Parm
Subject
Estimate
UN(1,1)
UN(2,1)
UN(2,2)
trait UN(1,1)
UN(2,1)
UN(2,2)
month AR(1)
Goat
Goat
Goat
Goat
Goat
Goat
Goat
0.2935
0.1837
0.1191
0.03376
0.004966
0.02596
0.05760
g
e
s
Fit Statistics
-2 Res Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
-73.5
-59.5
-58.9
-54.1
The between-goat correlation is 0.1837/(0.2935*0.1191) = 0.9825.
The within-goat-correlation is 0.004966/(0.03376*0.02596) = 0.1677.
[The serial correlation equals 0.0576]
Thus, the between-goat correlation is much more important than the within-goat correlation.
The between-within correlation is
(0.1837+0.004966)/[(0.2935+0.03376)*(0.1191+0.02596)] = 0.8659
Obviously, the between-goat correlation dominates the between-within-correlation, which is a
result of the larger between-goats variances.
To test the significance of both correlations, we fit models with g = 0 and with e = 0 are
record the value of twice the residual log-likelihood. Under the null hypothesis, the difference
in twice the log-likelihood between (i) a model with the correlation set to zero and (ii) a
model with the correlation allowed to vary, has a chi-squared distribution with one d.f. Thus,
the difference must exceed a value of 3.84 to be significant.
g = 0:
Fit Statistics
-2 Res Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
-35.6
-23.6
-23.2
-19.0
The difference to the full model is –35.6 + 73.5 = 37.9 >> 3.84, which is highly significant.
Thus, the between-goat correlation is highly significant.
e = 0:
Fit Statistics
-2 Res Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
-71.1
-59.1
-58.6
8
BIC (smaller is better)
-54.4
The difference to the full model is –71.1 + 73.5 = 2.4 < 3.84, which is not significant. Thus,
the within-goat correlation is not significant.
SAS-code for g = 0:
proc mixed data=d;
class trait month Goat;
model y=trait w*t*trait w*t*t*trait/ddfm=kr solution noint;
random trait/subject=Goat type=un;
repeated trait month/sub=goat type=un@ar(1);
parms (1)(0)(1)(1)(0)(1)(0.5)/hold=2;
run;
SAS-code for e = 0:
proc mixed data=d;
class trait month Goat;
model y=trait w*t*trait w*t*t*trait/ddfm=kr solution noint;
random trait/subject=Goat type=un;
repeated trait month/sub=goat type=un@ar(1);
parms (1)(0)(1)(1)(0)(1)(0.5)/hold=5;
run;
Alternatively, oue may fit the foll model and add the option COVTEST to the PROC MIXED
line. This will invoke a Wald-test for the covariance parameters. The test is appropriate only
for the two covariances [UN(2,1) in the output] and the serial correlation [AR(1)]. It is NOT
appropriate for the variance components [UN(1,1) and UN(2,2)]. Generally, likelihood-ratio
tests tend to be more reliable than Wald-tests, so I recommend the former. The results below
show that the between-goats covariance (and thus correlation) is significant, while the withingoats covariance (correlation is not significant.
The Mixed Procedure
Covariance Parameter Estimates
Cov Parm
Subject
Estimate
Standard
Error
Z
Value
Pr Z
UN(1,1)
UN(2,1)
UN(2,2)
trait UN(1,1)
UN(2,1)
UN(2,2)
month AR(1)
Goat
Goat
Goat
Goat
Goat
Goat
Goat
0.2935
0.1837
0.1191
0.03376
0.004966
0.02596
0.05760
0.1094
0.06895
0.04528
0.005392
0.003244
0.003928
0.09233
2.68
2.66
2.63
6.26
1.53
6.61
0.62
0.0037
0.0077
0.0043
<.0001
0.1258
<.0001
0.5327
SAS code for Wald-Tests of covariances:
proc mixed data=d covtest;
class trait month Goat;
model y=trait w*trait*t w*trait*t*t;
random trait/subject=Goat type=un;
repeated trait month/sub=goat type=un@ar(1);
run;
Finally, if an overall test of the between-within correlation is desired, we can compare the full
model to the one with both correlations set to 0. The LR-statistic is then compared against chisquared with 2 d.f., which has a critical value of 5,99 at the 5% level of significance.
9
SAS-code for e = 0 and g = 0:
proc mixed data=d;
class trait month Goat;
model y=trait w*t*trait w*t*t*trait/ddfm=kr solution noint;
random trait/subject=Goat type=un;
repeated trait month/sub=goat type=un@ar(1);
parms (1)(0)(1)(1)(0)(1)(0.5)/hold= 2,5;
run;
Output:
Fit Statistics
-2 Res Log Likelihood
AIC (smaller is better)
AICC (smaller is better)
BIC (smaller is better)
-11.5
-1.5
-1.2
2.3
The difference to the full model is –11.5 + 73.5 = 62.0 >> 5.99, which is highly significant.
A p-value for the LR-statistic can be computed as follows:
data;
chi=62;
df=2;
p_value=1-probchi(chi,df);
proc print; run;
Output:
Obs
chi
df
p_value
1
62
2
3.4417E-14
Thus, the p-value is p = 3.4417*10-14 < 0.001 which is highly significant.
Convergence problems: Occasionaly, MIXED does not converge for these types of model.
There are many ways to tackle convergence problems. Outlying observations (typos) may be
one reason for lack of convergence. Also, it is helpful to scale the data so that both traits have
about the same mean. This may be effected by multiplication of one trait by a constant factor.
Note that rescaling will not affect the correlation estimate. Finally, if none of these tricks
work, the AR(1) model may not be adequate. An alternative model is the compound
symmetry model, which states that the serial correlation is constand for any time lag:
Corr(eit, eit’) = 
for any t  t’
(7)
The model can be fitted using TYPE=UN@CS in place of TYPE=UN@AR(1) in the
REPEATED statement.
SAS-code for reading the data:
Data d;
Input Goat month trait$ y;
w=0; if trait='cheesey' then w=1;
10
t=month;
Cards;
124
1
124
1
124
2
124
2
124
3
124
3
124
4
124
4
124
5
124
5
124
6
124
6
124
7
124
7
126
1
126
1
126
2
126
2
126
3
126
3
126
4
126
4
126
5
126
5
126
6
126
6
126
7
126
7
128
1
128
1
128
2
128
2
128
3
128
3
128
4
128
4
128
5
128
5
128
6
128
6
128
7
128
7
130
1
130
1
130
2
130
2
131
1
131
1
131
2
131
2
131
3
131
3
131
4
131
4
131
5
131
5
131
6
131
6
131
7
131
7
134
1
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
4.30
3.57
4.25
3.64
4.35
3.28
4.10
3.60
3.86
3.77
4.33
3.82
4.42
3.96
4.48
3.95
4.36
3.72
4.53
3.79
4.23
3.70
4.08
4.05
4.63
3.81
4.34
3.97
4.95
3.92
4.46
3.75
4.67
3.18
4.30
3.69
4.20
3.69
4.27
3.71
4.14
3.83
4.92
3.88
4.32
3.95
4.44
3.62
5.14
3.81
4.62
3.30
4.67
3.71
4.59
3.83
4.51
3.87
4.72
3.77
3.89
11
134
134
134
134
134
134
134
134
134
134
134
134
134
140
140
140
140
140
140
140
140
140
140
140
140
140
140
143
143
143
143
143
143
143
143
143
143
143
143
145
145
145
145
145
145
145
145
145
145
145
145
145
145
147
147
147
147
147
147
147
147
147
147
1
2
2
3
3
4
4
5
5
6
6
7
7
1
1
2
2
3
3
4
4
5
5
6
6
7
7
1
1
2
2
3
3
4
4
5
5
6
6
1
1
2
2
3
3
4
4
5
5
6
6
7
7
1
1
2
2
3
3
4
4
5
5
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
3.24
3.52
3.30
3.62
3.08
3.62
3.14
3.45
3.09
3.63
3.24
3.45
3.06
3.87
3.36
4.16
3.52
3.81
3.23
3.94
3.35
3.63
3.23
3.72
3.34
3.89
3.37
4.40
3.59
4.45
3.76
4.25
3.72
3.64
3.71
4.20
3.61
4.37
3.92
4.64
3.88
4.47
3.65
4.45
3.87
3.88
3.75
4.20
3.76
4.27
3.86
4.34
3.76
4.11
3.38
4.45
3.32
4.02
3.46
4.01
3.54
3.63
3.27
12
147
147
147
147
157
157
157
157
157
157
157
157
157
157
157
157
157
157
183
183
183
183
183
183
183
183
183
183
183
183
183
183
184
184
184
184
184
184
184
184
184
184
184
184
184
184
185
185
185
185
185
185
185
185
185
185
185
185
185
185
186
186
186
6
6
7
7
1
1
2
2
3
3
4
4
5
5
6
6
7
7
1
1
2
2
3
3
4
4
5
5
6
6
7
7
1
1
2
2
3
3
4
4
5
5
6
6
7
7
1
1
2
2
3
3
4
4
5
5
6
6
7
7
1
1
2
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
3.90
3.41
3.94
3.28
4.24
3.36
4.04
3.34
3.96
2.82
3.72
3.27
3.64
3.35
4.00
3.57
3.78
3.50
3.19
2.84
3.02
2.71
3.01
3.15
2.97
2.69
2.98
2.79
2.98
2.72
3.00
2.79
3.60
3.19
3.18
2.87
3.10
3.25
2.99
2.76
2.89
2.73
2.84
2.76
2.75
2.70
3.63
3.11
3.51
3.35
3.37
3.17
3.12
3.11
3.00
2.86
3.16
2.94
3.54
3.10
3.33
2.99
3.10
13
186
186
186
186
186
186
186
186
186
186
186
187
187
187
187
187
187
187
187
187
187
187
187
187
187
;
2
3
3
4
4
5
5
6
6
7
7
1
1
2
2
3
3
4
4
5
5
6
6
7
7
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
cheesey
protein
2.90
3.11
3.13
2.97
2.89
2.92
2.96
2.96
2.97
2.94
3.03
4.02
3.42
3.68
3.11
3.49
3.30
4.08
3.29
3.77
3.25
3.56
3.33
3.97
3.36
14
Download