Karolina Sikorska, Nahid Mostafavi Montazeri, André Uitterlinden, Fernando
Rivadeneira, Paul H.C Eilers, and Emmanuel Lesaffre
1.
Appendix
Linear mixed model
The ML estimator of β in model (1) for balanced data set can be expressed as,
π π½Μ = (∑ π π
π π=1
ππ π
)
−1
π
(∑ π π
π π=1
ππ¦ π
), (17)
With (y i1
, … , y in
) T
, X i
the design matrix for the π th subject consisting of the columns 1 , π , π π
π, π π π, and π equals π −1
. Using Woodbury equation, (π΄ + π΅π·π΅ π ) −1 = π΄ −1 − π΄ −1 π΅(π· −1 +
π΅ π π΄ −1 π΅) −1 π΅ π π΄ −1 , π can be rewritten for model (2) as,
π =
1 π 2
(πΌ π
− π1 π
)(πΌ π
− πΎ ππ π (πΌ π
− π1 π
)), (18)
Where π = (π‘
1
, … , π‘ π
) π , is the vector of time points, π = (π 2 ⁄ π
0
2 + π) −1 , πΎ = (π 2 ⁄ π 2
1
+ π) −1 and π = ∑ π‘ π
2 − π (∑ π‘ π
)
2
.
By inserting equation (18) in equation (17) the estimate of interaction term in model (2), can be obtained as: π½Μ
3
= cov(π, π) − πΜ cov(π, π)
.
π var(π) var(π)
Likewise by inserting π in the variance-covariance matrix of LMM, i.e. var(π½Μ
3
) =
(∑ π π=1
π π
π ππ π
) −1 , the variance of the interaction estimator can be written as:
1
var(π½Μ
3
) = π 2 + π
1
2 π var(π)
π π var(π) var(π)
.
Two-step approach
We can reformulate model (2) into π¦ ππ
= π½ ∗
0
+ π½ ∗
2 π‘ ππ
+ π
∗
0π
+ π
∗
1π π‘ ππ
+ π ππ
, (19) with π½ ∗
0
= π½
0
+ π½
1
πΈ(π ), π½ ∗
2
= π½
2
+ π½
3
πΈ(π ), π
∗
0π
= π
0π
+ π½
1
(π π
− πΈ(π )), π ∗
1π
= π
1π
+ π½
3
(π π
− πΈ(π )).
We call model (19) the reduced model and its ο¬tting constitutes the ο¬rst step in the two-step procedure. The second step regresses the estimated bΜ ∗
1i
on SNPs with simple regression model πΜ ∗
1π
= π½ ∗∗
0
+ π½ ∗∗
1 π π
+ π π
∗∗ .
The MLE of π½Μ ∗∗
1
can be expressed as: π½Μ ∗∗
1
= ∑
(π π
∑(π
− π Μ )πΜ ∗ π1
, (20) π
− π Μ ) 2 where πΜ ∗ π1
is the best linear unbiased predictor (BLUP) and can be computed by the empirical
Bayesian approach as: πΜ π
∗ = π·π π π(π¦ π
− π½Μ ∗ ), (21) and where π½Μ ∗
is the ML estimator of the vector of fixed effects of reduced model (19). Likewise by inserting π½Μ ∗
and π in equation (21) the BLUP for model (19), can be obtained as: πΜ ∗
1π
= πΎ(π π
− πΜ ), (22)
2
where π π
= ∑ π‘ π π¦ ππ
− π(∑ π‘ π
)(∑ π¦ ππ
), πΜ = ∑ π‘ π π¦Μ π
− π(∑ π‘ π
)(∑ π¦Μ π
), π¦Μ π
=
1
π
∑ π π=1 π¦ ππ
, π =
(π 2 ⁄ π
0
∗2 + π) −1 , πΎ = (π 2 ⁄ π ∗2
1
+ π) −1
and π = ∑ π‘ π
2 − π(∑ π‘ π
)
2
.
By inserting the two above equations into equation (20) the estimate of π½ ∗∗
1
can be derived as: π½Μ ∗∗
1
= cov(π, π) − π π πΜ cov(π, π) var(π) (π 2 ⁄ π ∗2
1
+ ∑ π‘ π
2 − π(∑ π‘ π
)
2
)
Note that in the above derivations we have used the assumption that the covariance of the random intercept and slope is zero. We have argued in the text when this assumption is reasonable.
Conditional two-step approach
The transformed matrix (i.e. π΄ ) which was introduced in CLMM can be defined as, π΄ =
〈 π
1
βπ
1
β
, π
2
βπ
2
β
, π
3
βπ
3
β
〉, by using the Gram Schmidt process, where π
1
= π − 〈π. π〉
1 π
.
In the second step of the conditional two-step approach (i.e. model (11)), the ML estimator of π½Μ ββ
1
can be expressed as: π½Μ ββ
1
= ∑
(π π
∑(π π
− π Μ )πΜ β π1
− π Μ ) ´ 2
(23)
Where πΜ
β π1
is the best linear unbiased predictor (BLUP) and can be computed by empirical
Bayesian approach as, πΜ π
β = π·π ∗π π ∗ (π¦ π
∗ − π ∗ π½Μ β ), (24) where π½Μ β
is the MLE of the reduced model (10), π ∗
and π ∗
are the transformed design matrix of fixed and random effects, respectively. Likewise by using the Woodbury equation, π ∗
can be obtained as,
π ∗
1
= πΌ 2
(πΌ π
− πΎπ ∗ π ∗π ), where πΎ = (π 2 ⁄ π
1
∗2 + π var(π‘)) −1 .
By inserting π½Μ β
and π ∗
into equation (24) the BLUP for model (10) can be obtained as:
3
πΜ β
1π
= πΎ (π’ π
−
1
π
∑ π’ π π
− π‘Μ π¦ π
1
+
π π‘Μ ∑ π¦ π
), π with π¦ π
= ∑ π¦ ππ
and π’ π
= ∑ π¦ ππ π‘ π
.
By inserting the two above equations into the equation (23), the ML estimator of SNP effect in model (12) can be derived as: π½Μ ββ
1
= var(π)(π var(π) + π 2 ⁄ π
1
∗2 )
.
In the second step of the conditional two-step approach, model (11), variance of the π½Μ ββ
1
can be expressed as:
var(π½Μ ββ
1
) = var(πΜ β π1
∑(π π
)
− π Μ ) ´ 2
, (25) where var(πΜ β π1
) with respect to the variance-covariance matrix of LMM, i.e. var(πΜ π
β ) =
π·π π π ∗ π ∗ π· − π·π π π ∗ π ∗ (π ∗π π ∗ π ∗ ) −1 π ∗ π ∗ π ∗ π·, can be expressed as: var(πΜ π
β
) = π ∗2
1
(1 − πΎ π var(π)) π var(π). (26) π 2
By inserting former equation into the equation (25), the variance of estimator of SNP effect can be derived as: var(π½Μ ββ
1
) = π var(π)π
π π var(π)(π 2 ⁄ π
1
∗2
∗2
1
+ π var(π))
.
2.
Probit regression to estimate power
In the mixed model framework the distributions of the test statistics for the Wald, t-and F-tests are generally known only under the null hypothesis 1 . Exhaustive simulations are considered the most accurate method to compute statistical power. However ο¬tting many LMMs is time consuming and so is the simulation-based estimation of the power curve. The effect and parameters are computed repetitively for a grid of values. The proportion of times that a SNP is
4
qualiο¬ed as signiο¬cant gives the empirical type I error rate (simulation under null hypothesis) and power (simulation under alternative hypothesis). For an exhaustive simulation study this approach demands a discouraging amount of computation time.
We propose a faster way for power calculations based on the probit model. Helms 2 demonstrated via simulations that the distribution of the general F test
π»
0
: π ≡ πΏπ½ − π
0
= 0 versus π»
π΄
: π ≠ 0 under the alternative can be approximated by a noncentral F-distribution with noncentrality parameter
δ given by πΏ = π π π
π π π
−1 π π
) −1 πΏ π ] −1 π .
From this it follows that the t-test for testing
π»
0
: π½
3
= 0 versus π»
π΄
: π½
3
= π , under π»
π΄ has a noncentral t-distribution with noncentrality parameter √πΏ 3
. In GWAS situations the number of degrees of freedom is large and the (non-central) t-distribution can be approximated by a normal distribution. Consequently, the power curve plotting the effect size versus the statistical power has approximately the shape of the cumulative normal distribution function. We observed the same shape for the approximate procedures. In our simulations, a grid of equally spaced π½
3
-values is chosen on the interval [0, π½
3,πππ₯
] , where π½
3,πππ₯
is the smallest value for which the power is practically 100%. Thousand grid values were chosen and for each value one data set was simulated and the considered models were ο¬tted. The obtained p -values can be dichotomized according to the condition p < 0.05, giving π πππ
. Finally, a probit model was ο¬tted to π πππ
.
3.
Discussion on the power loss of the CTS approach
We observed that the behavior of the CTS is quite similar to the CLMM. That there is some loss in power of the CLMM approach might be surprising given the results in Section 4.2 of
4
. In that section it is argued that the CLMM implies no loss of information from a Bayesian
5
viewpoint. However, this result is based on the assumption that the random intercept has a ο¬at prior, while we have taken the classical assumption of joint normality for the random intercept and slope. That in the balanced case there is (basically) no power loss and in the unbalanced case there must be in general a power loss can be seen from the following reasoning.
The results in Section 4.3 of the same paper applied to the current simpliο¬ed situation results in: π(π | π
π
, π½
3
, π
1
, π 2 ) = π(π ∗ | π½
3
, π
1
, π 2 ) π(π | π
0
, π½
3
, π
1
, π 2 ), with y the stacked vector of responses, π
0
( π
1
) the stacked vector random intercepts (slope), π ∗ the stacked vector of π π
∗
values and π the stacked vector of profile means. Now for each of the profile means the following result holds: π¦|π
0π
, π
1π
∼ N(π½
0
+ π½
1 π π
+ π½
2 π‘ + π½
3 π π π‘ + π
0π
+ π
1π π‘, π 2 ), with π‘Μ π
the average time for the π th subject. In the balanced case, one can change the time origin such that tΜ i
≡ π‘ = 0 without changing anything on the estimation of the longitudinal part of the model. This implies that there is no information on π½
3
anymore in the second part of the likelihood (part of π¦ ). That a minimal loss of information for the CTS approach was seen in some of the simulations for the balanced case has to do with the estimation of the variance parameters. Indeed the variance parameters of the LMM are present in both parts of the likelihood and must therefore be estimated better with the LMM than with any of the two parts separately. In the unbalanced case, no change in origin can remove π½
3
from that part, however.
Hence, a loss of power is expected with the CLMM and hence also with the CTS approach, but the loss of power is often minimal as seen in the simulations.
6
4.
Supplementary Figures
Supplementary Figure 1: MAR case, scenario 5. Approximation of the CTS approach compared to the CLMM.
Supplemantary Figure 2: Flowchart describing practical use of the CTS approach
7
Supplementary Figure 3: Time needed to analyze 1 million of SNPs using the CTS approach combined with the semi-parallel regression (left panel). Computation time ratio between the function lmer and the CTS approach combined with the semi-parallel regression depending on the number of longitudinal observations (right panel).
Supplementary Figure 4: 100 SNPs from the BMD data. On the x-axis the p-values for the
SNP x time interaction effect from the mixed model assuming uncorrelated errors; on the y-axis the corresponding p-values from the model assuming continuous autoregressive structure of measurement error.
8
Suppementary Figure 5: Balanced case. Performance of the approximate procedures when a time-varying covariate is included to the linear mixed model.
Suppementary Figure 6: Unbalanced case. Performance of the approximate procedures when a time-varying covariate is included to the linear mixed model.
9
5.
R codes – an example of applying the CTS approach
The data are arranged in a so-called “long format”, with one row per observation. The SNP data are stored in a matrix S with N rows and ns columns. The size of ns depends on the available
RAM. The first few rows of the phenotype data ( mydata ) look as follows: id y Time
1 1.12 1
1 1.14 2
1 1.16 3
1 1.2 4
1 1.26 5
2 0.95 1
2 0.83 2
2 0.65 3
2 0.49 4
2 0.34 5
The code below, with function cond, transforms data for conditional linear mixed model. It is based on the SAS macro provided in Verbeke et al. in “Conditional linear mixed models” (2001).
Variable “vars” is a vector with names of response and all time-varying covariates that should be transformed. cond = function(data, vars) { data = data[order(data$id), ]
### delete missing observations data1 = data[!is.na(data$y), ]
## do the transformations ids = unique(data1$id) transdata = NULL for(i in ids) { xi = data1[data1$id == i, vars] xi = as.matrix(xi) if(nrow(xi) > 1) {
A = cumsum(rep(1, nrow(xi)))
A1 = poly(A, degree = length(A)-1) transxi = t(A1) %*% xi transxi = cbind(i, transxi) transdata = rbind(transdata, transxi)
}
} transdata = as.data.frame(transdata) names(transdata) = c("id", vars) row.names(transdata) = 1:nrow(transdata)
10
return(transdata)
}
The code below applies the conditional two-step approach. First, the data are transformed using function cond . Next, the reduced conditional linear mixed model is fit and the random slopes are extracted. Finally, the semi-parallel regression is performed.
# transform data for the conditional linear mixed model trdata = cond(mydata, vars = c("Time", "y"))
#fit the reduced model and extract random slopes mod2 = lmer(y ˜ Time - 1 + (Time-1|id), data = trdata) blups = ranef(mod2)$id blups = as.numeric(blups[ , 1])
# perform the second step using semi-parallel regression
X = matrix(1, n, 1)
U1 = crossprod(X, blups)
U2 = solve(crossprod(X), U1) ytr = blups - X %*% U2 ns = ncol(S)
U3 = crossprod(X, S)
U4 = solve(crossprod(X), U3)
Str = S - X %*% U4
Str2 = colSums(Str ˆ 2) b = as.vector(crossprod(ytr, Str) / Str2) sig = (sum(ytr ˆ 2) - b ˆ 2 * Str2) / (n - 2) err = sqrt(sig * (1 / Str2)) p = 2 * pnorm(-abs(b / err))
References
1 .G. Verbeke, G. Molenberghs. Linear Mixed Models for Longitudinal Data. New York,
Springer, 2009.
2. R.W. Helms. Intentionally incomplete longitudinal designs: I. methodology and comparison of some full span designs. Statistics in Medicine 1992, 11(14-15) :1889–1913.
3. R.C. Littell. SAS for Mixed Models. North Carolina, SAS institute, 2006.
4. G. Verbeke, B. Spiessens, E. Lesaffre. Conditional linear mixed models. The American
Statistician 2001, 55(1) :25–34.
11