Efficiency of sampling designs within a cohort for estimating

advertisement
Efficiency of sampling designs within a cohort for estimating
interaction effects between genetic and environmental risk
factors
Online appendix
Estimation of additive risk models in nested case-control samples
Under the sampling scheme where controls are sampled among the subjects free of disease at the
end of follow-up, estimation of an additive risk model in nested case-control samples is similar to
estimation in standard case-control studies where the stratum-specific risks Rj are known from
external sources.
Let x denote the vector encoding the terms of the independent variables, possibly involving the
stratification variable. For instance, with an environmental exposure variable e and genotype
variable g, we could have x  (1, e, g , eg )' . We consider first the situation where the number of
levels of x is small and the model is saturated. For stratum j, let Rjk be the risk of disease for the
kth level of x, xk, and n1jk and n0jk be the number of cases and controls with x = xk. With Rj known,
Rjk can be estimated by
Rj
~
R jk 
Rj
n1 jk
n1 j
n1 jk
n1 j
 (1  R j )
n0 jk
(A.1)
n0 j
1
provided that n0 jk  n1 jk  0 . In a nested case-control study, the Rjs are estimated using the data
from the full cohort. The intuitive estimator of Rj is N 1 j N  j , where N  j  N1 j  N 0 j . Under
the saturated model, this estimator is the maximum likelihood estimator of Rj.
Notice that the case-control sample provides information only on the proportion of subjects at
level k of the independent variables within the cases or controls of a given stratum. This allows
the denominators n1j and n0j to be set by design, as in the frequency matched and balanced
designs. The coefficients of the additive risk model are obtained by solving the system of
~
equations R jk   ' xk for 
In the general situation where the risk model is not saturated, N 1 j N  j is no longer the maximum
likelihood estimator of Rj.1 We have opted for the iterative maximization algorithm proposed by
Scott and Wild2 to obtain maximum likelihood estimates. It uses the likelihood function for casecontrol studies where the Rjs are known as a pseudo-likelihood function with the Rjs replaced by
their current estimate. The estimates of  are updated by numerically maximizing this pseudolikelihood function, subject to the constraint that the predicted risk of every subject lies between
0 and 1. The algorithm alternates between solving for  at the current estimate of the Rjs and
computing updated Rjs at the current estimates of  until convergence.
Estimation of additive risk models in case-cohort samples
As for nested case-control sampling, we analyze the affection status at the end of follow-up,
ignoring censoring. We perform the analysis under the binomial regression framework on all the
cohort subjects, treating the genotype of subjects outside the sample as missing data. Likelihood
2
maximization can then be performed using a special case of the method of weights,3 a form of
expectation-maximization (EM) algorithm.4 An alternative approach, taking into account
censoring, to estimate additive risk models under the case-cohort design has been proposed by
Kulich and Lin.5
We implement our EM algorithm by duplicating every subject with missing genotype to obtain
pseudo subjects, each one having one of the possible genotypes.6 The genotype is assumed
independent of exposure and covariates. Initially, each pseudo subject is given a weight
corresponding to the frequency of his genotype in the sub-cohort. Then, the algorithm iterates the
following expectation and maximization steps until convergence:
Expectation step: The weights of the pseudo subjects are re-computed as the expected
probability of their genotype given their disease status and their exposure and covariate values,
based on current coefficient estimates.
Maximization step: The coefficients are re-estimated by maximizing the likelihood of the full
cohort, with pseudo subjects weighted by the current estimated probability of their genotype,
under the constraint that the estimated risks for each genotyped subject is between 0 and 1.
Expected variance of the case-cohort design
When the number of levels of the independent variables is small and the model is saturated,
explicit estimators of the risk for each level can be derived under the case-cohort design. With a
dichotomous genotype variable g, the risk estimates take the form
3
Rˆ jg 
n1 jg
n1 jg  n 2 jg
.
N j
 n 2 jg 
n0 jg
n j
MLEs of the coefficients of the additive form of model (1) can be derived from these risk
estimates. For instance, ˆ2  Rˆ 01  Rˆ 00 and ˆ3  Rˆ11  Rˆ10  ( Rˆ 01  Rˆ 00 ) .
Variances and covariances of risk estimates are obtained by applying standard formula for the
variance and covariance of ratios of random variables and simplifying the resulting algebraic
expressions:
 
Var Rˆ jg 

 1
R jg (1  R jg ) 
N j
1 
R jg   R 2jg (1  R jg ) 2 

1  R jg 

N  j p jg 
n j

 n j N  j 


 1
1 
Cov Rˆ j 0 , Rˆ j1   R j 0 (1  R j 0 ) R j1 (1  R j1 ) 


 n j N  j 
where pjg denotes the frequency of genotype g in stratum j. Note that distinct strata are


independent, so that Cov Rˆ jg , Rˆ lg  0 for j  l .
The variance of the MLE of the coefficients of the additive form of model (1) are functions of the
variances and covariances of risk estimates. For instance,
 
 
 

Var ˆ2  Var Rˆ 01  Var Rˆ 00  2Cov Rˆ 00 , Rˆ 01
 
 
 



 
 


and Var ˆ3  Var Rˆ11  Var Rˆ10  2Cov Rˆ10 , Rˆ11  Var Rˆ 01  Var Rˆ 00  2Cov Rˆ 00 , Rˆ 01 .
4
Asymptotic variance estimators for nested case-control designs
Asymptotic variance estimators provide an estimate, for each replicate, of the total variance due
to sampling of the cohort from the population and to sampling of the genotyped subjects from the
cohort. For nested case-control sampling, we used the variance estimators proposed by Scott and
Wild2 for the additive risk model, and by Breslow and Cain7 for the logistic model. We observed
that the asymptotic estimator of the variance of the additive risk model MLEs for nested casecontrol designs was biased upward in our cohort from the FOS.
Nonparametric bootstrap variance estimator for nested case-control designs
In order to estimate the variance of the additive risk model MLEs without bias, we have
developed a nonparametric bootstrap estimator for nested case-control designs. The entire cohort
is resampled with replacement. The nested case-control sample of a bootstrap replicate is formed
of the cases and controls from the original nested case-control sample who are also in the
bootstrap resample of the cohort (and may be included multiple times). The bootstrap variance
estimate is obtained in the usual way as the empirical variance of the coefficient MLEs across the
bootstrap resamples. This estimator gave variance estimates very close to v empirical in the
simulations where both estimators were computed. The computation time of the bootstrap
estimator for one nested case-control sample is of the order of a few minutes. While this is a
reasonable time for one sample, the computation of the bootstrap estimator in all our simulations
with hundreds of replicates, three different nested case-control designs and dozens of scenarios
was too time consuming, and
vempirical was therefore preferred.
5
Software
We have implemented the methods described in this appendix in a software package called
addrisk written in the R language8 available at
http://www.crulrg.ulaval.ca/pages_perso_chercheurs/bureau_a/software.html.
References
1.
2.
3.
4.
5.
6.
7.
8.
Wild CJ. Fitting Prospective Regression Models to Case-Control Data. Biometrika
1991;78(4):705-717.
Scott AJ, Wild CJ. Fitting Regression Models to Case-Control Data by Maximum
Likelihood. Biometrika 1997;84(1):57-71.
Ibrahim JG. Incomplete Data in Generalized Linear Models. Journal of the American
Statistical Association 1990;85(411):765-769.
Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via the
EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological)
1977;39(1):1-38.
Kulich M, Lin DY. Additive hazards regression for case-cohort studies. Biometrika
2000;87(1):73-87.
Fleiss JL, Levin BA, Paik MC. Statistical methods for rates and proportions. Wiley series
in probability and statistics. 3rd ed. Hoboken, N.J.: J. Wiley, 2003, Section 16.4.
Breslow NE, Cain KC. Logistic Regression for Two-Stage Case-Control Data. Biometrika
1988;75(1):11-20.
The R Development Core Team. R. 2.1.1 ed The R Foundation for Statistical Computing,
2005.
6
Download