Efficiency of sampling designs within a cohort for estimating interaction effects between genetic and environmental risk factors Online appendix Estimation of additive risk models in nested case-control samples Under the sampling scheme where controls are sampled among the subjects free of disease at the end of follow-up, estimation of an additive risk model in nested case-control samples is similar to estimation in standard case-control studies where the stratum-specific risks Rj are known from external sources. Let x denote the vector encoding the terms of the independent variables, possibly involving the stratification variable. For instance, with an environmental exposure variable e and genotype variable g, we could have x (1, e, g , eg )' . We consider first the situation where the number of levels of x is small and the model is saturated. For stratum j, let Rjk be the risk of disease for the kth level of x, xk, and n1jk and n0jk be the number of cases and controls with x = xk. With Rj known, Rjk can be estimated by Rj ~ R jk Rj n1 jk n1 j n1 jk n1 j (1 R j ) n0 jk (A.1) n0 j 1 provided that n0 jk n1 jk 0 . In a nested case-control study, the Rjs are estimated using the data from the full cohort. The intuitive estimator of Rj is N 1 j N j , where N j N1 j N 0 j . Under the saturated model, this estimator is the maximum likelihood estimator of Rj. Notice that the case-control sample provides information only on the proportion of subjects at level k of the independent variables within the cases or controls of a given stratum. This allows the denominators n1j and n0j to be set by design, as in the frequency matched and balanced designs. The coefficients of the additive risk model are obtained by solving the system of ~ equations R jk ' xk for In the general situation where the risk model is not saturated, N 1 j N j is no longer the maximum likelihood estimator of Rj.1 We have opted for the iterative maximization algorithm proposed by Scott and Wild2 to obtain maximum likelihood estimates. It uses the likelihood function for casecontrol studies where the Rjs are known as a pseudo-likelihood function with the Rjs replaced by their current estimate. The estimates of are updated by numerically maximizing this pseudolikelihood function, subject to the constraint that the predicted risk of every subject lies between 0 and 1. The algorithm alternates between solving for at the current estimate of the Rjs and computing updated Rjs at the current estimates of until convergence. Estimation of additive risk models in case-cohort samples As for nested case-control sampling, we analyze the affection status at the end of follow-up, ignoring censoring. We perform the analysis under the binomial regression framework on all the cohort subjects, treating the genotype of subjects outside the sample as missing data. Likelihood 2 maximization can then be performed using a special case of the method of weights,3 a form of expectation-maximization (EM) algorithm.4 An alternative approach, taking into account censoring, to estimate additive risk models under the case-cohort design has been proposed by Kulich and Lin.5 We implement our EM algorithm by duplicating every subject with missing genotype to obtain pseudo subjects, each one having one of the possible genotypes.6 The genotype is assumed independent of exposure and covariates. Initially, each pseudo subject is given a weight corresponding to the frequency of his genotype in the sub-cohort. Then, the algorithm iterates the following expectation and maximization steps until convergence: Expectation step: The weights of the pseudo subjects are re-computed as the expected probability of their genotype given their disease status and their exposure and covariate values, based on current coefficient estimates. Maximization step: The coefficients are re-estimated by maximizing the likelihood of the full cohort, with pseudo subjects weighted by the current estimated probability of their genotype, under the constraint that the estimated risks for each genotyped subject is between 0 and 1. Expected variance of the case-cohort design When the number of levels of the independent variables is small and the model is saturated, explicit estimators of the risk for each level can be derived under the case-cohort design. With a dichotomous genotype variable g, the risk estimates take the form 3 Rˆ jg n1 jg n1 jg n 2 jg . N j n 2 jg n0 jg n j MLEs of the coefficients of the additive form of model (1) can be derived from these risk estimates. For instance, ˆ2 Rˆ 01 Rˆ 00 and ˆ3 Rˆ11 Rˆ10 ( Rˆ 01 Rˆ 00 ) . Variances and covariances of risk estimates are obtained by applying standard formula for the variance and covariance of ratios of random variables and simplifying the resulting algebraic expressions: Var Rˆ jg 1 R jg (1 R jg ) N j 1 R jg R 2jg (1 R jg ) 2 1 R jg N j p jg n j n j N j 1 1 Cov Rˆ j 0 , Rˆ j1 R j 0 (1 R j 0 ) R j1 (1 R j1 ) n j N j where pjg denotes the frequency of genotype g in stratum j. Note that distinct strata are independent, so that Cov Rˆ jg , Rˆ lg 0 for j l . The variance of the MLE of the coefficients of the additive form of model (1) are functions of the variances and covariances of risk estimates. For instance, Var ˆ2 Var Rˆ 01 Var Rˆ 00 2Cov Rˆ 00 , Rˆ 01 and Var ˆ3 Var Rˆ11 Var Rˆ10 2Cov Rˆ10 , Rˆ11 Var Rˆ 01 Var Rˆ 00 2Cov Rˆ 00 , Rˆ 01 . 4 Asymptotic variance estimators for nested case-control designs Asymptotic variance estimators provide an estimate, for each replicate, of the total variance due to sampling of the cohort from the population and to sampling of the genotyped subjects from the cohort. For nested case-control sampling, we used the variance estimators proposed by Scott and Wild2 for the additive risk model, and by Breslow and Cain7 for the logistic model. We observed that the asymptotic estimator of the variance of the additive risk model MLEs for nested casecontrol designs was biased upward in our cohort from the FOS. Nonparametric bootstrap variance estimator for nested case-control designs In order to estimate the variance of the additive risk model MLEs without bias, we have developed a nonparametric bootstrap estimator for nested case-control designs. The entire cohort is resampled with replacement. The nested case-control sample of a bootstrap replicate is formed of the cases and controls from the original nested case-control sample who are also in the bootstrap resample of the cohort (and may be included multiple times). The bootstrap variance estimate is obtained in the usual way as the empirical variance of the coefficient MLEs across the bootstrap resamples. This estimator gave variance estimates very close to v empirical in the simulations where both estimators were computed. The computation time of the bootstrap estimator for one nested case-control sample is of the order of a few minutes. While this is a reasonable time for one sample, the computation of the bootstrap estimator in all our simulations with hundreds of replicates, three different nested case-control designs and dozens of scenarios was too time consuming, and vempirical was therefore preferred. 5 Software We have implemented the methods described in this appendix in a software package called addrisk written in the R language8 available at http://www.crulrg.ulaval.ca/pages_perso_chercheurs/bureau_a/software.html. References 1. 2. 3. 4. 5. 6. 7. 8. Wild CJ. Fitting Prospective Regression Models to Case-Control Data. Biometrika 1991;78(4):705-717. Scott AJ, Wild CJ. Fitting Regression Models to Case-Control Data by Maximum Likelihood. Biometrika 1997;84(1):57-71. Ibrahim JG. Incomplete Data in Generalized Linear Models. Journal of the American Statistical Association 1990;85(411):765-769. Dempster AP, Laird NM, Rubin DB. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society. Series B (Methodological) 1977;39(1):1-38. Kulich M, Lin DY. Additive hazards regression for case-cohort studies. Biometrika 2000;87(1):73-87. Fleiss JL, Levin BA, Paik MC. Statistical methods for rates and proportions. Wiley series in probability and statistics. 3rd ed. Hoboken, N.J.: J. Wiley, 2003, Section 16.4. Breslow NE, Cain KC. Logistic Regression for Two-Stage Case-Control Data. Biometrika 1988;75(1):11-20. The R Development Core Team. R. 2.1.1 ed The R Foundation for Statistical Computing, 2005. 6