The bootstrap in the analysis of case

advertisement
The use of the bootstrap in the analysis of case-control studies with
missing data
Volkert Siersma*, Christoffer Johansen**
*Department of Biostatistics, University of Copenhagen, Copenhagen, Denmark
**Institute of Cancer Epidemiology, The Danish Cancer Society, Copenhagen, Denmark
Abstract
Background. Valid inference and efficiency are concerns when there are missing values in the risk
factors of case-control studies. Complete-case analysis is inefficient and often biased for these
studies. Multiple imputation is a more efficient approach, but valid confidence intervals require the
complex assumption that the imputation is proper which is sometimes hard to ascertain.
Computationally intensive resampling methods assume less of the imputation method in exchange
for computation time.
Methods. A practical bootstrap method is presented to conduct inference in multivariate casecontrol studies when risk factors have incomplete data. This is illustrated through two case studies
with considerable missing data in some risk factors of interest. The first study illustrates the
applicability of the bootstrap method compared to complete-case analysis and multiple imputation.
The second study illustrates the limitations of the bootstrap method.
Results. The bootstrap approach gives very similar results to multiple imputation.
Conclusion. The bootstrap approach can be preferable when the imputation procedure cannot be
ascertained fully proper, but merely to result in unbiased estimates.
Key words: nonparametric bootstrap, bootstrap confidence intervals, missing values, multiple
imputation, matched case-control study
Missing values are common in epidemiological data and even well designed and executed
experiments can feature a considerable number of missing values. A study design often used to
assess the effects of multiple factors on the risk for a relatively rare disease is the matched casecontrol design. With this design individuals with the disease are sampled and for each case, one or
more controls are sampled, similar to the case in certain characteristics but without the disease. The
risk factors are assessed often using conditional logistic regression1,2. In these studies risk factors
are sampled retrospectively and observations may be incomplete often because of causes beyond
the scope of the problem addressed in the study.
Many analysts exclude any subject with missing values from the data and proceed using methods
for data without missing values. This so-called complete-case analysis is the default for many, if not
all, statistical computer packages when faced with data containing missing values. This approach
gives biased estimates for case-control studies if the occurrence of missing values depends on both
the case-control identifier and the risk factors3,4. Additionally, complete-case analysis is inefficient
because of the loss of information by excluding subjects with incomplete observations. Several
incompletely observed risk factors can easily leave only few subjects with complete data.
Multiple imputation5,6 is a method that overcomes inefficiency and claims valid inference when
values are missing at random – MAR5 – i.e. the fact that the value is missing is unrelated to the
actual value that is missing; while still being just as general as the complete-case approach and
computationally inexpensive, and therefore popular7. In this approach, several complete datasets are
created by filling in incomplete observations using information from existing data. These completed
datasets are then analysed by standard methods. The results are combined with Rubin’s rule6,8
where the mean of the obtained estimates has a variance estimated by a sum of between- and
within-repetition variances. Few repetitions are needed because the simulation error is relatively
small compared to the overall uncertainty and Rubin’s rule accounts explicitly for this error6,9.
The variance estimate obtained by multiple imputation is criticised for inconsistency with possible
progressive bias in certain settings10. Additionally, the condition that the imputations are proper6, a
complex requirement for valid inference, is often hard to establish in practice9,11. Multiple
imputation was originally devised for large public-use datasets, where trained statisticians create a,
for computational and logistical reasons, limited number of completed datasets for public dispersion
for possibly many end-users with access only to standard statistical software. For case-control
studies the design demands often one end-user and an analysis, which on modern computers can use
much more simulated datasets than originally devised. This opens up to computationally intensive
resampling methods to circumvent the above criticism.
The nonparametric bootstrap is a general method of inference for statistics with an in principle
unknown distribution12-15. The data generation process is mimicked through sampling with
replacement from the original sample to obtain a replica dataset of the same size. Assuming that the
original data is representative of the total population, the parameter estimates from many resampled
replica datasets construct an empirical distribution for that estimate which is used for inference. The
use of the nonparametric bootstrap for inference on imputation estimators has been acknowledged
before7,16-18 , but generally discarded as being too cumbersome computationally.
We present a practical bootstrap method for inference in multivariate case-control studies when risk
factors have incomplete data. This is done through two case studies with considerable missing data
in risk factors of interest. The first study illustrates the applicability of the bootstrap method
compared to complete-case analysis and multiple imputation. The second study illustrates the
limitations of the bootstrap method.
RSV infection – the nonparametric bootstrap
Respiratory Syncytial Virus (RSV) infection causes hospitalisation during the first two years of life
for some 2 percent of children born each year. A matched case-control study19 features all 1272
hospitalisations for RSV infection in two Danish counties in the 5-year period from 1990 to 1995.
Whenever possible, five controls are randomly chosen from the Danish central personal register
matched on gender, age and municipality. There are 6075 controls in the study. Potential risk
factors are gestational age, birth weight, household size and space, the mother’s smoking habits and
level of maternal antibodies against RSV. Two factors have a sizable portion of missing
observations. For the smoking factor 38% of the entries are missing since this information was only
collected in the later part of the study. Maternal antibodies are thought to have a possible effect only
in the first three months after birth. Therefore this measurement is ordered only for the 286 children
that are hospitalised during the first three months of their lives and is available for 233 of these.
Additionally, since this information is expensive, it is obtained for only one corresponding, but
randomly chosen, control. As maternal antibodies have disappeared after three months, this value is
assumed zero for children over three months of age.
A multivariate conditional logistic regression model is estimated for complete data using a Cox
proportional hazard procedure present in many statistical software packages#. The estimated
log(OR)s with corresponding confidence intervals are subsequently transformed to OR scale. To
accommodate for non-linearities, continuous factors are made ordinal; if no natural categories exist,
four categories approximately corresponding to the quartiles of the distribution of the risk factor are
chosen. For these categories three sets, corresponding to three approaches to incomplete data, of
ORs relative to the hypothesised lowest risk category of each risk factor, with corresponding
confidence intervals are listed in Table 1.
Complete-case estimates are reported in the first column of Table 1 (CC). These are obtained by
excluding the children in which either the smoking status of the mother, or an antibody titer are
missing from the data sample, and applying the estimation procedure to these reduced data. This
approach is consistent here because missing smoking information depends solely on calendar time
and missing blood tests are due to administrative irregularities considered random3,4. The completecase method is the default of most statistical computer packages and thus requires no extra time to
implement, and only one call of the estimation procedure.
Complete-case analysis of the RSV data shows significant effects of many relevant risk factors and
could support a final conclusion. Other methods however could through efficiency gains give more
evidence for the effect of birth weight and better assess the influence of crowding and maternal
antibody titer, the latter being the most costly information.
Multiple imputation estimates are reported in the second column of Table 1 (MI). In this approach
M datasets are simulated by replacing missing values in the original data sample S by qualified
guesses through an imputation procedure imp(S). These completed datasets are then analysed
separately with the complete data procedure and the resulting M estimates are combined using
#
phreg in SAS or coxph in Splus, for example
Table 1: Estimated Odds Ratios (OR) with corresponding 95% Confidence Intervals (CI) in a multivariate
conditional logistic regression for the risk factors for hospitalisation for RSV infection. The estimates are constructed
through Complete-case analysis (CC), Multiple Imputation (MI) and NonParametric Bootstrap (NPB), respectively .
Risk factor
level
Gestational age <33 weeks
NPB
OR (95% CI)
4.65 (2.44-8.85) 3.88 (2.41-6.25) 3.75 (2.74-7.75)
1.64 (0.96-2.81) 1.73 (1.17-2.57) 1.66 (1.20-2.82)
36-37 weeks
1.31 (0.91-1.89) 1.43 (1.07-1.92) 1.40 (1.10-1.97)
38-39 weeks
1.13 (0.92-1.39) 1.18 (0.99-1.40) 1.16 (1.00-1.40)
1.00
1.00
1.00
<3.0 kg
1.61 (1.11-2.31) 1.42 (1.06-1.91) 1.46 (1.10-1.98)
3.0-3.5 kg
1.28 (0.94-1.75) 1.15 (0.89-1.47) 1.16 (0.90-1.51)
3.5-4.0 kg
1.12 (0.82-1.54) 1.06 (0.83-1.26) 1.07 (0.83-1.38)
>4.0 kg
Space per
member of
household
MI
OR (95% CI)
33-35 weeks
>39 weeks
Birth weight
CC
OR (95% CI)
1.00
1.00
1.00
<22 m2
1.36 (1.01-1.82) 1.10 (0.88-1.38) 1.09 (0.87-1.42)
22-28 m2
1.27 (0.97-1.67) 1.14 (0.91-1.42) 1.14 (0.92-1.48)
28-36 m2
1.06 (0.81-1.39) 1.03 (0.83-1.26) 1.02 (0.82-1.29)
>36 m2
1.00
1.00
1.00
Age difference 0-2 years
with next older
sibling
2-4 years
1.70 (1.26-2.28) 1.76 (1.40-2.20) 1.74 (1.45-2.32)
>4 years
1.45 (1.11-1.89) 1.23 (0.99-1.52) 1.22 (1.01-1.56)
adult (no sibs)
Maternal
antibody titer
1.61 (1.26-2.05) 1.64 (1.34-1.99) 1.62 (1.40-2.07)
1.00
1.00
1.00
<210
1.35 (0.69-2.64) 1.22 (0.71-2.11) 1.57 (0.78-2.22)
210-275
1.23 (0.63-2.40) 1.68 (0.97-2.91) 1.85 (1.08-2.74)
275-330
1.65 (0.86-3.15) 1.75 (1.04-2.95) 1.92 (1.22-2.85)
>330
Smoking status smoking
of the mother
non-smoking
1.00
1.00
1.00
1.64 (1.36-1.98) 1.56 (1.19-2.05) 1.57 (1.32-1.98)
1.00
1.00
1.00
Rubin’s rule6,8 to arrive at ORs and standard errors for the classes of the risk factors. The multiple
imputation approach in Table 1 uses M=10 simulated datasets.
Imputation of the two missing factors is based on sequential random draws from the conditional
probability distribution of the risk factor given all complete variable sets in the study20,21. First, the
smoking factor is imputed by draws from a Bernoulli distribution for the probability of a smoking
mother conditional on all complete information – i.e. the risk factors without missing values, the
case-control identifier and the matching variables, but not the maternal antibody titer – using a
logistic regression model estimated from the part of the data for which the smoking information is
not missing. Thereafter, the antibody titer is imputed by a draw from a linear regression model,
estimated on that part of the data for which the antibody titer is not missing, on all complete
information, now including the newly imputed smoking factor.
This imputation scheme focuses on easy implementation rather than being proper6. Single-outcome
procedures for linear and logistic regression are standard in statistical computer packages and
usually produce model predictions for missing outcome when all covariates are present. The
procedure is constructed by iteration of these standard procedures and functions that produce
random draws from probability distributions. This imputation scheme is proper for a single factor
with missing values, under an assumption of ignorability, i.e. missing values in a factor depend in a
similar way as its observed values on the other variables in the study9. The proposed sequential
imputation algorithm cannot be proper since not all available information is used to impute the
smoking factor. Proper imputation would be approached by iterating the above imputation
procedure between the factors with missing information; by additionally using the imputed antibody
count in the logistic model for the smoking factor in a next iteration step20,21. The uniterated
imputation procedure used here is rendered first-moment proper – gives unbiased estimates – by an
assumption of independence of smoking and antibody titer conditional on all other variables.
The multiple imputation procedure takes some time to implement. Software that performs the
iterations of the sequential method is available however20,22. The estimation procedure and the
imputation procedure are called M=10 times, combining the results only once. This gives usually
small computing times also for more elaborate imputation schemes.
Multiple imputation is more efficient compared to complete-case analysis, as evidenced by
narrower confidence intervals (Table 1). Especially, evidence to support a (non-linear) effect of
maternal antibodies is caused by efficiency gains. Effects of birth weight and crowding are slightly
lower than in the complete-case analysis, which might be the result of incorrectly assumed
independences.
Estimates obtained by a non-parametric bootstrap are reported in the last column of Table 1 (NPB).
The bootstrap method for inference on imputation estimates from an incomplete dataset S is
determined by a resampling procedure res(S) and an imputation procedure imp(S). Bootstrap replica
estimates  ( b ) =  (imp(res( S ))) are obtained by application of the imputation procedure on datasets
obtained through the resampling procedure. A large number B replicas construct an empirical
distribution, that approximates the distribution of the parameters  .& A 95% confidence interval is
estimated by the 2.5% and 97.5% percentiles of the replica estimates  ( b ) ; improvements to this
percentile method exist24. B=1000 bootstrap replicas are constructed in the RSV study to obtain
confidence intervals15. A condition for the bootstrap approach to give valid inference is consistency
of the estimate  , which is obtained through the imputation procedure imp(S) described above,
which is argued to provide consistent estimates.
Bootstrap point estimates can be obtained either by applying the imputation procedure to the
original dataset  ( S ) =  (imp( S )) , or by a central moment measure (mean or median) of the
bootstrap replica estimates. The former, reported in Table 1, stays close to the original data, but the
latter is more robust to imputation variance and generally less biased18. The focus of the bootstrap
approach on inference, not estimation, is underlined by this ambivalence.
The resampling scheme res(S) should mimic as closely as possible the actual sampling of the
data12,14. In a matched case-control study, the controls are not chosen randomly from the population.
Cases are therefore sampled with replacement from children hospitalised for RSV and thereafter,
ideally, for each case, controls are sampled with replacement from all children not hospitalised that
match the case, i.e. have the same sex, have the same birth month and live in the same municipality.
This can be approximated by sampling with replacement from the matched sets – i.e. the sets of
cases and their controls – since the matching in the RSV study is tight, such that a resampled dataset
consists of just as many matched sets as in the original dataset. Approximate random sampling by
matched sets increases the computation speed and is used here.
The non-parametric bootstrap described above is not more complex than multiple imputation to
implement. It is however much more computer intensive. The estimation, imputation and
resampling procedure are all called B=1000 times, which gives long computing times. Our
implementation of the bootstrap procedure applied to the RSV study gave results in Table 1 after
approximately one hour*. These computing times should not be an obstacle in practice. The
procedure can be speeded up by better implementation and faster computers.
The nonparametric bootstrap gives very similar results to multiple imputation. The largest
difference is seen in the estimates for the levels of the maternal antibodies. Whereas the confidence
intervals are in agreement, or even slightly narrower for the bootstrap approach, the point estimates
are more different, probably because of imputation bias in the estimate used.
Missing values in a considerable part of the data because of organisational and financial reasons, as
in the RSV study, are often encountered. Complete-case analysis gives unbiased estimates and
inference, if the missing values do not depend on the case-control identifier, at the expense of
efficiency. The latter is seen in Table 1 from the wider confidence intervals compared to both other
methods. Even though, conclusions from the three analyses do not differ much. The information on
the maternal antibodies is the most costly however and the wide confidence intervals in the
complete-case analysis are caused by inefficiency. Both multiple imputation and the nonparametric
bootstrap give smaller confidence intervals, and their results are strikingly similar. This indicates
Shao and Sitter23 construct a bootstrap procedure for already imputed data, relevant when imputed
data is the only data available, but not relevant in a situation when imputer and analyst are the same
person.
*
Pentium 600MHz, 128Mb RAM
&
that both approaches perform well, in spite of recent critique on multiple imputation inference10, a
possibly improper imputation procedure and approximate resampling techniques.
Welding exposure – limitations of the nonparametric bootstrap
A second study assesses the risk of occupational exposure to welding on breast cancer25 with
considerable missing values in the risk factors. This matched case-control study features 1326 cases
of breast cancer from the two occupational cohorts for the period 1985-1994 or 1975-1993 from
Sweden or Denmark respectively. From these cohorts one control, free of breast cancer at the
corresponding case’s date of diagnosis, is chosen randomly, matched on sex, age and nationality.
Information on exposure to various forms of welding and solvents is obtained by questionnaires
sent out to, sometimes former, employers. An overview is given in Table 2. Up to 52% of the
information on exposure to the risk factors is missing, as companies could not remember exactly
what a certain person was working with maybe decades ago. In the Danish data the total
employment period up to diagnosis is identified from pension fund records compulsory since 1964.
For Swedish data however only information about the employment time after 1984 is present as this
was collected from tax returns 1985 through 1994.
Table 2 : Data description and estimated Odds Ratios (OR), relative to the baseline level of no exposure, with
corresponding 95% Confidence Intervals (CI) in a multivariate conditional logistic regression for the increased risk
of development of breast cancer for exposure to various forms of welding. The last column (# undefined) lists the
number of bootstrap replica data samples out of 1000 that resulted in undefined estimates for the corresponding
exposure.
risk factor
Country
cases
controls
Denmark
733
733
Sweden
593
593
male
14
14
1312
1312
14
31
771
753
yes
13
3
no
733
724
yes
2
2
no
734
723
yes
57
74
no
582
563
missing
exposure
OR (95% CI)
# undefined
0 (0%)
matching variables
Sex
0 (0%)
female
Resistance yes
welding
no
Arch
welding
Other
welding
Solvents
1083
(41%)
1179
(44%)
1376
(52%)
1191
(45%)
low
0.41 (0.13-0.95)
0
high
0.36 (0.11-0.76)
0
low
5.10 (1.13-  )
1
high
9.57 (1.71-  )
0
low
0.98 (0.00-  )
32
high
67.4 (0-  )
574
low
0.75 (0.46-1.10)
0
high
0.91 (0.74-1.11)
0
The missing values in this welding exposure study depend on seniority and duration of the
employment, but not on the disease identifier. Consistent complete-case inference is therefore
possible. This is however not a real option because of efficiency loss: a complete-case analysis only
uses information from 1227 (46%) individuals, spreading the already sparse positive indications of
welding and solvents exposure thin. Both multiple imputation and the nonparametric bootstrap are
efficient options and the latter is attempted implemented as in the RSV study. Limitations of the
nonparametric bootstrap are illustrated below.
Following the paradigm for the nonparametric bootstrap, confidence intervals for the welding
exposures are constructed from the imputation estimates from many datasets sampled with
replacement from the available matched pairs. Note that this approximate resampling procedure is
exact here. We aim at estimating ORs for low and high exposure to each welding factor, defined as
less, respectively more than 2000 hours of total occupational exposure to welding or solvents. The
exposure is calculated from the approximate weekly exposure, if exposed at all, and the total
employment period.
The total employment period is only partly known for the Swedish part of the data. The Danish
employment data transformed to be comparable to the Swedish data, i.e. the total employment times
truncated to form employment times after 1984, show no evidence of a significant difference from
the Swedish data (Wilcoxon p-value 0.3448). An assumed similarity in employment times between
the two countries is used in the imputation procedure. The incomplete Swedish total employment
information is exchanged with a random draw from the distribution observed in the Danish part of
the data, restricted to the employment after 1984 from the Swedish data and upwards. This nonstandard part of the imputation procedure interferes with available general imputation software and
increases implementation efforts.
Relatively few subjects in the welding exposure study indicate any exposure. This might imply that
regressions where several welding exposure indicators are included are badly identified. This might
not be the case with the original dataset, but there is a risk of data separation in resampled datasets.
Sequential regression imputation20,21 , now using ordinal logistic regression for the three possible
classes of exposure, is again attempted in the welding exposure study. Blindly applying this
paradigm to a data sample where two factors are separated prohibits the imputation of a positive
indication for one factor when the other factor is present; a strong model assumption based when
these factors are sparse and separated often by coincidence. The imputation procedure is
constructed such that no sparse factors are included as dependent variables in the imputation models
thereby circumventing separation problems. This is justified by an assumption that there are no
systematic correlations between exposure to various forms of welding or solvents, all sparse factors.
This implies an imputation procedure where the ordinal logistic models depend only on the casecontrol identifier, age and country of residence; sex is also left out of the imputation models since
the sample consists predominantly of women.
To estimate the ORs for the various forms of welding or solvents all exposures are included in the
multivariate conditional logistic regression for each bootstrap replica data sample. Due to the
resampling, there is a possibility that for either cases or controls or both a certain exposure is not
present in the replica data sample such that estimates for this factor cannot be identified. The
probability of having at least one situation of no exposure indication at all in the collection of B
bootstrap resamples is given by
  N  n N 
B
   N  1  1  exp(  n)
1   1  



N
where N is the total sample size and n the number of exposure. This probability has relatively fast
convergence for N   . It then follows that, in studies with a decent sample size and using a
suggested15 B=1000, the above probability is 4.4% for n=10. If, as a rule of thumb, a 5% possibility
of having no observations with exposure indication to a certain factor in at least one of the
resampled datasets is accepted, at least 10 exposure indications, the sum of cases and controls,
should be in the original dataset. Consequently, if either cases or controls have less than 10
exposure indications for a certain factor, a more than 5% probability of data separation exists
resulting in infinite parameter estimates.
B
The above rule of thumb on total exposure indication is violated for the data of the welding
exposure study as there are only four indications of other forms of welding, of which only one in
the high class of exposure. When for a certain exposure in a bootstrap replica data sample no cases
and controls have a positive indication, estimates are undefined for these exposures. To calculate
correct confidence intervals, undefined bootstrap replica estimates have to be taken into account.
This is done by calculating adjusted percentiles of the distribution of the defined estimates such that
with 95% confidence the estimate is defined and within the interval. This adjustment is
implemented by taking the (100-  )% confidence interval from the defined estimates, where  is
defined as
u
B 
 
  max 0 ,  0.05   

B B u

with B the number of bootstrap resamples and u the number of undefined estimates. Undefined
estimates appear for both exposure classes of other forms of welding and for the low exposure class
of arch welding (Table 2) and their confidence intervals are shown adjusted accordingly. The
exposure to other forms of welding is seen so sparse that the confidence interval is effectively all
possible numbers, i.e. no effect estimate can be given with any confidence.
The results of the bootstrap approach in Table 2 show a significant protective effect of resistance
welding on breast cancer, but a harmful effect of arch welding. The confidence intervals are
generally wide reflecting the low number of indications in each exposure level. A conclusion is that
there is an effect of resistance welding and arch welding, but effect size or the existence of a dosage
effect cannot be determined.
The welding exposure study illustrates the breakdown of the bootstrap procedure on several points.
Non-standard imputation and more exposure variables with missing values, additional to increased
implementation efforts, increases the time needed for computation. For B=1000 the implemented
bootstrap took 3 hours¥. Imputation through sequential regression20,21 using all available
information, although possible even when complete separation exists, will depend too much on how
sparse factors enter these models. Not including sparse factors as covariates in the imputation
regression models can overcome this problem, and seems defendable when mutual exclusion cannot
be argued for, but undermines the consistency if factual correlations are disregarded. With very
sparse exposure, some resampled data will have no exposure indications at all for a certain factor
resulting in unidentified estimates, but bootstrap confidence intervals can still be obtained. Of
course, in the welding exposure study, discarding the exposure to other forms of welding and
¥
Pentium 600MHz, 128Mb RAM
deleting the few individuals with this exposure could be a justified approach without much loss of
efficiency for the other estimates in the face of this increase in complexity.
Discussion
Multiple imputation as a general approach for handling missing data has come under fire by critics
claiming that proper imputations, necessary for valid inference, are difficult to produce, especially
in data where multiple factors are deficient9,11, and even then multiple imputation is biased in some
cases10. Computer intensive resampling techniques are a real alternative for many case-control
studies with missing values, where imputer and analyst are one person and analysis is performed on
one computer. The nonparametric bootstrap needs only a first-moment proper imputation scheme
for valid confidence intervals, i.e. an imputation resulting in unbiased estimates, which is much
easier obtained and assessed than fully proper imputation. The main point of critique on the use of
the bootstrap to obtain confidence intervals for imputation estimates has been the long computing
times7,18,26. These times need not be unacceptable when a simple imputation procedure can be used
and estimation of the bootstrap replica estimates is straightforward, like in the RSV study. The
multiple imputation drawbacks as stated above are hereby avoided. The welding exposure study
shows that sparse data can, due to separation in resampled data, force the imputation procedure to
use less information than wanted and thereby undermining the consistency of the imputation
estimate. When the study is such that the imputation can be assumed to give consistent imputation
estimates, then sparse risk factors only inflate the bootstrap confidence intervals when exposure is
very rare, in which case the analyst could favourably consider to eliminate that factor from the
analysis.
A complete-case approach often gives consistent analysis in matched case-control studies with
missing values in the risk factors3,4. If evidence from this analysis is not enough to support a final
conclusion, a bootstrap approach can be used if a first-moment proper imputation procedure and
subsequently consistent estimation in each resampled data can be performed sufficiently fast.
Multiple imputation will in this case also give consistent inference in only a fraction of the
computing time when additionally the imputation can be assumed proper, which is sometimes the
case for particulary simple missing mechanisms, and when the estimates are approximately normal
distributed. Consequently, a bootstrap approach has an advantage over multiple imputation when
the imputation procedure can be assessed first-moment proper – hot-deck and random-draw
regression methods usually satisfy this condition7 and are generally not hard to implement – but not
fully proper, or when the estimates are not approximately normal.
Acknowledgements
This work was supported by Public health Services Grant 2R01-CA54706-10A1 from the National
Cancer Institute. We thank H.E. Nielsen – Department of Paediatrics, Gentofte University Hospital
– for his contributions and permission to use data from the RSV-study in this paper, and B. Floderus
and N. Håkansson – Division of Epidemiology, Karolinska Institutet – for their work and
permission to use data from the Swedish arm of the welding-study in this paper.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Breslow NE and Day NE: Statistical Methods in Cancer Research, 1, The Analysis of Case-Control Studies,
Lyon, International Agency for Research on Cancer, 1980
Breslow NE: Statistics in epidemiology: the case-control study. Journal of the American Statistical
Association 91: 14-28, 1996
Breslow NE and Cain KC: Logistic regression for two-stage case-control data. Biometrika 75: 11-20, 1988
Lipsitz SR, Parzen M and Ewell M: Inference using conditional logistic regression with missing covariates.
Biometrics 54, 295-303, 1998
Rubin DB: Inference and missing data. Biometrika 63: 581-592, 1976
Rubin DB: Multiple Imputation for Nonresponse in Surveys, New York, Wiley & Sons, 1987
Rubin DB: Multiple imputation after 18+ years. Journal of the American Statistical Association 91: 473-489,
1996
Li KH, Raghunathan TE and Rubin DB: Large-sample significance levels from multiply-imputed data using
moment-based statistics and an F reference distribution. Journal of the American Statistical Association 86,
1065-1073, 1991
Schafer JL: Analysis of Incomplete Multivariate Data, London, Chapman & Hall, 1997
Robins DB and Wang N: Inference for imputation estimators. Biometrika 87: 113-124, 2000
Binder DA and Sun W: Frequency valid multiple imputation for surveys with a complex design.
Proceedings of the survey research methods section of American Statistical Association 7: 281-286, 1996
Efron B and Tibshirani R: An Introduction to the Bootstrap. New York, Chapman & Hall, 1993
Manly BFJ: Randomization, Bootstrap and Monte Carlo Methods in Biology, 2 nd edition. London, Chapman
& Hall, 1997
Davison AC and Hinkley DV: Bootstrap Methods and their Applications, London, Chapman & Hall, 1997
Carpenter J and Bithell J: Bootstrap confidence intervals: when, which, what? A practical guide for medical
statisticians. Statistics in Medicine 19: 1141-1164, 2000
Efron B: Missing data, imputation and the bootstrap (with discussion). Journal of the American Statistical
Association 89: 463-479, 1994
Laird NM and Louis TA: Empirical Bayes confidence intervals based on bootstrap samples (with
discussion). Journal of the American Statistical Association 82:739-757, 1987
Little RJA and Rubin DB: Statistical Analysis with Missing Data, 2nd edition. New York, Wiley and Sons,
2002
Nielsen HE, Siersma V, Andersen S, Gahn-Hansen B, Mordhorst CH, Nørgaard-Pedersen B, Røder B,
Sørensen TL, Temme R and Vestergaard BF: Respiratory Syncytial Virus infection: Risk factors for hospital
admission, a population-based study. Acta Paediatrica 92, 1314-1321, 2003
Kennickell AB: Imputation of the 1989 Survey of Consumer Finances: Stochastic Relaxation and Multiple
Imputation, Proceedings of the Survey Research Methods Section of the American Statistical Association, 110, 1991
Raghunathan TE, Lepkowski JM, Solenberger P and Van Hoewyk J: A multivariate technique for multiply
imputing missing values using a sequence of regression models. Survey Methodology 27, 85-95, 2001
Raghunathan TE, Solenberger P and Van Hoewyk J: IVEware: imputation and variance estimation software.
Michigan, Survey Methodology Program, Survey Research Center, Institute for Social Research, University
of Michigan. http://www.isr.umich.edu/src/smp/ive/ 2002
Shao J and Sitter RR: Bootstrap for imputed survey data. Journal of the American Statistical Association 91,
1278-1288, 1996
Efron B: Better bootstrap confidence intervals. Journal of the American Statistical Association 82: 171-200,
1987
Johansen C, Floderus F, Håkansson N and Olsen, JH: unpublished data 2003
Rubin DB: Comments on “Missing data, imputation and the bootstrap” by B Efron. Journal of the American
Statistician Association, 89: 475-478, 1994
Download