Chapter 5-13. Monte Carlo Simulation andBootstrapping

advertisement
Chapter 5-13. Monte Carlo Simulation andBootstrapping
In this chapter, we will learn how to perform both a Monte Carlo simulation and bootstrapping
simulation to obtain a confidence interval for the mean. We will contrast the results and compare
them to the formula approach for the CI.
In the next chapter, we will apply the bootstrap method to validate a prediction model
(prognostic model).
Exercise Let’s begin with an example of bootstrapping being use (Kim et al, 2006).
In the Results section of the Abstract they state,
“Logistic regression analysis continued to show significance for lethargy (odds ratio,
2.20; bias-correced 95% confidence interval, 1.11-3.63)...Bootstrap resampling validated
the importance of the significant variables identified in the regression analysis.”
In the Statistical Analysis section they state,
“We then performed bootstrap resampling procedures with 100 iterations each to obtain
95% bias-corrected CIs for each predictor variable and to assess its stability.”
In the Results section they state,
“With bootstrap analysis, the results of the multivariable analysis were validated, and new
biased-corrected 95% CIs were calculated (Table 4).”
The authors are attempting to convince the reader that their predictors of shunt malfunction are
more valid and reliable since they used a boostrap approach.
By the end of this chapter, we should understand bootstrapping well enough to judge if
bootstrapping added to this Kim et al paper, or whether it simply misleads the reader into
thinking more validity was added than really was.
_________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2010.
Chapter 5-13 (revision 16 May 2010)
p. 1
Theoretical Justification for the Bootstrap Method
We begin with the theoretical justification for the bootstrap method.
Where the name “bootstrapping” comes from
Efron and Tibshirani (1998, p.5) explain,
“The use of the term bootstrap derives from the phrase to pull oneself up by one’s
bootstrap, widely thought to be based on one of the eighteenth century Adventures of
Baron Munchausen, by Rudolph Erich Raspe. (The Baron had fallen to the bottom of a
deep lake. Just when it looked like all was lost, he thought to pick himself up by his own
bootstraps.)”
Let’s recall what a sampling distribution is. Consider the sampling distribution of the mean.
Conducting a Monte Carlo simulation, we take 10,000 repeated samples (of size n=50, for
example) from the population, compute the mean from each of these samples, and then display
these means in a histogram. This histogram represents the “sampling distribution of the mean”.
In bootstrapping, we do something very similar. We begin with our sample (of size n=50, for
example). Then we take repeated samples (usually 1,000 samples is sufficient) of size n=50 from
our sample, but we do it with replacement. That is, we randomly select an observation, record
it’s value, and then put it back in the hat so that it has a chance to be drawn again. From each
sample (called a “resample”, since we sampled from a sample), we compute the mean. We
display the n=1,000 means computed from the resamples in a histogram. This histogram
represents the “sampling distribution of the mean”, just as it did in the Monte Carlo simulation.
Mooney and Duval (1993, p.10) give a concise description of bootstrapping,
“In bootstrapping, we treat the sample as the population and conduct a Monte Carlo-style
procedure on the sample. This is done by drawing a large number of ‘resamples’ of size n
from this original sample randomly with replacement. So, although each resample will
have the same number of elements as the original sample, through replacement
resampling each resample could have some of the original data points represented in it
more than once, and some not represented at all. Therefore, each of these resamples will
likely be slightly and randomly different from the original sample. And because the
elements in these resamples vary slightly, a statistic, θ̂* , calculated from one of these
resamples will likely take on a slightly different value from each of the other θ̂* ’s and
from the original θ̂* . The central assertion of bootstrapping is that a relative frequency
distribution of these θ̂* ’s calculated from the resamples is an estimate of the sampling
distribution of θ̂ .
Chapter 5-13 (revision 16 May 2010)
p. 2
If the sample is a good approximation of the population (a representative sample), bootstrapping
will provide a good approximation of the sampling distribution of θ̂ (Efron & Stein, 1981;
Mooney and Duval, 1993, p.20).
It might seem at first that there is not sufficient information in a sample to derive a sampling
distribution for a statistic. To explain why it works, we first define what a probability
distribution is (first described in mathematical language, followed by simple English).
For a discrete variable, the probability mass function p(a) of X is defined as:
p(a) = P{ X = a }
example: the pmf of a coin flip is
X=1 (if heads), p(1) = 1/2
X=0 (if tails ), p(0) = 1/2
For a continuous variable, the probability density function f(x) of X is defined as
P{ X  B}   f ( x)dx
B
example: the pdf of the normal distribution is
P{ X  B} 
Chapter 5-13 (revision 16 May 2010)
1
2

B
e ( x   )
2
/ 2 2
dx
p. 3
Let f(x) denote the probability distribution for any variable (either a probability mass function or
a probability density function), which may be known or unknown. We can construct an
empirical probability distribution function (EDF) of x from the sample by placing a probability of
1/n at each point x1, x2, ... xn. This EDF of x is the nonparametric maximum likelihood estimate
(MLE) of the population distribution function, f(x) (Rao, 1987, pp. 162-166; Rohatgi, 1984,
pp.234-236). In other words, given no other information about the population, the sample is our
best estimate of the population. (Mooney and Duval, 1993, pp 10-11)
Stated in simple English, the EDF is simply the histogram of the sample data, with the heights of
the bars representing the proportion of the sample that have each specific value. When we assign
1/n to each observation (each birth weight), and 5 babies have a birth weight of 3015, the
probability of that birth weight is 5  1/n, or 5/n, which is the height of the bar for a birth weight
of 3015 in the histogram. The f(x) in the population is simply a similar histogram of all the
values in the population, with heights of bars scaled to represent proportions. The normal
distribution function given above is nothing more than a mathematical expression of the smooth
line drawn through the center of the top of each bar in the histogram.
In Chapter 5-5, when we derived the logistic regression model and introduced maximum
likelihood estimation, we said that the MLE of the logistic regression was the set of model
parameters (the  and ’s in the model) that gave the greatest probability of producing the data in
our sample.
When we say that the EDF is the nonparametric MLE of f(x), we are simply stating what
population distribution has the greatest probability (likelihood) of producing our observed
sample. It is the population distribution that has the identical shape of the sample distribution.
In other words, our sample is representative of, or looks just like, the population distribution.
Because of the fact that the EDF is the nonparametric MLE of f(x), repeatedly resampling a
sample to arrive at the distribution function of a statistic, or bootstrap resampling, is analogous to
repeatedly taking random samples from a population to arrive at the distribution function of a
statistic (Monte Carlo sampling). (Mooney and Duval, 1993, p. 11)
Example
We will use the statistical formula approach, the Monte Carlo approach, and the Bootstrapping
approach to estimate a mean and its 95% confidence interval, and then compare the results.
Chapter 5-13 (revision 16 May 2010)
p. 4
We will illustrate the process with the Framingham Heart Study dataset.
Framingham Heart Study dataset (2.20.Framingham.dta)
This is a dataset distributed with Dupont (2002, p 77). The dataset comes from a long-term
follow-up study of cardiovascular risk factors on 4699 patients living in the town of
Framingham, Massachusetts. The patients were free of coronary heart disease at their baseline
exam (recruitment of patients started in 1948).
Date Codebook
Baseline exam:
sdp
systolic blood pressure (SBP) in mm Hg
dbp
diastolic blood pressure (DBP) in mm Hg
age
age in years
scl
serum cholesterol (SCL) in mg/100ml
bmi
body mass index (BMI) = weight/height2 in kg/m2
sex
gender (1=male, 2=female)
month
month of year in which baseline exam occurred
id
patient identification variable (numbered 1 to 4699)
Follow-up information on coronary heart disease:
followup
follow-up in days
chdfate
CHD outcome (1=patient develops CHD at the end of follow-up,
0=otherwise)
Reading in the data,
File
Open
Find the directory where you copied the course CD
Change to the subdirectory datasets & do-files
Single click on 2.20.Framingham.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\
Biostats & Epi With Stata\datasets & do-files\
2.20.Framingham.dta", clear
*
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\"
cd "Biostats & Epi With Stata\datasets & do-files"
use 2.20.Framingham.dta, clear
Chapter 5-13 (revision 16 May 2010)
p. 5
We’ll consider this sample (N=4,699) to be our population, which we will soon take a smaller
sample from. We are doing this for illustrative purposes, since we need to know the correct
value of the mean that our sample is trying to estimate.
Computing the “population” confidence interval for serum cholesterol (SCL),
Statistics
Summaries, tables & tests
Summary and descriptive statistics
Confidence intervals
Main tab: Variables: scl
OK
ci scl
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+--------------------------------------------------------------scl |
4666
228.2925
.6520838
227.0141
229.5709
Interpretation of Confidence Interval
With a 95% confidence interval, we are 95% confident that the interval covers the population
mean. (population mean is considered fixed, the interval is random)
Van Belle et al (2004, p.86) provide the following interpretation for the 95% confidence interval
for the population mean,  :
“Since the sample mean, Y , varies from sample to sample, it cannot mean that 95% of
the sample means will fall in the interval for a specific sample mean. The interpretation
is that the probability is 0.95 that the interval straddles the population mean.”
We will take a random sample of n=200 out of n=4,699 to provide a smaller, simpler dataset for
illustration.
set seed 999
sample 200, count
Chapter 5-13 (revision 16 May 2010)
p. 6
Formula Approach
Using our sample of N=200 patients, which contains one missing value for serum cholesterol
(SCL), we use the ordinary formula approach to obtain the mean and 95% CI.
ci scl
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+--------------------------------------------------------------scl |
199
233.4472
3.014984
227.5016
239.3928
We see that the mean differs slightly from the population mean
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+--------------------------------------------------------------scl |
4666
228.2925
.6520838
227.0141
229.5709
due to sampling variation.
Looking at the histogram of the n=199 SCL values,
10
0
5
Percent
15
20
histogram scl , percent
150
200
250
300
Serum Cholesterol
350
we see that the SCL variable is skewed to the right.
The central limit theorem states that the sampling distribution of the mean SCL is normally
distributed, even though the distribution of individual SCLs is skewed.
Therefore, the 95% confidence interval around the mean, which assumes a normal sampling
distribution that is symmetrical, is a still a correct confidence interval.
Chapter 5-13 (revision 16 May 2010)
p. 7
Monte Carlo Approach
From the sample we used in the formula approach, we obtain the mean and standard deviation,
which we will need in a moment,
Statistics
Summaries, tables & tests
Summary and descriptive statistics
Summary statistics
Main tab: Variables: scl
OK
summarize scl
<or abbreviate to:>
sum scl
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------scl |
199
233.4472
42.53158
150
375
We do not have a theoretical probability distribution that has exactly the skewness of our sample
that we can use in a Monte Carlo simulation to arrive at a 95% CI for the mean. Therefore, we
will just use random samples from a Normal distribution with the same mean (233.4472) and
standard deviation (42.53158) as an approximation. We will use sample sizes of n=199, the
same number of SCL values in our original sample.
* compute a Monte Carlo 95% CI from a normal
* distribution with mean of 233.4472 and SD of 42.53158
clear
set seed 999
quietly set obs 10000
gen scl=.
gen meanscl=.
forvalues i=1(1)10000 {
quietly replace scl=233.4472+42.53158*invnorm(uniform()) in 1/199
quietly sum scl, meanonly
quietly replace meanscl=r(mean) in `i'/`i'
}
histogram meanscl, percent normal
sum meanscl
centile meanscl, centile(2.5 97.5)
Note: In this simulation, we used Stata’s “invnorm(uniform())” which returns a value from the
standard normal distribution. To convert this to a normal distribution with a given mean and
standard deviation, we use the fact that
X  Mean
= z , which is Normal with mean=0 and SD=1 (Standard Normal)
SD
X  Mean  SDz
X  Mean  SDz , which is Normal with desired mean and SD
Chapter 5-13 (revision 16 May 2010)
p. 8
8
6
4
0
2
Percent
225
230
235
meanscl
240
245
. sum meanscl
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------meanscl |
10000
233.4123
3.031482
222.9115
245.1845
. centile meanscl, centile(2.5 97.5)
-- Binom. Interp. -Variable |
Obs Percentile
Centile
[95% Conf. Interval]
-------------+------------------------------------------------------------meanscl |
10000
2.5
227.3847
227.207
227.5685
|
97.5
239.3856
239.2764
239.5407
Using the 2.5-th and 97.5-th percentiles as the 95% CI, along with the computed mean of the
sample of means, we get:
Monte Carlo:
mean = 233.4123 , 95% CI (227.3847 , 239.3856)
which is very close to the original sample values from the ci command:
Variable |
Obs
Mean
Std. Err.
[95% Conf. Interval]
-------------+--------------------------------------------------------------scl |
199
233.4472
3.014984
227.5016
239.3928
Original Sample:
mean = 233.4472 , 95% CI (227.5016 , 239.3928)
At this point, we have verified that Monte Carlo simulation produces the same result as statistical
formulas. In other words,
Taking a series of samples to produce the long run average
gives the same answer as statistical formulas.
Chapter 5-13 (revision 16 May 2010)
p. 9
In this example, we contrived the Monte Carlo population to match the sample result, by
assuming the population had the same mean and SD as the sample. That is why the Monte Carlo
approach matched the sample result so closely.
Monte Carlo:
mean = 233.4123 , 95% CI (227.3847 , 239.3856)
Original Sample:
mean = 233.4472 , 95% CI (227.5016 , 239.3928)
Population:
mean = 228.2925 , 95% CI (227.0141 , 229.5709)
Bootstrapping Approach
First, we bring our original sample back into stata, since we cleared it from memory before
running the Monte Carlo experiment.
use "2.20.Framingham.dta", clear
set seed 999
sample 200, count
drop if scl==. // drop missing value observation
sum scl
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------scl |
199
233.4472
42.53158
150
375
Notice we dropped the missing scl observation from the Stata memory.
It’s always a good idea to drop the missing values from the dataset before performing a bootstrap.
The idea of the bootstrap is to draw sample of equal sizes. It draws sample from observations in
memory, regardless of whether they are missing or not. This results in samples of unequal sizes
from iteration to iteration. This may not matter if the number of missing values is small relative
to the sample size, but it would matter otherwise.
Chapter 5-13 (revision 16 May 2010)
p. 10
Computing the bootstrap, using 1,000 re-samples,
* compute bootstrap
set seed 999
set obs 1000
capture drop seqnum bootwt meanscl
gen seqnum=_n // create a variable to sort on
gen bootwt=.
gen meanscl=.
forvalues i=1(1)1000{
bsample 199 ,weight(bootwt), in 1/199
sort seqnum // bsample unsorts the data
quietly sum scl [fweight=bootwt], meanonly
quietly replace meanscl=r(mean) in `i'/`i'
* use list to see results for first two iterations
list seqnum scl bootwt meanscl in 1/5 if `i'<=2
}
sum meanscl
centile meanscl, centile(2.5 97.5)
1.
2.
3.
4.
5.
1.
2.
3.
4.
5.
+----------------------------------+
| seqnum
scl
bootwt
meanscl |
|----------------------------------|
|
1
289
0
228.0302 |
|
2
275
1
. |
|
3
217
1
. |
|
4
271
1
. |
|
5
234
3
. |
+----------------------------------+
+----------------------------------+
| seqnum
scl
bootwt
meanscl |
|----------------------------------|
|
1
289
0
228.0302 |
|
2
275
0
234.8744 |
|
3
217
1
. |
|
4
271
2
. |
|
5
234
1
. |
+----------------------------------+
. sum meanscl
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------meanscl |
1000
233.4235
3.010982
224.3467
242.1859
. centile meanscl, centile(2.5 97.5)
-- Binom. Interp. -Variable |
Obs Percentile
Centile
[95% Conf. Interval]
-------------+------------------------------------------------------------meanscl |
1000
2.5
227.6337
226.9178
228.0157
|
97.5
239.4345
238.7897
240.2121
Using the 2.5-th and 97.5-th percentiles as the 95% CI, along with the computed mean of the
sample of means, we get a mean of 233.4235 and 95% CI of (227.6337 , 239.4345).
Chapter 5-13 (revision 16 May 2010)
p. 11
Contrasting the Results
Contrasting the results from the three approaches:
Formula:
mean = 233.4472 , 95% CI (227.5016 , 239.3928)
Monte Carlo:
mean = 233.4123 , 95% CI (227.3847 , 239.3856)
Bootstrap:
mean = 233.4235 , 95% CI (227.6337 , 239.4345)
We see that all three approaches produce very similar results.
In particular, notice how similar the bootstrapping algorithm was to the Monte Carlo experiment.
In fact, all we did was treat the sample as if it was the population, and then applied Monte Carlo
sampling to generate an empirical estimate of the statistic’s sampling distribution.
Definition sampling without replacement: each sampled observation can be selected only (do
not put the observation “back in the hat” after selecting it).
Definition sampling with replacement: each observation can be selected on each draw in the
sample (put the observation “back in the hat” after selecting it).
In bootstrap sampling, sampling with replacement is used, and is necessary. If we sampled
n=199 observations without replacement, we would always obtain the original sample, and no
variability would be introduced into the simulation.
Sampling with replacement seems strange to us, because we do not do this when we take a
sample of patients in our research. Its use is justified in the following box.
Chapter 5-13 (revision 16 May 2010)
p. 12
Sampling with replacement
In bootstrapping, we take a sample with replacement.
As strange as that type of sampling might seem, it turns out that standard error of the mean,
which is required for the 95% CI for the mean, is based on the assumption that sampling is either
with replacement or that the samples are drawn from infinite populations. (Daniel, 1995, pp.125127)
This is because the statistical theory underlying this statistic assumes that every data point is
independently and identically distributed (i.i.d.). If sampling is done from a small population
without replacement, the way samples are usually drawn, the probability of drawing any
remaining data point is larger than the data point before. For example, if the population size is
N=100 from a uniform distribution, which means every data point has an equal probability 1/100,
then the actual probabilities of drawing these data points, done without replacement is:
1st observation: 1/100
2nd observation: 1/99
3rd observation: 1/98
Therefore, the probability distribution changes every time a data point is removed by sampling
without replacement, and thus violates the i.i.d. assumption.
When sampling is done without replacement, then, a finite population correction is required. For
example, when sampling without replacement, the correct formula for the standard error of the
mean is no longer
s
SE 
n
but is instead
s
s
N n
SE 
 (finite population correction factor) =

N 1
n
n
When the population size, N, is much larger than the sample size, n, this correction factor is so
close to 1 that its effect is negligible. In practice, statisticians ignore the correction factor when
the sample is no more than 5 percent of the population size, or n/N  0.05. (Daniel, 1995,
pp.125-127)
In bootstrapping however, the sample becomes the population, or sampled population, so
sampling with replacement is necessary to satisfy the i.i.d. assumption. This i.i.d. assumption
was stated above in the statistical theory justification for bootstrapping as, “We can construct an
empirical probability distribution function (EDF) of x from the sample by placing a probability of
1/n at each point x1, x2, ... xn. This EDF of x is the nonparametric maximum likelihood estimate
(MLE) of the population distribution function, F(X).” That is, sampling with replacement is
necessary to provide the 1/n probability for each data point.
Chapter 5-13 (revision 16 May 2010)
p. 13
Easier Approach
You can also use the bootstrap command for this bootstrapped 95% CI, but it uses the “normalbased” confidence interval, rather than the “percentile” confidence interval used above. Either
approach is okay, but the results are not identical.
The above bootstrap approach changed the number of observations in memory to n=1,000, with a
lot of missing observations for scl. We need to return the data in Stata memory to its original
state, with a sample size of n=199, before we bootstrap again.
use "2.20.Framingham.dta", clear
set seed 999
sample 200, count
drop if scl==. // drop missing value observation
sum scl
We will compute the bootstrap CI, using a seed of 999 so our result matches what is shown
below. Using the menu,
Statistics
Resampling
Bootstrap estimation
Main tab: Stata command to run: sum scl
Other statistical expressions: r(mean)
Replications: 1000
Options tab: Sample size: 199
Advanced tab: Random-number seed: 999
OK
bootstrap r(mean), reps(1000) size(199) seed(999) : sum scl
Bootstrap results
command:
_bs_1:
Number of obs
Replications
=
=
199
1000
summarize scl
r(mean)
-----------------------------------------------------------------------------|
Observed
Bootstrap
Normal-based
|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_bs_1 |
233.4472
3.069715
76.05
0.000
227.4307
239.4638
------------------------------------------------------------------------------
which is very close to the “percentile” method used above, which gave:
Bootstrap:
mean = 233.4235 , 95% CI (227.6337 , 239.4345)
Chapter 5-13 (revision 16 May 2010)
p. 14
Many Forms of Bootstrap Estimates
What has been presented is called the “percentile method” of bootstrapping, which is the easiest
to understand. Other methods add some statistical rigor and frequently result in tighter
confidence limits. A number of these are presented by Carpenter and Bithell (2000). Four of
these other forms are available in Stata, using the “estat” command.
The “BCa” method, however, must be specified to get it, since it requires more computation
time.
bootstrap r(mean), reps(1000) size(199) seed(999) bca : sum scl
estat bootstrap, all
Bootstrap results
command:
_bs_1:
Number of obs
Replications
=
=
199
1000
summarize scl
r(mean)
-----------------------------------------------------------------------------|
Observed
Bootstrap
Normal-based
|
Coef.
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------_bs_1 |
233.4472
3.069715
76.05
0.000
227.4307
239.4638
-----------------------------------------------------------------------------. estat bootstrap, all
Bootstrap results
command:
_bs_1:
Number of obs
Replications
=
=
199
1000
summarize scl
r(mean)
-----------------------------------------------------------------------------|
Observed
Bootstrap
|
Coef.
Bias
Std. Err. [95% Conf. Interval]
-------------+---------------------------------------------------------------_bs_1 |
233.44724 -.0524171
3.0697148
227.4307
239.4638
(N)
|
227.5477
239.4824
(P)
|
227.5729
239.4824 (BC)
|
227.6131
239.4874 (BCa)
-----------------------------------------------------------------------------(N)
normal confidence interval
(P)
percentile confidence interval
(BC)
bias-corrected confidence interval
(BCa) bias-corrected and accelerated confidence interval
Chapter 5-13 (revision 16 May 2010)
p. 15
To get a feel for how the confidence limits from the four approaches center around the sample
(observed) mean, we can run the following commands from the do-file editor.
display " (N):
" observed
display " (P):
" observed
display " (BC):
" observed
display "(BCa):
" observed
observed coef-lower
coef-upper bound: "
observed coef-lower
coef-upper bound: "
observed coef-lower
coef-upper bound: "
observed coef-lower
coef-upper bound: "
bound: " 233.4472-227.4307
233.4472-239.4638
bound: " 233.4472-227.5477
233.4472-239.4824
bound: " 233.4472-227.5729
233.4472-239.4824
bound: " 233.4472-227.6131
233.4472-239.4874
///
///
///
///
which produces,
(N):
(P):
(BC):
(BCa):
observed
observed
observed
observed
coef-lower
coef-lower
coef-lower
coef-lower
bound:
bound:
bound:
bound:
6.0165
5.8995
5.8743
5.8341
observed
observed
observed
observed
coef-upper
coef-upper
coef-upper
coef-upper
bound:
bound:
bound:
bound:
-6.0166
-6.0352
-6.0352
-6.0402
Only the “normal” method attempts to have confidence bounds which are symmetric about the
mean. The other methods are wider on the right, or upper side, of the interval, which is the same
direction in which the data are skewed.
Bias
The bias that is being corrected is the same bias that statisticians are familiar with.
Letting T be a statistic, and  be a population parameter, the bias is E(T) - , where E(T) is the
expected value of the test statistic, which is the long-run average. The bias shown in the above
output is the bootstrap estimate of this bias.
Steyerberg (2009, p.94) explains it for the BCa approach,
“Bias-corrected percentile method: Bias in estimation of the distribution is accounted for,
based on the difference between the median of the bootstrap estimates and the sample
estimate (“BCa”).108
-------------108
Efron B, Tibshirani R. An introduction to the bootstrap. New York: Chapman & Hall,
1993.
Chapter 5-13 (revision 16 May 2010)
p. 16
It was the BC, or bias-corrected” confidence interval that Kim et al (2006) used (our example
article on page 1).
Here is the approach they used:
bootstrap _b, reps(1000) size(199) seed(999): logistic chdfate age scl
estat bootstrap, all
Logistic regression
Log likelihood = -117.00523
Number of obs
Replications
Wald chi2(2)
Prob > chi2
Pseudo R2
=
=
=
=
=
199
1000
11.93
0.0026
0.0581
-----------------------------------------------------------------------------|
Observed
Bootstrap
Normal-based
chdfate | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf. Interval]
-------------+---------------------------------------------------------------age |
1.011431
.0199457
0.58
0.564
.9730842
1.05129
scl |
1.013314
.0042096
3.18
0.001
1.005097
1.021599
-----------------------------------------------------------------------------. estat bootstrap, all
Logistic regression
Number of obs
Replications
=
=
199
1000
-----------------------------------------------------------------------------|
Observed
Bootstrap
chdfate |
Coef.
Bias
Std. Err. [95% Conf. Interval]
-------------+---------------------------------------------------------------age |
.01136646 -.0002439
.0197203
-.0272846
.0500175
(N)
|
-.0294536
.0515303
(P)
|
-.0311151
.0496985 (BC)
scl |
.01322643
.0006473
.00415434
.0050841
.0213688
(N)
|
.0057071
.0216384
(P)
|
.0042645
.0204473 (BC)
_cons | -4.4385145 -.1585981
1.2091945
-6.808492 -2.068537
(N)
|
-7.261674 -2.431381
(P)
|
-6.951354 -2.297873 (BC)
-----------------------------------------------------------------------------(N)
normal confidence interval
(P)
percentile confidence interval
(BC)
bias-corrected confidence interval
Each of these CI’s are then converted to odds ratios, using
display exp(-.0272846)
.97308426
Chapter 5-13 (revision 16 May 2010)
p. 17
How many resamples should we draw?
Running the same Stata commands, but this time using 10,000 resamples instead of 1,000 we get
the results in the following table.
. sum meanscl
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+-------------------------------------------------------meanscl |
10000
233.4252
2.976414
221.4673
244.5126
. centile meanscl, centile(2.5 97.5)
-- Binom. Interp. -Variable |
Obs Percentile
Centile
[95% Conf. Interval]
-------------+------------------------------------------------------------meanscl |
10000
2.5
227.6231
227.4975
227.7314
|
97.5
239.3166
239.1831
239.4556
The two results, this one and the preceding one, are:
Bootstrap (10,000 resamples):
Bootstrap ( 1,000 resamples):
mean = 233.4252 , 95% CI (227.4975 , 239.1831)
mean = 233.6320 , 95% CI (227.7398 , 239.0903)
We see that 1,000 resamples are as good as 10,000 resamples.
So, how many resamples should we draw?
That is an empirical question that depends on the statistics to be estimated and the accuracy
desired (Efron, 1979, sec. 2; Mooney and Duval, 1993, p 21). However, the improvement in the
sampling distribution estimation is slight for more than 1,000 resamples, in most cases (Efron &
Tibshirani, 1986, sec. 9; Mooney and Duval, 1993, p 21).
Chapter 5-13 (revision 16 May 2010)
p. 18
Example Applications
1) Better estimates of confidence intervals.
We saw an application of this with the Kim et al paper (2006).
2) Computing a CI for a statistic for which a formula is not available.
For example, if you wanted to compute a CI around a ratio of two physiologic measures, the
bootstrap approach allows you to do this without having to mathematically derive a proper
formula.
3) Validating a model.
How to do this is covered in the following chapter.
Exercise Return to the Kim et al (2006) paper. The authors simply computed the biasedcorrected 95% CI for the regression coefficient for each predictor variable. Then they made the
claim, “Bootstrap resampling validated the importance of the significant variables identified in
the regression analysis.” By “validated”, the authors imply that these same predictors will appear
as significant predictors in future datasets. However, all they really did was provide perhaps a
more accurate confidence interval for their sample only. Many authors make this mistake of
thinking that the bootstrap CI does a lot more than it really does.
To actually validate the important of predictor variables, a more sophisticated approach is
required, which is covered in the following chapter.
Chapter 5-13 (revision 16 May 2010)
p. 19
References
Carpenter J, Bithell J. (2000). Bootstrap confidence intervals: when, which, what? A practical
guide for medical statisticians. Statist. Med. 19:1141-1164.
Daniel WW. (1995). Biostatistics: a Foundation for Analysis in the Health Sciences. 6th ed. New
York, John Wiley & Sons.
Dupont WD. (2002). Statistical Modeling for Biomedical Researchers: A Simple Introduction to
the Analysis of Complex Data. Cambridge UK, Cambridge University Press.
Efron B. (1979). Bootstrap methods: another look at the jackknife. Annals of Statistics 7:1-26.
Efron B, Stein C. (1981). The jackknife estimate of variance. Annals of Statistics 9:586-596.
Efron B, Tibshirani R. (1986). Bootstrap methods for standard errors, confidence intervals, and
other measures of statistical accuracy. Statistical Science 1:54-77.
Efron B, Tibshirani RJ. (1998). An Introduction to the Bootstrap. Boca Raton, Florida,
Chapman & Hall/CRC.
Kim TY, Stewart G, Voth M, et al. (2006). Signs and symptoms of cerebrospinal fluid shunt
malfunction in the pediatric emergency department. Pediatric Emergency Care 22(1):2834.
Mooney CZ, Duval RD. (1993). Bootstrapping: A Nonparametric Approach to Statistical
Inference. Newbury Park, CA, Sage Publications.
Rao BLSP. (1987). Asymptotic Theory of Statistical Inference. New York, John Wiley.
Rohatgi VK. (1984). Statistical Inference. New York, John Wiley.
Steyerberg EW. (2009). Clinical Prediction Models: A Practical Approach to Development,
Validation, and Updating. New York, Springer.
van Belle G, Fisher LD, Heagerty PJ, Lumley T. Biostatistics: A Methodology for the
Health Sciences. 2nd ed. Hoboken, NJ, John Wiley & Sons.
Chapter 5-13 (revision 16 May 2010)
p. 20
Download