1 Bootsrapping a Confidence Interval

advertisement

1 Bootsrapping a Confidence Interval

In this module we learn how to construct a confidence interval by bootstrapping.

We begin by typing

> ?faithful

to obtain a description of a dataset, contained within R, named ‘faithful’. The dataset can be viewed by typing

> faithful

We will be interested in the second column, extracted and named ‘wt’ by typing.

> wt<-faithful[,2]

‘wt’ stands for ‘waiting times’. ‘wt’ is a sample of 272 waiting times (measured in minutes) between successive eruptions of the Old Faithful geyser in Yellowstone National Park. A histogram for ‘wt’ is obtained via

> hist(wt)

The output is shown on the following page.

1

Histogram of wt

40 50 60 70 wt

80 90 100

Output 1: The two-peaked histogram ( n = 272) shows that the distribution of waiting times is probably not normal

2

We are treating the sample of waiting times as a simple random sample. Our goal is to infer from this sample to the population of all waiting times. We can estimate the mean, population waiting-time µ with the sample mean ¯ . In R we type

> mean(wt)

[1] 70.89706

A reasonable guess for the mean, population waiting-time is then about 70 minutes. However, to be honest we should report more than just this guess, which depends on our particular random sample. We’d like to report on how variable we expect our guess to be. Might another sample give a wildly different estimate? Or can we expect another sample to give comparable results? Trying to answer these questions without putting in the effort to obtain another sample may seem like trying to pull oneself up by ones own bootstraps, yet this is precisely what we will attempt to do. The procedure is called bootstrapping, and the idea is the following. Instead of taking another sample from the population, we can take another sample from our sample. We will call this resampling. We think of the empirical distribution of the original sample as approximating the theoretical distribution of the population and thus our resampling, as long as it is done with replacement, is a substitute for actively taking additional samples from the true population. The upshot is that we can resample as many times as we’d like. Here is one way to obtain the means of the additional samples using R.

> a<-numeric(10000)

> for(i in 1:10000){a[i]<-mean(sample(wt,272,replace=T))}

‘a’ is set up as a vector of blank, numeric entries, to be filled. The second, loop command then fills in the bootstrapped entries of ‘a’. This completed ‘a’ can then be viewed via a histogram.

> hist(a)

The histogram is displayed on the following page.

3

Histogram of a

68 69 70 a

71 72 73

Output 2: A histogram showing the distribution of the 10 , 000 bootsrapped means

74

4

The histogram displays nicely the variation in the resampled means.

Next consider the following two events.

• Our random sample resulted in a ¯ that was far from µ .

• A random entry of a is far from the fixed number ¯ .

Due to our use of the empirical distribution for wt as a substitute for the population distribution of waiting times, and because 10 , 000 is such a large number, the above events are analagous. To clearly appreciate the analogy just remember how we created the entries of a .

We can compute the probability that a random entry of a is within a given distance s of

¯ wt happened to be within the same distance s of µ , or even better as the probability that µ is within s of the observed ¯ .

The result is that we can think of µ as a random variable distributed in the same manner as a random entry of a !

Warning: µ is typically thought of as a fixed, unobservable quantity.

The central limit theorem ensures that a random entry of a is distributed symetrically about ¯ . Thus in the following code, meant to construct a confidence interval for µ , we chose our quantiles symetrically.

> quantile(a,c(.025,.975))

2.5% 97.5%

69.26094 72.47059

(69 .

26094 , 72 .

47059) is then a 95% confidence interval for µ . Heed the warning above; don’t say that the probability that µ is within the interval is 95%. Rather, see the interval as a function of our observed sample wt . It is thus a random interval. The probability that the random interval contains µ is then 95%.

To obtain confidence intervals of different levels, just vary the quantiles used in the above code.

We now know how to bootstrap confidence intervals for the mean. The procedure can be summarized as follows.

We use the original sample wt as an approximation for the true population so that resampling from wt (with replacement) can substitute for taking real, additional samples. The resampling scheme involves the creation of many synthetic samples of size equal to the original sample, and for each one we compute the mean. This results in a vector of means which we label with a . A confidence interval for µ is then obtained by chopping off the tails of a using the ‘quantile’ function.

Exercise 1.1.

The first variable, ‘eruptions’, from the same dataset, ‘faithful’, lists the times in minutes for 272 eruptions. We treat this data as a simple random sample from the population of all eruption times. Use the above bootstrapping procedure to construct a confidence interval for the population, mean eruption time.

Exercise 1.2.

How large do you think the original sample must be so that our bootstrapping method will be accurate? Conduct simulations to find out.

5

Download