Confidence interval for the mean

advertisement
Confidence interval for the mean
When we take a sample, we would like to know how far the sample mean may be from
the true population mean. To describe our estimate of how near or far the sample mean is
from the population mean, we will use the confidence interval for the mean. The
confidence interval for the mean will give us lower and upper bounds on likely values for
the mean in a large number of samples.
Recall that earlier we took 100 samples, each of sample size N=4, 9, or 25, from our
population of 200 roses. Here are the graphs we saw when we plotted the number of
flowers on each rose in the original population (top left), and the distribution of the
sample means for N = 4, 9, or 25. As we increase the number of observations in our
sample (from 4 to 25), we decrease the width of the distribution of sample means.
Here is another illustration showing how the distribution of sample means changes with
the number of observations (the sample size). Suppose that we manufacture 100,000 pills
per year in our factory. Each of the pills is supposed to have 50 mg of drug. Imagine for a
moment that we measured the amount of drug in all 100,000 pills, and got a histogram
showing the distribution of pill weights.
It would be expensive and destructive to measure the amount of drug in every pill.
Instead, we can take a sample from the 100,000. To simulate this, let's take 200 samples
of size N = 2, 30, and 100, and plot the means for each of the 200 samples for each
sample size N.
Again, we see that, as we increase the number of observations in the sample, we decrease
the width of the distribution of sample means.
When we take a sample, we would like to know how far the sample mean may be from
the true population mean. We would like to be confident that the true population mean is
within a range of possible values. We'll call this range of possible values (from a lower
bound to an upper bound) our confidence interval for the mean.
Let's start with a 95% confidence interval for the mean. We would like to determine a
lower bound and an upper bound, such that, if we take many samples, the true population
mean will be between the lower bound and the upper bound for 95% of the samples. The
values of the lower and upper bounds will give us the 95% confidence interval for the
mean.
We'll begin with our 100,000 pills, take 2000 samples each of size n, and calculate the
mean for each sample.
Here is the graph of the means of the 2000 samples where each sample is of size n=2.
In these 2000 samples, the lowest observed sample mean is 44.99. The highest observed
sample mean is 54.25.
2.5% of the sample means are below 47.06.
2.5% of the sample means are above 52.68. (Equivalently, 97.5% of the sample means
are below 52.68.)
95% of the sample means are between 47.06 and 52.68
So, for samples of size N=2 from the 100,000 pills, we estimate the 95% confidence
interval for the mean is 47.06 to 52.68.
The lower bound of the 95% confidence interval is 47.06.
The upper bound of the 95% confidence interval is 52.68.
The sample means corresponding to 0%. 2.5%, 97.5% and 100% of the distribution are
call quantiles of the distribution.
Quantiles of the distribution of sample means for sample size N=2 from the pills.
Percentile
0
2.5
97.5
100
Quantile
44.99
47.06
52.68
54.25
If we take samples of size N = 30 instead of N=2, we should get a smaller 95%
confidence interval.
Here's the distribution of sample means when we use 30 pills per sample.
We'll re-draw the distribution using a narrower range for the x axis, and more bins for the
histogram.
In these 2000 samples, the lowest observed sample mean is 48.87. The highest observed
sample mean is 51.05.
2.5% of the sample means are below 49.30.
2.5% of the sample means are above 50.71. (Equivalently, 97.5% of the sample means
are below 50.71.)
95% of the sample means are between 49.30 and 50.71.
So, for samples of size N=2 from the 100,000 pills, we estimate the 95% confidence
interval for the mean is 49.30 to 50.71.
The lower bound of the 95% confidence interval is 49.30.
The upper bound of the 95% confidence interval is 50.71.
Quantiles of the distribution of sample means for sample size N=30 from the pills.
Percentile
0
2.5
97.5
100
Quantile
48.87
49.30
50.71
51.05
The 95% confidence interval for the mean does not tell us that there is a 95 % probability
that the interval contains the true population mean. The confidence coefficient (95%) is
just the proportion of samples of a given size that may be expected to contain the true
population mean. For example, for a 95 % confidence interval, if we take many sample
and calculate the confidence interval for each sample, in the long run about 95 % of these
confidence intervals would contain the true population mean.
There are several ways that we can estimate the 95% confidence interval from a sample.
The classical method estimates the 95% confidence interval as the sample mean plus or
minus 1.96 times the standard error of the mean. An alternative method, called bootstrap,
involves taking many sub-samples from the original sample, and examining the
distribution of the means of the sub-samples, similar to what we did when we had the
entire population.
Download