Confidence Intervals

advertisement
Statistics115
Inference: Estimation and Confidence Intervals
1. What is the distribution of colors in a bag of M&Ms? In this study the population of interest is M&Ms.
And the parameter that you will estimate is the proportion of each color in a bag of M&Ms. We start off
with some assumptions: That 30 bags of M&Ms purchased at the Carleton book store are a simple
random sample (SRS) of bags. And the 50 or so M&Ms per bag are also a random sample of M&Ms.
(These assumptions are probably not true. Why not? But they might be “true enough.” That is, they may
be sufficient to give us a representative sample and to allow us to do statistical inference.)
2. Before consumption, count the M&Ms in your bag and the amounts for each color, filling in the
frequencies and relative frequencies.
Color
Blue
Brown
Green
Orange
Red
Yellow
Total
Number
Proportion
100%
3. The sample is your specific bag. And the statistic is the sample proportion, the proportion of each
color in the sample. To estimate the proportion of BLUE M&Ms in the population, record the sample
proportion of blue M&Ms in your sample.
p̂ = ____________.
That single number is called a point estimate. Do you believe that the true proportion p of blue M&Ms in
the population is exactly equal to this estimate? Hopefully, you said no! What are the sources of possible
error?
4. We will get a more reliable picture of the population by estimating an interval of values for which we
believe that the true proportion actually lies in. Since our sample is random we can never be completely
certain of what p is, but we can be “highly confident” of it falling in some interval. This is the idea behind
confidence intervals.
5. We first quantify the sampling variability of the statistic. The sampling distribution of p̂ is
approximately normal with mean equal to p and standard deviation equal to
SD( p̂ ) =
p(1  p)
.
n
The M&M website claims that the proportion of blues in a bag of milk chocolate M&Ms is actually 24%.
That is, they claim that p =0 .24. Compute the standard deviation for the sampling distribution of p̂ ,
assuming that the M&M Company’s claim is true. SD( p̂ ) = _____________ .
6. Draw the normal curve for the sampling distribution of p̂ . Label the x-axis and find your particular
sample value of p̂ .
Do you think that your sample value ( p̂ ) is consistent with the M&M Company’s claim? Do you think
your data provides evidence for or against the M&M Company’s claim? You should be able to answer
that question visually. But to quantify it better, we can ask how many standard deviations is your value of
p̂ value from p (the true mean)? Recall that this is what the z-score (or standard score) tells us. The zscore is equal to
z = (Data value – Mean) / Standard Deviation, so
z=
( pˆ  p )
= __________________.
p (1  p )
n
Again, does it look like your data is consistent with the M&M Company claim? A common measure of
what is “extreme” or “unusual” is an observation that is more than two standard deviations from the
mean. Is your data value within two standard deviations from p?
7. Now you will estimate the proportion of GREEN M&Ms. But now, to be more realistic, we won’t
have access to the M&M website and their claim of what the true value is. From your table, write down
the new p̂ : ____________.
Again we want to determine the sampling variability of the statistic, but look again at the formula for the
standard deviation of p̂ . It uses the value p. The problem is that we don’t know p. Instead, we’ll
approximate p in the formula by using p̂ .
SD( p̂ ) =
pˆ (1  pˆ )
. In statistics, this is called the standard error of the estimator, as opposed to the
n
standard deviation. When we estimate the standard deviation of a statistic, it’s called a standard error. I’ll
use SE( p̂ ) instead of SD( p̂ ) even though the textbook doesn’t talk about standard errors.
Find the standard error of p̂ . SE( p̂ ) = _______________.
This last number is the approximate standard deviation and gives a measure of how precise your
p̂ estimate is. You can interpret it as the “give or take” in your estimation. That is, “The estimated
proportion of green M&Ms is ______ ( p̂ ) give or take _____ (standard deviation).” We are more
confident about our estimate by including this give or take term. If we want to be even more confident
that the interval captures the true p value, we can take the “give or take” term as two standard deviations
instead of one.
8. To make the above statement more precise, we compute a confidence interval for the true proportion p.
A 95% confidence interval for the proportion of green M&Ms is an interval of the form
Estimate  Margin of Error, or pˆ  (2 standard deviations).
The Margin of Error accounts for the error in the estimate due to sampling variability. (We can’t account
for any error that might arise because of bias.) Two standard deviations is the most common number of
standard deviations to use for a confidence interval. The 68-95-99.7 rule says that 95% of the area under a
normal curve is within two standard deviations of the mean. Actually, that “rule” is just a rule of thumb.
It’s not exact. The more precise result is that 95% of the area under a normal curve is within 1.96 standard
deviations of the mean. So for a 95% confidence interval we’ll use the more precise 1.96. Compute the
95% confidence interval for the true proportion of green M&Ms:
pˆ  1.96
pˆ (1  pˆ )
.
n
9. You should always write the confidence interval in two ways:
(1) Write it in the form p̂  Margin of Error. This way of writing the interval makes it clear
what the actual point estimate is ( p̂ ) and what the margin of error is.
(2) Report the actual interval with left and right endpoints [ XX , XX ] to make it clear that the
confidence interval actually is an interval and so you can clearly see the range of values in that interval.
Notice that the point estimate p̂ lies in the middle of the interval. The range of the interval (right
endpoint minus left endpoint) is the width of the confidence interval.
p̂ = _________________.
Margin of Error = ____________________.
Width of interval = ____________________.
10. The M&M website states that the proportion of green M&Ms is actually 16%. Does your interval
contain p = 0.16? Do you think the M&M Company’s claim is correct?
11. In the previous example, 95% was the confidence level of the interval. We say that we are “95%
confident” that the true parameter value p lies in the interval. It would be nice if you could be 100%
confident about your interval, but you never can be (unless you use the interval [0.0, 1.0] or unless you
know the parameter value to begin with). So we would like to be as confident as possible. At the same
time, it would be nice if the width of the interval was as narrow as possible so that the estimate is very
precise. (The interval [.29, .31] is more precise than the interval [.10, .50].) Unfortunately there is a tradeoff between greater confidence and greater precision.
We can get more confidence, but at a price. To be more confident we have to accept a wider interval (and
hence more uncertainty in the estimate). On the other hand, we can make our confidence interval
narrower, but we have to accept less confidence in the final result.
12. In its most general form, a confidence interval for a population proportion takes the form
pˆ  z*
pˆ (1  pˆ )
.
n
The z* value is called the critical value and represents the number of standard deviations that you are
using. It is determined by the confidence level. The most common values are:
Confidence Level
80%
90%
95%
99%
99.9%
Critical Value (z*)
1.28
1.645
1.96
2.576
3.29
Notice that the greater the confidence level, the larger the critical value. But making the critical value
larger increases the width of the interval.
13. For the last estimation problem, compute an 80% confidence interval for the proportion of ORANGE
M&Ms.
14. What does this 80% confidence interval really mean? Here is the most precise statement: If you were
to repeat this estimation experiment many, many times, each time taking a different random sample,
finding p̂ , and computing an 80% confidence interval, then about 80% of all these intervals would
capture the true parameter value p. This, of course, implies that that 20% of all these intervals would not
contain p. Similarly, if you were estimating the population proportion with 95% confidence intervals, then
about 95% of all the intervals would contain the true proportion value, and 5% of them would not. That’s
why you can be so confident. But it’s also why you can never by 100% sure.
15. For more information about M&Ms, see http://us.mms.com/us/about/products/milkchocolate/.
Download