Statistics115 Inference: Estimation and Confidence Intervals 1. What is the distribution of colors in a bag of M&Ms? In this study the population of interest is M&Ms. And the parameter that you will estimate is the proportion of each color in a bag of M&Ms. We start off with some assumptions: That 30 bags of M&Ms purchased at the Carleton book store are a simple random sample (SRS) of bags. And the 50 or so M&Ms per bag are also a random sample of M&Ms. (These assumptions are probably not true. Why not? But they might be “true enough.” That is, they may be sufficient to give us a representative sample and to allow us to do statistical inference.) 2. Before consumption, count the M&Ms in your bag and the amounts for each color, filling in the frequencies and relative frequencies. Color Blue Brown Green Orange Red Yellow Total Number Proportion 100% 3. The sample is your specific bag. And the statistic is the sample proportion, the proportion of each color in the sample. To estimate the proportion of BLUE M&Ms in the population, record the sample proportion of blue M&Ms in your sample. p̂ = ____________. That single number is called a point estimate. Do you believe that the true proportion p of blue M&Ms in the population is exactly equal to this estimate? Hopefully, you said no! What are the sources of possible error? 4. We will get a more reliable picture of the population by estimating an interval of values for which we believe that the true proportion actually lies in. Since our sample is random we can never be completely certain of what p is, but we can be “highly confident” of it falling in some interval. This is the idea behind confidence intervals. 5. We first quantify the sampling variability of the statistic. The sampling distribution of p̂ is approximately normal with mean equal to p and standard deviation equal to SD( p̂ ) = p(1 p) . n The M&M website claims that the proportion of blues in a bag of milk chocolate M&Ms is actually 24%. That is, they claim that p =0 .24. Compute the standard deviation for the sampling distribution of p̂ , assuming that the M&M Company’s claim is true. SD( p̂ ) = _____________ . 6. Draw the normal curve for the sampling distribution of p̂ . Label the x-axis and find your particular sample value of p̂ . Do you think that your sample value ( p̂ ) is consistent with the M&M Company’s claim? Do you think your data provides evidence for or against the M&M Company’s claim? You should be able to answer that question visually. But to quantify it better, we can ask how many standard deviations is your value of p̂ value from p (the true mean)? Recall that this is what the z-score (or standard score) tells us. The zscore is equal to z = (Data value – Mean) / Standard Deviation, so z= ( pˆ p ) = __________________. p (1 p ) n Again, does it look like your data is consistent with the M&M Company claim? A common measure of what is “extreme” or “unusual” is an observation that is more than two standard deviations from the mean. Is your data value within two standard deviations from p? 7. Now you will estimate the proportion of GREEN M&Ms. But now, to be more realistic, we won’t have access to the M&M website and their claim of what the true value is. From your table, write down the new p̂ : ____________. Again we want to determine the sampling variability of the statistic, but look again at the formula for the standard deviation of p̂ . It uses the value p. The problem is that we don’t know p. Instead, we’ll approximate p in the formula by using p̂ . SD( p̂ ) = pˆ (1 pˆ ) . In statistics, this is called the standard error of the estimator, as opposed to the n standard deviation. When we estimate the standard deviation of a statistic, it’s called a standard error. I’ll use SE( p̂ ) instead of SD( p̂ ) even though the textbook doesn’t talk about standard errors. Find the standard error of p̂ . SE( p̂ ) = _______________. This last number is the approximate standard deviation and gives a measure of how precise your p̂ estimate is. You can interpret it as the “give or take” in your estimation. That is, “The estimated proportion of green M&Ms is ______ ( p̂ ) give or take _____ (standard deviation).” We are more confident about our estimate by including this give or take term. If we want to be even more confident that the interval captures the true p value, we can take the “give or take” term as two standard deviations instead of one. 8. To make the above statement more precise, we compute a confidence interval for the true proportion p. A 95% confidence interval for the proportion of green M&Ms is an interval of the form Estimate Margin of Error, or pˆ (2 standard deviations). The Margin of Error accounts for the error in the estimate due to sampling variability. (We can’t account for any error that might arise because of bias.) Two standard deviations is the most common number of standard deviations to use for a confidence interval. The 68-95-99.7 rule says that 95% of the area under a normal curve is within two standard deviations of the mean. Actually, that “rule” is just a rule of thumb. It’s not exact. The more precise result is that 95% of the area under a normal curve is within 1.96 standard deviations of the mean. So for a 95% confidence interval we’ll use the more precise 1.96. Compute the 95% confidence interval for the true proportion of green M&Ms: pˆ 1.96 pˆ (1 pˆ ) . n 9. You should always write the confidence interval in two ways: (1) Write it in the form p̂ Margin of Error. This way of writing the interval makes it clear what the actual point estimate is ( p̂ ) and what the margin of error is. (2) Report the actual interval with left and right endpoints [ XX , XX ] to make it clear that the confidence interval actually is an interval and so you can clearly see the range of values in that interval. Notice that the point estimate p̂ lies in the middle of the interval. The range of the interval (right endpoint minus left endpoint) is the width of the confidence interval. p̂ = _________________. Margin of Error = ____________________. Width of interval = ____________________. 10. The M&M website states that the proportion of green M&Ms is actually 16%. Does your interval contain p = 0.16? Do you think the M&M Company’s claim is correct? 11. In the previous example, 95% was the confidence level of the interval. We say that we are “95% confident” that the true parameter value p lies in the interval. It would be nice if you could be 100% confident about your interval, but you never can be (unless you use the interval [0.0, 1.0] or unless you know the parameter value to begin with). So we would like to be as confident as possible. At the same time, it would be nice if the width of the interval was as narrow as possible so that the estimate is very precise. (The interval [.29, .31] is more precise than the interval [.10, .50].) Unfortunately there is a tradeoff between greater confidence and greater precision. We can get more confidence, but at a price. To be more confident we have to accept a wider interval (and hence more uncertainty in the estimate). On the other hand, we can make our confidence interval narrower, but we have to accept less confidence in the final result. 12. In its most general form, a confidence interval for a population proportion takes the form pˆ z* pˆ (1 pˆ ) . n The z* value is called the critical value and represents the number of standard deviations that you are using. It is determined by the confidence level. The most common values are: Confidence Level 80% 90% 95% 99% 99.9% Critical Value (z*) 1.28 1.645 1.96 2.576 3.29 Notice that the greater the confidence level, the larger the critical value. But making the critical value larger increases the width of the interval. 13. For the last estimation problem, compute an 80% confidence interval for the proportion of ORANGE M&Ms. 14. What does this 80% confidence interval really mean? Here is the most precise statement: If you were to repeat this estimation experiment many, many times, each time taking a different random sample, finding p̂ , and computing an 80% confidence interval, then about 80% of all these intervals would capture the true parameter value p. This, of course, implies that that 20% of all these intervals would not contain p. Similarly, if you were estimating the population proportion with 95% confidence intervals, then about 95% of all the intervals would contain the true proportion value, and 5% of them would not. That’s why you can be so confident. But it’s also why you can never by 100% sure. 15. For more information about M&Ms, see http://us.mms.com/us/about/products/milkchocolate/.