Review Exercises Chapter 21 -- The Accuracy of Percentages Central ideas: Statistical inference (sample --> box --> sample error); bootstrapping; confidence interval). See Exercise set A (7,8,9) and set C (4,5,6) for good exercises to master the language. Computer simulation: using the command (run cidemo) will offer a demo similar to that in section 3, pages 383-386, with plot of outcomes as on p. 385. Read the documentation (doc cidemo) and run the suggested variations. Review Problems (p. 391-4) Especially important problems are starred. * Problem 1. Computer survey. In 1990, a survey showed 14.8 percent of US households had a computer. In a survey of a specific city of 25,000 households, 79 of the 500 households surveyed had a computer. Sample percentage = 79 / 500 = 0.158 or 15.8 percent. Differs from the national survey by 1 percent -- but is this a real difference or due to sampling error? Sample SD = sqrt ( 0.158 * (1 - 0.158)) = sqrt (0.158 * 0.842) = sqrt (0.133) = 0.3647 The sample data gives us our "box model": a thousand tickets with 158 ones and 842 zeros. Treating the "box model" as our model of reality, we find: EV of the percentage = 0.158 or 15.8 percent Estimated SE of the sum = (sqrt 500) * 0.3647 = 8.1559 Estimated SE of the percentage = 8.1559 / 500 = 0.0163 or 1.63 percent. Hence we say that the percentage of households with computers is estimated, on the basis of the sample, to be 15.8 percent, give or take 1.63 percent. More formally, we talk of confidence intervals: 68 percent confidence interval for the percentage: 15.8 +/- 1.63 or 14.17 to 17.43. 95 percent confidence interval for the percentage: 15.8 +/- 2 (1.63) = 12.54 to 19.06 Since even the 68 percent confidence interval includes the US average of 14.8 percent, we can say that there is no significant difference between this city and the average US city. * Problem 2. Refrigerator survey. Of the 500 sample households, 498 had refrigerators. Sample percentage = 498 / 500 = 0.996 or 99.6 percent. Sample SD = sqrt (0.996 * 0.004) = 0.004 or 0.4 percent EV of the sample percentage = 0.996 or 99.6 percent SE of the sample percentage = 0.004 / (sqrt 500) = 0.0002 The percentage of households with refrigerators is estimated, on the basis of the sample, to be 99.6 percent, give or take .02 percent. The 68 percent confidence interval for the percentage is 99.6 +/- 0.02 = 99.58 to 99.62 The 95 percent confidence interval for the percentage is 99.6 +/- 0.04 = 99.56 to 99.64 There would have been a potential problem if you had tried to use the sample SD of .004 rather than the SE of the percentage: a 2 SD confidence interval (which should NOT be constructed in the first place, since the SD of the sample IS what it is, and is not "estimated") would be: 99.6 +/ 0.8 = 98.8 to 100.4 percent, and percentages cannot go above 100 percent. Hence it is NOT possible to estimate the population percentage based on the SD of the box. Note that the usual warning in polls that there is a "margin of error" of "about" 3 percent is typically true only if the poll results are an evenly split at 50 percent for each outcome, and as the odds become more lopsided, the "margin for error" (usually a 2 SE confidence interval) shrinks. Review Exercises -- Chapter 21 -- The Accuracy of Percentages (p. 2) *Problem 3. Automobile survey. Finds that among 500 households, 121 have no car, 172 one car, and 207 more than one car. The interest in the problem is the percentage of households who have one or more cars, which is (172 + 207) / 500 = 379 / 500 = 0.758 or 75.8 percent SD of the sample percentage = sqrt (0.758 * 0.242) = sqrt (0.1834) = 0.4283 Hence EV of percentage = 0.758 or 75.8 percent and SE of estimated percentage = (sqrt 500) * 0.4283 / 500 = 0.0192 or 1.92 percent 68 percent CI for the estimate: 75.8 +/- 1.92 = 73.88 to 77.72 95 percent CI for the estimate: 75.8 +/- 3.84 = 71.96 to 79.64 Problem 4. Achievement tests. (Same techniques as problems 1-3) In sample of size 6000, a. Chaucer: 36.1 percent of students know who Chaucer was. Sample percentage: 36.1, so box model will have 361 ones and 639 zeros. Sample SD: sqrt (0.361 * 0.639) = 0.4803 EV of estimated percentage = 0.361 or 36.1 percent SE of estimated percentage = (sqrt 6000) * 0.4803 / 6000 = .0062 or 0.62 percent. 95 percent CI for estimated percentage: 36.1 +/- 1.24 = 34.86 to 37.34 percent. b. Edison: 95.2 percent of students know who Edison was. Sample percentage: 95.2, so box model will have 952 ones and 48 zeros. Sample SD: sqrt (0.952 * 0.048) = 0.2138 EV of estimated percentage = 0.952 or 95.2 percent SE of estimated percentage = 0.2138 / (sqrt 6000) = 0.0028 or 0.28 percent. 95 percent CI for estimated percentage: 95.2 +/- 0.56 = 94.64 to 95.76 percent. Problem 5. Is sample percentage likely to be equal to the population percentage? False; it is likely to be CLOSE if the survey is well-designed; just how close is given by the SE. * Problem 6. Stock exchange miscalculation (updated to drive home the moral of the story) An investment banker for Lehman Brothers noted that mortgage CDOs went up 131 of 252 days, or 52 percent of the time (actually, 131 / 252 = 0.5198 or 51.98 percent, but stick to the text numbers) The sample of 252 days resulted in a percentage of 0.52 or 52 percent, so we can create a box model with 52 ones and 48 zeros. SD of box = sqrt (0.52 * 0.48) = 0.4996 or approximately 50 percent. The statistical mechanics of the calculation are correct; the SE of the number of days stocks are expected to go up IF the sample is a random sample is given by [sqrt (252)] * [sqrt (0.52 * 0.48)] = 7.9309 so 95 percent confidence interval for the estimated number of is about 252 +/- 16 = 236 to 268 days, and a confidence interval for the estimated percentage is .52 +/- .4996 / (sqrt 252) = .52 +/- 0.0315 = 45.7 to 58.3 percent. But the question that is critical here is whether the last year in the mortgage CDO market is truly a "random sample" -- the answer to this question is clearly, from the perspective of the recession of 2008, "NO." Without a random sample, calculating the "SE" does not make much sense, unless we assume that "the future will be like the past." Maybe so -- but only if we remember that 1929 was part of the past, too. Review Exercises -- Chapter 21 -- The Accuracy of Percentages (p. 3) Problem 7. Survey of Newspaper Readers. Sample size 3500; newspaper readers 2487, so sample percentage = 2487 / 3500 = .7106 or approximately 71 percent We have a box model with 71 ones and 29 zeros, so the box SD = sqrt (0.71 * 0.29) = 0.4538 The SE for the estimated population percentage is 0.4558 / (sqrt 3500) = 0.0077 or about 0.8 percent, as given by the text. The text assures us this is a simple random sample, taken in a large town, so we are justified in using this to create a confidence interval -- for a 95 percent CI, we would have 71 +/- 1.6 = 69.4 to 72.6 percent. * Problem 8. Bank estimate of pocket change. Data: Sample size 100, sample percentage 73 cents. Calculation: SE for the number of coins in the average citizens pocket = (sqrt 100) * (sqrt (.73 * .27)) ERROR: 0.73 is NOT a percentage (except of a dollar; it does not reflect the percentage of people who have 73 cents in their pocket. The box model for the SE of a mean is more complicated: we would need to know the exact value of change for each person in the box, and calculate the SD of the box. Note: an example the correct model for sample size 5 would have been: For example, (with a sample size of 5) people in the box might have 75, 31, 50, 80, 128 cents in change. The mean of this box is 73 (as in the problem) with a SD of 32.52 cents. The SE of the sum of change is (sqrt 5) * 32.52 = 72.72 and the SE of the mean of change is 72.72 / 5 = 14.54. Hence a 95 percent confidence interval for the mean is 73 +/- 29 cents or 44 cents to 102 cents. Problem 9. Keno odds. See www.mathproblems.info/gam470/games/keno/prob-keno.html for a good treatment of rules and odds (googling "probabilities in keno" with mathproblems will get you there. The site, run by Michael Shackleford, has a very nice collection of other math problems as well, and links to other pages). The game is a model for lottery games of the "pick 5" or Powerball type -- you choose 2 numbers, the house draws 20 tickets, and if your two are drawn, you win $ 12 - $ 1 cost of ticket = $ 11; the odds of this happening are about 6 percent if 49 numbers are on the tickets. Box model: 6 tickets labeled "$ 11" and 94 labeled "- $ 1". Mean of box: (6 * 11 - 1 * 94) / 100 = (66 - 94) / 100 = - 28 / 100 = - 0.28 SD of box: (shortcut formula) = (11 - (- 1)) * (sqrt (0.94 * 0.06)) = 12 * 0.2375 = 2.85 EV of sum of 100 plays = 100 * (- 0.28) = - 28.00 SE of sum of 100 plays = (sqrt 100) * 2.85 = 28.50 Problem 10. Which game to play? Both games have 100 draws from the same box of numbered tickets. Game 1. Win a dollar if sum greater that 710 Game 2. Win a dollar if average is greater than 7.10 The box model is the same, so a sum of 710 and an average of 710 / 100 = 7.10 are equally likely. It is the also case that the EV for the sum will be 100 times greater than the EV for the average. The SE for the sum is 10 * SD box; the SE for the box model is 10 * SD box / 100 = SD box / 10 The SE for the sum is 100 times bigger than the SE for the average, so the Z-scores are: (7.10 - EVbox) / SEbox = (710 - EVsum) / SE sum, since all values including the divisor are 100 times greater (or smaller) than the other. There is no difference in the probability of winning. Review Exercises -- Chapter 21 -- The Accuracy of Percentages (p. 4) Problem 11. Interpreting the Polls. The standard polling language is to say that the poll has a "two percent margin of error", or that they are "reliable to within two percentage points" This is an attempt to translate the correct statement which should usually be: "Assuming that the sample was in fact taken randomly, the 1 SE confidence interval for the percentage found is the percentage reported, plus or minus two percentage points." This does NOT mean that almost all polls will be accurate; we should expect that 32 percent of the polls will be outside the "margin of error" if they do mean a 1 SE confidence interval, and 5 percent of the polls outside the "margin of error" if it means a 2 SE confidence interval (it normally does). Note also that this is a measure of "chance error", which presumes the sample is truly random (and in the case of election polls, that the preferences of the population will remain the same until election day). If there are sources of bias (undercalling cell phones, for example), the margin of error will not reflect those sources. Problem 12. Identifying graphs. For draws from a box [1 2 2 5] the histogram of numbers drawn will have a gap from 2 to 5, and two would be about twice as commonly drawn as one or five -- graph ii matches this description. For a simulation: (bind box (list 1 2 2 5)), (bind draws (draw 100 box)), (hist draws integers) For the sum of 100 numbers, we could reproduce the graph by our usual computer code. (bind sums nil), (dotimes (i 10000) (push (sum (draw 100 box)) sums), (hist sums integers). The simulated sums will be approximately normally distributed, as in graph iii. * Problem 13. Coin tossing; EV and SE for number of heads in 1000 tosses. (Adds basic hypothesis testing language). If we do not presume the coin is fair, we might want to see whether or not 1000 tosses leads us to reject the hypothesis that the coin is fair at the 95 percent significance level. This would translate to our result falling outside a 95 percent confidence interval centered at the hypothetical value of 500 heads. According to the hypothesis, the box will have a mean of 0.5 and a SD of 0.5; with 1000 draws, the EV is of course 500 heads, SE of the sum will be (sqrt 1000) * 0.5 = 15.8114. A 95 percent CI for the sum will be 500 +/- (2 * 15.8) = 500 +/- 31.6. In part a, our chance error of 29 is less than 2 SE from the EV; we do not have evidence sufficient to reject the hypothesis with 95 percent confidence. In part b, our chance error is 16, just a bit more than one SE away from the EV -- again, there is not sufficient evidence to reject the hypothesis of a fair coin. In part c, the chance error of 14 is less than one SE -- again, we do not reject the hypothesis. Note that hypothesis testing sets up the box model on the basis of the hypothesis; confidence intervals as treated in this chapter set up the box model on the basis of the actual draw. The text may have wanted you to in this chapter construct a box model with 529 ones and hence a mean of 0.529 and a SD of sqrt (0.529 * 0.471) = 0.4992 (which is practically the same as 0.5). The EV of the sum of 1000 draws will be 529 and the SE of the sum of 1000 draws will be (sqrt 1000) * 0.4992 = 15.7861; a 2 SE confidence interval around the EV would be 529 +/- 2 * 15.8 = 529 +/- 31.6 = 497.4 heads to 560.6 heads. The result expected from a fair coin falls into this interval, and again we would conclude that we do not have sufficient evidence against the hypothesis at the desired confidence level. Problem 14. Renters survey. Of 1500 surveyed, 1035 were renters. a. The expected value of the percentage of people who rent is EXACTLY 1035 / 1500 = 0.69 or 69 % b. The SE for the percentage of people who rent is EXACTLY sqrt (0.69 * 0.31) / sqrt (1500) = .0119 or 1.19 percent. Note that the actual percentage that we get in any draw falls into the interval 69 +/- 2.4 = 66.6 to 71.4 percent with 95 percent confidence.