bootstrapping the 𝑆𝑆𝑆𝑆 for 𝑥𝑥̅ original sample 𝑥𝑥̅ (sample statistic) bootstrap sample 1 𝑥𝑥1̅ (bootstrap statistic 1) bootstrap sample 2 𝑥𝑥̅2 (bootstrap statistic 2) . . . bootstrap sample 𝑘𝑘 . . . 𝑥𝑥̅𝑘𝑘 (bootstrap statistic 𝑘𝑘) bootstrap distribution 𝑘𝑘 > 1,000 𝑆𝑆𝑆𝑆 1 we have one sample, which we use to create many bootstrap samples and statistics sampling distribution bootstrap distribution sample statistics bootstrap sample statistics 𝜇𝜇 − 2𝑆𝑆𝑆𝑆 𝜇𝜇 𝑥𝑥̅ 𝜇𝜇 + 2𝑆𝑆𝑆𝑆 𝑥𝑥̅ − 2𝑆𝑆𝑆𝑆𝑏𝑏 𝑥𝑥̅ + 2𝑆𝑆𝑆𝑆𝑏𝑏 because 𝑆𝑆𝑆𝑆𝑏𝑏 ≅ 𝑆𝑆𝑆𝑆 we can create a 95% CI for 𝜇𝜇 using 𝑥𝑥̅ ± 2𝑆𝑆𝑆𝑆𝑏𝑏 Bootstrapping a 95% confidence interval 1. 2. 3. 4. 5. 6. 7. randomly pick a case (observation) from the sample record the variable of interest for that case replace (return) the case to the sample repeat 1-3 𝑛𝑛 times to get a bootstrap sample calculate the bootstrap statistic for the bootstrap sample repeat 1-5 many times to generate a bootstrap distribution compute 𝑆𝑆𝑆𝑆𝑏𝑏 and the CI using sample statistic ± 2 × 𝑆𝑆𝑆𝑆𝑏𝑏 • this process is tedious, so we use 3 a sample (𝑛𝑛 = 100) of recall, there are 52 orange Smarties, so the sample statistic is: 52 𝑝𝑝� = = 0.52 100 we need the 𝑆𝑆𝑆𝑆 of 𝑝𝑝� → 4 Bootstrapping a 𝑆𝑆𝑆𝑆 may be used to form a confidence interval (CI) for any sample statistic . . . • proportion (𝑝𝑝) • mean (µ) • difference in proportions (𝑝𝑝1 − 𝑝𝑝2) • difference in means (µ1 – µ2) The basic bootstrapping process is always the same . . . bootstrap samples → bootstrap statistics → bootstrap distribution → 𝑆𝑆𝑆𝑆 5 EXAMPLE – Ford Mustangs Suppose we want a 95% CI for the mean price of a used Ford Mustang. We have a random sample of 𝑛𝑛 = 25 cars. • what is the population parameter? 𝜇𝜇 = population mean price of a used Ford Mustang • how do we get a point estimate of µ? calculate the sample mean 𝑥𝑥,̅ which is the point estimate of 𝜇𝜇 6 dotplot of the sample MustangPrice 0 𝑛𝑛 = 25 5 Dot Plot 10 15 20 25 Price 30 35 40 45 𝑥𝑥̅ = 15.98 → the point estimate for µ is $15,980 . . . but how accurate is this estimate? • we need the 𝑆𝑆𝑆𝑆 of 𝑥𝑥̅ BOOTSTRAP 7 Original Sample 1-4. Bootstrap sample 5. Calculate the mean of the bootstrap sample 6. Repeat 1-5 1,000 + times to build up the bootstrap distribution 7. Find the standard deviation of the bootstrap distribution to get 𝑆𝑆𝑆𝑆𝑏𝑏 we use for 1-7 8 EXAMPLE – Ford Mustangs cont. . . . using Statkey, the 95% confidence interval is 𝑥𝑥̅ ± 2 × 𝑆𝑆𝑆𝑆𝑏𝑏 15.98 ± 2 × 2.178 (11.624, 20.336) • interpretation: based on the sample, we are 95% confident that the mean population price of a used Ford Mustang car is between $11,624 and $20,336 CHECK: how would this change if you re-did the bootstrapping? 9 EXAMPLE – Belief in global warming Source: “Wide Partisan Divide Over Global Warming”, Pew Research Center, 10/27/10. In 2010, 2,251 randomly-selected Americans were asked : “Is there solid evidence of global warming?” 1,328 answered ‘yes’ Calculate and interpret a 95% CI for 𝑝𝑝, the proportion who believe there is ‘solid evidence’ of global warming • the sample statistic 𝑝𝑝� = 1328/2251 = 0.590 (59%) is the point estimate of 𝑝𝑝 • to get 𝑆𝑆𝑆𝑆𝑏𝑏 we use 10 Belief in global warming cont. using we get 𝑆𝑆𝑆𝑆𝑏𝑏 = 0.01 𝑝𝑝̂ ± 2 × 𝑆𝑆𝑆𝑆𝑏𝑏 0.59 ± 2 × 0.01 (0.57, 0.61) • interpretation: we are 95% certain that the true percentage who believe there is ‘solid evidence’ of global warming lies between 57% and 61% 11 Belief in global warming cont. Does belief in global warming differ by political party? The sample proportion answering ‘yes’ was 79% among Democrats and 38% among Republicans. Calculate a 95% CI for the difference in proportions • sample sizes are not given; assume 𝑛𝑛 = 1,000 for each party • the sample statistic 𝑝𝑝�𝐷𝐷 − 𝑝𝑝�𝑅𝑅 = 0.79 − 0.38 = 0.41 is the point estimate of 𝑝𝑝𝐷𝐷 – 𝑝𝑝𝑅𝑅 12 Belief in global warming cont. using we get 𝑆𝑆𝑆𝑆𝑏𝑏 = 0.02 𝑝𝑝̂ ± 2 × 𝑆𝑆𝑆𝑆𝑏𝑏 → 0.41 ± 2 × 0.02 → (0.37, 0.45) • interpretation: we are 95% sure that the difference in the proportion of Democrats and Republicans who believe in global warming is between 37% and 45% NOTE: the confidence interval does not include zero. We are 95% sure that the difference is not zero, it is positive. (Something to about.) 13 EXAMPLE – Human body temperature (°F ) Using the BodyTemp50 (Temperature) data and 98.26 ± 2 × 0.105 98.26 ± 0.21 (98.05, 98.47) • interpretation: we are 95% sure that population mean body temperature is between 98.05° and 98.47° The true population value is 98.6°! What has gone wrong? 14 sample size v. number of bootstrap samples • the larger 𝑛𝑛, the smaller 𝑆𝑆𝑆𝑆𝑏𝑏 and the narrower the CI larger 𝑛𝑛 → more precise sample statistic • intuition: more data means you have better information • for given 𝑛𝑛, increasing the number of bootstrap samples has little effect on the 𝑆𝑆𝑆𝑆 (and hence on the CI) larger number of bootstraps → little impact on 𝑆𝑆𝑆𝑆 • intuition: no extra “information” about variability once the unevenness in a small number of bootstraps has been removed 15 Changing the confidence level • what if we want to be more than 95% confident? • can we produce a 99% confidence interval? • what about a 99.5% confidence interval? • what about being less than 95% confident? • e.g., a 90% confidence interval If the bootstrap distribution is roughly symmetric, we can construct any confidence interval we want by finding the appropriate percentiles in the bootstrap distribution 16 bootstrap distribution lower bound • changing 𝑃𝑃𝑃 shifts the CI bounds • a higher 𝑃𝑃𝑃 implies a wider CI 𝑃𝑃𝑃 sample statistic upper bound bootstrap statistics 17 . . . the middle 𝑃𝑃𝑃 defines the bounds of the 𝑃𝑃𝑃 CI • for a 99% CI, the bounds are defined by the middle 99%, leaving 0.5% in each tail • 99% CI is (0.5th percentile, 99.5th percentile) from the bootstrap distribution • for a 90% CI, the bounds are defined by the middle 90%, leaving 5% in each tail • 90% CI is (5th percentile, 95th percentile) from the bootstrap distribution • we can use or inspect the bootstrap distribution 18 EXAMPLE – body temperature cont. A 99% CI for mean body temperature is (97.996°, 98.584°) . . . wider than the 95% CI (98.05, 98.47) • a 99% CI contains the middle 99% of sample statistics, which is more than the middle 95% → the 99% CI is wider • to be ‘more confident’ that the population parameter is in the CI we need to have a larger interval. higher confidence level → wider confidence interval 19 bootstrap distribution higher confidence level → wider CI 90% 95% 99% sample statistic bootstrap statistics bootstrap CI methods Method 1: find 𝑆𝑆𝑆𝑆𝑏𝑏 (standard deviation of the bootstrap distribution) and compute a 95% confidence interval by sample statistic ± 2 × 𝑆𝑆𝑆𝑆 Method 2: Generate a 𝑃𝑃𝑃 confidence interval using the bounds of the middle 𝑃𝑃𝑃 of bootstrap statistics NOTE: if 𝑃𝑃 = 95% both will give almost identical results 21 Illustration of the two methods for a 95% CI Method 1: 1. use the bootstrap distribution 2. get 𝑆𝑆𝑆𝑆𝑏𝑏 and use the formula 98.26 ± 2 × 0.105 = (98.05, 98.47) Method 2: 1. use the bootstrap distribution 2. click Two-Tail and read off the bounds (98.052, 98.472) 22 • the two methods only give the same CI when the bootstrap distribution is smooth and symmetric ALWAYS examine the bootstrap distribution! • sadly, bootstrapping won’t always work • if the bootstrap distribution is highly skewed or looks ‘spiky’ with gaps, you must use other methods (beyond introductory statistics) 23 EXAMPLE – Mercury and pH in Lakes Study of the correlation between the average mercury level (ppm) in fish and the acidity (pH) of Florida lakes? 𝑟𝑟 = −0.575 decreasing acidity Find a 95% CI for 𝜌𝜌 24 The bootstrap distribution is not symmetric • in this example, the bootstrap method is suspect • plotting the bootstrap distribution is the only way to check! 25 • again, the bootstrap method is suspect 26 Week 4 homework starts 6.00pm Tuesday 22 August due 6.00pm next Monday (28 August) 27