Uploaded by lawsonbasocials

Lecture 1

advertisement
bootstrapping the 𝑆𝑆𝑆𝑆 for 𝑥𝑥̅
original
sample
𝑥𝑥̅
(sample statistic)
bootstrap
sample 1
𝑥𝑥1̅ (bootstrap
statistic 1)
bootstrap
sample 2
𝑥𝑥̅2 (bootstrap
statistic 2)
.
.
.
bootstrap
sample 𝑘𝑘
.
.
.
𝑥𝑥̅𝑘𝑘 (bootstrap
statistic 𝑘𝑘)
bootstrap
distribution
𝑘𝑘 > 1,000
𝑆𝑆𝑆𝑆
1
we have one sample, which we use to create many bootstrap samples and statistics
sampling distribution
bootstrap distribution
sample statistics
bootstrap
sample statistics
𝜇𝜇 − 2𝑆𝑆𝑆𝑆
𝜇𝜇
𝑥𝑥̅
𝜇𝜇 + 2𝑆𝑆𝑆𝑆
𝑥𝑥̅ − 2𝑆𝑆𝑆𝑆𝑏𝑏
𝑥𝑥̅ + 2𝑆𝑆𝑆𝑆𝑏𝑏
because 𝑆𝑆𝑆𝑆𝑏𝑏 ≅ 𝑆𝑆𝑆𝑆 we can create a 95% CI for 𝜇𝜇 using 𝑥𝑥̅ ± 2𝑆𝑆𝑆𝑆𝑏𝑏
Bootstrapping a 95% confidence interval
1.
2.
3.
4.
5.
6.
7.
randomly pick a case (observation) from the sample
record the variable of interest for that case
replace (return) the case to the sample
repeat 1-3 𝑛𝑛 times to get a bootstrap sample
calculate the bootstrap statistic for the bootstrap sample
repeat 1-5 many times to generate a bootstrap distribution
compute 𝑆𝑆𝑆𝑆𝑏𝑏 and the CI using sample statistic ± 2 × 𝑆𝑆𝑆𝑆𝑏𝑏
• this process is tedious, so we use
3
a sample (𝑛𝑛 = 100) of
recall, there are 52
orange Smarties, so the
sample statistic is:
52
𝑝𝑝� =
= 0.52
100
we need the 𝑆𝑆𝑆𝑆 of 𝑝𝑝�
→
4
Bootstrapping a 𝑆𝑆𝑆𝑆 may be used to form a confidence interval
(CI) for any sample statistic . . .
• proportion (𝑝𝑝)
• mean (µ)
• difference in proportions (𝑝𝑝1 − 𝑝𝑝2)
• difference in means (µ1 – µ2)
The basic bootstrapping process is always the same . . .
bootstrap samples → bootstrap statistics → bootstrap distribution → 𝑆𝑆𝑆𝑆
5
EXAMPLE – Ford Mustangs
Suppose we want a 95% CI for the mean price of a used Ford
Mustang. We have a random sample of 𝑛𝑛 = 25 cars.
• what is the population parameter?
𝜇𝜇 = population mean price of a used Ford Mustang
• how do we get a point estimate of µ?
calculate the sample mean 𝑥𝑥,̅ which is the point estimate of 𝜇𝜇
6
dotplot of the sample
MustangPrice
0
𝑛𝑛 = 25
5
Dot Plot
10
15
20
25
Price
30
35
40
45
𝑥𝑥̅ = 15.98
→ the point estimate for µ is $15,980
. . . but how accurate is this estimate?
• we need the 𝑆𝑆𝑆𝑆 of 𝑥𝑥̅
BOOTSTRAP
7
Original Sample
1-4. Bootstrap sample
5. Calculate the mean of
the bootstrap sample
6. Repeat 1-5 1,000 +
times to build up the
bootstrap distribution
7. Find the standard
deviation of the bootstrap
distribution to get 𝑆𝑆𝑆𝑆𝑏𝑏
we use
for 1-7
8
EXAMPLE – Ford Mustangs cont.
. . . using Statkey, the 95% confidence interval is
𝑥𝑥̅ ± 2 × 𝑆𝑆𝑆𝑆𝑏𝑏
15.98 ± 2 × 2.178
(11.624, 20.336)
• interpretation: based on the sample, we are 95% confident that the
mean population price of a used Ford Mustang car is between
$11,624 and $20,336
CHECK: how would this change if you re-did the bootstrapping?
9
EXAMPLE – Belief in global warming
Source: “Wide Partisan Divide Over Global Warming”, Pew Research Center, 10/27/10.
In 2010, 2,251 randomly-selected Americans were asked : “Is
there solid evidence of global warming?” 1,328 answered ‘yes’
Calculate and interpret a 95% CI for 𝑝𝑝, the proportion who
believe there is ‘solid evidence’ of global warming
• the sample statistic 𝑝𝑝� = 1328/2251 = 0.590 (59%) is the
point estimate of 𝑝𝑝
• to get 𝑆𝑆𝑆𝑆𝑏𝑏 we use
10
Belief in global warming cont.
using
we get 𝑆𝑆𝑆𝑆𝑏𝑏 = 0.01
𝑝𝑝̂ ± 2 × 𝑆𝑆𝑆𝑆𝑏𝑏
0.59 ± 2 × 0.01
(0.57, 0.61)
• interpretation: we are 95% certain that the true percentage
who believe there is ‘solid evidence’ of global warming lies
between 57% and 61%
11
Belief in global warming cont.
Does belief in global warming differ by political party?
The sample proportion answering ‘yes’ was 79% among
Democrats and 38% among Republicans.
Calculate a 95% CI for the difference in proportions
• sample sizes are not given; assume 𝑛𝑛 = 1,000 for each party
• the sample statistic 𝑝𝑝�𝐷𝐷 − 𝑝𝑝�𝑅𝑅 = 0.79 − 0.38 = 0.41 is the
point estimate of 𝑝𝑝𝐷𝐷 – 𝑝𝑝𝑅𝑅
12
Belief in global warming cont.
using
we get 𝑆𝑆𝑆𝑆𝑏𝑏 = 0.02
𝑝𝑝̂ ± 2 × 𝑆𝑆𝑆𝑆𝑏𝑏 → 0.41 ± 2 × 0.02 → (0.37, 0.45)
• interpretation: we are 95% sure that the difference in the
proportion of Democrats and Republicans who believe in
global warming is between 37% and 45%
NOTE: the confidence interval does not include zero. We are 95% sure
that the difference is not zero, it is positive. (Something to
about.)
13
EXAMPLE – Human body temperature (°F )
Using the BodyTemp50 (Temperature) data and
98.26 ± 2 × 0.105
98.26 ± 0.21
(98.05, 98.47)
• interpretation: we are 95% sure that population mean body
temperature is between 98.05° and 98.47°
The true population value is 98.6°! What has gone wrong?
14
sample size v. number of bootstrap samples
• the larger 𝑛𝑛, the smaller 𝑆𝑆𝑆𝑆𝑏𝑏 and the narrower the CI
larger 𝑛𝑛 → more precise sample statistic
• intuition: more data means you have better information
• for given 𝑛𝑛, increasing the number of bootstrap samples has
little effect on the 𝑆𝑆𝑆𝑆 (and hence on the CI)
larger number of bootstraps → little impact on 𝑆𝑆𝑆𝑆
• intuition: no extra “information” about variability once the
unevenness in a small number of bootstraps has been removed
15
Changing the confidence level
• what if we want to be more than 95% confident?
• can we produce a 99% confidence interval?
• what about a 99.5% confidence interval?
• what about being less than 95% confident?
• e.g., a 90% confidence interval
If the bootstrap distribution is roughly symmetric, we can
construct any confidence interval we want by finding the
appropriate percentiles in the bootstrap distribution
16
bootstrap
distribution
lower
bound
• changing 𝑃𝑃𝑃 shifts the CI bounds
• a higher 𝑃𝑃𝑃 implies a wider CI
𝑃𝑃𝑃
sample
statistic
upper
bound
bootstrap
statistics
17
. . . the middle 𝑃𝑃𝑃 defines the bounds of the 𝑃𝑃𝑃 CI
• for a 99% CI, the bounds are defined by the middle 99%, leaving
0.5% in each tail
• 99% CI is (0.5th percentile, 99.5th percentile) from the bootstrap
distribution
• for a 90% CI, the bounds are defined by the middle 90%, leaving
5% in each tail
• 90% CI is (5th percentile, 95th percentile) from the bootstrap distribution
• we can use
or inspect the bootstrap distribution
18
EXAMPLE – body temperature cont.
A 99% CI for mean body temperature is (97.996°, 98.584°)
. . . wider than the 95% CI (98.05, 98.47)
• a 99% CI contains the middle 99% of sample statistics, which is
more than the middle 95% → the 99% CI is wider
• to be ‘more confident’ that the population parameter is in
the CI we need to have a larger interval.
higher confidence level → wider confidence interval
19
bootstrap
distribution
higher confidence level
→ wider CI
90%
95%
99%
sample
statistic
bootstrap
statistics
bootstrap CI methods
Method 1: find 𝑆𝑆𝑆𝑆𝑏𝑏 (standard deviation of the bootstrap
distribution) and compute a 95% confidence interval by
sample statistic ± 2 × 𝑆𝑆𝑆𝑆
Method 2: Generate a 𝑃𝑃𝑃 confidence interval using the
bounds of the middle 𝑃𝑃𝑃 of bootstrap statistics
NOTE: if 𝑃𝑃 = 95% both will give almost identical results
21
Illustration of the two methods for a 95% CI
Method 1:
1. use the bootstrap distribution
2. get 𝑆𝑆𝑆𝑆𝑏𝑏 and use the formula
98.26 ± 2 × 0.105 = (98.05, 98.47)
Method 2:
1. use the bootstrap distribution
2. click Two-Tail and read off the bounds
(98.052, 98.472)
22
• the two methods only give the same CI when the bootstrap
distribution is smooth and symmetric
ALWAYS examine the bootstrap distribution!
• sadly, bootstrapping won’t always work
• if the bootstrap distribution is highly skewed or looks ‘spiky’ with
gaps, you must use other methods (beyond introductory statistics)
23
EXAMPLE – Mercury and pH in Lakes
Study of the correlation between the average mercury level
(ppm) in fish and the acidity (pH) of Florida lakes?
𝑟𝑟 = −0.575
decreasing acidity
Find a 95% CI for 𝜌𝜌
24
The bootstrap distribution is not symmetric
• in this example, the bootstrap method is suspect
• plotting the bootstrap distribution is the only way to check!
25
• again, the bootstrap method is suspect
26
Week 4 homework
starts 6.00pm Tuesday 22 August due
6.00pm next Monday (28 August)
27
Download