Uploaded by lingjun han

Lecture 5.1 W5

advertisement
Week 5: Foundations for Inference – Part 1
Required Reading
OpenIntro Statistics 4e: Chapter 5
Recap
• A Bernoulli random variable is a random variable which takes a value 𝑋 = 1
(success) with probability 𝑝 and a value 𝑋 = 0 (failure) with probability 1 − 𝑝.
𝐸 𝑋 =𝑝
π‘‰π‘Žπ‘Ÿ 𝑋 = 𝑝(1 − 𝑝)
• A binomial distribution is the distribution of the number of successes out of 𝑛
independent Bernoulli trials.
• If there is a large number of trials, the binomial distribution can be well approximated by the
normal distribution.
2 of 30
Recap
• Recall that if you have two independent random variables, then:
π‘‰π‘Žπ‘Ÿ π‘Žπ‘‹ + π‘π‘Œ = π‘Ž2 π‘‰π‘Žπ‘Ÿ 𝑋 + 𝑏 2 π‘‰π‘Žπ‘Ÿ π‘Œ
• Suppose you have 𝑛 independent Bernoulli random variables, each with a
probability of success 𝑝 and variance 1 − 𝑝. The variance of the proportion of
successes out of 𝑛 trials is:
𝑋1 + 𝑋2 + β‹― + 𝑋𝑛
1
1
1
π‘‰π‘Žπ‘Ÿ(𝑋) 𝑝(1 − 𝑝)
π‘‰π‘Žπ‘Ÿ
= π‘‰π‘Žπ‘Ÿ
𝑋 + 𝑋 + β‹― + 𝑋𝑛 =
=
𝑛
𝑛 1 𝑛 2
𝑛
𝑛
𝑛
3 of 30
Lecture Outline
1. Sampling Variability and Point Estimates
2. Sampling Distributions
3. Confidence Intervals
4 of 30
Sampling Variability and Point Estimates
• For much of the remainder of this course, the focus will be on statistical inference.
• Through statistical inference, we will make statements about certain features of the population
based on information contained within a sample from that population.
• To introduce this concept, consider the proposed second Scottish independence
referendum.
5 of 30
Sampling Variability and Point Estimates
• Suppose we are interested in the proportion of voting-aged individuals in Scotland
who favour Scottish independence.
• There are roughly 4.5 million individuals in Scotland who are 16 or over. This is the population
of interest. The size of the population is denoted by 𝑁.
• For simplicity, assume everybody either supports or opposes independence (i.e. there is no “I
don’t know”).
• To measure this, one option would be to go out and survey every single person
from the population.
• But as you can imagine, this would not be possible for many reasons.
6 of 30
Sampling Variability and Point Estimates
• Instead of surveying every person in the population, a more sensible approach
would be to randomly sample 𝑛 people from this population.
• Based on the responses of these 𝑛 individuals, we can make certain statements about the
proportion of people in the population who favour independence.
• Complete populations are difficult to collect data on, so we use sample statistics
as point estimates for the unknown population parameters of interest.
• These are our “best guesses” of the associated population parameter.
7 of 30
8 of 30
Sampling Variability and Point Estimates
• Each respondent can either support or not support independence.
• Because we randomly select people to take part in our poll, whether or not any one
individual supports Scottish independence is a random variable.
• In fact, it is a Bernoulli random variable with parameter 𝑝, where 𝑝 is the population proportion.
• If we have 𝑛 voters in our sample, then the number of voters in the sample who
support independence follows a binomial distribution.
• This requires independence across observations. We cannot assume that observations are
independent if the population of interest is small.
9 of 30
Sampling Variability and Point Estimates
• In this random sample of 42 observations, the sample proportion is 𝑝Ƹ =
22
42
= 0.52.
• 𝑝Ƹ is our point estimator of 𝑝. 𝑝Ƹ is a sample statistic and 𝑝 is the population parameter.
• We can observe 𝑝,ΖΈ but not 𝑝.
• 𝑝Ƹ is a random variable. In this case, it is a random draw from a binomial
distribution with 𝑛 = 42 trials and probability of success 𝑝.
• Go out and ask another 42 random people, we will get another sample proportion.
Maybe next time, just 18/42 will support independence.
10 of 30
Sampling Variability and Point Estimates
• Because of sampling
variability, 𝑝Ƹ is not going to
be exactly equal to 𝑝.
• It could be higher or lower
just due to chance.
• This notion is often
conveniently ignored by
politicians.
https://fivethirtyeight.com/features/the-art-of-cherry-picking-polls/
11 of 30
Sampling Variability and Point Estimates
• This is not the only possible source of error. It is also possible that we are
systematically over or under-estimating the true value. Bias is systematic, whereas
sampling variability is just chance.
• One possible source of bias is sampling bias. This occurs where a certain subset of
the population is more likely to be sampled that others.
• In the 1948 presidential election between Dewey and Truman, most polls
suggested Dewey would win, but he ended up losing by a lot. There were several
issues with polling here, one of which is that they were conducted over the
telephone.
• At the time, telephones were much more common in wealthy households. Wealthy
voters were more likely to favour Dewey.
12 of 30
Sampling Variability and Point Estimates
• Bias can be eliminated through well-designed
survey and sampling strategies (see chapter
1), as well as by using the appropriate
statistical methodology.
13 of 30
theconversation.com
Sampling Distributions
• If we know how 𝑝Ƹ varies from sample to sample, we can make inferential
statements about the associated population proportion.
• If observations are independent, then the sampling distribution of 𝑝Ƹ can be
calculated from a binomial distribution.
• A sampling distribution is the distribution of the sample statistic that you would
observe across a very large number of different samples.
• Sampling distributions are not observed in practice – rather you can observe a
single realisation from the sampling distribution.
14 of 30
• Suppose the population proportion of independence supporters is 0.51. Below is
the distribution of 𝑝Ƹ across 5,000 different samples for 𝑛 = 10 and 𝑛 = 42.
• The mean of the sampling distribution is the population proportion: 𝐸 𝑝Ƹ = 𝑝.
15 of 30
• As the sample size increases, the mean of the sampling distribution is the same but
the variability grows smaller. Below is the sampling distribution where 𝑛 = 500.
16 of 30
Sampling Distributions
• The mean of all of these distributions is 𝐸 𝑝Ƹ = 𝑝 and the standard deviation is
𝑝(1−𝑝)
.
𝑛
• The standard deviation of a sampling distribution is often referred to as the
standard error, rather than the standard deviation.
π‘†πΈπ‘ΰ·œ =
𝑝(1 − 𝑝)
𝑛
17 of 30
Sampling Distributions
• The binomial distribution is difficult to work with, but as discussed last week the
binomial distribution tends to the normal distribution as the sample size grows.
• This is due to the central limit theorem.
• As the sample size grows, the distribution of the sample proportion (or sample mean) can be
approximated by a normal distribution.
• If the sample size is large enough, and if draws are independent, then the
distribution of sample proportion tends to a normal distribution with a mean of
𝐸 𝑝Ƹ = 𝑝 and a standard error of
𝑝(1−𝑝)
.
𝑛
18 of 30
Sampling Distributions
• The Charlottetown Accord was
a proposed set of constitutional
amendments in Canada.
• The referendum was defeated
in 1992, with only 45% voting
“Yes” to the reforms.
• Let’s take 𝑝 = 0.45 to represent
the population proportion of
Canadians favouring the
reforms.
20 of 30
Sampling Distributions
• Take a random sample of 𝑛 = 100 Canadian voters. The proportion of those in the
sample who support the referendum is a random variable 𝑝Ƹ with a mean of:
𝐸 𝑝Ƹ = 0.45
• and a standard error of:
π‘†πΈπ‘ΰ·œ =
𝑝 1−𝑝
=
𝑛
0.45 1 − 0.45
= 0.0497
100
21 of 30
Sampling Distributions
• Because the sample is relatively large, the distribution of 𝑝Ƹ is approximately
normal.
• The textbook gives a rule for approximate normality of 𝑛𝑝 ≥ 10 and 𝑛(1 − 𝑝) ≥ 10.
distribution of 𝑝Ƹ
0.45
22 of 30
Sampling Distributions
• What is the probability that we observe a sample proportion of at least 0.50? We
need to work out how many standard errors away from the mean this is. In other
words, let’s calculate the appropriate z score:
0.50 − 0.45
𝑍=
= 1.01
0.0497
15.6%
0.45 0.50
23 of 30
Confidence Intervals
• If we know the sampling distribution of a sample statistic, we can construct a
confidence interval.
• A confidence interval is a range of plausible values for the population parameter
that we are trying to estimate.
• Confidence intervals have associated confidence level that represent a certain longrun capture rate.
• For a 95% confidence interval, 95% of random samples will produce a 95%
confidence interval that contains the true population parameter.
24 of 30
Confidence Intervals
• Let’s play a game. Get a sheet of paper.
• It needs to be big enough for you to write 10 pairs of numbers on it
• I am going to ask you 10 questions, each of which has numeric answer.
• For each question, write down two numbers: a lower bound and an upper bound.
You should be 90% confident that the correct answer lies between these two
numbers.
• Your goal is not to construct intervals that contain the true answer every time.
Your goal is to construct intervals that are just wide enough where the true answer
is contained 90% of the time in expectation.
25 of 30
1. What is the population of Africa?
2. In 2017/2018, how many students at UoE
were from the USA?
3. According to Google Maps, how long would it
take you to cycle from Aberdeen to Plymouth?
4. What is the diameter of the moon (in km or
miles)?
5. How many rooms are in Buckingham Palace?
26 of 30
6. From the lobby to the observation deck, how many
steps are there in the Empire State Building?
7. How many calories are there, on average, in a
750mm bottle of red wine?
8. For how many days has Justin Trudeau been the
Prime Minister of Canada?
9. What was the population of Scotland in 1775?
10. How much rain does Edinburgh get a year, in
mm?
27 of 30
Confidence Intervals
• Now, I’ll read the correct answers. Count how many of
your intervals contained the correct answer.
How many of you had zero outside the bounds?
How many of you had exactly one outside of the bounds?
How many of you had exactly two outside of the bounds?
How many of you had exactly three outside of the bounds?
How many of you had exactly four outside of the bounds?
How many of you had five or more outside of the bounds?
• If you are calculating your intervals correctly, then the
probability distribution for the number of “surprises” is
depicted to the right.
28 of 30
Confidence Intervals
• To construct a 95% confidence interval, we need to add and subtract a certain
value from our sample statistic such that 95% of the time we will capture the
population parameter.
• 97.5% of the normal distribution lies to below 1.96 standard deviations above the
mean. Because the normal distribution is symmetric, this means that 97.5% of the
normal distribution also lies above of 1.96 standard deviations below the mean.
29 of 30
Confidence Intervals
• To construct a 95% confidence interval, take the
sample proportion, and then add and subtract
1.96 standard errors.
• When a confidence interval is constructed in this
manner, then 95% of confidence intervals from a
very large number of samples will contain the
population mean.
𝑝 − 1.96SE
𝑝
𝑝 + 1.96𝑆𝐸
– And 5% will not.
30 of 30
Download