Week 5: Foundations for Inference – Part 1 Required Reading OpenIntro Statistics 4e: Chapter 5 Recap • A Bernoulli random variable is a random variable which takes a value π = 1 (success) with probability π and a value π = 0 (failure) with probability 1 − π. πΈ π =π πππ π = π(1 − π) • A binomial distribution is the distribution of the number of successes out of π independent Bernoulli trials. • If there is a large number of trials, the binomial distribution can be well approximated by the normal distribution. 2 of 30 Recap • Recall that if you have two independent random variables, then: πππ ππ + ππ = π2 πππ π + π 2 πππ π • Suppose you have π independent Bernoulli random variables, each with a probability of success π and variance 1 − π. The variance of the proportion of successes out of π trials is: π1 + π2 + β― + ππ 1 1 1 πππ(π) π(1 − π) πππ = πππ π + π + β― + ππ = = π π 1 π 2 π π π 3 of 30 Lecture Outline 1. Sampling Variability and Point Estimates 2. Sampling Distributions 3. Confidence Intervals 4 of 30 Sampling Variability and Point Estimates • For much of the remainder of this course, the focus will be on statistical inference. • Through statistical inference, we will make statements about certain features of the population based on information contained within a sample from that population. • To introduce this concept, consider the proposed second Scottish independence referendum. 5 of 30 Sampling Variability and Point Estimates • Suppose we are interested in the proportion of voting-aged individuals in Scotland who favour Scottish independence. • There are roughly 4.5 million individuals in Scotland who are 16 or over. This is the population of interest. The size of the population is denoted by π. • For simplicity, assume everybody either supports or opposes independence (i.e. there is no “I don’t know”). • To measure this, one option would be to go out and survey every single person from the population. • But as you can imagine, this would not be possible for many reasons. 6 of 30 Sampling Variability and Point Estimates • Instead of surveying every person in the population, a more sensible approach would be to randomly sample π people from this population. • Based on the responses of these π individuals, we can make certain statements about the proportion of people in the population who favour independence. • Complete populations are difficult to collect data on, so we use sample statistics as point estimates for the unknown population parameters of interest. • These are our “best guesses” of the associated population parameter. 7 of 30 8 of 30 Sampling Variability and Point Estimates • Each respondent can either support or not support independence. • Because we randomly select people to take part in our poll, whether or not any one individual supports Scottish independence is a random variable. • In fact, it is a Bernoulli random variable with parameter π, where π is the population proportion. • If we have π voters in our sample, then the number of voters in the sample who support independence follows a binomial distribution. • This requires independence across observations. We cannot assume that observations are independent if the population of interest is small. 9 of 30 Sampling Variability and Point Estimates • In this random sample of 42 observations, the sample proportion is πΖΈ = 22 42 = 0.52. • πΖΈ is our point estimator of π. πΖΈ is a sample statistic and π is the population parameter. • We can observe π,ΖΈ but not π. • πΖΈ is a random variable. In this case, it is a random draw from a binomial distribution with π = 42 trials and probability of success π. • Go out and ask another 42 random people, we will get another sample proportion. Maybe next time, just 18/42 will support independence. 10 of 30 Sampling Variability and Point Estimates • Because of sampling variability, πΖΈ is not going to be exactly equal to π. • It could be higher or lower just due to chance. • This notion is often conveniently ignored by politicians. https://fivethirtyeight.com/features/the-art-of-cherry-picking-polls/ 11 of 30 Sampling Variability and Point Estimates • This is not the only possible source of error. It is also possible that we are systematically over or under-estimating the true value. Bias is systematic, whereas sampling variability is just chance. • One possible source of bias is sampling bias. This occurs where a certain subset of the population is more likely to be sampled that others. • In the 1948 presidential election between Dewey and Truman, most polls suggested Dewey would win, but he ended up losing by a lot. There were several issues with polling here, one of which is that they were conducted over the telephone. • At the time, telephones were much more common in wealthy households. Wealthy voters were more likely to favour Dewey. 12 of 30 Sampling Variability and Point Estimates • Bias can be eliminated through well-designed survey and sampling strategies (see chapter 1), as well as by using the appropriate statistical methodology. 13 of 30 theconversation.com Sampling Distributions • If we know how πΖΈ varies from sample to sample, we can make inferential statements about the associated population proportion. • If observations are independent, then the sampling distribution of πΖΈ can be calculated from a binomial distribution. • A sampling distribution is the distribution of the sample statistic that you would observe across a very large number of different samples. • Sampling distributions are not observed in practice – rather you can observe a single realisation from the sampling distribution. 14 of 30 • Suppose the population proportion of independence supporters is 0.51. Below is the distribution of πΖΈ across 5,000 different samples for π = 10 and π = 42. • The mean of the sampling distribution is the population proportion: πΈ πΖΈ = π. 15 of 30 • As the sample size increases, the mean of the sampling distribution is the same but the variability grows smaller. Below is the sampling distribution where π = 500. 16 of 30 Sampling Distributions • The mean of all of these distributions is πΈ πΖΈ = π and the standard deviation is π(1−π) . π • The standard deviation of a sampling distribution is often referred to as the standard error, rather than the standard deviation. ππΈπΰ· = π(1 − π) π 17 of 30 Sampling Distributions • The binomial distribution is difficult to work with, but as discussed last week the binomial distribution tends to the normal distribution as the sample size grows. • This is due to the central limit theorem. • As the sample size grows, the distribution of the sample proportion (or sample mean) can be approximated by a normal distribution. • If the sample size is large enough, and if draws are independent, then the distribution of sample proportion tends to a normal distribution with a mean of πΈ πΖΈ = π and a standard error of π(1−π) . π 18 of 30 Sampling Distributions • The Charlottetown Accord was a proposed set of constitutional amendments in Canada. • The referendum was defeated in 1992, with only 45% voting “Yes” to the reforms. • Let’s take π = 0.45 to represent the population proportion of Canadians favouring the reforms. 20 of 30 Sampling Distributions • Take a random sample of π = 100 Canadian voters. The proportion of those in the sample who support the referendum is a random variable πΖΈ with a mean of: πΈ πΖΈ = 0.45 • and a standard error of: ππΈπΰ· = π 1−π = π 0.45 1 − 0.45 = 0.0497 100 21 of 30 Sampling Distributions • Because the sample is relatively large, the distribution of πΖΈ is approximately normal. • The textbook gives a rule for approximate normality of ππ ≥ 10 and π(1 − π) ≥ 10. distribution of πΖΈ 0.45 22 of 30 Sampling Distributions • What is the probability that we observe a sample proportion of at least 0.50? We need to work out how many standard errors away from the mean this is. In other words, let’s calculate the appropriate z score: 0.50 − 0.45 π= = 1.01 0.0497 15.6% 0.45 0.50 23 of 30 Confidence Intervals • If we know the sampling distribution of a sample statistic, we can construct a confidence interval. • A confidence interval is a range of plausible values for the population parameter that we are trying to estimate. • Confidence intervals have associated confidence level that represent a certain longrun capture rate. • For a 95% confidence interval, 95% of random samples will produce a 95% confidence interval that contains the true population parameter. 24 of 30 Confidence Intervals • Let’s play a game. Get a sheet of paper. • It needs to be big enough for you to write 10 pairs of numbers on it • I am going to ask you 10 questions, each of which has numeric answer. • For each question, write down two numbers: a lower bound and an upper bound. You should be 90% confident that the correct answer lies between these two numbers. • Your goal is not to construct intervals that contain the true answer every time. Your goal is to construct intervals that are just wide enough where the true answer is contained 90% of the time in expectation. 25 of 30 1. What is the population of Africa? 2. In 2017/2018, how many students at UoE were from the USA? 3. According to Google Maps, how long would it take you to cycle from Aberdeen to Plymouth? 4. What is the diameter of the moon (in km or miles)? 5. How many rooms are in Buckingham Palace? 26 of 30 6. From the lobby to the observation deck, how many steps are there in the Empire State Building? 7. How many calories are there, on average, in a 750mm bottle of red wine? 8. For how many days has Justin Trudeau been the Prime Minister of Canada? 9. What was the population of Scotland in 1775? 10. How much rain does Edinburgh get a year, in mm? 27 of 30 Confidence Intervals • Now, I’ll read the correct answers. Count how many of your intervals contained the correct answer. How many of you had zero outside the bounds? How many of you had exactly one outside of the bounds? How many of you had exactly two outside of the bounds? How many of you had exactly three outside of the bounds? How many of you had exactly four outside of the bounds? How many of you had five or more outside of the bounds? • If you are calculating your intervals correctly, then the probability distribution for the number of “surprises” is depicted to the right. 28 of 30 Confidence Intervals • To construct a 95% confidence interval, we need to add and subtract a certain value from our sample statistic such that 95% of the time we will capture the population parameter. • 97.5% of the normal distribution lies to below 1.96 standard deviations above the mean. Because the normal distribution is symmetric, this means that 97.5% of the normal distribution also lies above of 1.96 standard deviations below the mean. 29 of 30 Confidence Intervals • To construct a 95% confidence interval, take the sample proportion, and then add and subtract 1.96 standard errors. • When a confidence interval is constructed in this manner, then 95% of confidence intervals from a very large number of samples will contain the population mean. π − 1.96SE π π + 1.96ππΈ – And 5% will not. 30 of 30