SECTION 2.2 STATISTICAL INFERENCE FROM SAMPLE TO POPULATION BIG IDEA OF THE DAY Chapter 1 sampling from a process Observing a small number of attempts from an infinite number of possible attempts. Buzz and Doris do the experiment forever. Chapter 2 sampling from a finite population A limited (finite) number of individuals Sample a portion from the finite population How does this affect the chance model? EXAMPLE 2.2: FAKE CELL PHONE CALLS Have you ever pretended to be talking on a cell phone to avoid interacting with people around you? Pew Research Center surveyed a random sample of 1,858 American cell phone users and found 13% admitted to faking cell phone call in the past 30 days. iphone App Fake-A-Call ™ By Excelltech Inc. EXAMPLE 2.2: FAKE CELL PHONE CALLS What are the: Observational units? Variable of interest? Population? Sample? Parameter of interest? Statistic for this study? EXAMPLE 2.2: FAKE CELL PHONE CALLS Does this survey convince you that more than 1 in 10 cell phone users in the U.S. has engaged in such fake cell phone use in the past 30 days? If not, what could be another explanation for the survey results? EXAMPLE 2.2: FAKE CELL PHONE CALLS The sample result (13%) is greater than 1 in 10 (or 10%). It’s possible that 10% of the population of cell phone users have faked a call and the researchers just happened to have a higher percentage by the luck of the draw. How plausible (believable) is this explanation for the higher sample percentage? EXAMPLE 2.2: FAKE CELL PHONE CALLS Research Question: Do more than 1 in 10 cell phone users in the U.S. admit to engaging in fake cell phone use in the past 30 days? What is the Null Hypothesis? 10% of all American cell phone users admit to faking a call in the past 30 days What is the Alternative Hypothesis? More than 10% of all American cell phone users admit to faking a call in the past 30 days EXAMPLE 2.2: FAKE CELL PHONE CALLS Notice that the null and alternative hypotheses are statements about the unknown parameter. We don’t know what proportion of all cell phone users is. (We don’t know the parameter.) We only have information from 1858 cell phone users. (We know the statistic.) Do the 1858 responders give us meaningful information about this proportion for the entire population of all cell phone users (the parameter)? EXAMPLE 2.2: FAKE CELL PHONE CALLS As before, assume the null hypothesis true If the observed statistic is unlikely to have happened by chance alone, evidence against the null hypothesis in favor of the alternative hypothesis. The p-value assesses the probability we would get a sample proportion as large as 0.13 if in fact the proportion of all cell phone users faking calls is only 0.10. EXAMPLE 2.2: FAKE CELL PHONE CALLS If the null hypothesis is true, what is the probability that the first person selected will admit to faking a call? This probability is equal to the parameter (the proportion of all cell phone users who admit to faking a call), under the null hypothesis, which is 0.10. WHAT IS THE IMPACT OF SAMPLING FROM A FINITE POPULATION RATHER THAN A PROCESS? If the first person admits to faking a call, what is the probability that the second person selected also admits faking a call? Sampling “without replacement.” Once someone is selected, they aren’t replaced before the next person is selected. If there were 255 million cell phone users when the first person is selected, then there are 254,999,999 cell phone users when the second person is selected. The probability that the second person selected also feels this way: 25,499,999/254,999,999 ≈ 0.099999996. EXAMPLE 2.2: FAKE CELL PHONE CALLS When the population size is large (more than 10 or 20 times the size of the sample), still consider random sampling from a finite population to be equivalent to random sampling from a process. Is it safe to use a model of random sampling for sampling from a finite population in this study? Model the probability of success as the same for each observational unit in our sample. Under the null hypothesis, every person selected has a 10% chance of admitting to faking cell phone calls in the past 30 days. EXAMPLE 2.2: FAKE CELL PHONE CALLS We can conduct the same type of simulation we used in chapter 1 on processes for sampling from a population. Let’s go to an applet and try this. Remember we are testing to see if the population proportion of cell phone fakers is more than 10% and the results of our poll showed that 13% of 1858 respondents admitted faking. EXAMPLE 2.2: FAKE CELL PHONE CALLS We have convincing evidence that the sample proportion of 0.13 didn’t just “happen by chance.” Thus we have very strong evidence that the population proportion of cell phone users who will admit faking a call is larger than 0.10. SUMMARY Null Hypothesis: 10% of all cell phone users admit to faking a call in the past 30 days Alternative Hypothesis: More than 10% of cell phone users admit to taking a call in the past 30 days We simulate and find a tiny p-value (≈ 0). Thus we have very strong evidence that the population proportion of cell phone users who will admit faking a call is larger than 0.10. EXAMPLE 2.2: FAKE CELL PHONE CALLS Random Sampling Error Still possible that the researchers were unlucky. However, the probability of this is so low, a more believable explanation is that the population proportion does indeed exceed 0.10. Random sampling also allows us to estimate how much “random sampling error” we expect. EXAMPLE 2.2: FAKE CELL PHONE CALLS The sampling error is roughly 1 𝑛 where n is the number of observational units (sample size) in the sample. Therefore, the sampling error in our poll 1 is about ≈ 0.023 = 2.3%. Therefore, 1858 we can expect our sample percentage to be within 2.3% of the population proportion. The 10% we were testing is outside of this range. EXAMPLE 2.2: FAKE CELL PHONE CALLS The sampling error is roughly 1 𝑛 where n is the number of observational units in the sample. What if we only sample 500 people? The sampling error in this smaller poll is about 1 ≈ 0.045 = 4.5%. Therefore, we can expect our 500 sample percentage to be within 4.5% of the population proportion. The 10% we were testing is NOT outside of this range. EXAMPLE 2.2: FAKE CELL PHONE CALLS We reduce variability (or random sampling error) with larger sample sizes. Simple random sampling gives a lot of predictability in sample proportions This predictability depends strongly on the sample size, not on the population size as long as the population size is large. There is also the possibility of nonsampling error. We will talk about this in the next section. Let’s work on Exploration 2.2: Gettysburg Address Revisited SECTION 2.3: NONSAMPLING ERRORS Simple random sampling is an unbiased way to take a sample, but we did see that there still could be random sampling error. We can calculate random sampling error. However, other things can go wrong when you use a sample to infer something about a population. These other things are lumped together in what we call nonsampling errors. EXAMPLE 2.3: THE BRADLEY EFFECT In 1982, Tom Bradley (D) ran against George Deukmejian (R). (Duke-may-jon) Polls showed that Bradley had a significant lead on Deukmejian shortly before the election as well as in exit polls. However, Deukmejian narrowly defeated Bradley. THE BRADLEY EFFECT After the election, research suggested that a smaller percentage of white voters had voted for Bradley than polls had predicted and a very large proportion of voters who, in the polls, claimed to be undecided, had voted for Deukmejian. Some people may answer polling questions in the way they think the interviewer wants them to answer—the politically correct way. Some argue that that is what happened in this election, a number of white voters said they would vote for Bradley to an interviewer, but in the anonymity of the voting booth, voted for the Deukmejian candidate OBAMA VS CLINTON: NEW HAMPSHIRE 2008 The same sort of thing happened in the New Hampshire Democratic Presidential Primary in 2008. Polls showed Barak Obama with a significant lead over Hillary Clinton. (41% to 28% with a sample size of 778.) Clinton won that election with 39% of the vote compared to Obama’s 36%. SOME MORE DETAILS The poll used random digit dialing to get respondents for the poll. Only 9% of those whose phone numbers were chosen actually responded to the question. Others either didn’t answer their phone or refused to answer the question. OUR MODEL MAKES THE FOLLOWING ASSUMPTIONS 1. 2. 3. 4. Random digit dialing is a reasonable way to get a sample of likely voters. The 9% of individuals reached by phone who agree to participate are like the 91% who didn’t. Voters who said they plan to vote in the upcoming Democratic primary will vote in the upcoming primary. Respondents answers to who they say they will vote for, matches who they actually vote for in the primary. Assumption #1. Random digit dialing is a reasonable way to get a sample of likely voters Random digit dialing is roughly equivalent to a simple random sample of all New Hampshire residents who have a landline or cell phone, except for slightly over-representing individuals who have more than one phone. Random digit dialing is a common survey technique in cases where a sampling frame (list of all members of the population) is unavailable. Assumption #2. The 9% of individuals reached by phone who agree to participate are like the 91% who didn’t The assumption is that the respondents are like the non-respondents. Although the response rate was very low, it is line with many polls and other surveys conducted by phone. So, though it is possible for non-respondents to be the cause of the bias observed, many other political surveys conducted around the same time had similar response rates, but no bias. Of course, there is no guarantee that the 9% are representative. Assumption #3 Voters who said they plan to vote in the upcoming Democratic primary will vote in the upcoming primary It is typical to ask voters whether they plan to vote in the upcoming election/primary. But, there is no guarantee that they actually will. Assumption #4 Respondents answers to who they say they will vote for matches who they actually vote for in the primary. There is no guarantee that people won’t do something different in the voting booth than they say they will do when on the phone. They could just change their mind or they could not be honest with the polling interviewer. The American Association for Public Opinion Research conducted an independent investigation and concluded the following were among the most likely explanations for the discrepancies: People changed their opinion about who they were voting for at the last minute. (Assumption #4) People in favor of Hillary Clinton were more likely to be non-respondents. (Assumption #2) Social desirability based on the race of the interviewer. (Assumption #4). Black telephone interviewers were more likely to generate respondents who were in favor of Obama than were white interviewers. This is an example of the Bradley effect. Clinton was listed before Obama on every ballot (Assumption #4) If assumption 4 is not valid, as seemed to have been plausible here, then even a pure random sample would have still exhibited this discrepancy in the results. Simple random samples should produce a representative sample, but do nothing to control for the action of the individuals in the sample. Having respondents change their minds or misrepresent their answers are examples of nonsampling errors, reasons why the statistic may not be close to the parameter that are separate from sampling errors. DO YOU AGREE WITH THIS STATEMENT? “Quality of life lies in knowledge, in culture. Values are what constitute true quality of life, the supreme quality of life, even above food, shelter and clothing.” Thomas Jefferson HOW ABOUT THIS ONE? “Quality of life lies in knowledge, in culture. Values are what constitute true quality of life, the supreme quality of life, even above food, shelter and clothing.” Fidel Castro EXPLORATION 2.3 Let’s work on Exploration 2.3. I won’t be giving you a survey like it says in the exploration. We will just discuss them.