Chapter 1. Introduction to Statistical Inference: One Proportion

Chapter 1: Introduction to Statistical Inference: One Proportion I Chapter 1. Introduction to Statistical Inference: One Proportion Learning Objectives: • • • • • • • Begin to understand the process of statistical investigations as it relates to the scientific method: from posing a question to collecting data to analyzing data to drawing inferences beyond the data to communicating results. Be able to distinguish population from sample, parameter from statistic. Be able to state null and alternative hypotheses appropriate to a research question and conjecture. Begin to understand the reasoning process of statistical significance, using a 3S strategy: statistic, simulate, strength of evidence. Be able to conduct simulations and draw conclusions for inferences regarding strength of evidence about a (single) proportion. Begin to understand that a dataset allows for rejecting some hypothesized values of a parameter while retaining others as plausible values of the parameter. Begin to recognize that sample size plays an important role in assessing statistical significance and strength of evidence. Outline: Section 1.1: Introduction to Statistics Topics • Anecdotal evidence • Statistics from samples • Parameters from populations • Descriptive statistics versus inferential statistics • Scientific Method Section 1.2: Introduction to Statistical Reasoning: One Proportion Example: Matching Names to Faces Activity: A Preference Study: Friend or Foe? Topics: • Criminal Justice System as analogy for Statistics • Null and Alternative Hypotheses • Test of significance, P-value • Simulation • Strength of evidence • Stating conclusions Section 1.3: Statistical Significance for One Proportion—Other Null Hypotheses Example: Appetizing Dog Food? Activity: Cola Discrimination? Topics: • Testing null value other than 0.5 Section 1.4: Plausible Values Copyright © 2010 Example: Competitive advantage of red uniforms? Activity: Kissing Right? • Interval of plausible values • Not finding evidence against a value does not mean confirming the value Section 1.5: Effect of Sample Size Example: Predicting election results? Activity: Baseball “big bang”? • Sample size matters • All else being equal, larger samples produce stronger evidence against a null hypothesis than smaller samples do. Case Study: Will skipping breakfast lead to a world without men? Research Article: Stock Monkey Practice Exercises Exercises Chapter 1: Introduction to Statistical Inference: One Proportion II Section 1.1: Introduction to Statistics Have you ever heard statements like these? • “I don’t wear my seatbelt anymore because a friend of mine was in a car accident and wasn’t wearing a seatbelt when they crashed and she was the only survivor of the accident.” • “Don’t get your child vaccinated. I vaccinated my child and now he is autistic.” • “I’m never going to become a runner, my friend’s dad just started running and he died of a heart attack last week.” The people making these statements each use a single case to support a decision about how they will live their life. They are basing their conclusions on anecdotal evidence. Key Idea: Anecdotal evidence is information obtained from only one or a handful of particular cases. It doesn’t take into account that there are many other cases that may have very different outcomes. Thought Question: Do you think it is ever reasonable to draw a conclusion based on anecdotal evidence? We think most people would agree that drawing conclusions based on a single case is often a bad idea. However, there are a lot of everyday situations where we use anecdotal evidence to make decisions. For example, if you’ve been in a windowless room all day and someone enters the room holding a wet umbrella, you might conclude that it is raining outside. In this case, the use of a single case to make a conclusion is reasonable and probably correct. Of course, there are many scenarios where anecdotal evidence is insufficient for drawing a conclusion or making a decision. Typically, in these scenarios the risk involved with being wrong is higher than in situations where we are comfortable using anecdotal evidence. For example, in the statements earlier, the ramifications of being wrong about wearing your seatbelt or vaccinating your child are more severe than getting your hair wet! In these scenarios, we would prefer to see many cases, which each provide some evidence, in order to make a better conclusion or decision than what we can with only a few cases. Specifically, before we decide to stop wearing seatbelt, we would like to know the survival rates of individuals wearing seatbelts and individuals not wearing seatbelts in a large number of similar accidents. Before we decide not to vaccinate our child, we would like to know the rates of autism in a large number of vaccinated and unvaccinated children. We would also like to know the rates of disease and death in a large number of vaccinated and unvaccinated children. Before we put the skids on our running, and before we even look at a large number of runners and non-runners, we might find out a little bit more about that friend’s dad. Was he overweight? Was he a smoker? Did he have a family history of heart attacks? Gathering information or data can help us to make better informed decisions. This is Statistics at work. Key Idea: Basing decisions on anecdotal evidence is often not appropriate. Instead, it is often appropriate to draw conclusions only after we have gathered sufficiently large amounts of data in a carefully planned manner. What is Statistics? Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion III Statistics is a discipline that guides us in weighing evidence about phenomena in the world around us. More specifically, Statistics gives us a formal procedure for gathering evidence, evaluating that evidence, suggesting conclusions based on that evidence, and assessing our confidence in those conclusions. A large part of the discipline of Statistics (capital “S”) is the use of descriptive statistics (little “s”). You’ve likely come across and used statistics before; they are numbers like averages and percentages or graphs like bar charts. These numbers and graphs are called descriptive statistics because they are used to describe data that we have collected. If we continue beyond the data we have collected and make broader claims, this process is called Inferential Statistics. We can use descriptive and inferential statistics when we are trying to learn about a large and difficult to observe group of people, called the population, but we only have data on a portion of that population, called the sample. For example, when we are trying to learn about vaccine use and autism, the population of interest is children, but we could never investigate all living children. Instead we use a sample of children and investigate the relationship between vaccinations and autism in the sample. Descriptive statistics (e.g., percentages of vaccinated children and unvaccinated children in our sample who are autistic) are calculated, and then used to make conclusions or inferences about the population of interest, all living children. Key Idea: Descriptive statistics summarize, with graphs and numbers, what we see in the sample. Inferential statistics involves weighing the evidence to make conclusions about the population. As we’ve already pointed out, numbers that we calculate from our sample are called statistics. On the other hand, numbers that summarize information about the population are called parameters. The statistics from our sample can help us draw conclusions about the corresponding parameters in the population. In our previous example, we use the descriptive statistic (the percentage of vaccinated children in our sample who are autistic), to learn about the parameter (the percentage of all vaccinated children who are autistic). Key Idea: Descriptive statistics are numerical summaries of the sample, which can be used to learn about parameters, which are numerical summaries of the population. Sometimes instead of using statistics to summarize sample data and draw conclusions about a population, we use statistics to see whether what a researcher’s manipulations have made an impact on a response of interest. For example, in an experiment we might see whether taking a newly developed drug lowers blood pressure. Specifically, we might measure the blood pressure of everyone in our sample, then give them the drug and measure their blood pressure again. If, overall, we see a decrease in blood pressure after taking the drug, Statistics can help us decide if we should be convinced that this change can be attributed to taking the drug. Key Idea: Statistical methods can be used to tell us whether researcher intervention is a reasonable explanation for changes in a response. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion IV In this book we will use Statistics in a variety of real-life situations to help us to weigh evidence about research questions of interest, including the following: • • • • • • • Is swimming with dolphins successful therapy for people diagnosed with severe to moderate depression? Do pre-verbal infants have preferences between helper toys and hinderer toys? Do females have a higher body temperature than males? Does vitamin C prevent the common cold? Can people correctly distinguish between two brands of cola? What is the average weight of newborn babies? Are there effects of sleep deprivation on learning? Scientific Method The scientific method is a series of techniques used to objectively guide scientific inquiry. When we use the scientific method we start by asking questions, which we then refine into testable hypotheses based on prior research (if available). Studies are designed to test the hypotheses. Results from the studies are analyzed, conclusions are drawn, and often new research hypotheses are formed. Statistics informs all parts of the scientific method. See Figure 1.1 for an outline of the scientific method. Specifically, knowledge of Statistics helps to refine questions into testable hypotheses, guides many study design decisions, is used to evaluate whether the results of the study are evidence in favor of the research hypotheses, and guides the ultimate conclusions drawn from the research. In short, Statistics is used in many studies that use the scientific method Figure 1.1: Flowchart of Scientific Method Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion V Example: The Physicians’ health study conducted by the Harvard Medical School began in 1982 and ended in 1995. One question this study hoped to answer was: Is a bi-daily aspirin beneficial in the prevention of cardiovascular disease? Approximately twenty-two thousand male physicians ages 40 to 84 were split randomly into two groups of approximately 11,000 physicians each. One group took an aspirin every other day for the duration of the study and the other group took a placebo (sugar pill) every other day for the duration of the study. It was concluded that aspirin reduced the risk of a first myocardial infarction (heart attack) by 44%. The Physician’s health study is an example of applying the scientific method. Figure 1.2: Flowchart of Scientific Method applied to Physician’s health study. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion VI Section 1.2: Introduction to Statistical Reasoning: One Proportion Ei incumbit probatio qui dicit --- The burden of proof rests on one who asserts Statistics and the Criminal Justice System In our criminal justice system, each trial is designed to answer the same question: “Is the defendant guilty?” The system involves a jury that evaluates the strength of the evidence suggesting guilt of the defendant. An important initial instruction is given to the jury, to assume that the defendant is innocent, and not to conclude that the defendant is guilty unless the evidence is “beyond a reasonable doubt.” Notice that there are two competing hypotheses here. The first hypothesis is that the defendant is innocent. It is this first hypothesis that the jury is instructed to assume is true. The second hypothesis is that the defendant is guilty. It is this hypothesis that the prosecutor believes to be true. Indeed, if the prosecutor did not have a strong reason to suspect the defendant’s guilt he would not have brought the defendant to trial. In the trial, evidence is presented. The jury then examines the evidence in order to gauge whether the evidence strongly points to the defendant’s guilt. The jury must examine every piece of evidence assuming that the defendant is innocent; asking themselves the question “Is it possible that the defendant is innocent and we still see this evidence? Is it unlikely? How unlikely?” The logic of the criminal justice system is similar to the approach we use for weighing evidence in Statistics. In Statistics we have a research conjecture we want to evaluate; we call this research conjecture the alternative hypothesis. To see whether sufficient evidence exists to conclude that our research conjecture is reasonable, we start by assuming that the research conjecture is not correct. The statement of the research conjecture being incorrect, stated in such a way as to communicate no effect or equality, is known as the null hypothesis. In short, we assume the null hypothesis is true, and evaluate the data under that assumption in order to weigh the evidence supporting the alternative hypothesis (our research question). And so, the null hypothesis is like assuming the defendant is innocent, the data we gather is the evidence, and we are looking for strong evidence in favor of the alternative hypothesis (research question; the defendant is guilty). Key Idea: The alternative hypothesis is the research conjecture we are trying to establish. The null hypothesis is a statement contrary to the research conjecture, which is often a statement of no effect or equality. As in our criminal justice system, we will assume that the null hypothesis is true unless there is evidence (data), beyond a reasonable doubt, supporting the alternative hypothesis. If we don’t find strong enough evidence, the null hypothesis remains plausible. This process of assuming the null hypothesis (or the “dull hypothesis”) to be true, gathering data, then analyzing that data to see whether we have convincing evidence in favor of the alternative hypothesis is called a test of significance. Key Idea: Tests of significance use the data gathered to assess the strength of evidence in favor of the alternative hypothesis over the null hypothesis. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion VII Example: Matching Names to Faces: Bob or Tim? A study in Psychonomic Bulletin and Review (Lea, Thomas, Lamkin, & Bell, 2007) presented evidence that “people use facial prototypes when they encounter different names.” Participants were given two faces and had to determine which one was Tim and which one was Bob. The researchers wrote that their participants “overwhelmingly agreed” on which face belonged to Tim and which face belonged to Bob, but did not provide the exact results of the study. A recent class of statistics students (our sample) replicated this study and 23 of the 33 students correctly identified the face that belonged to Tim. So our statistic could be the 23/33 ≈ 0.697 proportion that made the correct identification. But do 23 of the 33 students making that choice convince us that there is something to this theory? Maybe these students just got lucky? What does that mean here? If our research conjecture is that people in general (our population) have a tendency to associate certain facial features with a name, this gives us our alternative hypothesis. Note: We might want to debate what population you are willing to consider this sample representative of, but we will come back to that. Alternative hypothesis: In the population, people have a tendency to associate certain facial features with a name. In other words, the correct face is matched with Tim more than half the time (i.e., the parameter, the proportion in the population, is greater than 0.5). The null hypothesis then becomes there is no such association. This means that people trying to match names to faces are essentially blindly guessing which name goes with which face. Null hypothesis: People are not more likely to match Tim with one face over the other. In other words, people match Tim with his face half the time (i.e., the parameter, the proportion in the population, is equal to 0.5). So, like in a criminal trial, we will begin our analysis by assuming there is nothing special going on here; in other words that the null hypothesis is true and people are blindly guessing which face is Tim’s. If that were the case, would it be possible that in a sample of 33 students, more than half would guess the Tim face correctly just by chance? Sure, this is possible. But what if all 33 students had matched Tim correctly? Well then, we would be pretty convinced there was something going on. Why? Because it’s unlikely that everyone would get the right answer if they were all blindly guessing. How unlikely? That would be like flipping a coin 33 times and getting all heads! This would happen less than 1 in a billion times! So what about 23 people matching names to faces correctly just by chance? How unlikely is that? We know that each time we toss a coin 33 times we will get a different value for the number of heads. Is it unusual to get 23 heads? How unusual? We will explore this by looking at what “could have happened” when we toss a coin 33 times. Figure 1.3 shows the results from tossing a coin 33 times, sorted by heads or tails. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion VIII Figure 1.3: Results from tossing a coin 33 times This time we got 15 heads and 18 tails, pretty close to the 50/50 split we would expect. But when we did it again we got 20 heads as shown in Figure 1.4. Figure 1.4: Results from tossing a coin 33 times a second time There are a lot of different possible “number of heads” we could get when flipping a coin 33 times, and the result will potentially vary every time we flip the coin another 33 times. So to help us assess whether 23 heads is unusual, we need to flip the coin 33 times over and over again. Below is a graph of the number of heads we obtained in 33 tosses, when we performed the 33 coin flips 1000 different times. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion IX Figure 1.5: A graph showing 1000 repetitions of flipping a fair coin 33 times and counting the number of heads. lower tail 6 7 8 outcomes 6 4 typical 44 7 4 4 48 upper tail 6 7 8 Examining the graph in Figure 1.5 (a “dotplot”) confirms that it is virtually impossible to get all 33 heads (it never happened), there is clearly some chance we will see 23 heads. Specifically, in these 1000 sets of 33 tosses, we obtained exactly 23 heads 11 times, and we got 23 or more heads in 17 of the 1000 times. Therefore, a result like 23 is not very likely. In fact, 23 lies out in the “tail” of the distribution. So our conclusion would be that a result of 23 heads in 33 tosses is fairly unusual. Thought Question: How does the above analysis help us answer our research question as to whether we have convincing evidence that people are able to match names to faces more than half the time? So what does this have to do with our study? Well, if the null hypothesis is true and people can’t match names to faces any better than with a coin toss, we can model their behavior with a coin. Heads means they matched the names correct, tails means they did not. This doesn’t mean every sample is expected to have exactly half the people match the right name; we know there will be some variability, just be chance. By repeating this process over and over again, like we showed in Figure 1.5 we get a sense for what the pattern of outcomes looks like if people are blindly guessing at which name belongs to which face. This allows us to evaluate whether the results of our students appear to be consistent with the random behavior we would see from coin tosses. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion X Figure 1.6: A graph showing 1000 repetitions of a group of 33 students picking Tim’s face and counting the number of correct matches under the null hypothesis that people are equally likely to match his face correctly as not. lower7 8 tail 6 typical outcomes if null is true 6 4 4 4 7 4 4 4 8 upper tail 6 7 8 The graph shown in Figure 1.6 is the same as the one shown in Figure 1.5, but we have changed the context from repeatedly flipping 33 coins and counting the number of heads to repeatedly sampling 33 people and counting the number of times each group correctly identifies the correct face. In both cases, the simulation is built on the premise that the probability of a “success” (heads or a correct guess) equals .5. This graph tells us that, if people are simply guessing between the two faces, getting an outcome of 23 matches is rather unlikely. But we DID see 23 correct matches. So we can conclude that the subjects in the study appear to have behaved differently than they would have if they were making their decisions by the result of a coin toss. Our data did not appear consistent with the null hypothesis, so we conclude that we have evidence against the null hypothesis, beyond a reasonable doubt. The more unusual the outcome is under the assumption that the null hypothesis is true, the stronger the evidence provided by the observed data against the null hypothesis that generated the “could have happened” results. Note that a key step in this process was assuming the null hypothesis was true so that the people in the study were behaving like they each tossed a coin to make their choice. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XI To help us determine whether 23 was an unlikely outcome in 33 tosses, we could either toss a coin 33 times, over and over, or we could use a computer to carry out the coin tosses. Typically, instead of actually flipping coins many times, we use a computer to simulate the “could have happened” data by assuming the people behaved like coins and then generating random coin toss results. Of course, people don’t behave exactly like coins, even when they are equally choosing between two choices. But this model gives us a reasonable assessment of how unusual our sample results are under this null hypothesis. Key Idea: A simulation is the imitation of a real world process that is typically performed by a computer. The use of a computer allows us to repeat the process many, many times very quickly. To test some different hypotheses we will use a three step strategy called the “three S” strategy. This process serves as the foundation for a majority of research questions we will investigate in this book. Three S Strategy for Test of Significance 1. Statistic: Compute the statistic from the observed data. 2. Simulate: Simulate the process to produce “could have happened” data under the assumption that the null hypothesis is true and calculate the value of the statistic in that data. Repeat the simulation process to generate a large number of could have happened data sets, always assuming the null hypothesis to be true. Examine the “what if the null hypothesis was true” distribution of these statistics. 3. Strength of evidence: Consider where the observed statistic falls in the “what if the null was true” distribution. If the statistic falls in the far tail of the distribution, then we have strong evidence against the null hypothesis. Otherwise, if the value of the observed sample statistic is not in the tail of the “what if” distribution, then consider the null hypothesis to be plausible. For example, in our study the statistic we computed was 23 out of 33 students who correctly matched Tim to his face. We simulated what “could have happened” if the null were true (people correctly match Tim with his face half the time and incorrectly half the time) using a coin flip (heads = correct match). We repeated this simulation several times, each time counting the number of heads out of 33 tosses (number of correct matches out of 33 just guessing students). The observed number of correct matches by the students in our study, 23, was far enough in the tail of the “what if the null was true” distribution to give us some evidence against the null hypothesis. How far out in the tail of the distribution does the observed statistic need to be so that we consider the result “beyond a reasonable doubt?” Thought Question: Looking at the dotplot in Figure 1.6, how many correct matches would you need to see in a group of 33 students to convince you that their outcome was better than what you would expect just by random chance? One way to quantify how unusual an outcome is in the “what if the null was true” distribution is to calculate a p-value. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XII Key Idea: The p-value is the probability that we would get a result as extreme as or more extreme than the one that was actually observed, if the null hypothesis was true. In our example, we found 23 or more correct matches occurred 11+ 5 + 1 = 17 times, giving us an approximation to the p-value of 17/1000 = 0.017. [You can calculate this pvalue more exactly as you will learn in Chapter XX, but 1000 repetitions of our simulation should give us a very reasonable estimate.] So this gives us a measure for how unusual it is to get 23 correct matches if everyone in the class simply guessing. FAQ: Why do we need to include more than when computing our p-value? Student: So 23 out of 33 students correctly matched Tim’s name to his picture, why isn’t the p-value the likelihood I would just get 23 heads out of 33 flips of a coin? Why do we have to also include the likelihood of getting more than 23 heads? Stat Prof: In flipping just 33 coins, the probability of getting 23 heads is very similar to the probability of getting 23 or more heads, so I can see how this would be confusing. I have an idea to help you understand this. Do you have any coins with you? Student: I might, let me check. It looks like I have a few. Stat Prof: Good. Do you think you can toss the coin “fairly”? Student: I don’t see why not. Stat Prof: Start flipping it. As you do this, keep track of the number of heads you get as well as the total number of flips. Also think about whether or not your results indicate whether this is an unfair tossing process. Student: Okay, I flipped the coin 10 times and got 6 heads. Since this is close to 50-50, I don’t think I could conclude that my coin tossing is unfair. Stat Prof: Good. The probability of getting 6 heads out of 10 flips is about 0.20 while the probability of getting 6 or more heads (our actual p-value) is about 0.37. Either way you think of this, getting 6 heads out of 10 coin flips is not too unlikely. Let’s keep flipping. Student: Okay, I’ve flipped it 50 times and … Stat Prof: Keep flipping. Student: I’ve flipped it now 100 times and… Stat Prof: Keep flipping. (1 hour later) Student: I think my arm is going to fall off! Stat Prof: I guess you can stop now. What are your results? Student: I flipped the coin 1000 times and got 505 heads. Stat Prof: Do you think your results show that your coin tossing is not fair? Student: Of course not. Getting 505 heads out of 1000 flips is close enough to 50% that the result is not unexpected. Stat Prof: Do you mean you expected to get 505 heads? Student: Well no, not exactly 505, but I did expect to get something close to 500. Stat Prof: In fact, the probability of getting exactly 505 heads out of 1000 flips of a fair coin is only about 0.02. Because there are so many different outcomes possible, the probability of any one particular outcome is rather small. But we wouldn’t want to look a the probability of .02 and consider this a surprising outcome, it is definitely among the typical values. this is better conveyed by noting that the probability of getting 505 heads or more in 1000 flips is about 0.39. Student: And since this p-value is so high, I would not conclude my coin tossing is unfair! Stat Prof: Exactly! Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XIII Key Idea: The p-value for a test of significance is what we base our conclusion on. A small p-value means it is unlikely the result in our data would have occurred simply by random chance, assuming the null hypothesis is true. What if only 18 out of 33 students had correctly identified Tim? In the class we reported on 23 out of 33 students correctly identified Tim, but what if that number had only been 18? In Figure 1.6 we see that 18 is a result that is typical if the null hypothesis is true. When we compute the p-value we see that it is 0.340 and so we would not consider this outcome unusual for a class that was simply guessing. Because the p-value is not small, this says that if people really can’t correctly identify Tim, it’s a fairly common occurrence to see a number like 18 correctly identify Tim anyway. Thus, these “new” data (18 out of 33 students correctly identifying Tim) would not provide enough evidence that the population of students can identify Tim. In this scenario, it is plausible that the population cannot correctly identify Tim and that we just happened to see a slight preference in our sample by random chance. Key Idea: If the p-value for our test of significance turns out to be large, then our data do not provide evidence that the observed outcome is something other than what we would have expected to see by chance alone assuming the null hypothesis to be true. Note, we never conclude that the null is true or even that we have evidence in favor of the null. We can only say that we don’t have evidence against the null hypothesis and so the null is plausible. Again, the legal analogy applies, in a trial, a verdict in favor if the defendant is not “innocent” but rather “not guilty.” We haven’t proven anyone innocent, only that the evidence provided does not convince us, beyond a reasonable doubt, of their guilt. Strength of Evidence But still, how do we decide whether our p-value is a small number or not? How rare did the outcome need to be in Figure 1.6 to convince you and your fellow classmates that the result should not be attributed to random chance? How similar were your answers? FAQ: What p-value should make us suspicious? Stat Prof: Did you ever google “Persi Diaconis”? Student Skeptic: No, why? Stat Prof: Try it sometime. He’s unusual … even for a statistician he’s unusual. He was one of the first people to win one of the MacArthur “genius” awards. Student Skeptic: Is that all? What else can he do? Stat Prof: Flip a coin and make it come up heads every single time. SS: I don’t believe anyone can do that. Isn’t this whole dialog just a stat prof’s geeky way to sneak in p-values? SP: Sorry, I can’t help it. We stat profs are so transparent sometimes. That’s just the data. But humor me: How many heads in a row would Persi have to get to make you begin to think maybe it’s not just chance? Would heads on the first two tosses do it for you? SS: Of course not. That happens 25% of the time. SP: You win a point for extra credit. What about heads on the first three tosses? SS: You’re making me get serious about this. Three heads in a row wouldn’t happen very often, but it’s still not unusual enough to be suspicious. SP: What about four in a row? Five in a row? Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XIV SS: Now it’s really getting serious … and suspicious. I find it hard to believe that you can get five heads in a row just by chance. Sure, it can happen, but five in a row is enough to make me think maybe there’s something else going on. It turns out that four heads in a row happens about 6% of the time and 5 heads in a row happens about 3% of the time. This is about when most people would start to think there is something suspicious going on. And in fact, many studies consider 5% the cutoff value for what we would consider strong evidence against the null hypothesis. Of course, in some situations you may want even stronger evidence. (A civil trial only requires “preponderance of evidence” opposed to “beyond a reasonable doubt” in a criminal trail.) Keep in mind that the smaller the p-value, the stronger the evidence against the null hypothesis in favor of the alternative hypothesis. Statisticians have agreed on the following standards for how they evaluate the strength of evidence conveyed by a p-value. Standards for judging p-values 0.10 < p-value 0.05 < p-value < 0.10 0.01 < p-value < 0.05 0.001 < p-value < 0.01 p-value < 0.001 not much evidence against null hypothesis moderate evidence strong evidence very strong evidence extremely strong evidence When the statistic provides strong evidence against the null hypothesis in favor of the alternative hypothesis, we often say the result is statistically significant, meaning unlikely to happen by random chance alone. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XV Activity 1.2: A Preference Study: Friend or Foe? As adults we know the difference between naughty and nice, but what about pre-verbal infants? A study in the November 2007 issue of Nature looked at children less than 1 year old to see whether they recognized and had a preference for nice versus naughty toys. In one component of the study, 10-month-old infants were shown a “climber” character (a piece of wood with “google” eyes glued onto it) that could not make it up a hill in two tries. Then they were alternately shown two scenarios for the climber’s next try, one where the climber was pushed to the top of the hill by another character (“helper”) and one where the climber was pushed back down the hill by another character (“hinderer”). The infant was alternately shown these two scenarios several times. Then the child was presented with both pieces of wood (the helper and the hinderer) and asked to pick one to play with. The helper toy was chosen by 14 of the 16 children. We will consider these 16 infants (our sample) as representative of a larger population of 10-month-old infants (our population). To see a video of both the “helper” and “hinder” scenarios as well as to see infants choosing the “helper,” visit the link below: http://www.yale.edu/infantlab/socialevaluation/Helper-Hinderer.html Keep in mind that the shape and colors of the helper and hinderer toys were changed across the trials, so focus only on the helper vs. hinderer roles rather than on these other aspects of shape and color. Setting up the hypotheses Remember, we said that the research conjecture becomes the alternative hypothesis and the null hypothesis is then a related statement of “no effect” or “nothing special going on.” The “dull” hypothesis. 1. State the null and alternative hypotheses for this research question. (Keep in mind there are a couple of different ways to state them, but be sure to state them about the population.) Examining the data 2. Statistic: In the study, 14 of the 16 infants chose the helper toy. Convert this to a proportion. Is this statistic in the direction of the alternative hypothesis? That is, does it provide initial support for the research conjecture? Strength of Evidence Although our data are supportive of the alternative hypothesis, a result as extreme as 14 could have arisen even if the infants were each choosing equally between the helper and the hinderer. So we again need to consider how unusual such a result would be by random chance alone. If the null hypothesis is true, then infants will equally choose either the helper or the hinderer toy and so each infant’s choice is really just like flipping a coin. Remember that tests of significance are like the criminal justice system in that we will be making an initial Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XVI assumption (the null hypothesis is true). So let’s assume, for the time being, that infants really don’t prefer the helper toy any more than the hinderer. 3. Simulate: Let’s use coin flipping to simulate “could have happened” results under this null hypothesis. a. Start by simulating a single infant randomly (with equal probability) choosing the helper or hinderer toy. Explain how you will use a coin to do this and what heads and tails represent. b. Now, simulate 16 infants randomly choosing the helper or hinderer toy. Explain how you will use a coin to do this, what heads and tails represent, and how you will obtain the statistic from your simulation. 4. Pool your results with your classmates to create the “what if the null was true” distribution for the number of infants choosing the helper toy. Based on these results, does 14 out of 16 infants choosing the helper toy when the null hypothesis is true and they have no genuine preference between the toys (or 14 heads in 16 coin tosses) appear to be an unlikely occurrence? It’s hard to get a real secure feeling about your conclusion having only replicated the study a few times. It would also be quite time consuming if we had to do too many more of these repetitions flipping a coin. We will use the power of the computer to repeat this simulation of 16 children’s toy choices under the null hypothesis. The more times we repeat the simulation the more accurately we will be able to simulate the p-value for this study. 5. Let’s use the “Coin Tossing” web applet to generate more repetitions. a. First, use the applet to simulate 16 coin flips. Report the statistic you obtained and explain how the simulation relates to the infants choice of helper or hinderer. b. Now change the Number of Repetitions to 20 and press the “16 Tosses” button to get a sense for how this statistic varies from trial to trial. c. Next change the number of repetitions to 979 and press the “16 Tosses” button to create a total of 1000 replications. d. Use the applet to count how many of the dots (sets of 16 tosses) had 14 or more heads by entering 14 in the “As extreme as” box and press Count. The proportion of repetitions reported by the applet is your approximation of the p-value – how often, when the null hypothesis is true, do we get a statistic that is at least as extreme as what the researchers found in the actual study. 6. Press the Exact p-value button and the applet will calculate the p-value exactly using probability rules (Think of repeating the above simulating infinitely many times). How does this value compare to your estimation from the 1000 repetitions? 7. Interpret this p-value in the context of this study. Explain what it measures. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XVII 8. Would you consider this p-value strong evidence against the null hypothesis? Explain. Let’s recap what you have done. We suspect you found that the proportion of repetitions with 14 or more heads was very small. The exact p-value equals 0.002. This means that if infants in general are just randomly choosing toys and you did this same study with 16 impartial infants 1000 times, only about 2 out of 1000 times will you get a number as large as or larger than 14 (or 87.5%) infants choosing the helper toy as observed in the actual study. This provides very strong evidence against the null hypothesis that infants in the population do not have a genuine preference between the two toys. Warning: Keep in mind that we might be making an error here. It’s possible that this study could be one of those 2 in 1000 studies that would produce such an extreme sample even if the null hypothesis was true. However, the chance of that is sufficiently small, that we feel more comfortable concluding something else is going on. Now, let’s step back a bit further and think about the implications of the fact that these 16 infants seem to be demonstrating a preference for the helper toy. Do you think this means that all ten-month-old infants can tell the difference between naughty and nice? An important question here is how representative these 16 infants are of all infants. We’ll discuss this more in Chapter 2. We would also probably need to do some more sophisticated testing to make the leap from toy preference to knowledge of difference between naughty and nice. 9. Use the applet to investigate how many infants would need to choose the helper toy to convince you that their result is more extreme than you would reasonably expect from chance alone. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XVIII Section 1.3: Statistical Significance for One Proportion—Other Null Hypotheses In the previous section, we discussed testing whether or not a null hypothesis of “choosing equally between two choices” was believable based on some observed sample data. But there are situations where we may wish to consider proportions other than .5. Example: Appetizing Dog Food? Does canned dog food make for a suitable and inexpensive alternative to blended meat products such as spam or liverwurst? As unappetizing as this might sound, researchers investigated this question in a study that was described in the February 20, 2009 issue of Science magazine. The researchers presented 18 subjects with 5 unlabeled blended meat products. One of these products was Newman’s Own dog food, prepared with a food processor to have the texture and appearance of a liver mousse. After tasting all 5 products, subjects were asked to identify which was their least favorite on the basis of taste. The research question was whether subjects are more likely to identify dog food as their least favorite than if they were making selections at random. Statistic: It turned out that 13 of the 18 subjects (≈72%) identified the dog food as their least favorite. This study has much in common with the infants’ toy study: We have 18 subjects instead of 16 infants, and researchers keep track of which selection each subject makes. Also, the research question is whether a certain selection is made so often that random chance can be eliminated as a plausible explanation for that tendency. We’ll analyze these studies in the same way: We’ll simulate the subjects’ selections assuming a null hypothesis that selections are made by random chance alone. We’ll repeat this simulation a large number of times, and we’ll see how often we obtain a result as (or more) extreme as the actual result from the subjects in the real study. But there’s one important difference between these studies, which affects how we’ll conduct the simulation. Thought Questions • What is the important difference? Hint: Would it make sense to use a fair coin to simulate the subjects’ selection process? • If they are picking among the five foods completely at random, roughly how many of the 18 subjects would you expect to pick dog food as their least favorite? Hopefully it is obvious that it would not make sense to use a fair coin here, because there are five options rather than two available to each subject (and then we will record whether or not they pick dog food as the least favorite). In order to simulate subjects’ choices of their least favorite product as random selections, we need to give dog food a 1 in 5 chance, rather than a 1 in 2 chance, of being selected. But this is the only change we need to make to our analysis strategy; the rest will proceed exactly as with the infants’ toy study. So our hypotheses will be: Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XIX Null hypothesis: Subjects in the population pick equally among the 5 foods (the probability dog food is picked as the least favorite equals 0.20) Alternative hypothesis: Subjects in the population are more likely to choose the dog food as their least favorite. (The probability dog food is picked as the least favorite is larger than 0.20.) Simulate: To simulate a process with a 0.2 probability, we could use a computer or a calculator. The results of 1000 sets of such random selections are shown in Figure 1.7. Figure 1.7: The “what if the null was true” distribution of the number of 18 subjects randomly choosing dog food as their least favorite As expected under the null hypothesis, the center of this distribution is close to (1/5)×18 = 3.6. Most of the trials in this simulation saw 2, 3, 4 or 5 people picking the dog food as their least favorite. Strength of Evidence: In particular, when the null hypothesis is true we never saw a result as extreme as 13 of the subjects just randomly picking the dog food when subjects are choosing equally among the five foods. (The exact p-value can be shown to equal 0.0000025, or less than 3 in a million or less often than tossing 18 heads in a row.) Thought Question What conclusion would you draw from this very small p-value? This small p-value suggests that the actual result of the study would be extremely unlikely to occur if in fact the selections were made purely at random (picking equally among the five food choices). Therefore, this study provides extremely strong evidence that dog food really is chosen as least favorite more than 20% of the time. Follow-up: Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XX The 18 subjects in this study were also asked to guess which of the 5 products was dog food. Interestingly, only 3 of the 18 subjects correctly identified which product was dog food. Thought Question Explain why it’s not even necessary to conduct a simulation analysis to investigate whether this provides strong evidence that people can correctly identify which is dog food more than random chance. In our 3S strategy, the first S: statistic involves looking at the sample data. One thing to consider is whether the data are in the direction conjectured by the alternative hypothesis. In this case, 3 out of 18 or about 16.7% of the subjects correctly identified the dog food. Since this is less than 20%, such data will certainly not convince us that people are more likely to correctly identify the dog food than if they were simply guessing. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXI Activity 1.3: Discriminating Colas Do you think that, overall, the cola drinkers in your class can tell the difference between two similar colas? Let’s find an answer to this question by running a taste test. Students who are cola drinkers will be given three plain cups. Two contain the same brand of cola and one contains a different brand of cola. They will then taste from all three cups and attempt to correctly determine which one is different from the others. Students who aren’t cola drinkers can set up and run the taste tests. Record the results to see how many of the students involved in the taste test correctly identify which cup of cola is different from the other two. We want to know if the proportion of students that can correctly identify the cola that is greater than if they were just guessing (making their decision by random chance). This is similar to the dog food example since we have more than two outcomes from which to choose, so we won’t be able to use a fair coin to simulate the student’s choice. (But will classify the response as “correct” or “not.) 1. If we assume that, overall, the class can’t tell the difference between the two colas, or each individual is just randomly guessing as to which of the three cups is different, what is the chance an individual will correctly choose the cup that is different? How many correct identifications do you think you would get on average in your sample if students are really just guessing? 2. What are your null and alternative hypotheses? 3. Statistic: Let’s examine the data we collected from our taste test. a) How many students correctly identified which of the three cups was different? b) What proportion of students correctly identified which of the three cups was different? c) Do you suspect that the evidence is strong enough to convince someone that the alternative hypothesis is true? Why or why not? What other information would you like to have in order to decide how convincing the evidence is? Simulate: Let’s use an applet to simulate 1000 repetitions of the class taste tests, where we assume that students are just guessing which cup is different (probability of 0.3333 as to whether a student correctly identify the different cola). Go to the “One Proportion Inference” applet which lets you specify the conjectured population proportion that can correctly distinguish the different cup (as specified by the null hypothesis) as well as the number of observations in each trial (e.g., the number of subjects in the study). 4. What value are you entering for the null value? What value are you entering for the number of tosses? 5. Press the “Randomize” button to generate one version of the results that “could have happened” if the null hypothesis was true. One dot is added to the dotplot. Clearly explain what this dot represents. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXII 6. Now change the Number of Repetitions box to 999 in order to generate a total of 1000 repetitions. Describe the resulting “what if the null is true” distribution. 7. How many students correctly identified the different cola in the class results? Enter this value in the “As extreme as” box and press Count. What proportion of times out of 1000 did you simulate a number as large or larger than the actual result of the study? 8. What is your proportion from question 8 called? 9. Strength of evidence: What do you conclude about the strength of evidence that the cola tasters are able to tell the difference between the two colas better than random guessing based on our results? Why? 10. Is there a larger population to which you can infer your results? Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXIII Section 1.4: Plausible Values In this section, we continue to apply the statistical process to explore a population proportion in more detail. Often our goal is less about examining one particular conjectured value for the parameter, but rather to identify a range of plausible values for the parameter. Example: Competitive advantage of red uniforms? Do athletes who wear red uniforms have an advantage over their competitors? To investigate this question, Hill and Barton (Nature, 2005) examined the records in the 2004 Olympic Games for four combat sports: boxing, tae kwon do, Greco-Roman wrestling, and freestyle wrestling. Competitors in these sports were randomly assigned to wear either a red or a blue uniform. The competitor wearing red defeated the competitor wearing blue in 248 matches, and the competitor wearing blue emerged as the winner in 209 matches. 1. Statistic: The proportion of matches won by the competitor wearing red is 248 / (248 + 209) = 248 / 457 ≈ .543. The question is: Do these data provide strong evidence that competitors wearing red really do win more than half the time? To investigate this question we ask how unlikely it would be to obtain 248 or more winners wearing red in 457 matches, if the winner is really equally likely to be either competitor. In other words, we ask how often a sample proportion of .543 or higher would result if in fact the population proportion were .5. The null hypothesis is that 50% of matches are won by the competitor wearing red, and the alternative hypothesis is that more than 50% of matches are won by the competitor wearing red. 2. Simulate: We have seen in previous sections that we can use simulation to evaluate the strength of evidence in favor of the alternative hypothesis. First, we will model the outcome of the game with a coin flip and we simulate some data that “could have happened” had the null hypothesis been true. To do this, we flip a coin 457 times, representing the matches in these sports, and we count how many heads appear, representing wins by the competitor wearing red. We then repeat this process many, many times generating the “what if the null hypothesis was true” distribution. We used a computer applet to simulate 1000 repetitions of flipping a coin 457 times. The results are shown in Figure 1.8. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXIV Figure 1.8: A “what if the null is true” distribution showing the number of heads in 1000 repetitions of a coin being flipped 457 times. Note that 248 is in the upper tail and thus an unlikely event if the red competitor wins 50% of the time. 3. Strength of Evidence: We see that the outcome 248, which is the sample number of matches won by the competitor wearing red, is in the upper tail of the distribution. In fact, only 35 of the 1000 simulated samples produced 248 or more heads, so the simulated p-value is .035. This is small enough to conclude that the sample data provide fairly strong evidence that the competitor wearing red really does have a better than 50% chance of winning the match. Now what? We have reason to believe that the competitor wearing red wins more than half the time, so a natural question to ask next is: How much more than half? In other words, we have concluded that .5 is not a plausible value for the underlying probability that the competitor wearing red wins a match, but what do we think are plausible values for that underlying probability? To investigate this, we can perform a similar test using a different potential value for this probability in the null hypothesis. Let’s start by testing the null hypothesis that the competitor wearing red has a .55 probability of winning a match. We’ll simulate 1000 sets of 457 coin tosses with a .55 probability of landing heads. The results are shown in Figure 1.9. Figure 1.9: When the probability that a competitor wearing red wins a match is .55, we find that 248 out of 457 matches won by the red competitor is not out in the tail, but lies fairly close to the center of this “what if the null is true” distribution. In this case we see that the observed data value (248) is not in the tail of the distribution. In fact, 39.5% of the above repetitions had a smaller number of successes (red competitor wins) than 248. So we don’t have evidence against the null hypothesis that the probability of winning while wearing red is .55. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXV Notice, that although any number between 0 and 1 is a possible value for the underlying probability that the competitor wearing red wins the match, our analyses above have revealed that 0.5 does not seem to be a plausible (or believable) value in light of the sample data, whereas 0.55 do appear to be plausible based on the sample data. This is not surprising, considering that the observed sample proportion of matches won by the red-wearing competitor is .543, much closer to .55 than to .5. But we still don’t know the true probability of winning while wearing red. Using the scientific method, let’s rehypothesize. Maybe we could close in on the true probability by finding an upper bound. Let’s next test the null hypothesis that the competitor wearing red has 53% chance of winning a match. The simulation results shown in Figure 1.10 reveal that an outcome of 248 is again not a surprising outcome in this “what if the null was true” distribution. Figure 1.10: When the probability that a red-wearing competitor wins the match is 0.53, we find that 248 out of 457 matches won by the person wearing red among the typical values of our “what if the null is true” distribution. So both .53 and .55 are considered plausible values for the probability of the red competitor winning. Key Idea: Keep in mind that not gathering enough evidence to support the alternative hypothesis is not the same thing as concluding the null hypothesis is true. There are almost always other null hypotheses that would also be plausible. Let’s next test a 60% chance of winning the match. The simulation result shown in Figure 1.11 reveals that an outcome of 248 wins (or less, looking in the more extreme tail) by the red competitor would be very surprising if this null hypothesis were true. The simulated p-value is .006, so we have very strong evidence that the actual probability of winning while wearing red is less than .6. Figure 1.11: When the probability that a red-wearing competitor wins the match is 0.6, we find that 248 out of 457 matches won by the person wearing red is out in the lower tail of our “what if the null is true” distribution. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXVI We can continue to use this analysis strategy to investigate many different null hypotheses, each using a different value for the probability that the red-wearing competitor wins the match. We’ve already tested whether .5, .53, .55, or .6 are plausible values for this probability, and we’ve concluded so far that the probability is above .5 but less than .6. We also found that .55 is plausible, because the observed data are not in either tail of its “what if null were true” distribution. There must be other values between .5 and .6 that would not be rejected and thus can also be considered plausible values for the true probability. Let’s go ahead and test values .51, .52, .53, and so on. We used the applet to test all of these null hypotheses, with 1000 repetitions for each analysis. Table 1.1 shows the strength of evidence for each of these hypothesized values. Table 1.1: p-values when testing various null hypothesis probabilities that the competitor wearing red wins the match, based on the sample result of 248 wins in 457 matches Simulated p-value Strength of evidence against null value 0.5 0.04 Strong 0.51 0.073 Moderate 0.52 0.17 Not much 0.53 0.319 Not much 0.54 0.469 Not much 0.55 0.375 Not much 0.56 0.253 Not much 0.57 0.109 Not much 0.58 0.044 Strong 0.59 0.021 Strong 0.6 0.008 Very strong Null value Note: In this table we have always computed the p-value by looking in the tail of the “what if the null hypothesis was true” distribution. This means that when the null hypothesis value is less than 0.645 we look in the upper tail and when the null hypothesis value is greater than 0.645 we look in the lower tail. In Chapter 2 we Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXVII introduce the idea of computing our p-values using a “two-sided” testing approach where we measure strength of evidence based on both tails of the distribution. This table illustrates that there are many values for which the null hypothesis value is plausible. Specifically, for values from .52 to .57, the observed data do not provide much evidence against the null hypothesis, so we can say that it’s plausible that the actual probability that the red competitor wins a match is between .52 and .57. This suggests a very slight advantage for the competitor wearing red. Notice that these endpoints (.52 and .57) are almost equally far from the observed sample proportion (.543). We could refine this analysis further by considering more possible values. For instance, we could consider values from .510 to .530 and from .560 to .580 in order to “zero in” more precisely on reasonable endpoints. We could also modify this procedure by deciding that we want to use a more or less stringent rule for evidence against the null hypothesis, for example by considering a value to be plausible unless the simulation produces very strong evidence against it. Key Idea: We consider a value plausible for the population parameter if a simulation analysis based on that value does not put the observed data in the tail of the “what if null were true” distribution. This analysis produces an interval of plausible values for a population parameter based on the observed sample data. This interval of all such plausible values is called a confidence interval. We will further explore confidence intervals in Chapters 5, 6, 7, and 8. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXVIII Activity 1.4: Kissing Right? Most people are right-handed and even the right eye is dominant for most people. Molecular biologists have suggested that late-stage human embryos tend to turn their heads to the right. In a study reported in Nature (2003), German bio-psychologist Onur Güntürkün conjectured that this tendency to turn to the right manifests itself in other ways as well, so he studied kissing couples to see if they tended to lean their heads to the right while kissing. He and his researchers observed couples in public places such as airports, train stations, beaches, and parks. They were careful not to include couples who were holding objects such as luggage that might have affected which direction they turned. For each couple observed, the researchers noted whether the couple leaned their heads to the right or to the left. They observed 124 kissing couples, finding that 80 leaned to the right. The question is: Do these data provide strong evidence that kissing couples really do lean to the right a majority of the time? To investigate this question we ask how unlikely it would be to obtain 80 or more couples leaning to the right in a random sample of 124 couples, if the couples are equally likely to lean right or left. 1. Statistic: Calculate the proportion of the observed couples who leaned to the right. 2. State the null and alternative hypotheses for this study. This time we don’t want to ask you to take the time to actually do a lot of coin flipping, but we do want you to think about how you would conduct this simulation analysis if you had to. 3. Simulate: Describe how you would use a coin to conduct a simulation analysis to determine whether these data provide strong evidence that in general kissing couples really do tend (more often than not) to lean to the right. Provide sufficient details that someone else could implement the analysis based solely on your description. You should include statements about how you would make a conclusion based on the simulation results. 4. Use technology (one proportion inference applet) to simulate 1000 repetitions of 124 couples, assuming the null hypothesis that couples are equally likely to lean right or left is true. Where is the distribution centered? Also report your simulated p-value. Would you consider the result of 80 or more out of 124 unlikely assuming the null hypothesis is true? 5. Strength of evidence: Based on your simulation results, would you conclude that the researchers’ data (80 of 124 couples leaning to the right) provides strong evidence that couples in general really do tend to the lean to the right more often than the left? Explain the reasoning behind your answer. Now suppose the researchers believed that couples are more likely to lean to the right when kissing. In fact, they believe this happens about 2/3 of the time, as that is similar to other right-sided tendencies that have been observed (right handedness, right eye-ness, embroyos turning to the right in the womb and so on). Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXIX 6. Carry out a test of significance to decide whether you have convincing evidence that couples turn to the right less than 2/3 of the time. Make sure you state the null and alternative hypotheses, explain how you generate the simulation, and report the empirical p-value. 7. What do you conclude about whether 2/3 (roughly .667) is a plausible value for the probability a couple leans to the right? 8. Carry out a test of significance to decide whether you have convincing evidence that couples lean to the right less than 75% of the time. Make sure you state the null and alternative hypotheses, explain how you generate the simulation, and report your simulated p-value. 9. What do you conclude about whether .75 is a plausible value for the probability a couple leans to the right? Confidence Interval: We can continue to use our analysis strategy to investigate many different null hypotheses, each stipulating a different value for the probability that a couple leans to the right. We’ve already tested whether 0.5, 0.67, or 0.75 are a plausible value. These analyses rejected the null hypothesis in two cases and we concluded that the proportion of all couples that lean to the right when kissing is more than 0.5 and less than 0.75. So, there must be values in between 0.50 and 0.75 that would not be rejected and thus be plausible values for the true proportion of couples that lean to the right when kissing. Let’s go ahead and test values such as 0.51, 0.52, 0.53, and so on. Granted, this will get tedious, but with technology it’s not too cumbersome. 10. Use the applet to test all of these null hypotheses, starting with values 0.51, 0.52, 0.53, and continuing up to and including 0.74 for the probability that a kissing couple leans to the right. Use 1000 repetitions for each analysis. Report the hypothesized values which do not provide strong evidence (p-value > .05) against the null hypothesis. [Hint: For now, always put the observed result in the “tail” of the distribution, so you may need to switch the count inequality in the applet.] 11. Where does the observed sample proportion (.645) lie compared to this interval of plausible value? 12. Why does it make sense that you received a large p-value for your test of the null hypothesis that 65% of couples lean right? Does this mean that you’ve proven that 65% f kissing couples lean right? Why or why not? Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXX Section 1.5: Role of Sample Size In this section we explore the important role of sample size in considering statistical significance. Example: Predicting Elections from Faces? Do voters make judgments about political candidates based on how their faces look? Can you correctly predict the outcome of an election, more often than not, simply by choosing the candidate whose face is judged to look more competent? Researchers investigated this question in a study published in Science (Todorov, Mandisodka, Goren, and Hall, 2005). Participants were shown pictures of two candidates and asked which one looked more competent. Researchers then predicted the winner to be the one whose face was judged to look more competent by most of the participants. For the 32 U.S. Senate races in 2004, this method predicted the winner correctly in 23 of them. This “competent face” method therefore succeeded in 23/32 ≈ 0.719, or 71.9% of the 32 races. Is this significantly higher than 50%? We can use the 3S method to investigate: Assume that the “competent face” method is no better than flipping a coin, and simulate the set of 32 races, over and over, for a total of 1,000 times. How often do you get 23 or more correct predictions? Our results are shown in Figure 1.12. Figure 1.12: A distribution of the number of correct predictions in 32 races if the “competent face” method works half the time. In our sample of 32 races we found the “competent face” method to work 23 times. number of correct predictions in 32 races (under null hypothesis) Only 9 of these 0,000 simulated sets of 32 races show 23 or more correct predictions, so the simulated p-value is 0.009. This p-value is small enough to provide strong evidence against the null hypothesis, in favor of concluding that the “competent face” method makes the correct prediction more than half the time. These researchers also predicted the outcomes of 279 races for the U.S. House of Representatives in 2004. [While all 435 House seats were up for reelection in 2004, the researchers were only able to obtain pictures for the winter and runner up in 279 of these races.] The “competent face” method correctly predicted the winner in 189 of those races, which is a proportion of 189/279 ≈ 0.677, or 67.7% of the 279 House races. Notice that this percentage is similar but a bit smaller than the 71.9% of correct predictions in 32 Senate races. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXXI Thought Question: Do you expect the strength of evidence for the “competent face” method to be stronger for the House results, weaker for the House results, or essentially the same for the House results as compared to the Senate results? Let’s investigate this key question with a simulation of 1,000 repetitions of 279 races, assuming that the “competent face” method does no better than a coin flip. (See Figure 1.13.) Figure 1.13: A distribution of the number of correct predictions in 279 races if the “competent face” method worked half the time. In our sample of 279 races we found the “competent face” method to work 189 times. number of correct predictions in 279 races (under null hypothesis) Notice that none of these 1,000 repetitions produced 179 or more correct predictions. In fact, none came close to that many correct predictions. Let’s try 10,000 repetitions. Figure 1.14: A larger distribution of the number of correct predictions in 279 races if the “competent face” method worked half the time. In our sample of 279 races we found the “competent face” method to work 189 times. number of correct predictions in 279 races (under null hypothesis) Notice this “fills in” and “smooths out” the distribution a bit, but we still did not receive a single simulated study that had as many as 189 correct predictions. So, the simulated pvalue from these data is extremely close to zero (less than .0001). These data provide overwhelmingly strong evidence that the “competent face” method makes a correct prediction more than half the time. This analysis reveals that sample size plays a very important role in assessing statistical significance. Even with a slightly smaller sample proportion of correct predictions, the data from the 279 House races provide much stronger evidence that the “competent Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXXII face” method works better than a coin flip, as compared to the data from the 32 Senate races. Key idea: Whenever you need to decide whether a sample proportion is strong evidence against the null hypothesis, be sure to take into account the sample size of the study. If two studies have the same value of the statistic, a larger sample provides stronger evidence against a null hypothesis than a smaller sample does. Also keep in mind that the “number of repetitions” in the simulation gives us a more accurate estimate of the p-value but does not necessarily make the p-value smaller like a larger sample size does. Frequently Asked Question: More extreme versus bigger sample? House Member: “My big House has a lot more data than your little Senate. So I’m betting the House evidence is stronger, which means the House p-value will be closer to 0 than your Senate p-value.” Senator: “Big House indeed. Your percentage is only 67.7%. Mine is 71.9%, which is more extreme – farther from 50/50 – so my evidence is stronger.” Houser: “Hold on! Your sample is so small that there’s a lot of shakiness in your percentage. It could easily have been lower. My percentage is much more solid, because it’s based on a lot more data.” The debate between the Houser and the Senator illustrates that the strength of evidence, as captured by the p-value, is really driven by two components. The first component is the distance the proportion is from the null hypothesis proportion, and the other is the sample size. Both components contribute to how strong the evidence is against the null hypothesis, and the p-value takes both into account. In later chapters, we will see how these components are directly associated with the statistical power, a measurement of how likely the study is to find evidence in favor of the alternative hypothesis. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXXIII Activity 1.5: Baseball Big Bang A reader wrote in to the “Ask Marilyn” column in Parade magazine to say that his grandfather told him that in 3/4 of all baseball games, the winning team scores more runs in one inning than the losing team scores in the entire game. (This phenomenon is known as a “big bang.”) Marilyn responded that this probability seemed to be too high to be believable. 1. Identify the relevant parameter in this study. 2. State the grandfather’s claim and Marilyn’s response, in terms of this parameter. Also identify which is the null hypothesis and which is the alternative. 3. Suppose that you take a sample of Major League Baseball (MLB) games and find that half of them contain a big bang. What more do you need to know before you can test whether this provides strong evidence against the grandfather’s claim? 4. Suppose that Jose and Maria both take a sample of MLB games and keep track of which contain a big bang and which do not. Suppose that Jose finds that half of his sample of 10 games contains a big bang, and Maria finds that half of her sample of 50 games has a big bang. Which person do you expect to have the smaller p-value, and therefore stronger evidence against the grandfather’s claim and in favor of Marilyn’s alternative? 5. Use the One Proportion Inference applet to conduct a simulation analysis to simulate the p-value for both Jose and Maria’s sample data. You will need to toggle the greater than sign to less than for this test. Report the simulated p-value, and summarize the strength of evidence, in each case. Also comment on whether your analysis supports or refutes your answer to question 4. 6. Collect your own data on a sample of baseball games. Use a sample size of at least 20, but preferably much larger, and try to make your sample as representative of the population as possible. Feel free to use Major League or minor league or college or high school or Little League games. For each game take note of whether the game contains a big bang or not. Describe how you collect your sample data, and report your sample proportion of games with a big bang. 7. Continuing from question 6, conduct a simulation analysis to investigate how much evidence your sample provides against the grandfather’s claim in favor of Marilyn’s alternative. Submit a graph of the “what if the null is true” distribution, and report the simulated p-value. Write a paragraph summarizing your conclusions and explaining the reasoning process by which they follow from your analysis. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXXIV Chapter Summary Basing decisions on data is better than relying on anecdotal information. The study of statistics provides us with a systematic procedure for gathering data (evidence), evaluating that evidence, suggesting conclusions based on that evidence, and assessing our confidence in those conclusions. Descriptive statistics involves summarizing information from data to detect patterns and tendencies. Numbers that describe a sample are called statistics, whereas numbers that describe a population are called parameters. In practice, we almost never know the values of parameters, and so we use sample statistics to learn about population parameters. This process is called inferential statistics. Statistical investigations make use of the scientific method: Ask research question, form a hypothesis, collect data, analyze results, draw conclusions, communicate results, repeat. The research conjecture is stated as the alternative hypothesis, to be compared against the null hypothesis that is often a statement of no effect or equality (the “dull” hypothesis). As with the presumption of innocence in the criminal justice system, a test of significance begins with the assumption that the null hypothesis is true. After gathering and examining sample data, we conduct a test of significance to assess the strength of evidence that the data provide against the null hypothesis in favor of the alternative hypothesis. To assess the strength of evidence, we outlined a three step process that we call the 3S strategy: 1. Statistic: Calculate the value of a statistic from the sample data. 2. Simulate: Use simulation to hypothetically replicate the data collection process many times, but under the assumption that the null hypothesis is true. Examine the “what if null were true” distribution to see what are typical and unusual values of the statistic when the null hypothesis is true. 3. Strength of evidence: If the observed data value falls in the tail of the simulated “what if null were true” distribution, then the sample data provide strong evidence against the null hypothesis in favor of the alternative hypothesis. The key component of this strategy is the p-value, which is the probability of obtaining a result as extreme as or more extreme than the one from the sample data, assuming the null hypothesis were true. Small p-values provide us with strong evidence against the null hypothesis, as they indicate our observed data would be unlikely to occur by chance alone if the null hypothesis is true. Large p-values don’t provide us with enough evidence to reject the null in favor of the alternative, so we consider the null hypothesis to be plausible (but not proven). Also keep in mind the distinction between a possible value for the parameter and a plausible or believable value for the parameter. We can use repeated test of significance under different null hypothesis values to determine an interval of plausible values for a population parameter. If we get a large pvalue and thus can’t reject the null, the parameter under the null is considered to be plausible. This interval of plausible values is called a confidence interval. Sample size plays a large role in assessing statistical significance. If the value of a statistic is the same with two samples, then a larger sample provides stronger evidence against the null hypothesis than a smaller sample. So far, you considered how to test values for a population proportion, but the same logic will apply to other parameter values as well. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXXV Case Study: Will skipping breakfast lead to a world without men? In a recent study that made headlines around the world, researchers at Oxford University (Mathews et al. 2008) explored whether women who had high food consumption at the time of conception were more like to have boys. Two hundred and forty-one first-time mothers who were classified as having high-food consumption at the time of pregnancy were followed for nine months until the gender of their child was identified. 1. Before learning about how the study turned out, specify the null and alternative hypotheses you will use to test the researchers’ conjecture that the women who had high food consumption were more likely to have boys than girls. Since all 241 women were first time mothers in England, it is important to know that, overall, the proportion of male babies born to all first time mothers was England is 51.2%. 2. Statistic: So, how did the study turn out? In 135 of the 241 pregnancies, a boy was born. What proportion of the births to high-food consumption mothers resulted in a boy being born? 3. Simulate: How can you use the coin flipping approach to test the hypotheses? Specifically describe the process you would use (remember, you can’t use a 50/50 coin) to carry out the simulation, including how to calculate the p-value and make a conclusion. Provide sufficient details that someone else could implement the analysis based solely on your description. Your answer should not simply explain how to use the applet. 4. Use technology to carry out the simulation. Produce a rough sketch of the number of correctly predicted outcomes assuming the null hypothesis is true, and indicate where the result observed by the researchers falls in that distribution. Also report the p-value. 5. Strength of evidence: Based on your p-value, state your conclusions about whether high-food consuming mothers were more likely to have boys. 6. Using non-statistical language, explain the process you conducted in order to arrive at your conclusion. 7. These researchers also explored whether the likelihood of having boys amongst the lowest food consuming mothers was less than expected. Write the null and alternative hypothesis you wish to test on this data. 8. One hundred eight of the 240 lowest food consuming mothers had boys. Use technology to analyze these data with the 3-S analysis strategy. Produce a rough sketch of the null distribution, and indicate where the observed research result falls in that distribution. Also report the p-value. Summarize your conclusions. 9. Compare the two p-values you calculated, remembering that the p-value is a measure of the strength of evidence. Explain why, based on the data, it makes sense that one of the p-values is smaller than the other. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXXVI 10. In another analysis presented in their paper, the researchers tried to pinpoint specific foods consumed by the mothers associated with the gender of the child. Of 300 women in the study who consumed at least one bowl of cereal per day during the period of conception, 181 had boys. State hypotheses, analyze the data with the 3S strategy, report the p-value and summarize your conclusions. 11. Do you think that this analysis proves that higher food and/or breakfast cereal consumption is causing women to have male babies, or are other explanations possible? Why or why not? 12. Perhaps she ate a lot of breakfast cereal, but in the early 1900s Annie Grace Buckland Jones, married to Grover C. Jones Sr., of Peterstown, West Virginia had 15 children---all boys! The Jones’ large family size and the 15 consecutive male children raised quite a stir at the New York World fair in 1940, where they were invited to be guests of then President Franklin Delano Roosevelt and were featured nationally on the radio. Of course, there were some naysayers who couldn’t believe this to be possible, they must have just been faking it! a) Using simulation, estimate the probability of getting 15 boys in a row. Make sure you describe how you’ve carried out your simulation (including any simplifying assumptions you have made) and estimated the probability. b) Do you think that this proves that the Jones’ must have been faking it? c) What are the chances that after 15 boy babies in a row, the next baby (16th) born to the Jones’ will be a boy? It was! d) What are the chances that, after having had 16 children, their next baby would be a girl? It was! 13. We would be remiss in telling you the story of the Jones’ if we did not mention something that made the Jones’ even more famous than having 16 boys in a row. While playing horseshoes with one of his sons, “Punch,” Grover C. Jones and his son found a bluish rock which they believed to be quartz. They placed it in a cigar box, where it stayed for 14 years while the Joneses struggled through the Great Depression. Later, in 1942, Grover brought the rock to a geology professor at a nearby University, who determined that it was a 34.48 carat diamond—the largest alluvial diamond ever found in the United States. What are the chances of that? Don’t worry, you don’t need to calculate that value. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXXVII Research Article: Stock Monkey Have you ever heard it said that taking your stock broker’s investment advice is no different than if a monkey had chosen your stocks? This popular notion was first introduced by Prof. Burton Malkiel’s in his book A Random Walk down Wall Street. Since that time a number of serious and not-so-serious studies have taken place that have placed real monkey’s up against experts to “test” the theory. Read the article from “The Daily Princetonian” which investigates the results of an ongoing “game” published in the Wall Street Journal. Answer the remaining questions. 1. Did the experts or randomly chosen stocks perform better in the WSJ’s game? What measure of stock performance do the authors use to make their point? 2. What is Malkiel’s reason for why the expert picks did better? What justification does he give for being right? 3. Explain intuitively why Malkiel’s “towel” approach of buying multiple stocks makes sense. 4. How does Malkiel argue that the price of a stock is determined? 5. Why do some people argue that the collapse of Enron means Malkiel’s theory is not true? 6. How does Malkiel argue that the collapse of Enron gives his theory more evidence? James “Jim” Cramer is host of the popular Mad Money investment advice show airing daily on CNBC. Jim gives investment advice. In a popular segment called “The Lightning Round” callers phone in and ask Jim’s opinion as to whether a stock will go up or down. Over a 30-day period, Jim Cramer’s picks of whether a stock would go up or down were right 124 times and wrong 122 times. Assume in the same 30 day period, 50% of stocks went up and 50% went down. 7. Carry out the 3S strategy for evaluating the quality of Jim’s picks. Make sure to state null and alternative hypotheses. 8. To see if Jim Cramer’s advice holds up, folks compared Jim’s picks to those of Leonard the Wonder Monkey who flipped a fair coin to predict whether the stock would go up or down over the ensuing 30 day period. Is this a fair comparison of Jim’s picks? If yes, why? If not, what might you do to make it more fair? 9. Imagine if instead of 50% of the stocks in the market went up over the same 30 day period, that only 40% did. Carry out the 3S strategy for evaluating the quality of Jim’s picks. Make sure to state null and alternative hypotheses. 10. Is comparing Jim’s picks to those of Leonard the Wonder Monkey a fair comparison in this scenario? If yes, why? If not, what might make it more fair? Imagine that you were to test your own stockbroker to see if they could give good shortterm picks (like the test of Jim Cramer). You have your stockbroker pick some stocks Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXXVIII that he thinks will go up over the next 30 days. You then track all stocks for the next 30 days, and find that 55% of all stocks increased in price during the 30 day time period. 11. If you had asked your broker to pick 5 stocks, how many stocks would he have needed to get right for you to be convinced he was “better than a monkey”? Why? 12. If you had asked your broker to pick 15 stocks, how many stocks would he have needed to get right for you to be convinced he was “better than a monkey”? Why? 13. If you had asked your broker to pick 30 stocks, how many stocks would he have needed to get right for you to be convinced he was “better than a monkey”? Why? 14. Calculate the proportion of “correct picks” needed for each of your answers to 11 to 13. Why is this different in the different scenarios? Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XXXIX Practice Exercises 1. Review the graph that summarizes the simulation for the Bob or Tim example. The graph shows the results of 1000 different sets of 33 tosses of a fair coin. The goal of the simulation was to find out how likely it is to get 23 or more heads in 33 tosses. For each of (a) – (d) below, indicate what it corresponds to in the example. a. Each toss that lands heads A. In individual student b. Each set of 33 tosses B. A class of students c. Each toss of a coin C. The number of Tim matches d. Each bar of the graph D. A single Tim match 2. Review Activity 1.1, Friend or Foe. For each of (a) – (d) below, indicate what it corresponds to in the activity. a. Each toss that lands heads A. In individual infant b. Each set of 16 tosses B. The 16 infants c. Each toss of a coin C. The number of helpers chosen d. Each stack of dots in the graph D. An infant choosing the helper toy 3. Review the Dog Food example and Discriminating Cola activity. Compare them with the Bob/Tim and Friend/Foe scenarios. In both the Dog Food and Cola scenarios, a set of individuals is asked to make a choice from a set of options. In all four,cases, the _______ (null/alternative) hypothesis says that the _______ (options/individuals) are equally likely. Explain. 4. Review the graphs for the Bob/Tim, Dog Food, and Kissing examples. • Each stack of dots corresponds to a value of the __________ (parameter/statistic). • The height of the stack reveals _____________ (choose one of a-e): a. How many individuals were asked to make a choice b. How many options there were c. How many times the value of the statistic occurred d . How many times the value of the parameter occurred e. The total number of repetitions 5. In these graphs, the p-value corresponds to ________ (choose one): a. The height of the bar above the observed value of the statistic. b. The area of the bars above or to the right of the observed value c. The area of the bars above or to the left of the observed value d. The number of bars above or to the right of the observed value e. The number of bars above or to the left of the observed value 6. The set of individuals you want to know about is called the ________ (sample/population). The set of individuals you get to see is called the ________ (sample/population). 7. Suppose you take several different samples from the same population. You are likely to get several different values for the _________ (statistic/parameter), but the value of the ________ (statistic/parameter) will not change. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XL 8. For almost all practical problems, once you have the data, you can compute the value of the _________ (statistic/parameter) but you can’t compute the value of the _________ (statistic/parameter). 9. For an election poll, the set of likely voters is often used as the _______ (population/sample). 10. Which one of the following statements is meaningful? (The other three are meaningless.) a. “The null hypothesis was statistically significant.” b. “The observed difference was statistically significant.” c. “The parameter was statistically significant.” d. “The sample size was statistically significant.” 11. True or false, and explain: a. The p-value depends on the sample. b. The p-value depends on the null hypothesis. c. The p-value is the probability that the null hypothesis is true. d. The p-value is the probability that the alternative hypothesis is true. 12. If the p-value is below 0.05, the evidence against the _______ (null/alternative) hypothesis is ________ (strong/weak). 13. Statistical testing allows you to reject the ________ (null/alternative) hypothesis based on the data, but cannot justify rejecting the _________ (null/alternative) hypothesis. 14. When you use the 3S method for hypothesis testing, the simulation you use to find the p-value is based on the assumption that the ________ (null/alternative) hypothesis is true. 15. When you use the 3S method, you use simulation to create a large number of hypothetical _______ (samples/populations). For each one, you compute a value of the ___________ (statistic/population) and look at the distribution of these values. 16. Multiple Choice. In a sample of size 20, we observe 12 (60%) of the subjects with the characteristic of interest. In conducting the related test of significance we have a computer applet simulate counting the number of heads when 20 fair coins are flipped. This process is repeated 1000 times and a distribution of the number of heads is made. This distribution represents: a. Repeated results if the null hypothesis is true. b. Repeated results if the alternative hypothesis is true. c. Repeated results if the population proportion is the same as your sample percentage of 60%. d. Repeated results if the population proportion is more than your sample proportion of 60%. 17. Multiple Choice. Suppose you sample 100 students at your school and find that 60% of them exercise regularly. A friend at another school sample’s 200 students and finds that 60% of that sample also exercises regularly. You both conduct tests of significance and find p-values. a. The sample with the larger sample size will have the smaller p-value. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion b. The sample with the smaller sample size will have the smaller p-value. c. Since the sample proportions are the same, the p-values will be the same. d. There is no way to tell which sample will have the smaller p-value. Copyright © 2010 XLI Chapter 1: Introduction to Statistical Inference: One Proportion XLII Exercises Section1.1 Introduction to Statistics 1. Pro-life or pro-choice In the May 16, 2009 edition of the Grand Rapids Press, the article entitled “Most of us say we’re pro-life, poll finds” reported the following information: A telephone interview conducted by Gallup involved 1,015 adults nationwide. The poll found that 51% called themselves pro-life rather than prochoice on the issue of abortion. This is the first time that a majority gave that answer in the 15 years that Gallup has asked the question. a. What is the population of interest in this study? b. What is the sample? c. Identify the statistic(s) in this study. If you don’t know the value of the statistic, describe it in words. d. Describe the parameter(s) in this study. If you don’t know the value of the parameter, describe it in words. 2. Education and Government In August 2010, a Gallup poll asked a random sample of 1,013 U.S. adults the following question, “In terms of public education in this country, do you think the federal government [should be more involved in education than it currently is, should keep its involvement the same, (or) should be less involved in education than it currently is]?” Here’s a bar chart that shows the distribution of responses. a. Approximately what proportion of respondents want the federal government’s involvement in public education to remain unchanged? b. Is your proportion from part (a) a parameter or a statistic? How do you know? c. How many respondents want the federal government’s involvement in public education to remain unchanged? d. Suppose that we want to use these data to test whether less than half of American adults want the federal government to be more involved in public education than it currently is. State appropriate null and the alternative hypotheses for this scenario. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XLIII 3. Practice Suppose that our population of interest is all households in San Luis Obispo County. Identify which of the following are possible parameters. a. Mean number of residents in all households in the county b. Whether there are any children under 12 years of age living in the household c. Proportion of households with children under 12 years of age d. Whether household is in an urban area e. Whether size (number of members) of the household is related to the household expenditure 4. Oil spill Following the oil spill in the Gulf of Mexico, The Pew Research Center for the People and the Press surveyed a random sample of 994 U.S. adults in early May 2010, and found that 547 of the respondents thought that the oil spill was a major environmental disaster. In the context of the study, answer the following. a. Identify the population of interest. b. Identify the sample. c. Describe, in words, the relevant parameter of interest. d. Describe, in words, the statistic corresponding to the parameter described in part (c). Also, specify the numeric value of the statistic. 5. U.S. Census Did you know that the U.S. Census is required by law? A survey of 1,504 U.S. adults conducted by The Pew Research Center for the People and the Press in early January 2010, found only 31% of the respondents knew this fact. a. Identify the population of interest. b. Identify the sample. c. Is 31% a parameter or a statistic? How do you know? d. Do the data provide evidence that less than a third of U.S. adults are “censusaware”? State appropriate null and alternative hypotheses for this scenario. 6. Practice The data collected from a sample of students are analyzed below with a bar chart and frequency table. Based on this information: a. What percentage of the sample are freshmen? b. What percentage of the sample are sophomores? c. What percentage of the sample are juniors? d. What percentage of the sample are seniors? e. What should be true about these four percentages? Bar Chart 35 30 25 20 Freshman 25 15 Junior 30 10 Senior 13 5 Sophomore 19 Class Column Summary 87 Freshman S1 = count ( ) count ( ) Copyright © 2010 Sophomore Junior Class Senior Chapter 1: Introduction to Statistical Inference: One Proportion XLIV Section 1.2 Introduction to Statistical Reasoning: One Proportion 7. Injection drug use Do a majority of women in Rhode Island Prisons who test positive for Hepatitis C virus report injection drug use? A recent article in the American Journal of Public Health, (Macalino, G.F. et al., 2005) reported that in a representative sample of inmates at the time of intake to the prison system 197 women tested positive for the Hepatitis C virus. Of these 197 women, 110 reported injection drug use. a. What proportion of women with Hepatitis C virus in this study reported injection drug use? b. Specify a null hypothesis and an alternative hypothesis to reflect the research question that a majority of women who test positive for Hepatitis C report injection drug use. c. Use the coin flipping applet to generate 1000 repetitions of 197 Rhode Island women prisoners who tested positive for Hepatitis C virus under the null hypothesis. Save a screenshot of your dotplot and p-value. d. Based on the study’s result, what is the p-value for this test? e. What are your conclusions based on the p-value you found in part d? f. To what population are you willing to generalize these results? Explain. 8. Staying up late? In a random sample of 40 students from a small college, 30 went to bed after 11 pm the previous night. From these data we want to know whether we can we conclude that the majority of all students at the college go to bed after 11 pm. a. What are the null and alternative hypotheses? b. Statistic: What proportion of the sample went to bed after 11pm? c. Simulate: Use the coin flipping applet to generate 1000 repetitions of 40 students under the null hypothesis. Save a screenshot of your dotplot and p-value. d. Strength of evidence: What are your conclusions based on the p-value you found in part c? 9. Credit card usage Do a majority of female college freshmen at a certain college have at least one credit card? To answer this question, some student researchers collected data from 42 female freshmen at their school. They found that 27 of them had at least one credit card. a. State the null and alternative hypotheses for a test of significance of the research question given above. b. Use the coin flipping applet to answer the following questions and compute the pvalue for this test by using 1000 repetitions. i) What value did you enter for the “Probability of heads”? ii) What value did you enter for the “Number of tosses”? iii) What value did you enter for the “Number of repetitions”? c. The graph generated is centered at approximately what number? d. What is your p-value? e. Use the p-value to give a conclusion for your significance test. 10. Multiple Choice The p-value of a test of significance is: a. The probability, assuming the null hypothesis is true, that we would get a result as extreme as the one that was actually observed. b. The probability, assuming the alternative hypothesis is true, that we would get a result as extreme as the one that was actually observed. c. The probability the null hypothesis is true. d. The probability the alternative hypothesis is true. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XLV 11. Planning to vote? The following figure shows the results of the coin tossing applet. Suppose the research question was, “Do a majority of students say they plan to vote in the next election?” Explain what the following numbers shown in the applet mean in terms of the students being sampled and the process and results from the test of significance. a. b. c. d. e. f. What does the 0.5 for the probability of heads represent? What does the 100 for the number of tosses represent? What do the 1000 repetitions represent? What does the 55 in the extreme as cell represent? What does the 0.202 for the proportion of repetitions represent? What would be an appropriate conclusion for this test? 12. Monkey see A recent article (Hauser, Glynn, and Wood, 2007) described a study that investigated whether rhesus monkeys have some ability to understand gestures made by humans. In one part of the study, the experimenter approached individual rhesus monkeys and placed 2 boxes an equal distance from the monkey. The experimenter then placed food in one of the boxes, making sure that the monkey could tell that one of the boxes received food without revealing which one. Finally, the researcher made eye contact with the monkey and then gestured toward the box with the food by jerking his head toward that box. This process was repeated for a total of 40 rhesus monkeys. It turned out that 30 of the monkeys approached the box that the human had gestured toward, and 10 approached the other box. a. Describe how you could use a coin to conduct a simulation analysis of this study and its result. Give sufficient detail that someone else could implement this simulation analysis based on your description. Be sure to indicate how you would decide whether the observed data provide convincing evidence that rhesus monkeys can read human gestures better than random chance. b. Use software to conduct a simulation analysis with at least 1000 repetitions. Report the approximate p-value, and summarize the conclusion that you would draw about the research question of whether rhesus monkeys have some ability to understand gestures made by humans. 13. CPR on pets A national survey conducted on October 1-5, 2009 asked pet owners whether they would perform CPR on their pet in the event of a medical emergency. In the sample of 1116 pet owners, 58% said that they are at least somewhat likely to perform CPR on their pet. Investigate whether this sample result provides strong evidence that more than half of all pet owners in the U.S. are at least somewhat willing to perform CPR on their pet. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XLVI 14. Practice Indicate whether each of the following p-values provides extremely strong evidence against the null hypothesis, moderate evidence against the null hypothesis, or not much evidence against the null hypothesis. a. 0.052 b. 0.00035 c. 0.417 Section 1.3 Statistical Significance for One Proportion: Other Null Hypotheses 15. Practice What are the null and alternative hypotheses in the following scenarios? a. Do a majority of males believe that cigars smell good? b. Hershey’s claims they put 45% orange Reese’s Pieces in their Reese’s Pieces candy mixture. Is this true, or is the true percent different from 45%? c. Do a majority of Americans believe in life-after-death? d. Do less than 30% of students at your school read for pleasure during the term? e. Do more than 12% of students at your school skip breakfast every morning? f. Do more than 25% of the students at your school exercise at least three times per week? g. Do fewer than 5% of people aged 14 or older get arrested per year? 16. Psychic powers Statistician Jessica Utts has conducted extensive analysis of studies that have investigated psychic functioning. One type of study involves having one person (called the “sender” concentrate on an image while a person in another room (the “receiver”) tries to determine which image is being “sent.” The receiver is given four images to choose from, one of which is the actual image that the sender is concentrating on. a. If the subjects in these studies have no psychic ability, what is the probability that they identify the correct image? Is this a null hypothesis or an alternative hypothesis? b. Utts (1995) cites research from Bern and Honorton (1994) that analyzed studies using a technique called ganzfeld. These researchers analyzed a total of 329 sessions (http://www.ics.uci.edu/~jutts/air.pdf). Use software to simulate 1000 repetitions of this study, assuming the null hypothesis to be true. Produce a welllabeled graph of the results. c. Based on the graph of your simulation results, about how many of these 329 sessions would have to produce a “hit” (correct identification of the image being “sent.”) in order to provide very strong evidence against the null hypothesis? Explain how you arrive at this number based on your graph. d. Utts reported that Bern and Honorton found a total of 106 “hits” in the 329 sessions. Does this result provide very strong evidence against the null hypothesis? Explain, and summarize your conclusion in the context of this study. 17. Reese’s pieces An astute statistics student wants to determine whether there are more than 45% orange Reese’s Pieces in small bags of Reese’s Pieces. Being low on funds, she purchases only one small bag of Reese’s Pieces. There are 25 total Reese’s Pieces in her bag. Sixteen are orange. The student quickly simulates the distribution of the number of orange Reese’s Pieces from 1000 samples of 25 total Reese’s Pieces assuming there are indeed 45% orange in the population. The graph resulting from 1000 repetitions of this simulation is displayed below. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XLVII number of orange Reese’s Pieces a. State a null hypothesis and an alternative hypothesis for this student. b. What percentage of the student’s Reese’s pieces is orange? Is this more than the 45% she is testing? c. Use the graph above to find the p-value for this test. What conclusion would you draw about the null hypothesis from this p-value? d. Suppose the student’s original bag had 13 orange instead of 16. Now, what percentage of the student’s Reese’s pieces is orange? Use the graph above to find the p-value for this test. e. Which p-value offers more support for the alternative hypothesis that there are more than 45% orange Reese’s Pieces? Explain why this makes sense using the percentage of orange Reese’s Pieces you found in (b) and (d). 18. Fantasy golf A statistics professor is in a fantasy golf league with 4 friends. Each week one of the 5 people in the league is the winner of that week’s tournament. During the 2010 season, this particular professor was the winner in 7 of the first 12 weeks of the season. Does this constitute strong evidence that his probability of winning in one week was larger than would be expected if the 5 competitors were equally likely to win? Conduct a simulation analysis to investigate this question. Write a paragraph summarizing your conclusion and explaining the reasoning process by which your conclusion follows from the simulation analysis. 20. Breakfast Findings at James Madison University indicate that 21% of students eat breakfast 6 or 7 times a week. A similar question was asked of a random sample of 159 students at a different college. Of the 97 who responded, 35 reported eating breakfast 6 or 7 times a week. Do students at the other college have healthier breakfast habits than James Madison students? More specifically, do more than 21% of all students at the other college eat breakfast 6 or 7 times weekly? a. State your null and alternative hypotheses. b. Statistic: What sample proportion of students at the other college eat breakfast 6 or 7 times per week? c. Simulate: Use the coin flipping applet to simulate “could have been” outcomes under the null hypothesis 1000 times. d. Based on the study’s result, what is the p-value for this test? e. Strength of evidence: What are your conclusions based on the p-value you found in part d? f. What are your thoughts about the fact that only 97 out of the random sample of 159 responded? 21. Cloning humans In a May 2010 poll of a random sample of 1,029 U.S. adults, Gallup found that 88% thought that the cloning of humans was morally unacceptable. Explain what, if anything, is incorrect in the following statements. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XLVIII a. The population is the 1,029 U.S. adults who were interviewed. b. The population is U.S. adults who think that cloning of humans is morally unacceptable. c. The number 88% is a parameter. d. The sample is all U.S. adults. e. The statistic is 1,029 U.S. adults. f. The statistic is the average number of U.S. adults who think cloning humans is morally unacceptable. g. If we repeatedly poll random samples of 1,029 U.S. adults, the percentage of respondents who think that the cloning of humans is morally unacceptable is always going to be 88%. 22. Cloning humans (contd.) Reconsider the previous exercise. Suppose that we want to use the data to test whether the proportion of U.S. adults who think that the cloning of humans is morally unacceptable is over 0.80. What, if anything, is incorrect in the following? a. Alternative hypothesis: The proportion of U.S. adults who think that the cloning of humans is morally unacceptable is less than 0.80. b. The p-value is the probability of observing a sample proportion of 0.88, if the proportion of U.S. adults who think that the cloning of humans is morally unacceptable is 0.80. c. Another survey of 1,029 randomly selected U.S. adults, run by a different survey team found 85% of their respondents think that the cloning of humans is morally unacceptable. These data would be stronger evidence against the null hypothesis, than the data collected by Gallup. 23. Pro-choice or pro-life (contd.) Reconsider the exercise. Suppose that we want to use the data from Gallup’s survey to test whether a majority of U.S. adults are now pro-life rather than pro-choice. Using the Simulating Coin Tossing applet the simulated p-value was found to be 0.276. What, if anything, is incorrect in the following statements? a. Null hypothesis: A majority of U.S. adults are now pro-life rather than pro-choice. b. Since the p-value is fairly large, we have strong evidence that exactly half of all U.S. adults are pro-life and exactly half are pro-choice. c. Since the p-value is fairly small, we have strong evidence that a majority of U.S. adults are now pro-life. d. Since the p-value is fairly large, this survey has provided no information about the proportion of U.S. adults who are now pro-life. Section 1.4 Plausible Values 24. Multiple Choice If you don’t have much evidence against the null hypothesis, you can: a. Conclude the null hypothesis must be true b. Conclude that the null hypothesis is one of a set of plausible values 25. Democrat or Republican A political poll finds that 305 of 600 likely voters are in favor of the Democrat over the Republican in a recent local election between the two candidates. Is this evidence that the majority of likely voters are in favor of the Democrat candidate? a. What is the proportion of likely voters in favor of the Democrat candidate in this sample? Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion XLIX b. Use the 3S process to evaluate whether this sample is evidence that the majority of likely voters in the population favor the Democrat candidate. c. Does your answer to b) mean that you have proven that the population of all likely voters is perfectly split (50/50) between the two candidates? 26. Outbreak Recently a small college in the Midwest experienced an outbreak of norovirus which forced its temporary closure. In a survey of a random sample of students after the outbreak, 34 out of 187 students reported experiencing symptoms of norovirus during the outbreak. a. Find the proportion of students reporting symptoms of norovirus. b. What is the population parameter of interest? c. State null and alternative hypotheses and then use the 3S process to investigate whether this sample provides evidence that more than 10% of students at the college experienced symptoms. d. State null and alternative hypotheses and then use the 3S process to investigate whether this sample provides evidence that less than 30% of students at the college experienced symptoms. e. Use your results from c) and d) and more computer simulations to find a range of plausible values of the population parameter. Remember that a range of plausible values is a set of numbers against which your sample does not provide much evidence. Make your range accurate to the nearest 1 percentage point. Clarify how you determined convincing evidence or not much evidence. Section 1.5 Effect of Sample Size 27. Phone home Suppose we were testing the hypothesis that more than 50% of college students call home at least once a week and use the 3S analysis strategy to estimate the p-value. Also suppose that four students collect data to try to answer the question. Each student collects a different sample size (one of 10, one of 20, one of 40, and one of 80). Surprisingly, in each case their results showed that 70% of the students in their respective samples called home at least once a week. Use the following dotplots of 1000 repetitions for these simulations to decide whether each result is statistically significant (small p-value) and they can therefore conclude that more than 50% of students call home at least once a week. Also comment on what happened to the p-value as the sample size increased even when the sample proportion stayed the same? number of college students calling home This student found that 28 out of 40 (or 70%) of the number of college students calling home students called home at least once a week. This student found that 7 out of 10 (or 70%) of the students called home at least once a week. Copyright © 2010 number of college students calling home number of college students calling home This student found that 56 out of 80 (or 70%) of the students called home at least once a week. This student found that 14 out of 20 (or 70%) of the students called home at least once a week. Chapter 1: Introduction to Statistical Inference: One Proportion L 28. Story falls flat A legendary campus story tells of two students who miss an exam because they are off partying. When they return to campus, they sheepishly approach the professor and say that they missed the exam because of a flat tire. The students are delighted when the professor grants them an opportunity to take a make-up exam. But when they are sent to separate rooms to take the make-up exam, they find that one question is worth 95 points: Which tire was flat? It’s been conjectured that when students are asked this question and forced to give an answer (left front, left rear, right front, or right rear) off the top of their head, they tend to answer “right front” more than would be expected by random chance. To test this conjecture, this question was asked of a recent class of 32 students, with the following results: Left front Left rear Right front Right rear 5 4 18 5 a. State the appropriate null and alternative hypotheses to be tested. b. Produce a bar graph to display the student responses, and comment on what the graph reveals. c. Statistic: Calculate the sample proportion who answered “right front.” Does this statistic appear to support the research conjecture? Explain. d. Simulate: Use software to conduct a simulation analysis for investigating whether the sample data provide strong evidence for the research conjecture. Submit a graph of the what if the null was true distribution, and report the approximate pvalue. e. Strength of Evidence: Explain what this p-value means. In other words, this is the probability of what, assuming what? (Do not draw a conclusion from the pvalue yet; that’s the next question.) f. Summarize the conclusion that you draw from this test. Also explain the reasoning process behind your conclusion. 29. Story falls flat (contd.) Reconsider the previous exercise. Suppose another class conducts the same study with exactly half as many students, and suppose the proportional breakdown in the four categories is identical to the previous exercise. In other words, 9 out of 16 students answered, “right front.” a. Before you analyze the data, would you expect to find stronger evidence for the research conjecture (that people pick the right front tire more than ¼ of the time), weaker evidence, or the same strength of evidence? Explain your thinking. b. Conduct a simulation analysis to produce an approximate p-value. How does it compare to the p-value from the previous exercise? Is this what you expected? Explain. 30. Haiti earthquake On January 12, 2010, a 7.0 magnitude earthquake shook Haiti, affecting about three million people. Aid started pouring in from all parts of the world in the form of volunteer services, donations, etc. In a February 2010 survey of 1,383 randomly chosen U.S. adults, The Pew Research Center for the People and the Press found that 719 of the respondents had made a donation to Haiti victims. We want to use these data to investigate whether more than half of U.S. adults made a donation to Haiti victims. a. Identify the population of interest. b. Describe the relevant parameter of interest. Copyright © 2010 Chapter 1: Introduction to Statistical Inference: One Proportion LI c. State the null hypothesis and the alternative hypothesis. d. Of the people that the Pew Research Center interviewed, what proportion made a donation to Haiti victims? e. Is the proportion you calculated in part (d) a parameter or a statistic? How do you know? f. The study description says that we are looking at data from “randomly chosen U.S. adults.” What is the main advantage of having data from “randomly chosen U.S. adults”? g. Investigate whether the data provide evidence that more than half of U.S. adults made a donation to Haiti victims. Be sure to include the p-value, an interpretation of the p-value, and your conclusion in the context of the study. h. A high school student surveys a random sample of 200 adults in his city, and finds that 52% of the respondents made donations to Haiti victims. The high school student decides to use his data to test whether more than half of the adults in the city made a donation to Haiti victims. Will the p-value from his analysis be larger than, smaller than, or the same as your p-value from part (g)? Explain your choice. Copyright © 2010

Chapter 1. Introduction to Statistical Inference: One Proportion

Related documents

Products

Support

Chapter 1. Introduction to Statistical Inference: One Proportion

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib