Slide 1 Hypothesis Testing Part III – Applying the Concepts Slide 2 This video is designed to accompany pages 95-116 of the workbook “Making Sense of Uncertainty: Activities for Teaching Statistical Reasoning,” a publication of the Van-Griner Publishing Company Slide 3 While you may not find yourself actually computing p-values in your chosen profession, you will without a doubt be trying to interpret those found by others. In particular, the phrase “statistically significant” or some very similar wording is widely used in media. Have a look at the New York Times article on multivitamins, appearing October 17th, 2012. The article addresses a clinical trial that unfolded over more than a decade. Male doctors were followed during that period of time and those taking a daily multivitamin were compared to those who took a placebo (dummy pill). What did the researchers find? You can see from the article that those taking the multivitamins “experienced 8 percent fewer cancers ….” But it is the first line of the last paragraph that invokes a statistical badging of sorts. “The reduction in total cancers was small but statistically significant ….” Our goal for this video is to make sure that all of us know what this means, to know in what sense it adds to the credibility of the results, and in what sense it does not. Slide 4 You have learned elsewhere that when the phrase “statistically significant” is used then a hypothesis has been tested. That is, a choice between a null and an alternative hypothesis has been put forward, and a decision has been made. If the results were statistically significant then you know that the alternative has been chosen. Else, either the null was chosen or, at least, no endorsement was given for the alternative. Finally, you also know, or are often safe to assume, that this decision was made because an estimated false positive rate was computed that had to be less than 0.05. That is, you know that the risk involved in choosing HA when H0 was really true is less than 5 chances in 100. All of this information is embedded in that phrase. To summarize: when reading the words “statistically significant” or “not statistically significant” you should be able to establish what is being compared; express that choice between a null and an alternative hypothesis; say how the comparison turned out; and what risk was associated with the conclusion. Keep in mind that saying a result is “significant” is not at all the same as saying it is “statistically significant.” Slide 5 Let’s apply what we have just learned to the New York Times article. The comparison is between the number of cancers seen in the group taking the multivitamins and the group taking the placebo. Remembering that the null is always generically “treatment is not effective,” we know that the null hypothesis in this case is that the number of cancers in the vitamin group is the same as in the placebo group. Therefore the alternative is that the number of cancers in the vitamin group is less than in the placebo group. If you are confused about how to tell which is the null and which is the alternative, the next two steps really determine that. We are told that the result were statistically significant. What results? The results that said the vitamin group had fewer total cancers than the placebo group. Since we know that the results were statistically significant, we know that the decision was made to go with the alternative. Hence, the alternative has to be that the total number of cancers in the vitamin group was less than in the placebo group. Of course, there is always the risk of wrongly saying HA is true, that is a risk of saying HA is true when H0 really is. But that is what the estimated FPR, the p-value, quantifies for us. And in this case, that risk is less than 5 chances in 100. We know that because that is what it means to be statistically significant. Slide 6 Let’s take a look at another example. The excerpt shown is from a 2013 article that appeared in the Chicago Tribune. You should be able to find the entire article by just searching on the title. This study was not an experiment but an observational study of adolescents who went to McDonald’s and Subway. Researchers collected data on what those adolescents actually ate at each of the restaurants and kept track of the total number of calories consumed. What did they find? Look at the last paragraph. They bought an average of 1,038 calories at McDonald’s and 955 at Subway. But that calorie difference was “not statistically significant” the article goes on to say. Let’s apply our template for analyzing what this means. Slide 7 What is being compared? The calories consumed by teens at Subway and those consumed by teens at McDonald’s. How do we know what H0 and HA are? Look at the third part of the template first and that may make this easier to answer. The difference between the calories consumed at Subway and McDonald’s was not statistically significant we are told. That means that whatever H0 and HA are, the decision was made to not go with HA. Thus H0 has to be “number of calories consumed at Subway is the same as at McDonald’s” and HA has to be “number of calories consumed at Subway is less than at McDonald’s.” Since the decision was made not to endorse HA, we know the estimated false positive rate was 0.05 or larger. Hence, it was too risky in the sense that the chances of going with HA when H0 was really true were too high to justify that decision. Keep in mind that the risk of going with H0 when HA is really true is not being explicitly monitored in hypothesis testing, but in general is assumed to be pretty small for common hypothesis tests. Slide 8 Let’s continue to practice. The New York Time study shown here – again you can almost surely find the full article if you search on the title – addresses weight loss in people with serious mental illness. While we don’t have an abundance of information from the article, we do see in the last paragraph that 24 well-designed studies of weight loss programs for the mentally ill were scrutinized. We are told that “most achieved statistically significant weight loss, but very few achieved ‘clinically significant’ weight loss.” What does this mean? Slide 9 Let’s start by applying our four-step template. Each study being scrutinized compared the weight of the patients before they completed a given fitness program for the mentally ill, to the weights afterwards. In terms of H0 and HA, we know that H0 is that the program didn’t work, or that the weight before starting the program was no different than the weight afterwards. HA has to be that the program being looked at worked, that is, there was a weight loss for patients after completing the program. We know the results were statistically significant so we know that from a statistical perspective we can safely assume HA is true. In fact, the chances of saying HA is true, based on the data from such a study, when in fact H0 is true, are less than 5 in 100. This article has an interesting nuance, however. The researchers also concluded that few of the studies found any practically significant weight loss. That is to say, the change in weight, probably on average, for patients in a given study was big enough to be statistically significant, but not big enough to be practically significant. More will be said on this important distinction in another video. Slide 10 Here is another Tribune article. This one has a look at a study that compared the pay of presidents at various universities with measures of prestige for those universities. The findings are in the last paragraph. “….[N]o statistically significant relationship was observed between academic quality and presidential pay.” Let’s make sure we know what this is telling us. Slide 11 Presidential pay at low-achieving universities is being compared to presidential pay at high-achieving universities. The null would be something to the effect of “there is no correlation between presidential pay and performance of a university. The alternative would counter with “there is a correlation between presidential pay and performance of a university.” The results were NOT statistically significant, so we know that the decision was made that it was unsafe to go with HA. That is, there was not enough evidence to counter the claim that presidential pay and university prestige are uncorrelated. Of course there is always a chance this was the wrong decision. That is, the decision to go with H0 when in fact HA was true could have been the wrong one. The chance of that happening is kind of like a false negative rate for a screening test. We don’t monitor false negative rates explicitly in hypothesis testing, as we have said before, but we assume the method they used to test this hypothesis had robust sensitivity. Slide 12 One more example. Search on the title of this New York Times article or pause the video for a chance to read the article more thoroughly. The so-called “hot hand” has been debated in sports since at least 1985. Coaches and players have long believed that players go on “streaks” and get “hot” with the tendency to continue playing very well once they start of streak of outstanding plays. This was first challenged in 1985 when researchers studied performance records from two pro and one university basketball team finding that players “statistically were not more likely to hit a second basket after sinking a first.” More recently other researchers have accessed much larger amounts of data and have concluded that “basketball players experienced statistically significant and recognizable hot periods over an entire game or two ….” What does this mean? Slide 13 This is a little harder to think about than some of our first examples. The comparison in the recent studies seems to be between streaks of free throw successes versus what a player would make at random. H0 in this case is that the streaks, also known as “runs” are no longer than would be expected by chance. HA is that the streaks are longer than would be expected by chance. We know the results were statistically significant so we know that the decision was to accept HA, that is, there is evidence that streaks are longer than expected by random chance. Of course, this can be the wrong decision. But the false positive rate had to be estimated to be below 5 chances in 100 since the results were said to be statistically significant. Slide 14 This concludes our video on the application of hypothesis testing concepts. Remember, to understand the usage of “statistical significance” in the media, always ask what is being compared, what the null and alternatives are, how the comparison turned out, and what risks were involved in the decision that was made.