Slide 1 Hypothesis Testing Part I – As a Diagnostic Test Slide 2 This video is designed to accompany pages 95-116 of the workbook “Making Sense of Uncertainty: Activities for Teaching Statistical Reasoning,” a publication of the Van-Griner Publishing Company Slide 3 Diagnostic tests – just another name for screening tests – have to choose between two outcomes. A positive outcome means the test has uncovered evidence of what it was designed to find. For instance, if you score high enough on a field sobriety test then the outcome will suggest you are intoxicated. A negative outcome simply means that the test did not uncover adequate evidence of what the test was looking for. If you have a low score on a field sobriety test then it will likely be assumed you are sober, whether you are or not. Slide 4 In many ways, controlled experimental studies have to make choices similar to a screening test. Positive outcomes are when the experiment has produced adequate evidence that the treatment is effective. Negative outcomes are when the experiment failed to produce enough evidence to say the treatment is effective. What constitutes “adequate evidence” is the crux of statistical hypothesis testing and as we will see, that discussion has a significant overlap with the language used to assess the validity of a diagnostic test. Slide 5 Let’s start by looking at an example. The study seen here first surfaced in the media in 2009. The drug Flibanserin was originally being studied as an anti-depressant, but was shown to be inadequate for that purpose. What researchers noticed during those clinical trials was that a number of women who participated in the trials indicated that their sex drive had increased. This led the manufacturer to study Flibanserin for a completely different purpose. As you see here, a clinical trial was developed wherein 1,378 premenopausal women satisfying the experimental protocol were randomly assigned to one of two groups. One group took 100 mg of Flibanerin and the other a placebo. All participants were required to keep a daily journal about whether they had sex and, if so, whether it was in their view “satisfying.” Slide 6 In what sense is this experiment behaving like a diagnostic (screening) test? First of all, the experiment is designed to produce evidence that will allow an informed choice between one of two possible outcomes. Either one concludes a that Flibanserin is no better than a placebo for increasing sex drive, or one concludes that Flibanserin is better than a placebo for increasing sex drive. Just like a typical screening test, this choice will have to be made based on the evidence at hand, using some meaningful numeric to guide the decision. It can’t be made with complete assurance the correct outcome was chosen. For medical experiments in general the choice is almost always between “treatment is not effective” and “treatment is effective.” The details just depend on the actual experiment, whether a placebo was part of the study, etc. Slide 7 In statistical science, the paradigm for choosing between “Treatment is Not Effective” and “Treatment is Effective” is called hypothesis testing. As a diagnostic test, hypothesis testing is not a kit or a physical examination or an agility test. Rather, it is a collection of mathematical steps that take the data the experiment generated and produce something very much akin to a false positive rate to assist in choosing between the “negative” and “positive” outcomes. Slide 8 The decisions made in hypothesis testing come with the same risks as, say, the decision made by a home pregnancy test, or the Beck Depression Inventory that screens for clinical depression. A test of hypothesis may produce insufficient evidence that the treatment is effective, when unbeknownst to the experimenters, it is. Likewise, the test may suggest that the data produced by the experiment are sufficient to say that the treatment is effective, when in fact it may not be effective. Hence, just like common screening tests, a test of hypothesis may be susceptible to false negatives and false positives. Slide 9 In general we know that both sensitivity and specificity are used to evaluate how well a screening test performs. This, in turn, informs our confidence in the results produced by the test in practice. In hypothesis testing the mathematical steps used to process the data are often steps that are known within statistical science to possess quite good sensitivity. Now this isn’t true for all types of hypothesis testing but a deeper discussion in that direction would be beyond the scope of this video. Sensitivity in hypothesis testing is similar to what statisticians call “power” and, by and large, the hypothesis testing procedures that are reported in the media can be shown to have good power. It turns out that the specificity of the testing procedure is what drives the practical decision of choosing between “Treatment is Not Effective” and “Treatment is Effective.” Slide 10 Let’s work a little harder on the analogy between diagnostic testing and hypothesis testing. In a typical screening scenario we have data from test subjects, both on what the screening test has predicted and what the gold standard test has said about their actual status. From those data, typically arrayed in a two by two table, we can compute sensitivity and specificity as a way of assessing how well the test is working. Some will call this the reliability of the test, others the validity. Why do we care? Don’t forget, in practice the test is applied in the absence of any gold standard. You may use a home pregnancy test, or get screened for depression, or are given a field sobriety test, but then what do you make of the results?. Clearly, if the test has done well in its validation stage then the results are more believable. In particular, if the specificity of the test was high, and the test gave you a “yes” (positive), then you are more likely to believe that since you’d know the chances of a false positive are low. Likewise, if the sensitivity of the test was high, and it gave you a “no” (negative), then you are more likely to believe that, as well, since you would know that the chances of a false negative are small. Keep in mind, embedded in all these screening tests is a rule that decides when to report back a “yes” and when to report back a “no.” It’s part of the chemistry inside the home pregnancy test; it’s the score level set by the psychologist when using the Beck Inventory, and it’s the discretion of the police officer when using the field sobriety test. But the rule matters. Change the rule and the sensitivity and specificity measures will change almost surely. Slide 11 Hypothesis testing is very similar in some important ways. Data are collected from experimental subjects. However, the truth – whether the treatment is effective or not – is not possible to know. In lieu of a tangible gold standard, we hypothesize that the treatment is ineffective, adopt a decision rule, and see how risky it would be to apply that rule if the treatment really is ineffective. This is the essence of formal inferential reasoning. The most awkward part of the analogy to screening tests is the rule. Think of the rule as being set to be an automatic “YES” (positive) based on the data from the experiment, no matter what the data say. That’s just the rule, keep in mind, not the conclusion. The decision to say the Treatment is Not Effective or to say the Treatment is Effective will be made by assessing how risky it would be to apply that rule if the treatment really is ineffective. So you are asking: “if I adapt the cutoff just enough so that I can reject H0 based on the data I have, how risky is that?” Measuring risk in this context is just measuring the false positive rate. That is, given we’ve assumed the treatment is ineffective, we are asking, “under that assumption, how likely are we to be wrong if we say the treatment is effective, based on the data at hand?” How likely are we to commit a “false positive,” saying that the treatment is effective when really it is (assumed) not? Calculating this type of FPR is different than for field sobriety tests, or home pregnancy tests, but the idea is the same. If that FPR is small enough, then we will accept the “YES” being offered by the Awkward Rule and know the risk of being wrong is small. If the FPR is too big, then we won’t trust the recommended “YES” and, instead, conclude that there was not enough evidence to say the treatment was effective. Slide 12 If the estimated false positive rate for choosing between “Treatment is Not Effective” and “Treatment is Effective” is small enough – usually taken to be less than 5/100 or 0.05 – then the results of the experiment are said to be statistically significant. Slide 13 There is a well-used, special notation that will simplify our discussion. The choice being made in the experiments we have described is a choice between a null hypothesis (denoted by H0) and an alternative hypothesis (denoted by HA). The null is that the treatment is not effective, and the alternative is that the treatment is effective. So, in terms of this new notation, the estimated FPR facilitates a choice between H0 of HA. If the estimated FPR is less than 0.05 then we choose HA. Else we choose H0, or more properly said, we fail to choose HA. Your instructor may or may not choose to distinguish this important point. Slide 14 The estimated FPR in this context has a much more common name in the popular media, and in statistical science. It is called a “p-value.” The “p” stands for “probability.” Don’t confuse this notation with the parameter p. They are completely different concepts. Slide 15 Statistical science is a very complex endeavor, not only mathematically, but also conceptually. In this brief video we have tried to lay out the idea of statistical hypothesis testing in a way that allows it to be compared to testing the validity of a screening test. The latter is accessible and a skill worth having in its own right. We have taken some liberties in our presentation, though. There are two very different approaches to classical hypothesis testing, one due to Fisher and the other attributed to Neyman and Pearson. Depending on which conceptual approach one adopts, one has to distinguish p-values from so-called “Type I error rates.” What we have described as a p-value is really more like a Type I error rate. However, we are not going to try and tease out such subtle differences, even if they are important, partly because that journey would be too long and difficult, and also because the language used to talk about p-values in practice is typically the language of Type I error rates. Your instructor may want to offer more details. Slide 16 This concludes our video on how to think of hypothesis testing as a diagnostic (screening) test. Remember, statistical hypothesis testing amounts to a screening test that chooses between a null hypothesis and an alternative hypothesis based on the size of the estimated false positive rate..