Transcript

advertisement
Slide 1
Hypothesis Testing Part I – As a Diagnostic Test
Slide 2
This video is designed to accompany pages 95-116 of the workbook “Making Sense of Uncertainty:
Activities for Teaching Statistical Reasoning,” a publication of the Van-Griner Publishing Company
Slide 3
Diagnostic tests – just another name for screening tests – have to choose between two outcomes. A
positive outcome means the test has uncovered evidence of what it was designed to find. For instance,
if you score high enough on a field sobriety test then the outcome will suggest you are intoxicated.
A negative outcome simply means that the test did not uncover adequate evidence of what the test was
looking for. If you have a low score on a field sobriety test then it will likely be assumed you are sober,
whether you are or not.
Slide 4
In many ways, controlled experimental studies have to make choices similar to a screening test. Positive
outcomes are when the experiment has produced adequate evidence that the treatment is effective.
Negative outcomes are when the experiment failed to produce enough evidence to say the treatment is
effective.
What constitutes “adequate evidence” is the crux of statistical hypothesis testing and as we will see,
that discussion has a significant overlap with the language used to assess the validity of a diagnostic test.
Slide 5
Let’s start by looking at an example. The study seen here first surfaced in the media in 2009. The drug
Flibanserin was originally being studied as an anti-depressant, but was shown to be inadequate for that
purpose. What researchers noticed during those clinical trials was that a number of women who
participated in the trials indicated that their sex drive had increased. This led the manufacturer to study
Flibanserin for a completely different purpose.
As you see here, a clinical trial was developed wherein 1,378 premenopausal women satisfying the
experimental protocol were randomly assigned to one of two groups. One group took 100 mg of
Flibanerin and the other a placebo. All participants were required to keep a daily journal about whether
they had sex and, if so, whether it was in their view “satisfying.”
Slide 6
In what sense is this experiment behaving like a diagnostic (screening) test?
First of all, the experiment is designed to produce evidence that will allow an informed choice between
one of two possible outcomes. Either one concludes a that Flibanserin is no better than a placebo for
increasing sex drive, or one concludes that Flibanserin is better than a placebo for increasing sex drive.
Just like a typical screening test, this choice will have to be made based on the evidence at hand, using
some meaningful numeric to guide the decision. It can’t be made with complete assurance the correct
outcome was chosen.
For medical experiments in general the choice is almost always between “treatment is not effective”
and “treatment is effective.” The details just depend on the actual experiment, whether a placebo was
part of the study, etc.
Slide 7
In statistical science, the paradigm for choosing between “Treatment is Not Effective” and “Treatment is
Effective” is called hypothesis testing.
As a diagnostic test, hypothesis testing is not a kit or a physical examination or an agility test. Rather, it
is a collection of mathematical steps that take the data the experiment generated and produce
something very much akin to a false positive rate to assist in choosing between the “negative” and
“positive” outcomes.
Slide 8
The decisions made in hypothesis testing come with the same risks as, say, the decision made by a home
pregnancy test, or the Beck Depression Inventory that screens for clinical depression.
A test of hypothesis may produce insufficient evidence that the treatment is effective, when
unbeknownst to the experimenters, it is. Likewise, the test may suggest that the data produced by the
experiment are sufficient to say that the treatment is effective, when in fact it may not be effective.
Hence, just like common screening tests, a test of hypothesis may be susceptible to false negatives and
false positives.
Slide 9
In general we know that both sensitivity and specificity are used to evaluate how well a screening test
performs. This, in turn, informs our confidence in the results produced by the test in practice.
In hypothesis testing the mathematical steps used to process the data are often steps that are known
within statistical science to possess quite good sensitivity. Now this isn’t true for all types of hypothesis
testing but a deeper discussion in that direction would be beyond the scope of this video. Sensitivity in
hypothesis testing is similar to what statisticians call “power” and, by and large, the hypothesis testing
procedures that are reported in the media can be shown to have good power.
It turns out that the specificity of the testing procedure is what drives the practical decision of choosing
between “Treatment is Not Effective” and “Treatment is Effective.”
Slide 10
Let’s work a little harder on the analogy between diagnostic testing and hypothesis testing. In a typical
screening scenario we have data from test subjects, both on what the screening test has predicted and
what the gold standard test has said about their actual status.
From those data, typically arrayed in a two by two table, we can compute sensitivity and specificity as a
way of assessing how well the test is working. Some will call this the reliability of the test, others the
validity.
Why do we care? Don’t forget, in practice the test is applied in the absence of any gold standard. You
may use a home pregnancy test, or get screened for depression, or are given a field sobriety test, but
then what do you make of the results?. Clearly, if the test has done well in its validation stage then the
results are more believable. In particular, if the specificity of the test was high, and the test gave you a
“yes” (positive), then you are more likely to believe that since you’d know the chances of a false positive
are low. Likewise, if the sensitivity of the test was high, and it gave you a “no” (negative), then you are
more likely to believe that, as well, since you would know that the chances of a false negative are small.
Keep in mind, embedded in all these screening tests is a rule that decides when to report back a “yes”
and when to report back a “no.” It’s part of the chemistry inside the home pregnancy test; it’s the score
level set by the psychologist when using the Beck Inventory, and it’s the discretion of the police officer
when using the field sobriety test. But the rule matters. Change the rule and the sensitivity and
specificity measures will change almost surely.
Slide 11
Hypothesis testing is very similar in some important ways. Data are collected from experimental
subjects. However, the truth – whether the treatment is effective or not – is not possible to know. In
lieu of a tangible gold standard, we hypothesize that the treatment is ineffective, adopt a decision rule,
and see how risky it would be to apply that rule if the treatment really is ineffective. This is the essence
of formal inferential reasoning.
The most awkward part of the analogy to screening tests is the rule. Think of the rule as being set to be
an automatic “YES” (positive) based on the data from the experiment, no matter what the data say.
That’s just the rule, keep in mind, not the conclusion. The decision to say the Treatment is Not Effective
or to say the Treatment is Effective will be made by assessing how risky it would be to apply that rule if
the treatment really is ineffective. So you are asking: “if I adapt the cutoff just enough so that I can
reject H0 based on the data I have, how risky is that?”
Measuring risk in this context is just measuring the false positive rate. That is, given we’ve assumed the
treatment is ineffective, we are asking, “under that assumption, how likely are we to be wrong if we say
the treatment is effective, based on the data at hand?”
How likely are we to commit a “false positive,” saying that the treatment is effective when really it is
(assumed) not?
Calculating this type of FPR is different than for field sobriety tests, or home pregnancy tests, but the
idea is the same. If that FPR is small enough, then we will accept the “YES” being offered by the
Awkward Rule and know the risk of being wrong is small. If the FPR is too big, then we won’t trust the
recommended “YES” and, instead, conclude that there was not enough evidence to say the treatment
was effective.
Slide 12
If the estimated false positive rate for choosing between “Treatment is Not Effective” and “Treatment is
Effective” is small enough – usually taken to be less than 5/100 or 0.05 – then the results of the
experiment are said to be statistically significant.
Slide 13
There is a well-used, special notation that will simplify our discussion. The choice being made in the
experiments we have described is a choice between a null hypothesis (denoted by H0) and an
alternative hypothesis (denoted by HA). The null is that the treatment is not effective, and the
alternative is that the treatment is effective.
So, in terms of this new notation, the estimated FPR facilitates a choice between H0 of HA. If the
estimated FPR is less than 0.05 then we choose HA. Else we choose H0, or more properly said, we fail to
choose HA. Your instructor may or may not choose to distinguish this important point.
Slide 14
The estimated FPR in this context has a much more common name in the popular media, and in
statistical science. It is called a “p-value.” The “p” stands for “probability.” Don’t confuse this notation
with the parameter p. They are completely different concepts.
Slide 15
Statistical science is a very complex endeavor, not only mathematically, but also conceptually. In this
brief video we have tried to lay out the idea of statistical hypothesis testing in a way that allows it to be
compared to testing the validity of a screening test. The latter is accessible and a skill worth having in
its own right.
We have taken some liberties in our presentation, though. There are two very different approaches to
classical hypothesis testing, one due to Fisher and the other attributed to Neyman and Pearson.
Depending on which conceptual approach one adopts, one has to distinguish p-values from so-called
“Type I error rates.” What we have described as a p-value is really more like a Type I error rate.
However, we are not going to try and tease out such subtle differences, even if they are important,
partly because that journey would be too long and difficult, and also because the language used to talk
about p-values in practice is typically the language of Type I error rates. Your instructor may want to
offer more details.
Slide 16
This concludes our video on how to think of hypothesis testing as a diagnostic (screening) test.
Remember, statistical hypothesis testing amounts to a screening test that chooses between a null
hypothesis and an alternative hypothesis based on the size of the estimated false positive rate..
Download