Transcript

advertisement
Slide 1
Hypothesis Testing Part III – Applying the Concepts
Slide 2
This video is designed to accompany pages 95-116 of the workbook “Making Sense of Uncertainty:
Activities for Teaching Statistical Reasoning,” a publication of the Van-Griner Publishing Company
Slide 3
While you may not find yourself actually computing p-values in your chosen profession, you will without
a doubt be trying to interpret those found by others. In particular, the phrase “statistically significant”
or some very similar wording is widely used in media.
Have a look at the New York Times article on multivitamins, appearing October 17th, 2012. The article
addresses a clinical trial that unfolded over more than a decade. Male doctors were followed during
that period of time and those taking a daily multivitamin were compared to those who took a placebo
(dummy pill).
What did the researchers find? You can see from the article that those taking the multivitamins
“experienced 8 percent fewer cancers ….”
But it is the first line of the last paragraph that invokes a statistical badging of sorts. “The reduction in
total cancers was small but statistically significant ….”
Our goal for this video is to make sure that all of us know what this means, to know in what sense it
adds to the credibility of the results, and in what sense it does not.
Slide 4
You have learned elsewhere that when the phrase “statistically significant” is used then a hypothesis has
been tested. That is, a choice between a null and an alternative hypothesis has been put forward, and a
decision has been made. If the results were statistically significant then you know that the alternative
has been chosen. Else, either the null was chosen or, at least, no endorsement was given for the
alternative. Finally, you also know, or are often safe to assume, that this decision was made because an
estimated false positive rate was computed that had to be less than 0.05.
That is, you know that the risk involved in choosing HA when H0 was really true is less than 5 chances in
100. All of this information is embedded in that phrase.
To summarize: when reading the words “statistically significant” or “not statistically significant” you
should be able to establish what is being compared; express that choice between a null and an
alternative hypothesis; say how the comparison turned out; and what risk was associated with the
conclusion.
Keep in mind that saying a result is “significant” is not at all the same as saying it is “statistically
significant.”
Slide 5
Let’s apply what we have just learned to the New York Times article. The comparison is between the
number of cancers seen in the group taking the multivitamins and the group taking the placebo.
Remembering that the null is always generically “treatment is not effective,” we know that the null
hypothesis in this case is that the number of cancers in the vitamin group is the same as in the placebo
group. Therefore the alternative is that the number of cancers in the vitamin group is less than in the
placebo group.
If you are confused about how to tell which is the null and which is the alternative, the next two steps
really determine that. We are told that the result were statistically significant. What results? The
results that said the vitamin group had fewer total cancers than the placebo group. Since we know that
the results were statistically significant, we know that the decision was made to go with the alternative.
Hence, the alternative has to be that the total number of cancers in the vitamin group was less than in
the placebo group.
Of course, there is always the risk of wrongly saying HA is true, that is a risk of saying HA is true when H0
really is. But that is what the estimated FPR, the p-value, quantifies for us. And in this case, that risk is
less than 5 chances in 100. We know that because that is what it means to be statistically significant.
Slide 6
Let’s take a look at another example. The excerpt shown is from a 2013 article that appeared in the
Chicago Tribune. You should be able to find the entire article by just searching on the title.
This study was not an experiment but an observational study of adolescents who went to McDonald’s
and Subway. Researchers collected data on what those adolescents actually ate at each of the
restaurants and kept track of the total number of calories consumed.
What did they find? Look at the last paragraph. They bought an average of 1,038 calories at
McDonald’s and 955 at Subway. But that calorie difference was “not statistically significant” the article
goes on to say.
Let’s apply our template for analyzing what this means.
Slide 7
What is being compared? The calories consumed by teens at Subway and those consumed by teens at
McDonald’s.
How do we know what H0 and HA are? Look at the third part of the template first and that may make
this easier to answer. The difference between the calories consumed at Subway and McDonald’s was
not statistically significant we are told. That means that whatever H0 and HA are, the decision was
made to not go with HA.
Thus H0 has to be “number of calories consumed at Subway is the same as at McDonald’s” and HA has
to be “number of calories consumed at Subway is less than at McDonald’s.”
Since the decision was made not to endorse HA, we know the estimated false positive rate was 0.05 or
larger. Hence, it was too risky in the sense that the chances of going with HA when H0 was really true
were too high to justify that decision.
Keep in mind that the risk of going with H0 when HA is really true is not being explicitly monitored in
hypothesis testing, but in general is assumed to be pretty small for common hypothesis tests.
Slide 8
Let’s continue to practice. The New York Time study shown here – again you can almost surely find the
full article if you search on the title – addresses weight loss in people with serious mental illness. While
we don’t have an abundance of information from the article, we do see in the last paragraph that 24
well-designed studies of weight loss programs for the mentally ill were scrutinized. We are told that
“most achieved statistically significant weight loss, but very few achieved ‘clinically significant’ weight
loss.”
What does this mean?
Slide 9
Let’s start by applying our four-step template.
Each study being scrutinized compared the weight of the patients before they completed a given fitness
program for the mentally ill, to the weights afterwards.
In terms of H0 and HA, we know that H0 is that the program didn’t work, or that the weight before
starting the program was no different than the weight afterwards. HA has to be that the program
being looked at worked, that is, there was a weight loss for patients after completing the program.
We know the results were statistically significant so we know that from a statistical perspective we can
safely assume HA is true. In fact, the chances of saying HA is true, based on the data from such a study,
when in fact H0 is true, are less than 5 in 100.
This article has an interesting nuance, however. The researchers also concluded that few of the studies
found any practically significant weight loss. That is to say, the change in weight, probably on average,
for patients in a given study was big enough to be statistically significant, but not big enough to be
practically significant. More will be said on this important distinction in another video.
Slide 10
Here is another Tribune article. This one has a look at a study that compared the pay of presidents at
various universities with measures of prestige for those universities.
The findings are in the last paragraph. “….[N]o statistically significant relationship was observed
between academic quality and presidential pay.”
Let’s make sure we know what this is telling us.
Slide 11
Presidential pay at low-achieving universities is being compared to presidential pay at high-achieving
universities.
The null would be something to the effect of “there is no correlation between presidential pay and
performance of a university. The alternative would counter with “there is a correlation between
presidential pay and performance of a university.”
The results were NOT statistically significant, so we know that the decision was made that it was unsafe
to go with HA. That is, there was not enough evidence to counter the claim that presidential pay and
university prestige are uncorrelated.
Of course there is always a chance this was the wrong decision. That is, the decision to go with H0 when
in fact HA was true could have been the wrong one. The chance of that happening is kind of like a false
negative rate for a screening test. We don’t monitor false negative rates explicitly in hypothesis testing,
as we have said before, but we assume the method they used to test this hypothesis had robust
sensitivity.
Slide 12
One more example. Search on the title of this New York Times article or pause the video for a chance to
read the article more thoroughly.
The so-called “hot hand” has been debated in sports since at least 1985. Coaches and players have long
believed that players go on “streaks” and get “hot” with the tendency to continue playing very well once
they start of streak of outstanding plays. This was first challenged in 1985 when researchers studied
performance records from two pro and one university basketball team finding that players “statistically
were not more likely to hit a second basket after sinking a first.”
More recently other researchers have accessed much larger amounts of data and have concluded that
“basketball players experienced statistically significant and recognizable hot periods over an entire game
or two ….”
What does this mean?
Slide 13
This is a little harder to think about than some of our first examples.
The comparison in the recent studies seems to be between streaks of free throw successes versus what
a player would make at random.
H0 in this case is that the streaks, also known as “runs” are no longer than would be expected by
chance. HA is that the streaks are longer than would be expected by chance.
We know the results were statistically significant so we know that the decision was to accept HA, that is,
there is evidence that streaks are longer than expected by random chance.
Of course, this can be the wrong decision. But the false positive rate had to be estimated to be below 5
chances in 100 since the results were said to be statistically significant.
Slide 14
This concludes our video on the application of hypothesis testing concepts. Remember, to understand
the usage of “statistical significance” in the media, always ask what is being compared, what the null and
alternatives are, how the comparison turned out, and what risks were involved in the decision that was
made.
Download