Hypothesis Testing

Part I – As a Diagnostic Test

This video is designed to accompany pages 95-116 in

Making Sense of Uncertainty

Activities for Teaching Statistical Reasoning

Van-Griner Publishing Company

Diagnostic Tests

Diagnostic tests, such as a field sobriety tests and home pregnancy test, have to choose between two outcomes.

• A “positive” outcome means the test has uncovered adequate evidence of what it is designed to find.

• A “negative” outcome means the test does not have adequate evidence of what it is designed to find.


Many experiments in medicine, social science, education, etc. are designed to make similar bimodal choices.

• A “positive” outcome means the experiment has uncovered adequate evidence that the treatment is effective.

• A “negative” outcome means the experiment has not found adequate evidence that the treatment is effective.

Flibanserin Study

From TIME :

The Flibanserin findings are based on the study of 1,378 premenopausal women who had been in a monogamous relationship for 10 years on average. The women were randomly assigned to take 100 mg of Flibanserin or a placebo daily and to record daily whether they had sex and whether it was satisfying.

The Choice

In this case:

Flibanserin is no better than a

Placebo or

Flibanserin is better than a Placebo


Treatment is Not Effective or

Treatment is Effective

Hypothesis Testing

The paradigm for deciding between “Treatment is Not

Effective” and “Treatment is Effective” is an example of what statisticians call “hypothesis testing.”

This diagnostic test is not a kit or a physical examination.

Rather, it consists of a collection of mathematical steps.

No Surprise

Can’t make this decision risk free. There are two potential mistakes:

Experimental Results

Treatment Ineffective

Treatment Effective

Treatment Really is Ineffective

True Negative

False Positive


Treatment Really is Effective

False Negative

True Positive

Diagnostic Due Process

Generally, sensitivity and specificity are used to evaluate how well a screening test performs. That evaluation informs our confidence in results produced by the test.

Statistical science tends to focus on specificity for a similar role in hypothesis testing.

This is partly because the sensitivity of most common hypothesis testing procedures is pretty good.

Typical Screening


Data from



Truth from Gold


Based on Some



Actual Status

Prediction Negative Positive

Negative A B

Positive C D

Compute Sensitivity and


Apply the Test to a

Real Person

Get Yes/ No Result

More likely to believe a

“Yes” if the Specificity is high; a “No” if the

Sensitivity is high.

Hypothesis Testing Analogy

Data from



A Truth is Hypothesized

Adopt Awkward Rule:

“ Based on the data from the experiment, say the treatment is effective.” This is like an automatic


If FPR is small enough, accept the

“YES” and conclude treatment is

Compute False

Positive Rate for effective.

Awkward Rule.

Else : don’t trust the recommended

“YES” and conclude that the treatment is not effective.

Statistical Significance

If the estimated false positive rate (FPR) for deciding between “Treatment is Not Effective” and “Treatment is

Effective” is low enough – typically less than 0.05 - the results of the experiment are said to be statistically significant .

Important Vocabulary

Testing a hypothesis in the present context means choosing between

H0: Treatment is Not Effective the “null” hypothesis and

HA: Treatment is Effective the “alternative” hypothesis

To make the choice, we have to compute an estimated false positive rate and compare it to 0.05. If the estimated FPR is smaller than 0.05, choose HA. Else, choose H0.

Also Known As …

In hypothesis testing the estimated false positive rate is more commonly called a p-value .

That stands for “probability value”.


Statistical science, particularly statistical inference, is a very complex endeavor. In this presentation we have purposely avoided discussing a few things, including:

• The distinction between two very different approaches to hypothesis testing due to Fisher and Neyman-Pearson.

• The difference between a p-value and a Type I error rate.

• The real and important distinction between “Accepting H0” and

“Failing to Reject H0”.

Your instructor may want to offer more details.

One-Sentence Reflection

Statistical hypothesis testing amounts to a screening test that chooses between a null hypothesis and an alternative hypothesis based on the size of the estimated false positive rate.