Hypothesis-Testing

advertisement
Hypothesis Testing
Learning Objectives
After completing this module, the student will be able to





carry out a statistical test of significance
calculate the acceptance and rejection region
calculate and interpret the p-value of a statistical
test
calculate and interpret type 1 and type 2 errors
calculate the power of a test
Knowledge and Skills



Concepts: null hypothesis, alternative, test statistic, rejection region, acceptance region, pvalue, significance level, type 1 error, type 2 error, false positive, false negative, power of a
test
Resampling method
Fisher’s exact test
Prerequisites






binomial distribution
hypergeometric distribution
Normal distribution
Sample average
Sample standard deviation
macros in Excel
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 1
Prologue
The problem of decision making is ubiquitous. Almost daily, you can read in the news about studies that
lead to recommendations based on statistical evidence. The U.S. Department of Health and Human
Services’ Agency for Healthcare Research and Quality (http://www.ahrq.gov/) provides health care
recommendations, for instance, through its U.S. Preventive Services Task Force
(http://www.ahrq.gov/clinic/uspstfix.htm), an independent panel of experts in primary care and
prevention, which reviews research results and develops recommendations. These recommendations
are based on analyses of tens or hundreds of clinical studies, and recommendations may change as new
evidence accumulates over time.
Frequently, clinical studies are phrased in terms of hypothesis testing. For instance, if a new treatment
for a disease is developed, we might wish to know whether it performs better than the current
treatment. We set up a clinical trial where patients are randomly assigned to one or the other
treatment. We then compare the number of successful treatments in each group. Let’s assume that the
two groups have the same number of patients. In order to conclude that the new treatment is better
than the current treatment, we would need to demonstrate that the number of successful treatments in
the new treatment group is larger than the number of successful treatments in the current treatment
group. The question is how much larger the number of successful treatments in the new treatment
group would need to be to convince other investigators that the new treatment is indeed better. These
kinds of questions can be answered within the framework of hypothesis testing.
In-class Activity 1
Assume the current treatment for a disease is successful in 30% of all cases. A new treatment is being
developed and a preliminary clinical trial showed that 5 out of 10 patients were successfully treated. Can
you conclude that the new treatment is more successful?
If the new treatment was not better than the current treatment, we would hypothesize that the new
treatment has probability 0.3 of being successful. Alternatively, if the new treatment is better than the
current treatment, we would hypothesize that the new treatment has probability greater than 0.3 of
being successful.
If the new treatment has the same likelihood of success than the current treatment, namely probability
0.3, then the number of patients in the small clinical trial who are treated successfully under the new
treatment is binomially distributed with 10 trials and success probability 0.3. The following table was
created in EXCEL using the BINOMDIST function and shows this probability distribution:
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 2
x
P(X=x)
0
1
2
3
4
5
6
7
8
9
10
0.0282
0.1211
0.2335
0.2668
0.2001
0.1029
0.0368
0.0090
0.0014
0.0001
0.0000
We see that the probability of five or more successes when the success probability is 0.3 is
0.1029  0.0368  0.0090  0.0014  0.0000  0.1502
Thus, it is not unlikely to see 5 (or more) out of 10 patients recover when the success probability of
recovery is 0.3. We conclude that there is not enough evidence to conclude that the new treatment is
better.
Discuss in your group the following questions:
1. Why did we add up the probabilities in the above example?
2. Would you be able to conclude definitively from this study that the new treatment isn’t any better?
3. What would be your next step in determining whether the new treatment is better?
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 3
In-class Activity 2
Suppose you have a coin in your pocket. You want to decide whether the coin is fair or biased. You
hypothesize that the coin is fair. To test this hypothesis, you toss the coin 30 times. The number of
heads is binomially distributed with the number of trials being 30 and the probability of heads (success)
being 0.5. Below is the histogram of the probability distribution.
Suppose the experiment resulted in 18 heads and 12 tails. Discuss the following questions in your group:
1.
2.
3.
4.
What can you say about the coin? Is it a fair coin or a biased coin?
What would your conclusion be if the experiment resulted in 24 heads and 6 tails?
What criteria did you use to make the decision in each of the two cases?
Can you be sure that your decision is correct?
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 4
Some Theory
In both In-class Activities, you had to make a decision between two alternatives. In the first case, you
needed to decide whether the new treatment was better than the current treatment. In the second
case, you needed to decide whether the coin was fair or biased. In both cases, you relied on a
probability model, and you based your decision on how likely the outcome of the experiment was
compared to the expectation of the model. In both cases, there was also the possibility that you arrived
at the wrong decision.
In the following, we will discuss the basic elements of hypothesis testing. We will use the example of the
fair coin versus the biased coin because of its simplicity. One hypothesis is that the coin is fair, that is,
that the probability of heads is 0.5. The alternative hypothesis is that the coin is biased, that is, the
probability of heads is different from 0.5. We base our decision of whether or not the coin is fair on
comparing the result of our experiment to what we expect based on a probabilistic model. Namely, if
the fraction of heads in the experiment is close to 1/2, the experiment provides evidence for the coin
being fair; if the fraction of heads is either low or high, the experiment provides evidence for the coin
being biased.
The hypothesis “the coin is fair” is called the null hypothesis and is denoted by H 0 . The alternative “the
coin is biased” is denoted by H1 . (We will say more about which of the two hypotheses is the null
hypothesis and which is the alternative later.) We summarize this as
H0 : p  0.5
H1 : p  0.5
We designed an experiment in which we tossed the coin thirty times. The data collected in the
experiment provided evidence for or against the null hypothesis. The data in our experiment were the
sequence of heads and tails in the thirty trials. The data suggest that we can calculate a single number,
namely the number of heads, which we can compare against what we would expect under the null
hypothesis. This single number is called the test statistic. A probabilistic model for tossing a fair coin
allows us to calculate the probability distribution of the test statistic. Namely, under the null hypothesis,
the number of heads is binomially distributed with 30 trials and success probability p  0.5 . In the
experiment, we observed 18 heads. How likely is it that we observe 18 or more heads? If X denotes the
number of heads, we are asking for P(X  18) , which can be calculated by adding up the probabilities of
the events  X  18 , X  19 ,...X  30 . Refer to the spreadsheet (tab “Fair Coin”) to verify that
P(X  18)  0.1808
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 5
Since the alternative is two-sided, that is p  0.5 or p  0.5 , we will reject the null hypothesis if the
number of heads is either too large or too small. We only calculated the probability of at least 18 heads.
The probability distribution under the null hypothesis is symmetric and so we add to this the probability
of the symmetric event “at most 12 heads” or  X  12 . We thus need to calculate the probability of the
event of having either at least 18 heads or at most 12 heads
P(X  12 or X  18)  0.1808  0.1808  0.3616
This probability is called the p-value and denoted by p. Commonly accepted by statisticians is the
following:
p<0.01: strong evidence against H 0
0.01<p<0.05: moderate evidence against H 0
p>0.10: little or no evidence against H 0
Since p  0.3616 , there is little or no evidence against H 0 , and thus not sufficient evidence to reject the
null hypothesis.
If, instead of 18 heads, we observed 24 heads in the experiment, we find for the p-value
P(X  6 or X  24)  (2)(0.0006  0.0001)  0.0014
We conclude that there is strong evidence against H 0 . We reject the null hypothesis and say that the
result is highly statistically significant.
Rejection Region
In our example, we are looking for outcomes that have either a large or a small number of heads. We
can define a set of extreme outcomes a priori so that if the outcome of the experiment is in this set, we
reject the null hypothesis. The set of extreme outcomes is called the rejection region. Because the
probability distribution is symmetric about 15, and there are 15 possible outcomes below 15, namely
0,1,2,…,14, and 15 possible outcomes above 15, namely 16,17,..,30, we will define the rejection region
in a symmetric way, namely, we will identify a number a, so that the rejection region is of the form
0,1,2,..., a 30  a,30  a  1,...,30 . If we are looking for moderate evidence, say, we want to test at
the 0.05 significance level, we will choose a as large as possible so that each of the two sets has
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 6
probability close to 0.025. Looking at the table of probabilities for the outcomes “Number of heads is
equal to k,” we see that if a  9 , we have
P(X  9 or X  21)  0.0214  0.0214  0.0428
If we choose a larger value for a, the probability would exceed 0.05; a smaller value of a would result in
a probability that is smaller than 0.0428. Thus a=9 is the best choice for defining the rejection region if
we are interested in “moderate evidence against the null hypothesis.” We thus reject the null
hypothesis if the number of heads in the experiment of tossing the coin thirty times is either less than or
equal to 9 or greater than or equal to 21. The complement of the rejection region, called the acceptance
region, is the set 10,11,12,...,17,18,20 . In the experiment, we observed 18 heads, which is in the
acceptance region. We thus do not reject the null hypothesis.
Statisticians are careful about phrasing their conclusion. If the outcome of the experiment is unlikely
under the null hypothesis, they reject a null hypothesis. If not, they will say that the null hypothesis
cannot be rejected. Statisticians do not accept a null hypothesis. There is a big difference between
saying “we do not reject a null hypothesis” and “we accept a null hypothesis.” Just because the data is
consistent with the null hypothesis, does not mean that the null hypothesis is true—there could be
many other reasons for getting a result that is consistent with the null hypothesis. That is, not rejecting a
null hypothesis does not assert its truth. The null hypothesis merely withstood a challenge. As we will
see shortly, rejecting a null hypothesis only means that the null hypothesis may not be true.
The histogram in the figure below indicates the acceptance region and rejection region. In a two-sided
test, the rejection region is the union of the two extreme events that are in the two “ends” of the
distribution, which are called the tails of the distribution
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 7
Type I Error
The probability of the rejection region in our experiment is 0.0428. That is, there is a 4.3% probability
that we will reject the null hypothesis even though it is true. Erroneously rejecting the null hypothesis is
called committing a type I error. The type I error leads to false positives. Since there is a positive
probability that the null hypothesis is erroneously rejected, we can only conclude that the null
hypothesis may not be true when we reject the null hypothesis.
Type II Error
The other possible error is not rejecting the null hypothesis when the alternative is true. This is called a
type II error. The type II error leads to false negatives. The type II error can only be calculated if the
alternative is sufficiently specified. In our example, we only said that the coin is biased under the
alternative. There are infinitely many probability models that satisfy the assumption of the alternative,
namely any binomial distribution with p  0.5 . For a fixed value of p  0.5 , we can calculate the type II
error. For instance, let’s assume p  0.7 . Then
P(10  X  20|p  0.7)  0.4112
For larger values of p, this probability will be smaller. For instance, if p  0.8 , then
P(10  X  20|p  0.8)  0.0611
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 8
This means that the larger (or, by symmetry, the smaller) the probability of heads is, the better we will
be able to detect whether a coin is biased. This is quantified by the power of the test, which is defined
as 1-type II error. The power of a test is therefore the probability of rejecting the null hypothesis when
the alternative is true. The following table lists the power of the test for different values of the
probability of heads
P(Heads)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Power
1
0.999546
0.938913
0.588816
0.177143
0.042774
0.177143
0.588816
0.938913
0.999546
1
We can plot the power of the test as a function of the probability of heads:
Hypothesis Testing trough Resampling
In our example of testing whether a coin is fair, we were able to calculate the probability distribution
under the null hypothesis exactly. In many applications, the probability distribution under the null
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute.
Page 9
hypothesis is not known and must be simulated. This method is called resampling method. We can
illustrate this important method on our example.
In the spreadsheet (tab “Simulation”), we set up a simulation of 30 independent trials, each with
probability 0.5 of success. If we denote a success by a “1” and a failure by a “0,” then the syntax to
accomplish this in EXCEL is
“=IF(RAND()<0.5,1,0)”
(See, for instance, Cell B4.) If you add up the 30 values, you obtain the total number of successes in the
30 trials, which you can find in Cell E4. Write a macro so that you can record the outcomes of 500 such
experiments and record the number of heads in each of the 500 runs in column I. If you want the type I
error to be 5%, then since the test is two-sided, we need to determine the 2.5th and 97.5th percentiles of
the simulation outcomes to find the acceptance region. Find the acceptance region and the
corresponding rejection region. How does this compare to the exact calculations we did earlier?
Summary
Statistical hypothesis testing involves the following steps:
1. Formulation of a null hypothesis and an alternative.
2. Construction of a test statistic that can discriminate between the null hypothesis and the
alternative. Calculate the probability distribution under the null hypothesis.
3. There are two ways to proceed from here. Either one allows us to control the type I error:
a. Specification of the type I error and calculation of the rejection and acceptance region
followed by data collection and decision of whether to reject or not to reject the null
hypothesis based on the data.
b. Collection of data and calculation of the corresponding p-value followed by decision of
whether or not to reject the null hypothesis. The p-value is the type I error, that is, it is the
probability of erroneously rejecting the null hypothesis based on the data.
Worked-out Example
Problem: A jury pool includes 50% women and 50% men. A jury of 12 people was selected from this pool
and included 3 women. A newspaper commented on the “biased selection process.” (a) Test the
hypothesis that the jury selection was fair. (b) Repeat the test assuming now that the jury only included
2 women.
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 10
Solution: The first part of the solution applies to both (a) and (b). The null hypothesis is that the jury
selection was fair, that is, the proportion of women is 0.5. The alternative is that the selection process
was biased against women. We thus choose for the alternative that the proportion of women is less
than 0.5:
H0 : p  0.5
H1 : p  0.5
The next step is to find a test statistic and to calculate the probability distribution of the test statistic
under the null hypothesis. We can choose as the test statistic the number of women in the jury pool. We
denote the test statistic by X. The test statistic is binomially distributed with n, the number of trial, equal
to 12, and p, the probability of success being 0.5. The EXCEL function “=BINOMDIST(k, n, p, FALSE)”
calculates the probability distribution of a binomial distribution of k successes in n trials with success
probability p. The last entry “FALSE” indicates that it calculates the probability mass function. To
calculate the cumulative probability distribution, replace “FALSE” by “TRUE”. With n  12 , and p  0.5 ,
we obtain the following table:
k
P(X=k)
0
1
2
3
4
5
6
7
8
9
10
11
12
0.000244
0.00293
0.016113
0.053711
0.12085
0.193359
0.225586
0.193359
0.12085
0.053711
0.016113
0.00293
0.000244
(a) In this part of the problem, three women were on the jury. To calculate the p-value, we compute the
probability of the event “three or fewer women”:
P(X  3)  0.000244  0.00293  0.016113  0.053711  0.072998
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 11
Since the p-value is about 0.073, we conclude that there is not enough evidence to reject the null
hypothesis. In about 7.3% of jury selections from this pool, we would expect to see three or fewer
women. The result is not statistically significant.
(b) The situation is different if only two women had been selected. The probability of two or fewer
women on the jury is
P(X  2)  0.000244  0.00293  0.016113  0.019287
Now, the p-value is only about 2%, which is statistically significant. We would now reject the null
hypothesis.
Homework
1. A student committee composed of 20 upper division and lower division students needs to be
assembled. One third of the student population is upper division students. The committee ends up
having an equal number of upper division and lower division students. The lower division students,
expecting a higher number of them on the committee, made the accusation that the selection process
was biased against them. Test the hypothesis that the selection process was fair against the alternative
that the selection process was biased against the lower division students.
2. In a cross between heterozygous plants of genotype Cc, we expect that 50% of the offspring are
heterozygous (i.e., genotype Cc) and 50% are homozygous (i.e., either of genotype CC or of genotype
cc). Among 14 plants, we find that 3 plants are homozygous and 11 plants are heterozygous. Test the
hypothesis that the ratio of homozygous to heterozygous plants is 1:1 against the alternative that the
ratio is different from 1:1.
3. Assume that the population distribution is normal with mean  and standard deviation  . We take a
sample of size n from this population and calculate the sample average
Xn 
1
n
n
X
i
i 1
We know that the distribution of X n is then again normal with mean  and standard deviation  / n .
We can define a new random variable
Z
Xn  
/ n
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 12
which is normally distributed with mean 0 and standard deviation 1. Suppose now that the average
lifetime of a sample of 50 medical devices is 5 years and 8 months with population standard deviation of
4 months. Assume that the lifetime of this medical device is listed as 6 years. Test the hypothesis
  6 years against the alternative   6 years at the 0.05 significance level by first calculating the
rejection and acceptance regions for the 0.05 significance level. Calculate the power of this test.
4. For large samples, the sampling distributions can often be approximated by normal distributions even
if the population distribution is not normal. Here is a typical example: A group of 100 students takes the
ACT math test and has an average score of 20.6. The standard deviation   3.2 . The average score
nationwide was 21.2. Test whether the average score of this group of 100 students is lower than the
national average.
5. Suppose you have a biased coin in your pocket. One side shows up with probability 0.3, the other with
probability 0.7. Unfortunately, you don’t remember which side is more likely.
Here are the two scenarios:
Hypothesis A P(Heads)=0.3 P(Tails)=0.7
Hypothesis B P(Heads)=0.7 P(Tails)=0.3
To determine whether the probability of heads is 0.3 (Hypothesis A) or 0.7 (Hypothesis B), you toss the
coin 9 times and record the number of heads. Here is the outcome of the experiment:
H,T,T,T,H,H,T,T,H,T
(a) Based on this outcome, what do you think is the more likely scenario, Hypothesis A or Hypothesis B?
(b) The number of heads in the experiment is binomially distributed with 9 trials and probability of
heads equal to 0.3 in Hypothesis A and 0.7 in Hypothesis B. Calculate the probabilities for the event
“Number of Heads = k” under the two hypotheses for the ten possible values of k. For which values of k
would you reject Hypothesis A? Calculate the probability of erroneously rejecting Hypothesis A.
Calculate the probability of not rejecting Hypothesis A when in fact Hypothesis B is true.
6. Focht et al. 2002 reported in a research article on “the efficacy of duct tape vs cryotherapy in the
treatment of Verruca vulgaris (the common wart)” (Arch. Pediatr. Adolesc. Med. 2002; 156: 971-974).
Their objective was “[t]o determine if application of duct tape is as effective as cryotherapy in the
treatment of common warts.” They enrolled 61 patients into their study; 51 patients completed the
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 13
study. The main outcome measure was complete resolution of the wart being studied. Patients were
randomized to receive either cryotherapy (liquid nitrogen applied to the wart every two-three weeks) or
application of duct tape for a maximum of two months. Of the 51 patients, 26 were treated with duct
tape and 25 with cryotherapy. In the duct tape group, 22 had complete resolution of the wart; in the
cryotherapy group, 15 patients had complete resolution of the wart. Here is the data in table form
summarizing the outcome of the study:
Duct Tape
No Resolution
Resolution
SUM
4
22
26
Cryotherapy
SUM
10
15
25
14
37
51
(a) What percentage of patients completing the study were treated with duct tape, and what
percentage were treated with cryotherapy?
(b) To test whether duct tape is at more effective than cryotherapy, we design a statistical test. The null
hypothesis states that the two treatments are equally effective. The alternative is that duct tape therapy
is more effective than cryotherapy. Under the null hypothesis, the two treatments are equally effective.
Under this assumption, we can develop a probability model to calculate the probability of 22 patients in
the duct tape group that saw complete resolution. This is how: We have a group of 51 patients, which is
the population. 37 patients saw successful resolution, which is the group of successes in the population.
26 patients are randomly assigned to the duct tape group, which is the sample size. The number of
successes in the sample is 22. This is to the following urn problem that we can solve using basic
probability theory: An urn has a total of 51 balls, 37 of which are blue, the remainder is green. We take a
sample of 26 balls at random from the urn, what is the likelihood that 22 balls are blue? If we denote the
number of successes in the sample by X, we can calculate the probability of this event using the
hypergeometric distribution:
 37  14 
  
22 4
P( X  22)    
 51 
 
 26 
Excel has a function that will calculate this probability: “=HYPGEOMDIST(22,26,37,51).” To calculate the
p-value, we need to determine the probability of at least 22 complete resolutions in a sample of size 26
when the population size is 51 and the number of successes in the population is 37. Find this probability.
What can you conclude? (The statistical test in this problem is called Fisher’s exact test.)
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 14
7. In 2006, another study on the efficacy of duct tape in treating warts was done to address some of the
shortcomings of the first study, in particular the small sample size and the lack of a placebo group. The
study by de Haen et al. (2006) on the “efficacy of duct tale vs placebo in the treatment of Verruca
vulgaris (warts) in primary school children” (Arch. Pediatr. Adolesc. Med. 2006; 106: 1121-1125) had 103
participants who completed treatment; 51 patients were treated with duct tape and the remaining 52
patients received a placebo treatment. After 6 weeks, the wart had disappeared in 8 of the children in
the duct tape group and 3 of the children in the placebo group. Test whether the duct tape treatment is
more effective.
Citation: Neuhauser, C. Hypothesis Testing.
Created: November 29, 2009 Revisions:
Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution
Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows
others to translate, make remixes, and produce new stories based on this work, provided the original author and source are
credited and the new work will carry the same license.
Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 15
Download