Hypothesis Testing Learning Objectives After completing this module, the student will be able to carry out a statistical test of significance calculate the acceptance and rejection region calculate and interpret the p-value of a statistical test calculate and interpret type 1 and type 2 errors calculate the power of a test Knowledge and Skills Concepts: null hypothesis, alternative, test statistic, rejection region, acceptance region, pvalue, significance level, type 1 error, type 2 error, false positive, false negative, power of a test Resampling method Fisher’s exact test Prerequisites binomial distribution hypergeometric distribution Normal distribution Sample average Sample standard deviation macros in Excel Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 1 Prologue The problem of decision making is ubiquitous. Almost daily, you can read in the news about studies that lead to recommendations based on statistical evidence. The U.S. Department of Health and Human Services’ Agency for Healthcare Research and Quality (http://www.ahrq.gov/) provides health care recommendations, for instance, through its U.S. Preventive Services Task Force (http://www.ahrq.gov/clinic/uspstfix.htm), an independent panel of experts in primary care and prevention, which reviews research results and develops recommendations. These recommendations are based on analyses of tens or hundreds of clinical studies, and recommendations may change as new evidence accumulates over time. Frequently, clinical studies are phrased in terms of hypothesis testing. For instance, if a new treatment for a disease is developed, we might wish to know whether it performs better than the current treatment. We set up a clinical trial where patients are randomly assigned to one or the other treatment. We then compare the number of successful treatments in each group. Let’s assume that the two groups have the same number of patients. In order to conclude that the new treatment is better than the current treatment, we would need to demonstrate that the number of successful treatments in the new treatment group is larger than the number of successful treatments in the current treatment group. The question is how much larger the number of successful treatments in the new treatment group would need to be to convince other investigators that the new treatment is indeed better. These kinds of questions can be answered within the framework of hypothesis testing. In-class Activity 1 Assume the current treatment for a disease is successful in 30% of all cases. A new treatment is being developed and a preliminary clinical trial showed that 5 out of 10 patients were successfully treated. Can you conclude that the new treatment is more successful? If the new treatment was not better than the current treatment, we would hypothesize that the new treatment has probability 0.3 of being successful. Alternatively, if the new treatment is better than the current treatment, we would hypothesize that the new treatment has probability greater than 0.3 of being successful. If the new treatment has the same likelihood of success than the current treatment, namely probability 0.3, then the number of patients in the small clinical trial who are treated successfully under the new treatment is binomially distributed with 10 trials and success probability 0.3. The following table was created in EXCEL using the BINOMDIST function and shows this probability distribution: Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 2 x P(X=x) 0 1 2 3 4 5 6 7 8 9 10 0.0282 0.1211 0.2335 0.2668 0.2001 0.1029 0.0368 0.0090 0.0014 0.0001 0.0000 We see that the probability of five or more successes when the success probability is 0.3 is 0.1029 0.0368 0.0090 0.0014 0.0000 0.1502 Thus, it is not unlikely to see 5 (or more) out of 10 patients recover when the success probability of recovery is 0.3. We conclude that there is not enough evidence to conclude that the new treatment is better. Discuss in your group the following questions: 1. Why did we add up the probabilities in the above example? 2. Would you be able to conclude definitively from this study that the new treatment isn’t any better? 3. What would be your next step in determining whether the new treatment is better? Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 3 In-class Activity 2 Suppose you have a coin in your pocket. You want to decide whether the coin is fair or biased. You hypothesize that the coin is fair. To test this hypothesis, you toss the coin 30 times. The number of heads is binomially distributed with the number of trials being 30 and the probability of heads (success) being 0.5. Below is the histogram of the probability distribution. Suppose the experiment resulted in 18 heads and 12 tails. Discuss the following questions in your group: 1. 2. 3. 4. What can you say about the coin? Is it a fair coin or a biased coin? What would your conclusion be if the experiment resulted in 24 heads and 6 tails? What criteria did you use to make the decision in each of the two cases? Can you be sure that your decision is correct? Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 4 Some Theory In both In-class Activities, you had to make a decision between two alternatives. In the first case, you needed to decide whether the new treatment was better than the current treatment. In the second case, you needed to decide whether the coin was fair or biased. In both cases, you relied on a probability model, and you based your decision on how likely the outcome of the experiment was compared to the expectation of the model. In both cases, there was also the possibility that you arrived at the wrong decision. In the following, we will discuss the basic elements of hypothesis testing. We will use the example of the fair coin versus the biased coin because of its simplicity. One hypothesis is that the coin is fair, that is, that the probability of heads is 0.5. The alternative hypothesis is that the coin is biased, that is, the probability of heads is different from 0.5. We base our decision of whether or not the coin is fair on comparing the result of our experiment to what we expect based on a probabilistic model. Namely, if the fraction of heads in the experiment is close to 1/2, the experiment provides evidence for the coin being fair; if the fraction of heads is either low or high, the experiment provides evidence for the coin being biased. The hypothesis “the coin is fair” is called the null hypothesis and is denoted by H 0 . The alternative “the coin is biased” is denoted by H1 . (We will say more about which of the two hypotheses is the null hypothesis and which is the alternative later.) We summarize this as H0 : p 0.5 H1 : p 0.5 We designed an experiment in which we tossed the coin thirty times. The data collected in the experiment provided evidence for or against the null hypothesis. The data in our experiment were the sequence of heads and tails in the thirty trials. The data suggest that we can calculate a single number, namely the number of heads, which we can compare against what we would expect under the null hypothesis. This single number is called the test statistic. A probabilistic model for tossing a fair coin allows us to calculate the probability distribution of the test statistic. Namely, under the null hypothesis, the number of heads is binomially distributed with 30 trials and success probability p 0.5 . In the experiment, we observed 18 heads. How likely is it that we observe 18 or more heads? If X denotes the number of heads, we are asking for P(X 18) , which can be calculated by adding up the probabilities of the events X 18 , X 19 ,...X 30 . Refer to the spreadsheet (tab “Fair Coin”) to verify that P(X 18) 0.1808 Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 5 Since the alternative is two-sided, that is p 0.5 or p 0.5 , we will reject the null hypothesis if the number of heads is either too large or too small. We only calculated the probability of at least 18 heads. The probability distribution under the null hypothesis is symmetric and so we add to this the probability of the symmetric event “at most 12 heads” or X 12 . We thus need to calculate the probability of the event of having either at least 18 heads or at most 12 heads P(X 12 or X 18) 0.1808 0.1808 0.3616 This probability is called the p-value and denoted by p. Commonly accepted by statisticians is the following: p<0.01: strong evidence against H 0 0.01<p<0.05: moderate evidence against H 0 p>0.10: little or no evidence against H 0 Since p 0.3616 , there is little or no evidence against H 0 , and thus not sufficient evidence to reject the null hypothesis. If, instead of 18 heads, we observed 24 heads in the experiment, we find for the p-value P(X 6 or X 24) (2)(0.0006 0.0001) 0.0014 We conclude that there is strong evidence against H 0 . We reject the null hypothesis and say that the result is highly statistically significant. Rejection Region In our example, we are looking for outcomes that have either a large or a small number of heads. We can define a set of extreme outcomes a priori so that if the outcome of the experiment is in this set, we reject the null hypothesis. The set of extreme outcomes is called the rejection region. Because the probability distribution is symmetric about 15, and there are 15 possible outcomes below 15, namely 0,1,2,…,14, and 15 possible outcomes above 15, namely 16,17,..,30, we will define the rejection region in a symmetric way, namely, we will identify a number a, so that the rejection region is of the form 0,1,2,..., a 30 a,30 a 1,...,30 . If we are looking for moderate evidence, say, we want to test at the 0.05 significance level, we will choose a as large as possible so that each of the two sets has Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 6 probability close to 0.025. Looking at the table of probabilities for the outcomes “Number of heads is equal to k,” we see that if a 9 , we have P(X 9 or X 21) 0.0214 0.0214 0.0428 If we choose a larger value for a, the probability would exceed 0.05; a smaller value of a would result in a probability that is smaller than 0.0428. Thus a=9 is the best choice for defining the rejection region if we are interested in “moderate evidence against the null hypothesis.” We thus reject the null hypothesis if the number of heads in the experiment of tossing the coin thirty times is either less than or equal to 9 or greater than or equal to 21. The complement of the rejection region, called the acceptance region, is the set 10,11,12,...,17,18,20 . In the experiment, we observed 18 heads, which is in the acceptance region. We thus do not reject the null hypothesis. Statisticians are careful about phrasing their conclusion. If the outcome of the experiment is unlikely under the null hypothesis, they reject a null hypothesis. If not, they will say that the null hypothesis cannot be rejected. Statisticians do not accept a null hypothesis. There is a big difference between saying “we do not reject a null hypothesis” and “we accept a null hypothesis.” Just because the data is consistent with the null hypothesis, does not mean that the null hypothesis is true—there could be many other reasons for getting a result that is consistent with the null hypothesis. That is, not rejecting a null hypothesis does not assert its truth. The null hypothesis merely withstood a challenge. As we will see shortly, rejecting a null hypothesis only means that the null hypothesis may not be true. The histogram in the figure below indicates the acceptance region and rejection region. In a two-sided test, the rejection region is the union of the two extreme events that are in the two “ends” of the distribution, which are called the tails of the distribution Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 7 Type I Error The probability of the rejection region in our experiment is 0.0428. That is, there is a 4.3% probability that we will reject the null hypothesis even though it is true. Erroneously rejecting the null hypothesis is called committing a type I error. The type I error leads to false positives. Since there is a positive probability that the null hypothesis is erroneously rejected, we can only conclude that the null hypothesis may not be true when we reject the null hypothesis. Type II Error The other possible error is not rejecting the null hypothesis when the alternative is true. This is called a type II error. The type II error leads to false negatives. The type II error can only be calculated if the alternative is sufficiently specified. In our example, we only said that the coin is biased under the alternative. There are infinitely many probability models that satisfy the assumption of the alternative, namely any binomial distribution with p 0.5 . For a fixed value of p 0.5 , we can calculate the type II error. For instance, let’s assume p 0.7 . Then P(10 X 20|p 0.7) 0.4112 For larger values of p, this probability will be smaller. For instance, if p 0.8 , then P(10 X 20|p 0.8) 0.0611 Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 8 This means that the larger (or, by symmetry, the smaller) the probability of heads is, the better we will be able to detect whether a coin is biased. This is quantified by the power of the test, which is defined as 1-type II error. The power of a test is therefore the probability of rejecting the null hypothesis when the alternative is true. The following table lists the power of the test for different values of the probability of heads P(Heads) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Power 1 0.999546 0.938913 0.588816 0.177143 0.042774 0.177143 0.588816 0.938913 0.999546 1 We can plot the power of the test as a function of the probability of heads: Hypothesis Testing trough Resampling In our example of testing whether a coin is fair, we were able to calculate the probability distribution under the null hypothesis exactly. In many applications, the probability distribution under the null Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 9 hypothesis is not known and must be simulated. This method is called resampling method. We can illustrate this important method on our example. In the spreadsheet (tab “Simulation”), we set up a simulation of 30 independent trials, each with probability 0.5 of success. If we denote a success by a “1” and a failure by a “0,” then the syntax to accomplish this in EXCEL is “=IF(RAND()<0.5,1,0)” (See, for instance, Cell B4.) If you add up the 30 values, you obtain the total number of successes in the 30 trials, which you can find in Cell E4. Write a macro so that you can record the outcomes of 500 such experiments and record the number of heads in each of the 500 runs in column I. If you want the type I error to be 5%, then since the test is two-sided, we need to determine the 2.5th and 97.5th percentiles of the simulation outcomes to find the acceptance region. Find the acceptance region and the corresponding rejection region. How does this compare to the exact calculations we did earlier? Summary Statistical hypothesis testing involves the following steps: 1. Formulation of a null hypothesis and an alternative. 2. Construction of a test statistic that can discriminate between the null hypothesis and the alternative. Calculate the probability distribution under the null hypothesis. 3. There are two ways to proceed from here. Either one allows us to control the type I error: a. Specification of the type I error and calculation of the rejection and acceptance region followed by data collection and decision of whether to reject or not to reject the null hypothesis based on the data. b. Collection of data and calculation of the corresponding p-value followed by decision of whether or not to reject the null hypothesis. The p-value is the type I error, that is, it is the probability of erroneously rejecting the null hypothesis based on the data. Worked-out Example Problem: A jury pool includes 50% women and 50% men. A jury of 12 people was selected from this pool and included 3 women. A newspaper commented on the “biased selection process.” (a) Test the hypothesis that the jury selection was fair. (b) Repeat the test assuming now that the jury only included 2 women. Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 10 Solution: The first part of the solution applies to both (a) and (b). The null hypothesis is that the jury selection was fair, that is, the proportion of women is 0.5. The alternative is that the selection process was biased against women. We thus choose for the alternative that the proportion of women is less than 0.5: H0 : p 0.5 H1 : p 0.5 The next step is to find a test statistic and to calculate the probability distribution of the test statistic under the null hypothesis. We can choose as the test statistic the number of women in the jury pool. We denote the test statistic by X. The test statistic is binomially distributed with n, the number of trial, equal to 12, and p, the probability of success being 0.5. The EXCEL function “=BINOMDIST(k, n, p, FALSE)” calculates the probability distribution of a binomial distribution of k successes in n trials with success probability p. The last entry “FALSE” indicates that it calculates the probability mass function. To calculate the cumulative probability distribution, replace “FALSE” by “TRUE”. With n 12 , and p 0.5 , we obtain the following table: k P(X=k) 0 1 2 3 4 5 6 7 8 9 10 11 12 0.000244 0.00293 0.016113 0.053711 0.12085 0.193359 0.225586 0.193359 0.12085 0.053711 0.016113 0.00293 0.000244 (a) In this part of the problem, three women were on the jury. To calculate the p-value, we compute the probability of the event “three or fewer women”: P(X 3) 0.000244 0.00293 0.016113 0.053711 0.072998 Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 11 Since the p-value is about 0.073, we conclude that there is not enough evidence to reject the null hypothesis. In about 7.3% of jury selections from this pool, we would expect to see three or fewer women. The result is not statistically significant. (b) The situation is different if only two women had been selected. The probability of two or fewer women on the jury is P(X 2) 0.000244 0.00293 0.016113 0.019287 Now, the p-value is only about 2%, which is statistically significant. We would now reject the null hypothesis. Homework 1. A student committee composed of 20 upper division and lower division students needs to be assembled. One third of the student population is upper division students. The committee ends up having an equal number of upper division and lower division students. The lower division students, expecting a higher number of them on the committee, made the accusation that the selection process was biased against them. Test the hypothesis that the selection process was fair against the alternative that the selection process was biased against the lower division students. 2. In a cross between heterozygous plants of genotype Cc, we expect that 50% of the offspring are heterozygous (i.e., genotype Cc) and 50% are homozygous (i.e., either of genotype CC or of genotype cc). Among 14 plants, we find that 3 plants are homozygous and 11 plants are heterozygous. Test the hypothesis that the ratio of homozygous to heterozygous plants is 1:1 against the alternative that the ratio is different from 1:1. 3. Assume that the population distribution is normal with mean and standard deviation . We take a sample of size n from this population and calculate the sample average Xn 1 n n X i i 1 We know that the distribution of X n is then again normal with mean and standard deviation / n . We can define a new random variable Z Xn / n Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 12 which is normally distributed with mean 0 and standard deviation 1. Suppose now that the average lifetime of a sample of 50 medical devices is 5 years and 8 months with population standard deviation of 4 months. Assume that the lifetime of this medical device is listed as 6 years. Test the hypothesis 6 years against the alternative 6 years at the 0.05 significance level by first calculating the rejection and acceptance regions for the 0.05 significance level. Calculate the power of this test. 4. For large samples, the sampling distributions can often be approximated by normal distributions even if the population distribution is not normal. Here is a typical example: A group of 100 students takes the ACT math test and has an average score of 20.6. The standard deviation 3.2 . The average score nationwide was 21.2. Test whether the average score of this group of 100 students is lower than the national average. 5. Suppose you have a biased coin in your pocket. One side shows up with probability 0.3, the other with probability 0.7. Unfortunately, you don’t remember which side is more likely. Here are the two scenarios: Hypothesis A P(Heads)=0.3 P(Tails)=0.7 Hypothesis B P(Heads)=0.7 P(Tails)=0.3 To determine whether the probability of heads is 0.3 (Hypothesis A) or 0.7 (Hypothesis B), you toss the coin 9 times and record the number of heads. Here is the outcome of the experiment: H,T,T,T,H,H,T,T,H,T (a) Based on this outcome, what do you think is the more likely scenario, Hypothesis A or Hypothesis B? (b) The number of heads in the experiment is binomially distributed with 9 trials and probability of heads equal to 0.3 in Hypothesis A and 0.7 in Hypothesis B. Calculate the probabilities for the event “Number of Heads = k” under the two hypotheses for the ten possible values of k. For which values of k would you reject Hypothesis A? Calculate the probability of erroneously rejecting Hypothesis A. Calculate the probability of not rejecting Hypothesis A when in fact Hypothesis B is true. 6. Focht et al. 2002 reported in a research article on “the efficacy of duct tape vs cryotherapy in the treatment of Verruca vulgaris (the common wart)” (Arch. Pediatr. Adolesc. Med. 2002; 156: 971-974). Their objective was “[t]o determine if application of duct tape is as effective as cryotherapy in the treatment of common warts.” They enrolled 61 patients into their study; 51 patients completed the Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 13 study. The main outcome measure was complete resolution of the wart being studied. Patients were randomized to receive either cryotherapy (liquid nitrogen applied to the wart every two-three weeks) or application of duct tape for a maximum of two months. Of the 51 patients, 26 were treated with duct tape and 25 with cryotherapy. In the duct tape group, 22 had complete resolution of the wart; in the cryotherapy group, 15 patients had complete resolution of the wart. Here is the data in table form summarizing the outcome of the study: Duct Tape No Resolution Resolution SUM 4 22 26 Cryotherapy SUM 10 15 25 14 37 51 (a) What percentage of patients completing the study were treated with duct tape, and what percentage were treated with cryotherapy? (b) To test whether duct tape is at more effective than cryotherapy, we design a statistical test. The null hypothesis states that the two treatments are equally effective. The alternative is that duct tape therapy is more effective than cryotherapy. Under the null hypothesis, the two treatments are equally effective. Under this assumption, we can develop a probability model to calculate the probability of 22 patients in the duct tape group that saw complete resolution. This is how: We have a group of 51 patients, which is the population. 37 patients saw successful resolution, which is the group of successes in the population. 26 patients are randomly assigned to the duct tape group, which is the sample size. The number of successes in the sample is 22. This is to the following urn problem that we can solve using basic probability theory: An urn has a total of 51 balls, 37 of which are blue, the remainder is green. We take a sample of 26 balls at random from the urn, what is the likelihood that 22 balls are blue? If we denote the number of successes in the sample by X, we can calculate the probability of this event using the hypergeometric distribution: 37 14 22 4 P( X 22) 51 26 Excel has a function that will calculate this probability: “=HYPGEOMDIST(22,26,37,51).” To calculate the p-value, we need to determine the probability of at least 22 complete resolutions in a sample of size 26 when the population size is 51 and the number of successes in the population is 37. Find this probability. What can you conclude? (The statistical test in this problem is called Fisher’s exact test.) Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 14 7. In 2006, another study on the efficacy of duct tape in treating warts was done to address some of the shortcomings of the first study, in particular the small sample size and the lack of a placebo group. The study by de Haen et al. (2006) on the “efficacy of duct tale vs placebo in the treatment of Verruca vulgaris (warts) in primary school children” (Arch. Pediatr. Adolesc. Med. 2006; 106: 1121-1125) had 103 participants who completed treatment; 51 patients were treated with duct tape and the remaining 52 patients received a placebo treatment. After 6 weeks, the wart had disappeared in 8 of the children in the duct tape group and 3 of the children in the placebo group. Test whether the duct tape treatment is more effective. Citation: Neuhauser, C. Hypothesis Testing. Created: November 29, 2009 Revisions: Copyright: © 2009 Neuhauser. This is an open-access article distributed under the terms of the Creative Commons Attribution Non-Commercial Share Alike License, which permits unrestricted use, distribution, and reproduction in any medium, and allows others to translate, make remixes, and produce new stories based on this work, provided the original author and source are credited and the new work will carry the same license. Funding: This work was partially supported by a HHMI Professors grant from the Howard Hughes Medical Institute. Page 15