Chapter 2-15. Equivalence Tests and Noninferiority Tests Barker et al (2002) provide an example research situation where equivalence is the hypothesis of interest. Motivated by the public health policy of eliminating health disparities in vaccination coverage among various groups, the collected vaccination coverage from the 2000 National Immunization Survey was analyzed. The following shows their data for polio vaccination coverage. Coverage Disparity (group – white) White 90.6% Ref Black 86.8% -3.8% Hispanic 87.9% -2.7% Asian 92.7% +2.1% What would be a good way to approach these data? Using the classical “difference testing” approach, a significant difference would be observed, at the alpha=0.05 level, if the 95% confidence interval around the difference did not cover 0. White 90.6% Ref Black Hispanic Coverage 86.8% 87.9% Disparity (group – white) -3.8% -2.7% 95% CI for difference -5.9 , -1.7* -4.5 , -0.9* * Denotes significance, p<0.05, since 95% CI does not cover 0. Asian 92.7% +2.1% -0.7 , 4.9 Should we conclude that disparity still exits for two of these groups, and conclude disparity has been eliminated between Whites and Asians? First, let’s consider the White-Asian comparison: Asian – White: +2.1%, 95% CI(-0.7% , 4.9%). We might consider a conclusion of equivalent coverage because statistical significance was not achieved. However, there is a well-known competing explanation—perhaps this is simply due to an insufficient sample size. That is, if this same 2.1% difference was maintained in a sufficiently larger sample size, it would be significant. It would seem, however, that if our sample size provided adequate power to detect some smallest meaningful disparity effect, say a 3% difference, failing to achieve significance should permit the conclusion of equivalence. _____________________ Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385 Chapter 2-15 (revision 30 Aug 2011) p. 1 So, if the power was 95% to detect a 3% difference, then we would have had a 95% probability that our sample would provide a conclusion of disparity if that difference of 3%, or one larger, existed in the sampled population. The probability of making a type II error (concluding no difference when a difference exists) is then 1-power=beta=(1-.95)=0.05. With this reasoning, and our 2.1% observed difference, it seems we should be able to conclude equivalence at the beta=0.05 level. Such reasoning has been frequently applied. Greene (2000) reviewed human subjects papers listed in Medline from 1992 to 1996 and found 67% of those papers followed that approach. To give a published example of this approach being used, so that you can recognize it when you see it, Rumsfeld et al (2003) published a clinical trial of high-risk patients with medically refractory ischemia randomized to percutaneous coronary intervention (PCI) versus coronary artery bypass graft (CABG) surgery. In this study, a non-significant difference was observed in six-month health-related quality of life (HRQL) between the two study arms of less than 2 points in multivariable models. The authors reported having “…97% power to detect a four-point difference in scores, where four to seven points is a clinically important difference.” They concluded, “High-risk patients with medically refractory ischemia randomized to PCI versus CABG surgery have equivalent six-month HRQL.” ------APOLOGY: I apologize to Rumsfeld et al for using them as an example. They were doing what they learned was a correct analysis, as this approach was has been widely taught and advocated. I even used to teach this approach. It will take decades for this approach to go away. This approach to equivalence testing was made popular by Jacob Cohen (1988), the author of the classic text, Statistical Power Analysis for the Behavioral Sciences, being advocated in both the first and second editions. He also proposed it elsewhere, specifically (Cohen, 1965). As pointed out by Hoenig and Heisey (2001, p.21), Cohen, without providing a mathematical proof, advocated using power to support the null hypothesis. Cohen claimed (1988, p.16) that if you design a study to have high power (say, power = 0.95, so = 0.05) to detect a stated biologically meaningful effect size, , (or alternatively some trivial effect size) and then you fail to reject the null hypothesis (p > , or p > 0.05), than you can conclude that the effect in the population is smaller than with probability . No one has ever been able to provide a proof that Cohen’s assertion is logically consistent. Finally, in 2001, Hoenig and Heisey published a proof in The American Statistician demonstrating that Cohen’s assertion is, in fact, false. Chapter 2-15 (revision 30 Aug 2011) p. 2 Unfortunately, Jacob Cohen died in 1998 before this equivalence testing approach was logically refuted, so his textbook will never see a new edition where the error is corrected. This approach is no longer acceptable in the approval process of new drugs in Europe, Japan, and the United States. In the E9 guidance document it states (ICH, 1999, section Trials to show Eqivalence or Noninferiority (3.3.2)), “Concluding equivalence or noninferiority based on observing a nonsignificant test result of the null hypothesis that there is no difference between the investigational product and the active comparator is considered inappropriate.” Hoenig and Heisey used the method of proof by contradiction (see box). Proof by Contradiction Let P and Q be two propositions. A proposition is a statement that is either true or false. The steps used in a proof by contradiction are: Proof of P by Contradiction (Smith, et al, 1997, p.34) Suppose Not-P. (where Not-P is true if P if false, or false if P is true) … Therefore, Q. …. Therefore, Not-Q. Hence, Q and Not-Q, a contradiction (since a proposition cannot be both true and false) Thus, P. Example from Mathematics of a Proof by Contraction Aim: prove that division by zero is meaningless (which is why division by zero is said to be undefined). Proof Steps Suppose Not-P: ... Therefore, Q: Therefore, Not-Q: Hence, Q and Not-Q, a contradiction: Thus, P. Chapter 2-15 (revision 30 Aug 2011) Proof Suppose division by 0 is meaningful. Let a 0. Let a/0 = b. (using opening premise in this step) Therefore, a=b 0 =0. (multiplying both sides by 0) Hence both a 0 and a = 0, a contradiction. Thus division by 0 is meaningless. p. 3 Hoenig and Heisey’s Proof by Contradiction of the Fallacy of Using a Power Analysis to Demonstrate the Null Hypothesis of No Effect Of course, proof by contraction is a logical construct that is not limited to mathematics. Hoenig and Heisey state their proof in English sentences, so it would better to say that theirs is a logical proof, rather than a mathematic proof. Most researchers, however, are not trained in logic or mathematics, so they would not recognize that a proof is being presented. Here is how Hoenig and Heisey (2001, p.21) stated their logical proof: “A number of authors have noted that observed power may not be especially useful, but to our knowledge a fatal logical flaw has gone largely unnoticed. Consider two experiments that gave rise to non-rejected null hypotheses. Suppose that the observed power was larger in the first experiment than the second. Advocates of observed power would interpret this to mean that the first experiment gives stronger support favoring the null hypothesis. Their logic is that if power is low one might have missed detecting a real departure from the null hypothesis but if, despite high power, one fails to reject the null hypothesis, than the null is probably true or close to true. This is easily shown to be nonsense. For example, consider the one-sided Z test described above. Let Zp1 and Zp2 refer to the observed test statistics in the respective experiments. The observed power was highest in the first experiment and we know this implies Zp1 > Zp2 because observed power is GZ p ( ) which is an increasing function of the Z statistic. So by usual standards of using the p value as statistical evidence, the first experiment gives the stronger support against the null, contradicting the power interpretation. “We will refer to this inappropriate interpretation as the “power approach paradox” (PAP), higher observed power does not imply stronger evidence for a null hypothesis that is not rejected.” Let’s examine how a proof by contraction was applied. Chapter 2-15 (revision 30 Aug 2011) p. 4 Step 1. Suppose Not-P. If higher power and the test is not significant, this implies stronger evidence in favor of the null hypothesis. (the misconception to be shown fallacious) Step 2. Therefore, Q. Choose any two experiments, each which produce a test statistic (say, Z) that is not significant. Use subscript 1 for the experiment with the greater power, so Power1 > Power2 Because power always increases when Z increases (a known relationship between power and the test statistic), we have, Z1 > Z2 Therefore, Z1 provides stronger evidence in favor of the null hypothesis (from our assumption in step 1) because the power is higher. Step 3. Therefore, Not-Q. We know that the larger the test statistic, the smaller the p value. So, p value1 < p value2 By the common standards for interpreting p values in statistics, we know that the smaller the p value, the stronger evidence against the null hypothesis. Therefore, Z1 provides stronger evidence against the null hypothesis. Step 4. Hence, Q and Not-Q, a contradiction The conclusions of steps 2 and 3 contradict each other. Step 5. Thus, P. The opposite of the proposition in Step 1 must therefore be true. That is, Not-{If higher power and the test is not significant, this implies stronger evidence in favor of the null hypothesis.} which is to say, If higher power and the test is not significant, this does not imply stronger evidence in favor of the null hypothesis. Chapter 2-15 (revision 30 Aug 2011) p. 5 Thus, there is no logical justification for using power to support the null hypothesis. The correct way to provide a statistical argument (or probability argument) to support the null hypothesis is to apply one of the procedures known as equivalence tests or bioequivalence tests, also noninferiority and non-superiority tests if a specific direction is of interest. A variant of the fallacious power approach is the post hoc power, or observed power, approach, which was also refuted in the Hoenig and Heisey article (see box). Post Hoc Power Approach It is frequently requested by journal reviewers to have researchers compute a post hoc power analysis when failing to reject the null hypothesis. This is done by computing the power of the test based on the observed effects, variability, and sample size. Advocates of this approach argue that there is evidence for the null hypothesis being true if statistical significance was not achieved when the computed power is high at the observed effect size (Hoening and Heisey (2001, p.19). As pointed out by Hoenig and Heisey (2001, p.20), several major statistical packages, such as SPSS, provide observed power. Using the menu in SPSS version 12 Analyze -> General Linear Model -> Univariate -> Options -> Observed Power outcome group 1 2 Type III Sum of Squares N Mean Std. Deviation Std. Error Mean 5 5.40 1.140 .510 5 3.80 1.924 .860 Mean Source df Square F Corrected 6.400(b) 1 6.400 2.560 Model Intercept 211.600 1 211.600 84.640 group 6.400 1 6.400 2.560 Error 20.000 8 2.500 Total 238.000 10 Corrected 26.400 9 Total a Computed using alpha = .05 b R Squared = .242 (Adjusted R Squared = .148) Sig. Noncent. Parameter Observed Power(a) .148 2.560 .292 .000 .148 84.640 2.560 1.000 .292 We can next verify that the observed power is nothing more than using the observed means, standard deviations, and sample sizes into a sample size calculation. Using SamplePower version 2.0, specifying an independent groups t test, we get: With the proposed sample size of 5 and 5 for the two groups, the study will have power of 29.2% to yield a statistically significant result. This computation assumes that the mean difference is 1.600 (corresponding to means of 5.400 versus 3.800) and the common within-group standard deviation is 1.581 (based on SD estimates of 1.140 and 1.924) . Chapter 2-15 (revision 30 Aug 2011) p. 6 Equivalence Testing The correct way to demonstrate equivalence is with the class of statistical procedures called equivalence tests. Bioequivalence tests are often used in the pharmaceutical industry. However, there are many other instances, such as in public health, when an equivalence test is what is actually required. For example, an investigator might want to show that access to health care is essentially equal for two groups of people, such as between the young working population and the elderly. Let’s take a bioequivalence example. The United States Food and Drug Administration (FDA) will grant approval of a drug if it can be shown to have the same bioavailability profile as an already approved drug, within 20% (the 20% rule). Expressed in proportional form, for a test drug (T) and a referent drug (R), bioequivalence is accepted if 0.80 T 1.20 R This suggests an interval hypothesis testing approach, which Chow and Liu (2000, p.97) describe as follows: The hypothesis of bioequivalence, then, can be formulated as an interval hypothesis as H 0 : T R L or T R U vs H a : L T R U where L and U are some clinically meaningful limits (such as mean differences that represent 20% of the reference mean). To show bioequivalence, we reject the null hypothesis of not bioequivalent. Notice that this is the opposite of the usual statistical hypothesis, where the null hypothesis is a statement of equality. The interval hypothesis can be decomposed into two sets of one-sided hypotheses H 01 : T R L vs H a1 : T R L and H 02 : T R U vs H a 2 : T R U to verify that the bioavailability is not too low to verify that the bioavailability is not too high If one concludes both H a1 and H a 2 , then it can be concluded that Ha : L T R U . The first hypothesis is called a “lack of inferiority” or “non-inferiority” test, and the second is called a “lack of superiority” or “non-superiority” test). Chapter 2-15 (revision 30 Aug 2011) p. 7 Schuirmann’s Two One-Sided Tests (TOST) Procedure Chow and Liu (2000, 98) describe this procedure as follows: Schuirmann’s (1987) procedure suggests the conclusion of equivalence of T and R at the level of significance if and only if both H 01 and H 02 are rejected at a predetermined level of significance. The two sets of one-sided hypotheses can be tested with ordinary one-sided t tests. We conclude that T and R are equivalent if TL (YT YR ) θ L t ( , n1 n2 2) 1 1 σ̂ d n1 n2 and TU (YT YR ) θ U t ( , n1 n2 2) 1 1 σ̂ d n1 n2 The two one-sided t tests procedure is operationally equivalent to the classical confidence interval approach. If the classical (1 - 2 ) 100% confidence interval for T - R is within (θ L , θ U ) , than both H 01 and H 02 are rejected at a predetermined level by the two one-sided t tests procedure. One disadvantage of the TOST procedure is that it requires reporting two p values. Another popular equivalence test, then, is the Anderson and Hauck’s Test (Chow and Liu, 2000), which only uses one p value to reject the null hypothesis of non-equivalence. The TOST procedure is known to be slightly conservative. Several tests have been developed that are more powerful to show equivalence (Chow and Liu, 1992). Barker et al (2001), for example, describe eight equivalence tests for binomial (yes/no) variables and discuss their relative power. One advantage to the TOST procedure, over other equivalence tests, is that it is easy for nonstatisticians to understand. Also, it is not limited to the t test. One could use it for comparing two proportions, just as well, or any test where it is possible to specify the hypothesized non-zero difference in the numerator of the test procedure. Chapter 2-15 (revision 30 Aug 2011) p. 8 Confidence Interval Approach Perhaps the biggest advantage to the TOST procedure is that it is equivalent to the confidence interval approach. Such confidence intervals are readily available in all existing software packages. Using the confidence interval approach, one can avoid having to report the two p values of the TOST procedure, and the confidence intervals are easy for readers to understand. In this approach, a 2 % CI covering T R is constructed using the two sample means or two sample proportions. If this CI is completely contained within the interval ( L ,U ) , then equivalence is demonstrated. That is, first choose the equivalence window, say 20%. Next, compute the 90% CI for the difference in means, or difference in proportions. Basically, this CI will look like (see box): 2 2 X test X referent 1.645 stest sreferent If the endpoints of this CI are within the -20% and +20% bounds, then a conclusion of equivalence is supported. Point and Interval Estimation for the Risk Difference (Rosner, 1995, p. 363) Let p̂1 , p̂2 represent the sample proportion who develop disease in a prospective study, based on sample sizes of n1 and n2 , respectively. A point estimate of the risk difference is given by pˆ1 pˆ 2 . A 100% (1 – α) confidence interval for the risk difference is given by (c1 , cc ) , where c1 pˆ1 pˆ 2 z1 / 2 pˆ1qˆ1 pˆ 2 qˆ2 pˆ qˆ pˆ qˆ , c2 pˆ1 pˆ 2 z1 / 2 1 1 2 2 n1 n2 n1 n2 Use this expression for the confidence interval only if n1 pˆ1qˆ1 5 and n2 pˆ 2 qˆ2 5 . Notice that a 90% CI is equivalent to two t tests, using = 0.05 for each t test. It might seem strange that a 95% CI is not used. Westlake (1981) proposed the use of a 90% CI in order to achieve an = 0.05 level test. The FDA requires an = 0.05 level test for demonstrating efficacy of a new drug, so the (1 - 2 ) or 90% CI for equivalence testing achieves parallelism with the efficacy testing requirement. (Westlake, 1988, p.343) Westlake’s argument is based on the fact that a 95% CI is wider than a 90% CI, making it harder to demonstrate equivalence. The (1 - 2 ) or 90% CI makes it just as easy to achieve significant equivalence as it would be to achieve efficacy at the 0.05 level. If a 95% CI is used, the nominal is at most /2, providing a 0.025 level test. Chapter 2-15 (revision 30 Aug 2011) p. 9 Chow and Liu (2000, p.80) state that the FDA has adopted the approach of using a 90% CI for bioequivalence studies, “The FDA requires that the bioequivalence be concluded with 90% assurance.” They then refer to the 90% confidence interval on page 81. Noninferiority Tests Using Confidence Intervals A noninferiority test is one of the very rare instances where a one-sided comparison is appropriate. In the FDA guidance document “E9 Statistical Principles for Clinical Trials” (ICH, 1999, section 5.5.E) it states, “For noninferiority trials, a one-sided interval should be used. The confidence interval approach has a one-sided hypothesis test counterpart for testing the null hypothesis that the treatment difference (investigational product minus control) is equal to the lower equivalence margin versus the alternative that the treatment difference is greater than the lower equivalence margin. The choice of Type I error should be a consideration separate from the use of a one-sided or two-sided procedure.” The 2-sided 95% CI versus 1-sided 95% CI for Non-inferiority Testing (What Are Researchers Using) Piaggo et al (2006, p.1154) published a methods paper, extending the Consolidated Standards of Reporting Trials (CONSORT), to the situation of reporting non-inferiority and equivalence trials. In their paper, they prefer a 2-sided 95% CI, “Many noninferiority trials based their interpretation on the upper limit of a 1-sided 97.5% CI, which is the same as the upper limit of a 2-sided 95% CI. Although both 1sided and 2-sided CIs allow for inferences about noninferiority, we suggest that 2-sided CIs are appropriate in most noninferiority trials.29 If a 1-sided 5% significance level is deemed acceptable for the noninferiority hypothesis test42 (a decision open to question), a 90% 2-sided CI could then be used. __________ 29 Points to Consider on the Choice of Noninferiority Margin. London, England: European Medicines Agency (EMEA);February 26, 2004. Available at: http://www.emea.eu.int/pdfs/human/ewp/215899en.pdf. Accessed February 9, 2006. 42 Sackett DL. Superiority trials, non inferiority trials, and prisoners of the 2-sided null hypothesis. ACP J Club. 2004;140:A11.” The Piaggo et al suggestion is consistent with the FDA guidance document “E9 Statistical Principles for Clinical Trials” (ICH, 1999, section 5.5.E) which states, “The approach of setting Type I errors for one-sided tests at half the conventional Type I error used in two-sided tests is preferable in regulatory settings. This promotes Chapter 2-15 (revision 30 Aug 2011) p. 10 consistency with the two-sided confidence intervals that are generally appropriate for estimating the possible size of the difference between two treatments.” In the medical literature, using a two-sided 95% CI is popular for noninferiority studies. This is what is advocated in the EMEA guidance document (reference 29 two paragraphs above). Paiggo et al (2006, p.1154) also advocate this approach (see two paragraphs above). Personally, I (Stoddard) currently prefer to use a 1-sided test using a two-sided 95% confidence interval for noninferiority testing. This is consistent with Paiggo’s recommendation and with the E9 guidance document statement, two paragraphs above. One clear advantage to this approach is that it allows the reader to use your two-sided 95% confidence interval to test for an effect in the opposite direction. It is true that you only have interest in one direction, but the reader has an interest in either direction. Even so, many researchers are still using the one-sided approach with an alpha of 0.05. For example, van der Gaag et al. (N Engl J Med, 2010) used a one-sided alpha=0.05 significance test of noninferiority, which is identical to using a one-sided 95% CI. In their Statistical Analysis section they state, “Assuming that there would be a complication rate of 38% in the early-surgery group and 48% in the biliary-drainage group, we would consider early surgery to be noninferior if the associated percentage of serious complications was less than 10 percentage points above the percentage of serious complications in the biliary-drainage group. We used a two-group large-sample normal approximation test of proportions, with a one-sided significance level of 0.05, to test the null hypothesis that early surgery would lead to an increase of at least 10 percentage points in the rate of complications, as compared with preoperative biliary drainage, followed by surgery. To attain a power of 80% to show noninferiority of the early surgery, 94 patients were needed in each group.” A second example is Haskal et al. (N Engl J Med, 2010). In their Study End Points section, they state their noninferiority hypothesis as, “The study objective was to demonstrate that treatment with a stent graft is not inferior to treatment with balloon angioplasty alone regarding the primary end point, the 6-month primary patency of a stenotic venous anastomosis in the treatment area.” Then, in their Statistical Analysis section, they state, “We calculated the sample size needed to test the primary noninferiority hypothesis using the methods of Blackwelder.19 The incidence of primary patency at 6 months was estimated as 60% in the stent-graft group and 50% in the balloon-angioplasty group. The two rates were considered clinically noninferior if the difference was 10 percentage points or less (with a significance threshold of P=0.05 on a one-tailed test and 80% statistical power). On this basis, the number of patients required for each of the two treatment groups was calculated to be 76. The target number of patients enrolled in each group was Chapter 2-15 (revision 30 Aug 2011) p. 11 set at 95, to account for a dropout rate of up to 20%. Thus, the total target sample size was 190 patients.” --------19 Blackwelder WC. “Proving the null hypothesis” in clinical trials. Control Clin Trials 1982;3:345-53. Returning to the Vaccination Disparity Example The first step to equivalence testing is to state the smallest acceptable difference for which anything smaller would be considered equivalent. In our beginning immunization disparity example, we might consider a an absolute difference of 5% to be acceptable, so that differences in the range -5% to 5% are the same thing as 0% difference (equivalence). The choice of a range should depend on the context of the research question, such as what would be the public health impact of a 5% disparity. In bioequivalence testing of drugs, the FDA allows a relative 20% window in the average bioavailability of a test drug to a referent drug. However, in other situations, the decision should be based on what a clinician would find acceptable. In this immunization example, it is doubtful that a public health professional would find a 20% difference acceptable. Using the difference testing approach, White Black Hispanic Coverage 90.6% 86.8% 87.9% Disparity (group – white) Ref -3.8% -2.7% 95% CI for difference -5.9 , -1.7* -4.5 , -0.9* * Denotes significance, p<0.05, since 95% CI does not cover 0. Asian 92.7% +2.1% -0.7 , 4.9 we would conclude disparity between White/Black and White/Hispanic. A conclusion of White/Asian would be “insufficient evidence to conclude disparity”. Using an equivalence testing approach, however, White 90.6% Ref Black Hispanic Asian Coverage 86.8% 87.9% 92.7% Disparity (group – white) -3.8% -2.7% +2.1% 90% CI for difference -5.5 , -2.1 -4.2 , -1.2* -0.3 , 4.5* * Denotes significance, p<0.05, since 90% CI falls within the -5% to +5% equivalence window. we would conclude equivalence between White/Hispanic and White/Asian. There would be insufficient evidence to conclude equivalence between White/Black. Chapter 2-15 (revision 30 Aug 2011) p. 12 What Researchers Are Using Now that we know the correct way to do it, let’s see what researchers are actually reporting. Greene (2000) found 1209 citations in Medline (1992 through 1996) that contained the word “equivalence”, of out which 88 turned out to be original research reports involving human subjects in which made an equivalence claim. In the 88 articles, Greene found that: 23% of articles correctly set an equivalence boundary and confirmed with an appropriate statistical approach (the right way) 67% of articles declared significance after failing to show a significant difference (the refuted way, but refuted after these papers were published) 10% of articles claimed equivalence without the use of statistics (the “shouldn’t be doing research” way) Henanff et al (2006) also did a review, but they limited their review to papers that were specifically testing equivalence or non-inferiority, ignoring papers that claimed significance after failing to demonstrate a difference. Thus, this paper does not help to determine if the situation has improved, relative to the Greene paper. Chapter 2-15 (revision 30 Aug 2011) p. 13 Some Available Software for Equivalence Testing Almost always, a confidence interval approach is used, so that can be done in Stata, without the need to purchase any specialty software. If you really want a p value, however, here is some suggested software: In Chow and Liu’s textbook (2000), the SAS code for a wide variety of equivalence tests is found in the appendix. This code is not available from a website. Accompanying Wellek’s textbook (2003) is free software (e.g., SAS procs) on the authors website: http://www.zi-mannheim.de/wktsheq A friendly to use software package for equivalence testing is available, called EquivTest. A description of this can be found at the vendor’s website: http://www.statsol.ie/ The statistical package StatXact-7 provides equivalence testing for proportions, as well as a sample size calculation. Sample size for equivalence or noninferiority studies is easily done with some simple formulas. These can be found in Chow et al. (2008). Chapter 2-15 (revision 30 Aug 2011) p. 14 Example Let’s consider the dichotomous case. Suppose we have data that looks like: | therapy recover | 0 1 | Total -----------+----------------------+---------0 | 40 35 | 75 | 40.00 35.00 | 37.50 -----------+----------------------+---------1 | 60 65 | 125 | 60.00 65.00 | 62.50 -----------+----------------------+---------Total | 100 100 | 200 | 100.00 100.00 | 100.00 In the control group, 60% recovered. In the test group, 65% recovered. If we use a 20% window, equivalence will be demonstrated if the test group is within 20% of the referent group (control group in this case). 60% x 0.8 = 48% and 60% x 1.2 = 72%. The absolute difference between the two group percents, then, is required to be between ( -12% , +12%) Using Stata, we compute a 90% test based confidence interval around the percent difference using the command, cs recover therapy, tb level(90) we get | therapy | | Exposed Unexposed | Total -----------------+------------------------+-----------Cases | 65 60 | 125 Noncases | 35 40 | 75 -----------------+------------------------+-----------Total | 100 100 | 200 | | Risk | .65 .6 | .625 | | | Point estimate | [90% Conf. Interval] |------------------------+-----------------------Risk difference | .05 | -.062898 .162898 (tb) Risk ratio | 1.083333 | .9042128 1.297937 (tb) Attr. frac. ex. | .0769231 | -.1059344 .2295465 (tb) Attr. frac. pop | .04 | +------------------------------------------------- chi2(1) = 0.53 Pr>chi2 = 0.4652 Our observed 90% CI for the risk difference (proportion difference) is (-0.062898 , 0.162898). Since this fails to lie within the equivalence window of (-0.12 , 0.12), our data fail to demonstrate equivalence. Chapter 2-15 (revision 30 Aug 2011) p. 15 To verify this was done correctly, an equivalence test for two independent proportions was run in StatXact-7, using these same data. The result was: st 1 1-sided t-test, p = 0.006 nd 2 1-sided t-test, p = 0.152 90% CI around difference, -0.063 , 0.162 Since only one t-test was significant at the 0.05 level, rather than both, equivalence was not established by the Schuirmann’s two one-sided t test procedure. Chapter 2-15 (revision 30 Aug 2011) p. 16 Testing Noninferiority and Superiority in the Same Study It is common, and acceptable, to test both the noninferiority and the superiority hypothesis in the same study, using the same nominal alpha (α = 0.05, for example) for both comparisons. There is no need to adjust this alpha for multiplicity (multiple comparisons). Strategy 1) First test of noninferiority. If noninferiority is demonstrated (p<0.05), using the prespecificed noninferiority margin, then test superiority using the ordinary null value (mean difference = 0, RR = 1, OR = 1) and the same alpha (p<0.05). If noninferiority is not demonstrated, then superiority is automatically not demonstrated as well. Strategy 2) First test for superiority. If superiority is demonstrated, then noninferiority is usually of no interest but is demonstrated as well. If superiority is not demonstrated, then go on to test noninferiority using the same alpha (p<0.05). Hung and Wang (2004) describe this approach: “Morikawa and Yoshida (1995) and Dunnett and Tamhane (1997) considered the case of two δs; specifically, one δ is zero for the superiority objective and the other is a specified positive real number corresponding to the noninferiority objective described by the defined noninferiority margin. To test the superiority hypothesis and the noninferiority hypothesis, two possible stepwise strategies can be entertained. One strategy (labeled as S-NI) begins with testing the superiority hypothesis. Achievement of superiority immediately leads to achievement of the so-defined noninferiority. If superiority is not achieved, test for noninferiority. The other (labeled as NI-S) reverses the order. If noninferiority is not achieved, superiority can never be concluded. If noninferiority is concluded, test further for superiority. For each strategy, use of the same α level at each testing step is valid in the sense that the total type I error probability associated with testing for superiority and noninferiority is exactly α. They showed that the two stepwise procedures are equivalent in terms of the rejection regions for superiority and noninferiority. In practice, the two stepwise test procedures may carry with different sample size plans that are often designed primarily to achieve the first intended hypothesis (Wang et al., 2001). ________ Dunnett, C. W., Tamhane, A. C. (1997). Multiple testing to establish superiority/equivalence of a new treatment compared with kappa standard treatments. Statist. Med. 16(21):2489-2506. Morikawa T., Yoshida M. (1995). A useful testing strategy in phase III trials: Combined test of superiority and test of equivalence. J. Biopharmaceutical Statist. 5(3):297306. Wang, S. J., Hung, H. M. J., Tsong, Y., Cui, L. (2001). Group sequential testing for superiority and non-inferiority hypotheses in active controlled clinical trials. Statist. Med. 20;1903-1912.” Chapter 2-15 (revision 30 Aug 2011) p. 17 Some examples of how to state this in your Statistical Methods section of your article A good way to state this approach of testing for both noninferiority and superiority in the same study, without adjustment to alpha, is illustrated in an article by Ullmann et al. Ullmann et al. (N Engl J Med, 2007) describe this approach in their Statistical Methods section, being careful to include that this approach was a pre-specified analysis, “As stated in the protocol, the evaluation of efficacy occurred in two stages. First, the noninferiority of posaconazole to fluconazole was assessed. If noninferiority was demonstrated, then the superiority of posaconazole to fluconazole was assessed. This two-stage process allowed for control of the type I error rate.” In a protocol, you might also want to cite Hung and Wang (2004), to support that no multiple comparison adjustment is required, just to head off this question if the reviewer is not aware that it is a widely used, accepted practice. Exercise Look at the article by Munger et al (2008). 1) Notice they describe this two-stage noninferiority-superiority testing approach in their Statistical Analysis section. 2) In their figure, they show the confidence intervals and the noninferiority bound (dashed line). It is easy to see that no CI crosses the noninferiority bound. It is also easy to see which CIs not cross the null value of RR=1, thus demonstrating superiority, as well. Chapter 2-15 (revision 30 Aug 2011) p. 18 References Barker L, Rolka H, Rolka D, Brown C. (2001). Equivalence testing for binomial random variables: which test to use? The American Statistician. 55(4):279-287. Barker LE, Luman ET, McCauley MM, Chu SY. (2002). Assessing equivalence: an alternative to the use of difference tests for measuring disparities in vaccination coverage. Am J Epidemiol 156(11):1056-1061. Blackwelder WC. (1982). “Proving the null hypothesis” in clinical trials. Control Clin Trials 3:345-53. Borenstein M, Rothstein H, Cohen J (2001). SamplePower® 2.0. Chicago, SPSS Inc. software can be purchased at http://www.spss.com Chan I. (1998). Exact tests of equivalence and efficacy with a non-zero lower bound for comparative studies. Statistics in Medicine 17, 1403-1413. Chow S-C, Liu J-P. (2000). Design and analysis of bioavailability and bioequivalence studies. 2nd edition, New York, Marcel Dekker. Chow S-C, Shao Jun, Wang H. (2008). Sample Size Calculations in Clinical Research. 2nd ed. New York, Chapman & Hall/CRC. Cohen J. (1965). Some statistical issues in psychological research. In B.B. Wolman (Ed.), Handbook of Clinical Psychology. New York, McGraw-Hill. pp. 95-121. Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, New Jersey. Greene WL, Concato J, Feinstean AR. (2000). Claims of equivalence in medical research: are they supported by the evidence. Ann Intern Med 132:715-722. Haskal ZJ, Trerotola S, Dolmatch B, et al. (2010). Stent graft versus balloon angioplasty for failing dialysis-access grafts. N Engl J Med 362(6):494-503. Hoenig JM, Heisey DM. (2001). The abuse of power: the prevasive fallacy of power calculations for data analysis. The American Statistician 55(1):19-24. Hung HMJ, Wang, S-J. (2004). Multiple testing of noninferiority hypotheses in active controlled trials. J Biopharm Statist 14(2):327-335. International Conference on Harmonisation E9 Expert Working Group. (1999). ICH harmonised tripartite guideline: statistical principles for clinical trials. Stat Med 18(15):1905-42. Freely available as a guidance document on the FDA website (word for word same content): Guidance for industry: E9 statistical principles for clinical trials. Chapter 2-15 (revision 30 Aug 2011) p. 19 http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guid ances/ucm073137.pdf Johns MW. (2000). Sensitivity and specificity of the multiple sleep latency test (MSLT), the maintenance of wakefulness test and the Epworth sleepiness scale: Failure of the MSLT as a gold standard. J Sleep Res 9:5-11. Linnett K. (2000). Nonparametric estimation of reference intervals by simple and bootstrap-based procedures. Clinical Chemistry 46(6):867-869. Munger MA, Stoddard GJ, Wenner AR, et al. (2008). Safety of prescribing PDE-5 inhibitors via e-medicine vs traditional medicine. Mayo Clin Proc 83(8):890-896. Nicoll CD, Pignone M. Diagnostic testing & medical decision making. In Tierney LM, McPhee SJ, Papadakis MA (eds). Current Medical Diagnosis & Treatment 2003, 42nd ed. Columbus OH, The McGraw-Hill Companies, 2003, pp. 1667-1677. [see note below for electronic access] Piaggio G, Elbourne DR, Altman DG, et al. (2006). Reporting of noninferiority and equivalence randomized trials: an extension of the CONSORT Statement. JAMA 295:1152-1160. Rosner B. (1995). Fundamentals of Biostatistics. 4th ed. Belmont, California, Duxbury Press. Rumsfeld JS, Magid DJ, Plomondon ME, et al. (2003). Health-related quality of life after percutaneous coronary intervention versus coronary bypass surgery in high-risk patients with medically refractory ischemia. J Am Coll Cardiology 41(10):1732-1738. Schuirmann DJ. (1987). A comparison of the two one-sided tests procedure and the power approach for assessing the equivalence of average bioavailability. J of Pharmaokin. Biopharm. 15:657-680. Smith D, Eggen M, St. Andre R. (1997). A Transition to Advanced Mathematics, 4th ed., Pacific Grove, California, Brooks/Cole Publishing Company. StatXact® Version 6 with Cytel StudioTM: Statistical Software for Exact Nonparametric Inference User Manual. Cambridge, Massachusetts, Cytel Software Corporation. Ullmann AJ, Lipton JH, Vesole DH, et al. (2007). Posaconzole or flucanzole for prophylaxis in severe graph-versus-host disease. N Engl J Med 356:335-47. Van der Gaag NA, Rauws EAJ, van Eijck CHJ, et al. (2010). Preoperative biliary drainage for cancer of the head of the pancreas. N Engl J Med 362:129-37. Wellek S. (2003). Testing Statistical Hypotheses of Equivalence. New York, Chapman & Hall/CRC. Chapter 2-15 (revision 30 Aug 2011) p. 20 Westlake WJ. (1981). Bioequivalence testing—a need to rethink (Reader Reaction Response) Biometrics 37:591-593. Westlake WJ. (1988). Bioavailability and bioequivalence of pharmaceutical formulations. In, Peace KE (ed), Biopharmaceutical Statistics for Drug Development. New York, Marcel Dekker. Chapter 2-15 (revision 30 Aug 2011) p. 21