Logic Argument of Research Article

advertisement
Chapter 2-15. Equivalence Tests and Noninferiority Tests
Barker et al (2002) provide an example research situation where equivalence is the hypothesis of
interest.
Motivated by the public health policy of eliminating health disparities in vaccination coverage
among various groups, the collected vaccination coverage from the 2000 National Immunization
Survey was analyzed. The following shows their data for polio vaccination coverage.
Coverage
Disparity (group – white)
White
90.6%
Ref
Black
86.8%
-3.8%
Hispanic
87.9%
-2.7%
Asian
92.7%
+2.1%
What would be a good way to approach these data?
Using the classical “difference testing” approach, a significant difference would be observed, at
the alpha=0.05 level, if the 95% confidence interval around the difference did not cover 0.
White
90.6%
Ref
Black
Hispanic
Coverage
86.8%
87.9%
Disparity (group – white)
-3.8%
-2.7%
95% CI for difference
-5.9 , -1.7*
-4.5 , -0.9*
* Denotes significance, p<0.05, since 95% CI does not cover 0.
Asian
92.7%
+2.1%
-0.7 , 4.9
Should we conclude that disparity still exits for two of these groups, and conclude disparity has
been eliminated between Whites and Asians?
First, let’s consider the White-Asian comparison:
Asian – White: +2.1%, 95% CI(-0.7% , 4.9%).
We might consider a conclusion of equivalent coverage because statistical significance was not
achieved. However, there is a well-known competing explanation—perhaps this is simply due to
an insufficient sample size. That is, if this same 2.1% difference was maintained in a sufficiently
larger sample size, it would be significant.
It would seem, however, that if our sample size provided adequate power to detect some smallest
meaningful disparity effect, say a 3% difference, failing to achieve significance should permit the
conclusion of equivalence.
_____________________
Source: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
Chapter 2-15 (revision 30 Aug 2011)
p. 1
So, if the power was 95% to detect a 3% difference, then we would have had a 95% probability
that our sample would provide a conclusion of disparity if that difference of 3%, or one larger,
existed in the sampled population.
The probability of making a type II error (concluding no difference when a difference exists) is
then
1-power=beta=(1-.95)=0.05.
With this reasoning, and our 2.1% observed difference, it seems we should be able to conclude
equivalence at the beta=0.05 level.
Such reasoning has been frequently applied. Greene (2000) reviewed human subjects papers
listed in Medline from 1992 to 1996 and found 67% of those papers followed that approach.
To give a published example of this approach being used, so that you can recognize it when you
see it,
Rumsfeld et al (2003) published a clinical trial of high-risk patients with medically
refractory ischemia randomized to percutaneous coronary intervention (PCI) versus
coronary artery bypass graft (CABG) surgery. In this study, a non-significant difference
was observed in six-month health-related quality of life (HRQL) between the two study
arms of less than 2 points in multivariable models. The authors reported having “…97%
power to detect a four-point difference in scores, where four to seven points is a clinically
important difference.” They concluded, “High-risk patients with medically refractory
ischemia randomized to PCI versus CABG surgery have equivalent six-month HRQL.”
------APOLOGY: I apologize to Rumsfeld et al for using them as an example. They were
doing what they learned was a correct analysis, as this approach was has been widely
taught and advocated. I even used to teach this approach. It will take decades for this
approach to go away.
This approach to equivalence testing was made popular by Jacob Cohen (1988), the author of the
classic text, Statistical Power Analysis for the Behavioral Sciences, being advocated in both the
first and second editions. He also proposed it elsewhere, specifically (Cohen, 1965).
As pointed out by Hoenig and Heisey (2001, p.21), Cohen, without providing a mathematical
proof, advocated using power to support the null hypothesis. Cohen claimed (1988, p.16) that if
you design a study to have high power (say, power = 0.95, so  = 0.05) to detect a stated
biologically meaningful effect size, , (or alternatively some trivial effect size) and then you fail
to reject the null hypothesis (p > , or p > 0.05), than you can conclude that the effect in the
population is smaller than  with probability .
No one has ever been able to provide a proof that Cohen’s assertion is logically consistent.
Finally, in 2001, Hoenig and Heisey published a proof in The American Statistician
demonstrating that Cohen’s assertion is, in fact, false.
Chapter 2-15 (revision 30 Aug 2011)
p. 2
Unfortunately, Jacob Cohen died in 1998 before this equivalence testing approach was logically
refuted, so his textbook will never see a new edition where the error is corrected.
This approach is no longer acceptable in the approval process of new drugs in Europe, Japan, and
the United States. In the E9 guidance document it states (ICH, 1999, section Trials to show
Eqivalence or Noninferiority (3.3.2)),
“Concluding equivalence or noninferiority based on observing a nonsignificant test result
of the null hypothesis that there is no difference between the investigational product and
the active comparator is considered inappropriate.”
Hoenig and Heisey used the method of proof by contradiction (see box).
Proof by Contradiction
Let P and Q be two propositions. A proposition is a statement that is either true or false. The
steps used in a proof by contradiction are:
Proof of P by Contradiction (Smith, et al, 1997, p.34)
Suppose Not-P.
(where Not-P is true if P if false, or false if P is true)
…
Therefore, Q.
….
Therefore, Not-Q.
Hence, Q and Not-Q, a contradiction (since a proposition cannot be both true and false)
Thus, P.
Example from Mathematics of a Proof by Contraction
Aim: prove that division by zero is meaningless (which is why division by zero is said to be
undefined).
Proof Steps
Suppose Not-P:
...
Therefore, Q:
Therefore, Not-Q:
Hence, Q and Not-Q, a contradiction:
Thus, P.
Chapter 2-15 (revision 30 Aug 2011)
Proof
Suppose division by 0 is meaningful.
Let a  0.
Let a/0 = b. (using opening premise in this step)
Therefore, a=b  0 =0. (multiplying both sides by 0)
Hence both a  0 and a = 0, a contradiction.
Thus division by 0 is meaningless.
p. 3
Hoenig and Heisey’s Proof by Contradiction of the Fallacy of Using a Power Analysis to
Demonstrate the Null Hypothesis of No Effect
Of course, proof by contraction is a logical construct that is not limited to mathematics. Hoenig
and Heisey state their proof in English sentences, so it would better to say that theirs is a logical
proof, rather than a mathematic proof. Most researchers, however, are not trained in logic or
mathematics, so they would not recognize that a proof is being presented.
Here is how Hoenig and Heisey (2001, p.21) stated their logical proof:
“A number of authors have noted that observed power may not be especially useful, but to our
knowledge a fatal logical flaw has gone largely unnoticed. Consider two experiments that gave
rise to non-rejected null hypotheses. Suppose that the observed power was larger in the first
experiment than the second. Advocates of observed power would interpret this to mean that the
first experiment gives stronger support favoring the null hypothesis. Their logic is that if power
is low one might have missed detecting a real departure from the null hypothesis but if, despite
high power, one fails to reject the null hypothesis, than the null is probably true or close to true.
This is easily shown to be nonsense. For example, consider the one-sided Z test described above.
Let Zp1 and Zp2 refer to the observed test statistics in the respective experiments. The observed
power was highest in the first experiment and we know this implies Zp1 > Zp2 because observed
power is
GZ p ( )
which is an increasing function of the Z statistic. So by usual standards of
using the p value as statistical evidence, the first experiment gives the stronger support against
the null, contradicting the power interpretation. “We will refer to this inappropriate
interpretation as the “power approach paradox” (PAP), higher observed power does not imply
stronger evidence for a null hypothesis that is not rejected.”
Let’s examine how a proof by contraction was applied.
Chapter 2-15 (revision 30 Aug 2011)
p. 4
Step 1. Suppose Not-P.
If higher power and the test is not significant, this implies stronger evidence in favor of
the null hypothesis. (the misconception to be shown fallacious)
Step 2. Therefore, Q.
Choose any two experiments, each which produce a test statistic (say, Z) that is
not significant.
Use subscript 1 for the experiment with the greater power, so
Power1 > Power2
Because power always increases when Z increases (a known relationship between power
and the test statistic), we have,
Z1 > Z2
Therefore, Z1 provides stronger evidence in favor of the null hypothesis (from
our assumption in step 1) because the power is higher.
Step 3. Therefore, Not-Q.
We know that the larger the test statistic, the smaller the p value. So,
p value1 < p value2
By the common standards for interpreting p values in statistics, we know that
the smaller the p value, the stronger evidence against the null hypothesis.
Therefore, Z1 provides stronger evidence against the null hypothesis.
Step 4. Hence, Q and Not-Q, a contradiction
The conclusions of steps 2 and 3 contradict each other.
Step 5. Thus, P.
The opposite of the proposition in Step 1 must therefore be true. That is,
Not-{If higher power and the test is not significant, this implies stronger
evidence in favor of the null hypothesis.}
which is to say,
If higher power and the test is not significant, this does not imply stronger
evidence in favor of the null hypothesis.
Chapter 2-15 (revision 30 Aug 2011)
p. 5
Thus, there is no logical justification for using power to support the null hypothesis. The correct
way to provide a statistical argument (or probability argument) to support the null hypothesis is
to apply one of the procedures known as equivalence tests or bioequivalence tests, also noninferiority and non-superiority tests if a specific direction is of interest.
A variant of the fallacious power approach is the post hoc power, or observed power, approach,
which was also refuted in the Hoenig and Heisey article (see box).
Post Hoc Power Approach
It is frequently requested by journal reviewers to have researchers compute a post hoc power
analysis when failing to reject the null hypothesis. This is done by computing the power of the
test based on the observed effects, variability, and sample size. Advocates of this approach argue
that there is evidence for the null hypothesis being true if statistical significance was not achieved
when the computed power is high at the observed effect size (Hoening and Heisey (2001, p.19).
As pointed out by Hoenig and Heisey (2001, p.20), several major statistical packages, such as
SPSS, provide observed power. Using the menu in SPSS version 12
Analyze -> General Linear Model -> Univariate -> Options -> Observed Power
outcome
group
1
2
Type III
Sum of
Squares
N
Mean
Std. Deviation
Std. Error
Mean
5
5.40
1.140
.510
5
3.80
1.924
.860
Mean
Source
df
Square
F
Corrected
6.400(b)
1
6.400
2.560
Model
Intercept
211.600
1
211.600 84.640
group
6.400
1
6.400
2.560
Error
20.000
8
2.500
Total
238.000 10
Corrected
26.400
9
Total
a Computed using alpha = .05
b R Squared = .242 (Adjusted R Squared = .148)
Sig.
Noncent.
Parameter
Observed
Power(a)
.148
2.560
.292
.000
.148
84.640
2.560
1.000
.292
We can next verify that the observed power is nothing more than using the observed means,
standard deviations, and sample sizes into a sample size calculation. Using SamplePower
version 2.0, specifying an independent groups t test, we get:
With the proposed sample size of 5 and 5 for the two groups, the study will have power of
29.2% to yield a statistically significant result. This computation assumes that the mean difference
is 1.600 (corresponding to means of 5.400 versus 3.800) and the common within-group
standard deviation is 1.581 (based on SD estimates of 1.140 and 1.924) .
Chapter 2-15 (revision 30 Aug 2011)
p. 6
Equivalence Testing
The correct way to demonstrate equivalence is with the class of statistical procedures called
equivalence tests.
Bioequivalence tests are often used in the pharmaceutical industry. However, there are many
other instances, such as in public health, when an equivalence test is what is actually required.
For example, an investigator might want to show that access to health care is essentially equal for
two groups of people, such as between the young working population and the elderly.
Let’s take a bioequivalence example. The United States Food and Drug Administration (FDA)
will grant approval of a drug if it can be shown to have the same bioavailability profile as an
already approved drug, within 20% (the 20% rule). Expressed in proportional form, for a test
drug (T) and a referent drug (R), bioequivalence is accepted if

0.80  T  1.20
R
This suggests an interval hypothesis testing approach, which Chow and Liu (2000, p.97) describe
as follows:
The hypothesis of bioequivalence, then, can be formulated as an interval hypothesis as
H 0 : T   R   L
or T   R  U
vs H a :  L  T   R  U
where  L and U are some clinically meaningful limits (such as mean differences that represent
20% of the reference mean).
To show bioequivalence, we reject the null hypothesis of not bioequivalent. Notice that this is
the opposite of the usual statistical hypothesis, where the null hypothesis is a statement of
equality.
The interval hypothesis can be decomposed into two sets of one-sided hypotheses
H 01 : T   R   L
vs H a1 : T   R   L
and
H 02 : T   R  U
vs H a 2 : T   R  U
to verify that the bioavailability is not too low
to verify that the bioavailability is not too high
If one concludes both H a1 and H a 2 , then it can be concluded that Ha :  L  T  R  U .
The first hypothesis is called a “lack of inferiority” or “non-inferiority” test, and the second is
called a “lack of superiority” or “non-superiority” test).
Chapter 2-15 (revision 30 Aug 2011)
p. 7
Schuirmann’s Two One-Sided Tests (TOST) Procedure
Chow and Liu (2000, 98) describe this procedure as follows:
Schuirmann’s (1987) procedure suggests the conclusion of equivalence of T and R at the
 level of significance if and only if both H 01 and H 02 are rejected at a predetermined  level of
significance. The two sets of one-sided hypotheses can be tested with ordinary one-sided t tests.
We conclude that T and R are equivalent if
TL 
(YT  YR )  θ L
 t ( , n1  n2  2)
1 1
σ̂ d

n1 n2
and
TU 
(YT  YR )  θ U
 t ( , n1  n2  2)
1 1
σ̂ d

n1 n2
The two one-sided t tests procedure is operationally equivalent to the classical confidence
interval approach. If the classical (1 - 2  )  100% confidence interval for T - R is within
(θ L , θ U ) , than both H 01 and H 02 are rejected at a predetermined  level by the two one-sided t
tests procedure.
One disadvantage of the TOST procedure is that it requires reporting two p values. Another
popular equivalence test, then, is the Anderson and Hauck’s Test (Chow and Liu, 2000), which
only uses one p value to reject the null hypothesis of non-equivalence.
The TOST procedure is known to be slightly conservative. Several tests have been developed
that are more powerful to show equivalence (Chow and Liu, 1992). Barker et al (2001), for
example, describe eight equivalence tests for binomial (yes/no) variables and discuss their
relative power.
One advantage to the TOST procedure, over other equivalence tests, is that it is easy for nonstatisticians to understand. Also, it is not limited to the t test. One could use it for comparing
two proportions, just as well, or any test where it is possible to specify the hypothesized non-zero
difference in the numerator of the test procedure.
Chapter 2-15 (revision 30 Aug 2011)
p. 8
Confidence Interval Approach
Perhaps the biggest advantage to the TOST procedure is that it is equivalent to the confidence
interval approach. Such confidence intervals are readily available in all existing software
packages. Using the confidence interval approach, one can avoid having to report the two p
values of the TOST procedure, and the confidence intervals are easy for readers to understand.
In this approach, a 2 % CI covering T  R is constructed using the two sample means or two
sample proportions. If this CI is completely contained within the interval ( L ,U ) , then
equivalence is demonstrated.
That is, first choose the equivalence window, say 20%. Next, compute the 90% CI for the
difference in means, or difference in proportions. Basically, this CI will look like (see box):
2
2
X test  X referent  1.645 stest
 sreferent
If the endpoints of this CI are within the -20% and +20% bounds, then a conclusion of
equivalence is supported.
Point and Interval Estimation for the Risk Difference (Rosner, 1995, p. 363)
Let p̂1 , p̂2 represent the sample proportion who develop disease in a prospective study, based on
sample sizes of n1 and n2 , respectively. A point estimate of the risk difference is given by
pˆ1  pˆ 2 . A 100% (1 – α) confidence interval for the risk difference is given by (c1 , cc ) , where
c1  pˆ1  pˆ 2  z1 / 2
pˆ1qˆ1 pˆ 2 qˆ2
pˆ qˆ pˆ qˆ

, c2  pˆ1  pˆ 2  z1 / 2 1 1  2 2
n1
n2
n1
n2
Use this expression for the confidence interval only if n1 pˆ1qˆ1  5 and n2 pˆ 2 qˆ2  5 .
Notice that a 90% CI is equivalent to two t tests, using  = 0.05 for each t test. It might seem
strange that a 95% CI is not used. Westlake (1981) proposed the use of a 90% CI in order to
achieve an  = 0.05 level test. The FDA requires an  = 0.05 level test for demonstrating
efficacy of a new drug, so the (1 - 2  ) or 90% CI for equivalence testing achieves parallelism
with the efficacy testing requirement. (Westlake, 1988, p.343)
Westlake’s argument is based on the fact that a 95% CI is wider than a 90% CI, making it harder
to demonstrate equivalence. The (1 - 2  ) or 90% CI makes it just as easy to achieve significant
equivalence as it would be to achieve efficacy at the 0.05 level. If a 95% CI is used, the
nominal  is at most  /2, providing a 0.025 level test.
Chapter 2-15 (revision 30 Aug 2011)
p. 9
Chow and Liu (2000, p.80) state that the FDA has adopted the approach of using a 90% CI for
bioequivalence studies,
“The FDA requires that the bioequivalence be concluded with 90% assurance.”
They then refer to the 90% confidence interval on page 81.
Noninferiority Tests Using Confidence Intervals
A noninferiority test is one of the very rare instances where a one-sided comparison is
appropriate. In the FDA guidance document “E9 Statistical Principles for Clinical Trials” (ICH,
1999, section 5.5.E) it states,
“For noninferiority trials, a one-sided interval should be used. The confidence
interval approach has a one-sided hypothesis test counterpart for testing the null
hypothesis that the treatment difference (investigational product minus control) is
equal to the lower equivalence margin versus the alternative that the treatment
difference is greater than the lower equivalence margin. The choice of Type I error
should be a consideration separate from the use of a one-sided or two-sided
procedure.”
The 2-sided 95% CI versus 1-sided 95% CI for Non-inferiority Testing (What Are
Researchers Using)
Piaggo et al (2006, p.1154) published a methods paper, extending the Consolidated Standards of
Reporting Trials (CONSORT), to the situation of reporting non-inferiority and equivalence trials.
In their paper, they prefer a 2-sided 95% CI,
“Many noninferiority trials based their interpretation on the upper limit of a 1-sided
97.5% CI, which is the same as the upper limit of a 2-sided 95% CI. Although both 1sided and 2-sided CIs allow for inferences about noninferiority, we suggest that 2-sided
CIs are appropriate in most noninferiority trials.29 If a 1-sided 5% significance level is
deemed acceptable for the noninferiority hypothesis test42 (a decision open to question), a
90% 2-sided CI could then be used.
__________
29
Points to Consider on the Choice of Noninferiority Margin. London, England: European
Medicines Agency (EMEA);February 26, 2004. Available at:
http://www.emea.eu.int/pdfs/human/ewp/215899en.pdf. Accessed February 9, 2006.
42
Sackett DL. Superiority trials, non inferiority trials, and prisoners of the 2-sided null
hypothesis. ACP J Club. 2004;140:A11.”
The Piaggo et al suggestion is consistent with the FDA guidance document “E9 Statistical
Principles for Clinical Trials” (ICH, 1999, section 5.5.E) which states,
“The approach of setting Type I errors for one-sided tests at half the conventional Type I
error used in two-sided tests is preferable in regulatory settings. This promotes
Chapter 2-15 (revision 30 Aug 2011)
p. 10
consistency with the two-sided confidence intervals that are generally appropriate for
estimating the possible size of the difference between two treatments.”
In the medical literature, using a two-sided 95% CI is popular for noninferiority studies. This is
what is advocated in the EMEA guidance document (reference 29 two paragraphs above).
Paiggo et al (2006, p.1154) also advocate this approach (see two paragraphs above).
Personally, I (Stoddard) currently prefer to use a 1-sided test using a two-sided 95% confidence
interval for noninferiority testing. This is consistent with Paiggo’s recommendation and with the
E9 guidance document statement, two paragraphs above. One clear advantage to this approach is
that it allows the reader to use your two-sided 95% confidence interval to test for an effect in the
opposite direction. It is true that you only have interest in one direction, but the reader has an
interest in either direction.
Even so, many researchers are still using the one-sided approach with an alpha of 0.05.
For example, van der Gaag et al. (N Engl J Med, 2010) used a one-sided alpha=0.05 significance
test of noninferiority, which is identical to using a one-sided 95% CI. In their Statistical Analysis
section they state,
“Assuming that there would be a complication rate of 38% in the early-surgery group and
48% in the biliary-drainage group, we would consider early surgery to be noninferior if
the associated percentage of serious complications was less than 10 percentage points
above the percentage of serious complications in the biliary-drainage group. We used a
two-group large-sample normal approximation test of proportions, with a one-sided
significance level of 0.05, to test the null hypothesis that early surgery would lead to an
increase of at least 10 percentage points in the rate of complications, as compared with
preoperative biliary drainage, followed by surgery. To attain a power of 80% to show
noninferiority of the early surgery, 94 patients were needed in each group.”
A second example is Haskal et al. (N Engl J Med, 2010). In their Study End Points section, they
state their noninferiority hypothesis as,
“The study objective was to demonstrate that treatment with a stent graft is not inferior to
treatment with balloon angioplasty alone regarding the primary end point, the 6-month
primary patency of a stenotic venous anastomosis in the treatment area.”
Then, in their Statistical Analysis section, they state,
“We calculated the sample size needed to test the primary noninferiority hypothesis using
the methods of Blackwelder.19 The incidence of primary patency at 6 months was
estimated as 60% in the stent-graft group and 50% in the balloon-angioplasty group. The
two rates were considered clinically noninferior if the difference was 10 percentage points
or less (with a significance threshold of P=0.05 on a one-tailed test and 80% statistical
power). On this basis, the number of patients required for each of the two treatment
groups was calculated to be 76. The target number of patients enrolled in each group was
Chapter 2-15 (revision 30 Aug 2011)
p. 11
set at 95, to account for a dropout rate of up to 20%. Thus, the total target sample size
was 190 patients.”
--------19
Blackwelder WC. “Proving the null hypothesis” in clinical trials. Control Clin Trials
1982;3:345-53.
Returning to the Vaccination Disparity Example
The first step to equivalence testing is to state the smallest acceptable difference for which
anything smaller would be considered equivalent. In our beginning immunization disparity
example, we might consider a an absolute difference of 5% to be acceptable, so that differences
in the range -5% to 5% are the same thing as 0% difference (equivalence).
The choice of a range should depend on the context of the research question, such as what would
be the public health impact of a 5% disparity.
In bioequivalence testing of drugs, the FDA allows a relative 20% window in the average
bioavailability of a test drug to a referent drug. However, in other situations, the decision should
be based on what a clinician would find acceptable. In this immunization example, it is doubtful
that a public health professional would find a 20% difference acceptable.
Using the difference testing approach,
White
Black
Hispanic
Coverage
90.6%
86.8%
87.9%
Disparity (group – white) Ref
-3.8%
-2.7%
95% CI for difference
-5.9 , -1.7*
-4.5 , -0.9*
* Denotes significance, p<0.05, since 95% CI does not cover 0.
Asian
92.7%
+2.1%
-0.7 , 4.9
we would conclude disparity between White/Black and White/Hispanic. A conclusion of
White/Asian would be “insufficient evidence to conclude disparity”.
Using an equivalence testing approach, however,
White
90.6%
Ref
Black
Hispanic
Asian
Coverage
86.8%
87.9%
92.7%
Disparity (group – white)
-3.8%
-2.7%
+2.1%
90% CI for difference
-5.5 , -2.1
-4.2 , -1.2*
-0.3 , 4.5*
* Denotes significance, p<0.05, since 90% CI falls within the -5% to +5% equivalence
window.
we would conclude equivalence between White/Hispanic and White/Asian. There would be
insufficient evidence to conclude equivalence between White/Black.
Chapter 2-15 (revision 30 Aug 2011)
p. 12
What Researchers Are Using
Now that we know the correct way to do it, let’s see what researchers are actually reporting.
Greene (2000) found 1209 citations in Medline (1992 through 1996) that contained the word
“equivalence”, of out which 88 turned out to be original research reports involving human
subjects in which made an equivalence claim.
In the 88 articles, Greene found that:

23% of articles correctly set an equivalence boundary and confirmed with an appropriate
statistical approach (the right way)

67% of articles declared significance after failing to show a significant difference (the
refuted way, but refuted after these papers were published)

10% of articles claimed equivalence without the use of statistics (the “shouldn’t be
doing research” way)
Henanff et al (2006) also did a review, but they limited their review to papers that were
specifically testing equivalence or non-inferiority, ignoring papers that claimed significance after
failing to demonstrate a difference. Thus, this paper does not help to determine if the situation
has improved, relative to the Greene paper.
Chapter 2-15 (revision 30 Aug 2011)
p. 13
Some Available Software for Equivalence Testing
Almost always, a confidence interval approach is used, so that can be done in Stata, without the
need to purchase any specialty software.
If you really want a p value, however, here is some suggested software:
In Chow and Liu’s textbook (2000), the SAS code for a wide variety of equivalence tests is found
in the appendix. This code is not available from a website.
Accompanying Wellek’s textbook (2003) is free software (e.g., SAS procs) on the authors
website: http://www.zi-mannheim.de/wktsheq
A friendly to use software package for equivalence testing is available, called EquivTest.
A description of this can be found at the vendor’s website:
http://www.statsol.ie/
The statistical package StatXact-7 provides equivalence testing for proportions, as well as a
sample size calculation.
Sample size for equivalence or noninferiority studies is easily done with some simple formulas.
These can be found in Chow et al. (2008).
Chapter 2-15 (revision 30 Aug 2011)
p. 14
Example
Let’s consider the dichotomous case. Suppose we have data that looks like:
|
therapy
recover |
0
1 |
Total
-----------+----------------------+---------0 |
40
35 |
75
|
40.00
35.00 |
37.50
-----------+----------------------+---------1 |
60
65 |
125
|
60.00
65.00 |
62.50
-----------+----------------------+---------Total |
100
100 |
200
|
100.00
100.00 |
100.00
In the control group, 60% recovered. In the test group, 65% recovered.
If we use a 20% window, equivalence will be demonstrated if the test group is within 20% of
the referent group (control group in this case).
60% x 0.8 = 48% and 60% x 1.2 = 72%.
The absolute difference between the two group percents, then, is required to be between
( -12% , +12%)
Using Stata, we compute a 90% test based confidence interval around the percent difference
using the command,
cs recover therapy, tb level(90)
we get
| therapy
|
|
Exposed
Unexposed |
Total
-----------------+------------------------+-----------Cases |
65
60 |
125
Noncases |
35
40 |
75
-----------------+------------------------+-----------Total |
100
100 |
200
|
|
Risk |
.65
.6 |
.625
|
|
|
Point estimate
|
[90% Conf. Interval]
|------------------------+-----------------------Risk difference |
.05
|
-.062898
.162898 (tb)
Risk ratio |
1.083333
|
.9042128
1.297937 (tb)
Attr. frac. ex. |
.0769231
|
-.1059344
.2295465 (tb)
Attr. frac. pop |
.04
|
+-------------------------------------------------
chi2(1) =
0.53
Pr>chi2 = 0.4652
Our observed 90% CI for the risk difference (proportion difference) is
(-0.062898 , 0.162898).
Since this fails to lie within the equivalence window of (-0.12 , 0.12), our data fail to demonstrate
equivalence.
Chapter 2-15 (revision 30 Aug 2011)
p. 15
To verify this was done correctly, an equivalence test for two independent proportions was run in
StatXact-7, using these same data.
The result was:
st
1 1-sided t-test, p = 0.006
nd
2
1-sided t-test, p = 0.152
90% CI around difference, -0.063 , 0.162
Since only one t-test was significant at the 0.05 level, rather than both, equivalence was not
established by the Schuirmann’s two one-sided t test procedure.
Chapter 2-15 (revision 30 Aug 2011)
p. 16
Testing Noninferiority and Superiority in the Same Study
It is common, and acceptable, to test both the noninferiority and the superiority hypothesis in the
same study, using the same nominal alpha (α = 0.05, for example) for both comparisons. There
is no need to adjust this alpha for multiplicity (multiple comparisons).
Strategy 1) First test of noninferiority. If noninferiority is demonstrated (p<0.05), using the
prespecificed noninferiority margin, then test superiority using the ordinary null value (mean
difference = 0, RR = 1, OR = 1) and the same alpha (p<0.05). If noninferiority is not
demonstrated, then superiority is automatically not demonstrated as well.
Strategy 2) First test for superiority. If superiority is demonstrated, then noninferiority is usually
of no interest but is demonstrated as well. If superiority is not demonstrated, then go on to test
noninferiority using the same alpha (p<0.05).
Hung and Wang (2004) describe this approach:
“Morikawa and Yoshida (1995) and Dunnett and Tamhane (1997) considered the case of
two δs; specifically, one δ is zero for the superiority objective and the other is a specified
positive real number corresponding to the noninferiority objective described by the
defined noninferiority margin. To test the superiority hypothesis and the noninferiority
hypothesis, two possible stepwise strategies can be entertained. One strategy (labeled as
S-NI) begins with testing the superiority hypothesis. Achievement of superiority
immediately leads to achievement of the so-defined noninferiority. If superiority is not
achieved, test for noninferiority. The other (labeled as NI-S) reverses the order. If
noninferiority is not achieved, superiority can never be concluded. If noninferiority is
concluded, test further for superiority. For each strategy, use of the same α level at each
testing step is valid in the sense that the total type I error probability associated with
testing for superiority and noninferiority is exactly α. They showed that the two stepwise
procedures are equivalent in terms of the rejection regions for superiority and
noninferiority. In practice, the two stepwise test procedures may carry with different
sample size plans that are often designed primarily to achieve the first intended
hypothesis (Wang et al., 2001).
________
Dunnett, C. W., Tamhane, A. C. (1997). Multiple testing to establish
superiority/equivalence of a new treatment compared with kappa standard
treatments. Statist. Med. 16(21):2489-2506.
Morikawa T., Yoshida M. (1995). A useful testing strategy in phase III trials: Combined
test of superiority and test of equivalence. J. Biopharmaceutical Statist. 5(3):297306.
Wang, S. J., Hung, H. M. J., Tsong, Y., Cui, L. (2001). Group sequential testing for
superiority and non-inferiority hypotheses in active controlled clinical trials.
Statist. Med. 20;1903-1912.”
Chapter 2-15 (revision 30 Aug 2011)
p. 17
Some examples of how to state this in your Statistical Methods section of your article
A good way to state this approach of testing for both noninferiority and superiority in the same
study, without adjustment to alpha, is illustrated in an article by Ullmann et al.
Ullmann et al. (N Engl J Med, 2007) describe this approach in their Statistical Methods section,
being careful to include that this approach was a pre-specified analysis,
“As stated in the protocol, the evaluation of efficacy occurred in two stages. First, the
noninferiority of posaconazole to fluconazole was assessed. If noninferiority was
demonstrated, then the superiority of posaconazole to fluconazole was assessed. This
two-stage process allowed for control of the type I error rate.”
In a protocol, you might also want to cite Hung and Wang (2004), to support that no multiple
comparison adjustment is required, just to head off this question if the reviewer is not aware that
it is a widely used, accepted practice.
Exercise Look at the article by Munger et al (2008).
1) Notice they describe this two-stage noninferiority-superiority testing approach in their
Statistical Analysis section.
2) In their figure, they show the confidence intervals and the noninferiority bound
(dashed line). It is easy to see that no CI crosses the noninferiority bound. It is also
easy to see which CIs not cross the null value of RR=1, thus demonstrating superiority,
as well.
Chapter 2-15 (revision 30 Aug 2011)
p. 18
References
Barker L, Rolka H, Rolka D, Brown C. (2001). Equivalence testing for binomial random
variables: which test to use? The American Statistician. 55(4):279-287.
Barker LE, Luman ET, McCauley MM, Chu SY. (2002). Assessing equivalence: an alternative
to the use of difference tests for measuring disparities in vaccination coverage. Am J
Epidemiol 156(11):1056-1061.
Blackwelder WC. (1982). “Proving the null hypothesis” in clinical trials. Control Clin Trials
3:345-53.
Borenstein M, Rothstein H, Cohen J (2001). SamplePower® 2.0. Chicago, SPSS Inc.
software can be purchased at http://www.spss.com
Chan I. (1998). Exact tests of equivalence and efficacy with a non-zero lower bound for
comparative studies. Statistics in Medicine 17, 1403-1413.
Chow S-C, Liu J-P. (2000). Design and analysis of bioavailability and bioequivalence studies.
2nd edition, New York, Marcel Dekker.
Chow S-C, Shao Jun, Wang H. (2008). Sample Size Calculations in Clinical Research. 2nd ed.
New York, Chapman & Hall/CRC.
Cohen J. (1965). Some statistical issues in psychological research. In B.B. Wolman (Ed.),
Handbook of Clinical Psychology. New York, McGraw-Hill. pp. 95-121.
Cohen J. (1988). Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, New
Jersey.
Greene WL, Concato J, Feinstean AR. (2000). Claims of equivalence in medical research: are
they supported by the evidence. Ann Intern Med 132:715-722.
Haskal ZJ, Trerotola S, Dolmatch B, et al. (2010). Stent graft versus balloon angioplasty for
failing dialysis-access grafts. N Engl J Med 362(6):494-503.
Hoenig JM, Heisey DM. (2001). The abuse of power: the prevasive fallacy of power calculations
for data analysis. The American Statistician 55(1):19-24.
Hung HMJ, Wang, S-J. (2004). Multiple testing of noninferiority hypotheses in active controlled
trials. J Biopharm Statist 14(2):327-335.
International Conference on Harmonisation E9 Expert Working Group. (1999). ICH harmonised
tripartite guideline: statistical principles for clinical trials. Stat Med 18(15):1905-42.
Freely available as a guidance document on the FDA website (word for word same
content): Guidance for industry: E9 statistical principles for clinical trials.
Chapter 2-15 (revision 30 Aug 2011)
p. 19
http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guid
ances/ucm073137.pdf
Johns MW. (2000). Sensitivity and specificity of the multiple sleep latency test (MSLT), the
maintenance of wakefulness test and the Epworth sleepiness scale: Failure of the MSLT
as a gold standard. J Sleep Res 9:5-11.
Linnett K. (2000). Nonparametric estimation of reference intervals by simple and
bootstrap-based procedures. Clinical Chemistry 46(6):867-869.
Munger MA, Stoddard GJ, Wenner AR, et al. (2008). Safety of prescribing PDE-5 inhibitors via
e-medicine vs traditional medicine. Mayo Clin Proc 83(8):890-896.
Nicoll CD, Pignone M. Diagnostic testing & medical decision making. In Tierney LM, McPhee
SJ, Papadakis MA (eds). Current Medical Diagnosis & Treatment 2003, 42nd ed.
Columbus OH, The McGraw-Hill Companies, 2003, pp. 1667-1677.
[see note below for electronic access]
Piaggio G, Elbourne DR, Altman DG, et al. (2006). Reporting of noninferiority and equivalence
randomized trials: an extension of the CONSORT Statement. JAMA 295:1152-1160.
Rosner B. (1995). Fundamentals of Biostatistics. 4th ed. Belmont, California, Duxbury
Press.
Rumsfeld JS, Magid DJ, Plomondon ME, et al. (2003). Health-related quality of life after
percutaneous coronary intervention versus coronary bypass surgery in high-risk patients
with medically refractory ischemia. J Am Coll Cardiology 41(10):1732-1738.
Schuirmann DJ. (1987). A comparison of the two one-sided tests procedure and the power
approach for assessing the equivalence of average bioavailability. J of Pharmaokin.
Biopharm. 15:657-680.
Smith D, Eggen M, St. Andre R. (1997). A Transition to Advanced Mathematics, 4th ed.,
Pacific Grove, California, Brooks/Cole Publishing Company.
StatXact® Version 6 with Cytel StudioTM: Statistical Software for Exact Nonparametric Inference
User Manual. Cambridge, Massachusetts, Cytel Software Corporation.
Ullmann AJ, Lipton JH, Vesole DH, et al. (2007). Posaconzole or flucanzole for prophylaxis in
severe graph-versus-host disease. N Engl J Med 356:335-47.
Van der Gaag NA, Rauws EAJ, van Eijck CHJ, et al. (2010). Preoperative biliary drainage for
cancer of the head of the pancreas. N Engl J Med 362:129-37.
Wellek S. (2003). Testing Statistical Hypotheses of Equivalence. New York, Chapman &
Hall/CRC.
Chapter 2-15 (revision 30 Aug 2011)
p. 20
Westlake WJ. (1981). Bioequivalence testing—a need to rethink (Reader Reaction Response)
Biometrics 37:591-593.
Westlake WJ. (1988). Bioavailability and bioequivalence of pharmaceutical formulations. In,
Peace KE (ed), Biopharmaceutical Statistics for Drug Development. New York, Marcel
Dekker.
Chapter 2-15 (revision 30 Aug 2011)
p. 21
Download