Chapter 2-8. Multiplicity and Comparisons of 3 or More

advertisement
Chapter 2-8. Multiplicity and Comparisons of 3 or More Independent
Groups
Multiplicity
Whether our dependent variable is dichotomous, nominal, ordinal, or interval, if we have more
than two groups in our independent variable, then we have the potential for the statistical
problem referred to as the multiple-comparison problem, or synonymously, multiplicity.
Table 4-1. Extending the two-sample comparison hypothesis to more than two samples
Dependent
Variable
dichotomous
nominal
(e.g., 3 categories)
ordinal
interval
2-sample
comparison*
H0: no association
H0: 1 = 2
H0: no association
H0: 11 = 12
21 = 22
31 = 32
H0: no association
H0: median1 = median2
H0: no association
H0: 1 = 2
K-sample comparison
(global hypothesis)**
H0: no association
H0: 1 = 2 = … = k
H0: no association
H0: 11 = 12 = … = 1k
21 = 22 = … = 2k
31 = 32 = … = 3k
H0: no association
H0: median1 = median2 = … = mediank
H0: no association
H0: 1 = 2 = … = k
*where  denotes the population proportion (reserving p to denote the sample proportion)
 denotes the population mean (reserving X to denote the sample mean)
**The global hypothesis is the simultaneous equality of all averages, rather than the equality of specific
pairs of averages.
In the two-sample comparison problem, we used a statistical test to give a single p value to test
the hypothesis of no association between the grouping variable and the dependent variable. This
is identically a hypothesis of no difference in averages between the two groups.
For the interval dependent variable, two-sample case, it is clear that the hypothesis can be tested
with a single t test. For a three-sample case, the null hypothesis would be false if
1  2 or 1  3 or 2  3 .
We could test this hypothesis using three tests, such as t tests, using one test for each of the three
mean comparisons. If any of the three tests are significant, then the overall hypothesis of no
difference among the three groups would be rejected. If we do this, however, with each p value
being compared to alpha = 0.05, it turns out that the overall hypothesis of no difference among
the three means is actually being tested at an alpha > 0.05.
_____________________
ource: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah
School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385
Chapter 2-8 (revision 8 May 2011)
p. 1
Statisticians use the term family-wise alpha to when referring to a group of comparisons.
This inflated Type I error (rejecting H0 more often than we should) is called the multiplecomparison problem.
An intuitive analogy is flipping a coin. If you flip the coin once, the probability of it coming up
heads is 1/2. If you continue flipping the coin, the probability that it comes up heads at least one
time approaches 1.0 (a sure thing). The hypothesis H0: 1 = 2 = … = k being rejected if any of
the possible pairwise t tests is statistically significant is analogously inflated.
A classic example of this situation is a study comparing three doses of a drug against a placebo.
If low dose has greater effectiveness than placebo, or moderate dose has greater effectiveness
than placebo, or high dose has greater effectiveness than placebo, then you intend to conclude
that the drug is effective. In this situation, you give yourself 3 changes to get significance (any of
which will lead to the same conclusion of drug effectiveness relative to placebo).
We can discover how much inflation occurs using a Monte Carlo simulation.
Chapter 2-8 (revision 8 May 2011)
p. 2
Here is the simulation problem:
Suppose that we are interested in the absorption profile resulting from the administration
of ointment containing 20, 30, and 40 mg of progesterone to the nasal mucosa of
women. Women are to be randomized into these three dose groups, giving a parallel
groups design (rather than a crossover design). We want to be able to detect the
following profile, expecting to see differences in absorption of this magnitude or greater:
Absorption of Serum Progesterone Peak Value (nmol/l)
Dosagemean std.dev
20mg
20
10
30mg
25
10
40mg
30
10
A power analysis, for an alpha=0.05, two-sided t-test with equal variances, tells us that
we need N=64 in each group to have 80% power to detect this difference between groups
20mg and 30mg, similarly between 30mg and 40mg (both have a mean difference of 5).
An N=64 gives us 100% power to detect this difference (mean difference of 10) between
20mg and 40mg.
Simulation: Draw a random sample of size N=3x64=192 (N=64 in each of three groups)
from the same normally distributed population with mean 25 and standard deviation 10.
The null hypothesis is:
H0: 1 = 2 = 3
which is actually correct because all three groups have a population mean of 25.
Compare the three groups using t tests. Repeat this process of sampling and making three
pairwise comparisons with t tests 10,000 times. Tally the number of significant results
(number of times p < 0.05).
The result is:
Grp 1 vs
Grp 1 vs
Grp 2 vs
Times at
chance
2 significant 488 out of 10000 samples (4.88%)
3 significant 504 out of 10000 samples (5.04%)
3 significant 524 out of 10000 samples (5.24%)
least one of the three comparisons significant by
was 1244 out of 10000 samples (12.44%)
Chapter 2-8 (revision 8 May 2011)
p. 3
For the curious, the Stata code for this simulation is:
clear
set obs 192
gen peakval=.
gen group=1 in 1/64
replace group=2 in 65/128
replace group=3 in 129/192
set seed 999
scalar sumsig12 = 0
// sig12 counts times grp 1 vs 2 has p<0.05
scalar sumsig13 = 0
// sig13 counts times grp 1 vs 3 has p<0.05
scalar sumsig23 = 0
// sig23 counts times grp 2 vs 3 has p<0.05
scalar atleastone = 0 // atleastone counts times at least one
// significant for same sample
scalar n_times = 0 // number of times a sample was drawn
forvalues i=1(1)10000{
scalar sig12=0 // initialize to false
scalar sig13=0
scalar sig23=0
quietly replace peakval=25+invnorm(uniform())*10
// Normal (mean=25, SD=10)
quietly ttest peakval if group==1 | group==2 , by(group)
if r(p)<0.05 {
scalar sig12=1
} // if significant, change to true
quietly ttest peakval if group==1 | group==3 , by(group)
if r(p)<0.05 {
scalar sig13=1
}
quietly ttest peakval if group==2 | group==3 , by(group)
if r(p)<0.05 {
scalar sig23=1
}
if sig12==1 {
scalar sumsig12=sumsig12+1
} // if significant, add 1 to counter
if sig13==1 {
scalar sumsig13=sumsig13+1
}
if sig23==1 {
scalar sumsig23=sumsig23+1
}
if (sig12==1 | sig13==1 | sig23==1) {
scalar atleastone=atleastone+1
}
scalar n_times=n_times+1 // increment the samples drawn counter
}
display "Grp 1 vs 2 significant " sumsig12 " out of " n_times /*
*/ " samples (" sumsig12/n_times*100 "%)"
display "Grp 1 vs 3 significant " sumsig13 " out of " n_times /*
*/ " samples (" sumsig13/n_times*100 "%)"
display "Grp 2 vs 3 significant " sumsig23 " out of " n_times /*
*/ " samples (" sumsig23/n_times*100 "%)"
display "Times at least one of the three comparisons " /*
*/ "significant by"
display " chance was " atleastone " out of " n_times /*
*/ " samples (" atleastone/n_times*100 "%)"
Chapter 2-8 (revision 8 May 2011)
p. 4
Redisplaying the simulation results,
Grp 1 vs
Grp 1 vs
Grp 2 vs
Times at
chance
2 significant 488 out of 10000 samples (4.88%)
3 significant 504 out of 10000 samples (5.04%)
3 significant 524 out of 10000 samples (5.24%)
least one of the three comparisons significant by
was 1244 out of 10000 samples (12.44%)
If we set our alpha (probability of Type I error) at 0.05, then we should get 5% significant
differences in this simulation, if we just consider one pair (either 20mg vs 30mg, or 20mg vs
40mg, or 30mg vs 40mg). As it should be, that this is what happened.
If the research hypothesis is: “HA: dosage is related to amount absorbed” and we intend to
conclude our research hypothesis is demonstrated if any of the three pairwise comparisons come
out significant, then we see that our probability of making a Type I Error is inflated to 12.44%.
This multiple comparison problem can be more generally described as multiplicity.
The problem of “multiplicity” is that in any substantial clinical trial, or in any observation study,
it is all too easy to think up a whole multiplicity of hypotheses, each one geared to exploring
different aspects of response to treatment. (Pocock, 1983, p. 228)
The multiplicity problem has five main aspects (Pocock, 1983, p. 228):
(1) Multiple treatments Some trials have more than two treatments. The number of
possible treatment comparisons increases rapidly with the number of treatments.
(2) Multiple end-points There may be many different ways of evaluating how each patient
responds to treatment. It is possible to make a separate treatment comparison for each
end-point.
(3) Repeated measurements In some trials one can monitor each patient’s progress by
recording his disease state at several fixed time points after start of treatment. One could
then produce a separate analysis for each time point.
(4) Subgroup analyses One may record prognostic information about each patient prior to
treatment. Patients may then be classified into prognostic subgroups and each subgroup
analyzed separately.
(5) Interim analyses In most trials there is a gradual accumulation of data as more and
more patients are evaluated. One may undertake repeated interim analyses of the
accumulating data while the trial is in progress.
Chapter 2-8 (revision 8 May 2011)
p. 5
The problem arises in that the more significant test one performs, the more likely at least one test
will be significant by chance alone (sampling variability).
If the comparisons are independent, the probability can be determined as follows:
For one comparison,
P(significant by chance) = alpha
P(not significant by chance) = 1 – alpha
For two comparisons,
P(at least one significant by chance)
= 1 - P(neither is significant)
= 1 - (1 – alpha)(1 – alpha)
For k comparions,
P(at least one significant by chance)
= 1 – (1-alpha)k
For alpha=0.05, the formula produces the following probabilities:
k
1
2
3
4
5
Prob
.050
.098
.143
.185
.226
Thus, multiplicity increases the risk of committing a false-positive error, or Type I error
(concluding a significant effect when it does not exist in the sampled population).
The formula
P(at least one significant by chance) = 1 – (1-alpha)k
assumes that the k comparison are “independent” of each other. This independence
approximately holds when comparing several study groups on the same outcome. Some lack of
independence is introduced by using each group more than once (1 vs 2)(1 vs 3)(2 vs 3), as
pointed out by Ludbrook (1998), which explains why our simultion resulted in 0.124, instead of
0.143 as predicted by the formula.
If we set up the simulation so that 6 groups are sampled 10,000 times, and then only use each
group once (1 vs 2)(3 vs 4)(5 vs 6), the comparisons will be independent. The results of such a
simulation are:
Grp 1 vs
Grp 3 vs
Grp 5 vs
Times at
chance
2 significant 483 out of 10000 samples (4.83%)
4 significant 518 out of 10000 samples (5.18%)
6 significant 484 out of 10000 samples (4.84%)
least one of the three comparisons significant by
was 1413 out of 10000 samples (14.13%)
which is very close to the formula (simulation: 14.1% , formula: 14.3%).
Chapter 2-8 (revision 8 May 2011)
p. 6
For the curious, the Stata code for this simulation was:
clear
set obs 384 // n=64 x 6
set seed 999
capture drop peakval group
gen peakval =25+invnorm(uniform())*10 // Normal (mean=25, SD=10)
quietly gen group = 1
quietly replace group = 2 in 65/128
quietly replace group = 3 in 129/192
quietly replace group = 4 in 193/256
quietly replace group = 5 in 257/320
quietly replace group = 6 in 321/384
tab group
*
scalar sumsig12 = 0
// sig12 counts times grp 1 vs 2 has p<0.05
scalar sumsig34 = 0
// sig34 counts times grp 3 vs 4 has p<0.05
scalar sumsig56 = 0
// sig56 counts times grp 5 vs 6 has p<0.05
scalar atleastone = 0 // times at least one significant for same
sample
scalar n_times = 0
// n_times is number of times a sample was
drawn
forvalues i=1(1)10000{
scalar sig12=0 // initialize to false
scalar sig34=0
scalar sig56=0
quietly replace peakval=25+invnorm(uniform())*10
// Normal (mean=25, SD=10)
quietly ttest peakval if group==1 | group==2 , by(group)
if r(p)<0.05 {
scalar sig12=1
scalar sumsig12=sumsig12+1
} // if significant, change to true
quietly ttest peakval if group==3 | group==4 , by(group)
if r(p)<0.05 {
scalar sig34=1
scalar sumsig34=sumsig34+1
}
quietly ttest peakval if group==5 | group==6 , by(group)
if r(p)<0.05 {
scalar sig56=1
scalar sumsig56=sumsig56+1
}
if (sig12==1 | sig34==1 | sig56==1) {
scalar atleastone=atleastone+1
}
scalar n_times=n_times+1 // increment the samples drawn counter
}
display "Grp 1 vs 2 significant " sumsig12 " out of " n_times /*
*/ " samples (" sumsig12/n_times*100 "%)"
display "Grp 3 vs 4 significant " sumsig34 " out of " n_times /*
*/ " samples (" sumsig34/n_times*100 "%)"
display "Grp 5 vs 6 significant " sumsig56 " out of " n_times /*
*/ " samples (" sumsig56/n_times*100 "%)"
display "Times at least one of the three comparisons " /*
*/ "significant by"
display " chance was " atleastone " out of " n_times /*
*/ " samples (" atleastone/n_times*100 "%)"
Chapter 2-8 (revision 8 May 2011)
p. 7
For interim analysis problems, however, the interim tests are clearly not independent, since the
data from the earlier test is included in each later test. The following table (third column) shows
how the probability of at least one significant test changes with additional looks at the data
(Pocock, 1983, p. 148):
k
1
2
3
4
5
Independent
Tests
Probability
.050
.098
.143
.185
.226
Interim
Tests
Probability
0.05
0.08
0.11
0.13
0.14
which we can see inflates much less rapidly than the formula used above for the independent
comparison case (2nd column).
Exercise Read section 5.6 Adjustment of Significance and Confidence Levels in the guidance
document E9 Statistical Principles for Clinical Trials.
How Frequently Are Multiple Comparison Procedures Used
Horton and Switzer (2006) surveyed what statistical methods are used in research articles
published in NEJM. They found that 23% of research articles published in 2004-2005 reported
using a multiple comparison procedure.
P Value Based Multiple-Comparison Procedures
There are many multiple-comparison procedures designed for interval scale data and independent
groups. The statistical package SPSS has about 20 of these. However, they don’t apply to
ordinal or nominal scaled data, nor to paired samples.
Fortunately, there are a number of multiple-comparison procedures that simply adjust the p
value, and thus it makes no difference which test was used to produce the p value. So, these
procedures work for all comparisons, regardless of the level of measurement, or whether it is an
independent sample or related sample case.
Below are several such procedures. In these formulas, alpha is replaced with 0.05, which is
almost always what alpha is set to be.
Chapter 2-8 (revision 8 May 2011)
p. 8
Bonferroni procedure (Ludbrook, 1998):
p = unadjusted P value (P value from test statistic, not yet adjusted for multiplicity)
adjusted p = kp , where k=number of comparisons
For k=3 comparisons, this amounts to comparing
each p to adjusted alpha= α/k = 0.05/3=0.0167
which is identical to multiplying each p vaue by 3 and then comparing to
alpha=0.05.
If an adjusted p value is greater than 1, set the adjusted p value to 1 (since p>1 is
undefined).
This is the most conservative p value adjustment procedure of all (known to be needlessly
conservative).
____________________________________________________________________________
Note on Bonferroni Procedure: algebraic identity of adjusted p value formula and
adjusted alpha formula (if you are curious)
In most introductory statistics textbooks, the Bonferroni procedure is presented simply as,
adjusted alpha = α/k , where k = number of comparisons.
This is not very helpful, since it requires informing the reader what the adjusted alpha is that the
reader is supposed to compare the p value against.
The reader really appreciates just having an adjusted p value, instead, so only one alpha is needed
in your article, the alpha almost always being 0.05. Coming up with the adjusted p value is
simple enough. We want p ≤ adjusted α, as our adjusted rule for statistical significance. Solving
this inequality,
p ≤ adjusted α
p ≤ α/k , where k = number of comparisons, and α = 0.05, the original alpha
kp ≤ α
so, adjusted p = kp is now compared against the original, or nominal, alpha.
____________________________________________________________________________
Chapter 2-8 (revision 8 May 2011)
p. 9
Holm procedure (Sankoh et al., 1997):
p = unadjusted P value arranged in sort order (smallest to largest)
adjusted p = (k-i)p , where k=number of comparisons, and i=0,1,…,(k-1)
For 3 comparisons, this amounts to comparing
smallest p to adjusted alpha=0.05/3=0.0167
middle p to adjusted alpha=0.05/2=0.025
largest p to adjusted alpha=0.05/1=0.05
which is identical to multiplying each p value as follows:
3 × smallest p , 2 × middle p , 1 × largest p
If an adjusted p value is greater than 1, set the adjusted p value to 1 (since p>1 is
undefined). Also, with the p values in ascending sort order, if an adjusted p value
is smaller than the previous adjusted p value (which is illogical), then set the adjusted p
value to the value of the previous adjusted p value.
It is obvious that this procedure is a big improvement over the Bonferroni procedure, since every
p value but the smallest is compared against a larger alpha.
Šidák procedure (Ludbrook, 1998):
p = unadjusted P value
adjusted p = 1-(1-p)k , where k=number of comparisons
For 3 comparisons, this amounts to comparing
each p to adjusted alpha=1-(1-0.05)1/3 = 0.01695
This procedure provides a trivial improvement over Bonferroni, since we get to compare our p
values against a trivially larger alpha.
Holm-Šidák (Ludbrook, 1998):
p = unadjusted P value
adjusted p = 1-(1-p)k-i , where k=number of comparisons, and i=0,1,…,(k-1) sort order
For 3 comparisons, this amounts to comparing
smallest p to adjusted alpha=1-(1-0.05)1/3 = 0.01695
middle p to adjusted alpha=1-(1-0.05)1/2 = 0.0253
largest p to adjusted alpha=1-(1-0.05)1/1 = 0.05
With the p values in ascending sort order, if an adjusted p value is smaller than the
previous adjusted p value (which is illogical), then set the adjusted p value to the value of
the previous adjusted p value.
This procedure provides a trivial improvement over the Holm procedure, since we get to compare
our p values against a trivially larger alpha.
Chapter 2-8 (revision 8 May 2011)
p. 10
Hochberg procedure (Wright, 1992):
p = unadjusted P value arranged in sort order (smallest to largest)
adjusted p = (k-i)p , where k=number of comparisons, and i=0,1,…,(k-1)
For 3 comparisons, this amounts to comparing
smallest p to adjusted alpha=0.05/3=0.0167
middle p to adjusted alpha=0.05/2=0.025
largest p to adjusted alpha=0.05/1=0.05
which looks just like the Holm procedure at this stage.
It gains its advantage over the Holm procedure in the way the following adjustment is
made. Adjustments of anomalies are opposite of the Holm’s procedure. With the p
values in ascending sort order, if an adjusted p value is smaller than the previous adjusted
p value (which is illogical), then set the previous adjusted p value to the value of the
adjusted p value. With this approach, no adjusted p value can be larger than the largest
unadjusted p value.
This procedure is more powerful than the Holm procedure (Ludbrook, 1998).
Chapter 2-8 (revision 8 May 2011)
p. 11
Finner’s procedure (Finner, 1993):
p = unadjusted P value
adjusted p = 1-(1-p)k/i , where k=number of comparisons, and i=1,…,k sort order
For k=3 comparisons, this amounts to comparing
smallest p to adjusted alpha=1-(1-0.05)1/3 = 0.0170
middle p to adjusted alpha=1-(1-0.05)2/3 = 0.0336
largest p to adjusted alpha=1-(1-0.05)3/3 = 0.05
If an adjusted p value is greater than 1, set the adjusted p value to 1 (since p>1 is
undefined). Also, with the p values in ascending sort order, if an adjusted p value
is smaller than the previous adjusted p value (which is illogical), then set the adjusted p
value to the value of the previous adjusted p value.
This procedure is an improvement over all the above procedures since we get to compare our p
values against a larger alpha. However, by working backwards to correct illogically ordered
adjusted p values, the Hochberg procedure will win out over Finner’s procedure in many cases
when the largest unadjusted p value is 0.05 or just under it.
____________________________________________________________________________
Note on Finner’s Procedure: algebraic identity of adjusted p value formula and adjusted
alpha formula (if you are curious)
Starting with Finner (1993, p.922, Corollary 3.1),
αi = 1 – (1 – α)i/k , i = 1,…,k
in ascending sort order from smallest p to largest p
We want, pi ≤ αi , as our adjusted rule for statistical significance. Dropping the subscripts, which
are now assumed, and solving
p≤α
p ≤ 1 – (1 – α)i/k
p – 1 ≤ -(1 - α)i/k
1 – p ≥ (1 - α)i/k
(1 – p) k/i ≥ (1 - α)
(1 – p) k/i – 1 ≥ - α
1 – (1– p) k/i ≤ α
This adjusted p value formula, and the above stated correction for anomolies following applying
the formula, can be found in Adramson and Gahlinger (2001, p.13).
____________________________________________________________________________
Chapter 2-8 (revision 8 May 2011)
p. 12
Hommel’s procedure (Wright, 1992):
p = unadjusted P value arranged in sort order (smallest to largest).
Let j be the number of comparisons in the largest subset of comparisons for which
p > i(0.05/k), which represent the nonsignificant P values in the Simes (1986) procedure,
where k=number of comparisons, and i=1,…,k. If there are no nonsignificant Simes tests
[all p  i(0.05/k)], then all comparisons are significant. Otherwise, the comparison is
significant when p  0.05/j.
This procedure is more powerful than the Hochberg procedure (Wright, 1992).
General Comments
All of the above p value adjustment procedures maintain the desired alpha (0.05) for the
combined set of k comparisons when the comparisons are independent.
The formulas are not self adjusting when the comparisons are correlated (such with repeated
measures data), which makes some researchers nervous about using them. We will discuss this
below.
However, both the Bonferroni and the Holm procedures maintain the desired alpha (0.05) ,
regardless of independence or dependence among the p-values (Wright, 1992). [Wright bases
this claim on the existance of mathematical proofs that show that the Bonferroni and Holm’s
procedures maintain the alpha at 0.05, regardless of the correlation structure.]
Given that the more powerful Holm procedure shares that feature with the Bonferroni procedure,
there is no rational reason to prefer the Bonferroni over the Holm procedure. Researchers using
Bonferroni’s procedure are simply doing so because they are uninformed. Three papers
presenting mathematic proofs that Holm’s procedure accomplishes the same protection against a
Type I error as Bonferroni’s procedures, are:
1) Holm’s original paper (1979): sophisticated and elegant proof
2) Aickin and Geisler (1996): rendition of Holm’s proof, but easier to read
3) Levin (1996): simplified justification of why it works (more like an outline of the proof)
Both the Bonferroni and Holm procedures are too conservative if the endpoints are correlated, so
you do not get significance as often as you should.( Sankoh et al, 1997)
The procedures of Hochberg and Hommel are even more powerful than the Holm procedure; but
strictly speaking, they are known to maintain the family-wise alpha only for independent p-values
(Wright, 1992, p.1011).
However, that is a weak reason to prefer Holm’s over the more powerful procedures, such as the
Hochberg or Hommel’s procedures. Simulations by Sankoh et al (1997) have shown the
Chapter 2-8 (revision 8 May 2011)
p. 13
Hochberg and Hommel’s procedures to maintain the alpha of 0.05 over a wide range of
correlation structures in the data, so the lack of a mathematical proof that they maintain alpha in
all situations is of little concern.
Protocol Suggestion for Holm Procedure
If you feel timid about using the Holm procedure over Bonferroni, because you see everyone else
using Bonferroni, you could educated your reader by saying something like:
The p values for sets of multiple comparisons will be adjusted for multiplicity using
Holm’s multiple comparison procedure. Like the Bonferroni procedure, the Holm
procedure maintains the desired alpha (0.05) regardless of the correlation structure of the
endpoints, while being more powerful than the Bonferroni procedure. (Sankoh et al.,
1997)
…or, if you are comparing groups on the same variable, you could be more specific:
The p values for pairwise group comparisons will be adjusted for multiplicity using
Holm’s multiple comparison procedure. Like the Bonferroni procedure, the Holm
procedure maintains the desired alpha (0.05) regardless of the correlation structure of the
outcome variable among the groups, while being more powerful than the Bonferroni
procedure. (Sankoh et al., 1997)
It is sufficient, however, to just say:
The p values for pairwise group comparisons will be adjusted for multiplicity using
Holm’s multiple comparison procedure. (Sankoh et al., 1997)
Protocol Suggestion for Any of These Procedures
For any of the procedures given in this chapter, it is sufficient to say:
The p values for pairwise group comparisons will be adjusted for multiplicity using
<fill in the blank> multiple comparison procedure. (fill in suggested citation)
Example. Cummings et al. (N Engl J Med 2010) performed a randomized controlled trial of
lasofoxifene for the treatment of osteoporosis. The study had two active drug groups, 0.25 mg
lasofoxifene and 0.5 mg lasofoxifene, both compared to placebo. In their sample size paragraph,
they state,
“For the primary analyses, each dose of lasofoxifene was compared with placebo, and the
Hochberg procedure was used to control for multiple comparisons.8”
--------8
Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance.
Biometrika 1988;75:800-2.
Chapter 2-8 (revision 8 May 2011)
p. 14
Response to Reviewer for Use of the Hommel’s Procedure
Although the single sentence suggested above is sufficient for a protocol or article, here is some
wording you can use to justify Hommel’s procedure if you need to. You might need do this, for
example, if a journal reviewer questions the use of the procedure out of his or her own lack of
familarity.
Here is a cut-and-paste response you can use for such a situation:
The p values for pairwise group comparisons were adjusted for multiplicity using Hommel’s
multiple comparison procedure (Wright, 1992). A good statistical practice is to a priori
choose the most powerful statistical test available for which the assumptions are justified,
which is why the Student t-test is commonly selected over its nonparametric alternatives.
Dozens of multiple comparison procedures are available to choose from, for both the analysis
of variance based approaches and for the more generally applicable Bonferroni-like p value
adjustment procedures. A good statistical practice, then, is to a priori choose a specific
multiple comparison procedure over less powerful procedures. The Bonferroni-like p value
adjustment class of procedures was selected because these procedures have fewer assumptions
and apply to any test statistic. The Hommel’s p value adjustment procedure was selected
from this class, because it is known to be more powerful than several alternative procedures,
including the Bonferroni, Holm’s, and Hochberg’s procedures (Wright, 1992). Furthermore,
simulations have shown that Hommel’s procedure maintains alpha at 0.05 for both
independent comparisons and even for non-independent comparisons over a wide range of
correlation structures in the data (Sankoh et al, 1997).
_____
Wright SP. (1992). Adjusted P-Values for Simultaneous Inference. Biometrics 48:1005-1013.
Sankoh AJ, Huque MF, Dubey SD. (1997). Some comments on frequently used multiple
endpoint adjustment methods in clinical trials. Statistics in Medicine 16:2529-2542.
Chapter 2-8 (revision 8 May 2011)
p. 15
Example An example of a paper that uses Holm’s procedure is Florez et al (N Engl J Med,
2006). In their Statistical Methods section, they state,
“Nominal two-sided P values are reported and adjusted for multiple comparisons (three
genotypic groups within each trait) with the use of the Holm procedure.19”
Explanatory note: Statisticians use the term “nominal significance level” to refer to the original
selected alpha (almost universally this is 0.05). Florez’s use of “nominal” is nonstandard and
will probably confuse most readers. They are simply saying that they took the original P values
from the statistical tests and then adjusted them for multiple comparisons before reporting. They
apparently were looking for a way to avoid saying “unadjusted P values” because they were
applying the Holm’s adjustment to results from a regression model that adjusted for covariates
(so where adjusted P values already, in that sense). Their reference 19 is Holm’s original article
(Holm, 1979), which is the most correct citation, but the Sankoh paper, which is recommended
above, is far easier to read and therefore more useful to a reader who wants more information.
It would have been more clear to say,
Two-sided P values are reported after adjusting for multiple comparisons (three
genotypic groups within each trait) with the use of the Holm procedure.
Chapter 2-8 (revision 8 May 2011)
p. 16
The Correlated Endpoints Situation
Above, we compared the independent comparison situation to the correlated endpoint situation
(illustrated with multiple comparisons performed as interim analyses):
k
1
2
3
4
5
Independent
Tests
Probability
.050
.098
.143
.185
.226
Interim
Tests
Probability
0.05
0.08
0.11
0.13
0.14
The above p value adjustment procedures, which attempt to hold the independent tests alpha at
0.05, will be conservative, and thus not get significance often enough, for the correlated endpoint
situation. This is because the procedures are making greater adjustment than is necessary in the
correlated endpoint situation.
The Tukey-Ciminera-Heyse procedure (1985) was proposed as a p value adjustment procedure
specifically for correlated hypotheses (dependent tests, which would include repeated measures
data). The example of correlated hypotheses presented in that original paper was the use of
several outcome variables used for toxicity testing, where if any one of the outcomes was
significant, a conclusion of toxicity was demonstrated.
Tukey-Ciminera-Heyse procedure (Sankoh, 1997; Tukey et al, 1985):
p = unadjusted P value
adjusted p = 1  (1  p) k , where k = the number of comparisons
For 3 comparisons, this amounts to comparing each p to
adjusted alpha= 1  (1  0.05)1 / 3 = 0.029
This procedure is generally an improvement over many of the other procedures.
Sankoh et al (1997) demonstrated with simulation that the Tukey-Ciminera-Heyse procedure
maintains alpha appropriately in the situation with the endpoints are highly correlated (r=.90), but
rejects too often with lesser correlation (see Sankoh, Tables II and III).
Sankoh et al (1997) also demonstrated that the other procedures described above are too
conservative in the correlated endpoint situation (Tables II and III, shown for Hochberg and
Hommell procedures).
Some correction factors are listed in the Sankoh paper to adjust for the correlation structure of
the endpoints, so that the alpha is maintained right at 0.05 (not too conservative, not too liberal).
Chapter 2-8 (revision 8 May 2011)
p. 17
P Value Adjustments in Stata (mcpi command)
[Multiple Comparison Procedures Immediate]
These p value adjustment procedures are not available in Stata. However, I wrote a program to
do them, mcpi.ado, which is available in the datasets and do files subdirectory of the electronic
course manual. It is also available as an appendix to this chapter, which you can use to create it.
Note: Using mcpi.ado
In the command window, execute the command
sysdir
This tells you the directories Stata searches to find commands, or ado files.
It will look like:
STATA: C:\Program Files\Stata10\
UPDATES: C:\Program Files\Stata10\ado\updates\
BASE: C:\Program Files\Stata10\ado\base\
SITE: C:\Program Files\Stata10\ado\site\
PLUS: c:\ado\plus\
PERSONAL: c:\ado\personal\
OLDPLACE: c:\ado\
I suggest you copy the file mcpi.ado and mcpi.sthlp from the electronic course manual to the
c:\ado\personal\ directory. Having done that, mcpi becomes an executable command in your
installation of Stata. If the directory c:\ado\personal\ does not exit, then you should create it
using Windows Explorer (My Documents icon), and then copy the two files into this directory.
These two files are also available in the appendix to this chapter.
To get help for mcpi, use help mcpi in the command window. This help file also contains the
suggested citations, which is helpful.
To execute, use the command mcpi followed by a list of p values you want to adjust, each p value
separated by a space.
A nice feature of my mcpi command is that the suggested references are cited in the help file. To
see this,
help mcpi
Chapter 2-8 (revision 8 May 2011)
p. 18
The following is an example output from the Stata do-file mcpi.ado, using the unadjusted p
values found in the illustrative example of Abramson and Gahlinger (2001, pp 12-13).
mcpi .702 .045 .003 .0001 .135 .007 .110 .135 .004 .018
SORTED ORDER: before anomaly
Unadj ---------------------P Val
TCH
Homml Finnr
0.0001 0.000 0.001 0.001
0.0030 0.009 0.024 0.015
0.0040 0.013 0.028 0.013
0.0070 0.022 0.049 0.017
0.0180 0.056 0.108 0.036
0.0450 0.135 0.180 0.074
0.1100 0.308 0.220 0.153
0.1350 0.368 0.270 0.166
0.1350 0.368 0.270 0.149
0.7020 0.978 0.702 0.702
corrected
Adjusted --------------------------Hochb Ho-Si Holm
Sidak Bonfr
0.001 0.001 0.001 0.001 0.001
0.027 0.027 0.027 0.030 0.030
0.032 0.032 0.032 0.039 0.040
0.049 0.048 0.049 0.068 0.070
0.108 0.103 0.108 0.166 0.180
0.225 0.206 0.225 0.369 0.450
0.440 0.373 0.440 0.688 1.100
0.405 0.353 0.405 0.765 1.350
0.270 0.252 0.270 0.765 1.350
0.702 0.702 0.702 1.000 7.020
SORTED ORDER: anomaly corrected
(1) If Finner or Holm or Bonfer P > 1 (undefined) then set to 1
(2) If Finner or Hol-Sid or Holm P < preceding smaller P
(illogical) then set to preceding P
(3) Working from largest to smallest, if Hochberg preceding
smaller P > P then set preceding smaller P to P
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.0001 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001
0.0030 0.009 0.024 0.015 0.027 0.027 0.027 0.030 0.030
0.0040 0.013 0.028 0.015 0.032 0.032 0.032 0.039 0.040
0.0070 0.022 0.049 0.017 0.049 0.048 0.049 0.068 0.070
0.0180 0.056 0.108 0.036 0.108 0.103 0.108 0.166 0.180
0.0450 0.135 0.180 0.074 0.225 0.206 0.225 0.369 0.450
0.1100 0.308 0.220 0.153 0.270 0.373 0.440 0.688 1.000
0.1350 0.368 0.270 0.166 0.270 0.373 0.440 0.765 1.000
0.1350 0.368 0.270 0.166 0.270 0.373 0.440 0.765 1.000
0.7020 0.978 0.702 0.702 0.702 0.702 0.702 1.000 1.000
ORIGINAL ORDER: anomaly corrected
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.7020 0.978 0.702 0.702 0.702 0.702 0.702 1.000 1.000
0.0450 0.135 0.180 0.074 0.225 0.206 0.225 0.369 0.450
0.0030 0.009 0.024 0.015 0.027 0.027 0.027 0.030 0.030
0.0001 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001
0.1350 0.368 0.270 0.166 0.270 0.373 0.440 0.765 1.000
0.0070 0.022 0.049 0.017 0.049 0.048 0.049 0.068 0.070
0.1100 0.308 0.220 0.153 0.270 0.373 0.440 0.688 1.000
0.1350 0.368 0.270 0.166 0.270 0.373 0.440 0.765 1.000
0.0040 0.013 0.028 0.015 0.032 0.032 0.032 0.039 0.040
0.0180 0.056 0.108 0.036 0.108 0.103 0.108 0.166 0.180
----------------------------------------------------------------*Adjusted for 10 multiple comparisons
KEY: TCH
Homml
Finnr
Hochb
Ho-Si
Holm
Sidak
Bonfr
=
=
=
=
=
=
=
=
Tukey-Ciminera-Heyse procedure
Hommel procedure
Finner procedure
Hochberg procedure
Holm-Sidak procedure
Holm procedure
Sidak procedure
Bonferroni procedure
We notice that Finner’s procedure provides the greatest number of significant adjusted p values
for this combination of unadjusted p values.
Chapter 2-8 (revision 8 May 2011)
p. 19
Finner is certainly not the winner in all cases, as the following example illustrates, where
Hommel’s procedure does better.
SORTED ORDER: before anomaly
Unadj ---------------------P Val
TCH
Homml Finnr
0.0210 0.036 0.023 0.062
0.0220 0.038 0.023 0.033
0.0230 0.040 0.023 0.023
corrected
Adjusted --------------------------Hochb Ho-Si Holm
Sidak Bonfr
0.063 0.062 0.063 0.062 0.063
0.044 0.044 0.044 0.065 0.066
0.023 0.023 0.023 0.067 0.069
SORTED ORDER: anomaly corrected
(1) If Finner or Holm or Bonfer P > 1 (undefined) then set to 1
(2) If Finner or Hol-Sid or Holm P < preceding smaller P
(illogical) then set to preceding P
(3) Working from largest to smallest, if Hochberg preceding
smaller P > P then set preceding smaller P to P
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.0210 0.036 0.023 0.062 0.023 0.062 0.063 0.062 0.063
0.0220 0.038 0.023 0.062 0.023 0.062 0.063 0.065 0.066
0.0230 0.040 0.023 0.062 0.023 0.062 0.063 0.067 0.069
ORIGINAL ORDER: anomaly corrected
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.0210 0.036 0.023 0.062 0.023 0.062 0.063 0.062 0.063
0.0220 0.038 0.023 0.062 0.023 0.062 0.063 0.065 0.066
0.0230 0.040 0.023 0.062 0.023 0.062 0.063 0.067 0.069
----------------------------------------------------------------*Adjusted for 3 multiple comparisons
KEY: TCH
Homml
Finnr
Hochb
Ho-Si
Holm
Sidak
Bonfr
=
=
=
=
=
=
=
=
Tukey-Ciminera-Heyse procedure
Hommel procedure
Finner procedure
Hochberg procedure
Holm-Sidak procedure
Holm procedure
Sidak procedure
Bonferroni procedure
Chapter 2-8 (revision 8 May 2011)
p. 20
A very nice feature of Hochberg’s procedure is that if the largest unadjusted p value in  0.05,
then all the p value will be significant (no matter how many comparisons are done), since the
adjusted p values never exceed the largest unadjusted p value for that procedure.
SORTED ORDER: before anomaly
Unadj ---------------------P Val
TCH
Homml Finnr
0.0470 0.080 0.049 0.134
0.0480 0.082 0.049 0.071
0.0490 0.083 0.049 0.049
corrected
Adjusted --------------------------Hochb Ho-Si Holm
Sidak Bonfr
0.141 0.134 0.141 0.134 0.141
0.096 0.094 0.096 0.137 0.144
0.049 0.049 0.049 0.140 0.147
SORTED ORDER: anomaly corrected
(1) If Finner or Holm or Bonfer P > 1 (undefined) then set to 1
(2) If Finner or Hol-Sid or Holm P < preceding smaller P
(illogical) then set to preceding P
(3) Working from largest to smallest, if Hochberg preceding
smaller P > P then set preceding smaller P to P
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.0470 0.080 0.049 0.134 0.049 0.134 0.141 0.134 0.141
0.0480 0.082 0.049 0.134 0.049 0.134 0.141 0.137 0.144
0.0490 0.083 0.049 0.134 0.049 0.134 0.141 0.140 0.147
ORIGINAL ORDER: anomaly corrected
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.0470 0.080 0.049 0.134 0.049 0.134 0.141 0.134 0.141
0.0480 0.082 0.049 0.134 0.049 0.134 0.141 0.137 0.144
0.0490 0.083 0.049 0.134 0.049 0.134 0.141 0.140 0.147
----------------------------------------------------------------*Adjusted for 3 multiple comparisons
Chapter 2-8 (revision 8 May 2011)
p. 21
If the largest value does exceed 0.05, however, than the advantage shown in the previous
example is lost.
SORTED ORDER: before anomaly
Unadj ---------------------P Val
TCH
Homml Finnr
0.0490 0.083 0.051 0.140
0.0500 0.085 0.051 0.074
0.0510 0.087 0.051 0.051
corrected
Adjusted --------------------------Hochb Ho-Si Holm
Sidak Bonfr
0.147 0.140 0.147 0.140 0.147
0.100 0.098 0.100 0.143 0.150
0.051 0.051 0.051 0.145 0.153
SORTED ORDER: anomaly corrected
(1) If Finner or Holm or Bonfer P > 1 (undefined) then set to 1
(2) If Finner or Hol-Sid or Holm P < preceding smaller P
(illogical) then set to preceding P
(3) Working from largest to smallest, if Hochberg preceding
smaller P > P then set preceding smaller P to P
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.0490 0.083 0.051 0.140 0.051 0.140 0.147 0.140 0.147
0.0500 0.085 0.051 0.140 0.051 0.140 0.147 0.143 0.150
0.0510 0.087 0.051 0.140 0.051 0.140 0.147 0.145 0.153
ORIGINAL ORDER: anomaly corrected
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.0490 0.083 0.051 0.140 0.051 0.140 0.147 0.140 0.147
0.0500 0.085 0.051 0.140 0.051 0.140 0.147 0.143 0.150
0.0510 0.087 0.051 0.140 0.051 0.140 0.147 0.145 0.153
----------------------------------------------------------------*Adjusted for 3 multiple comparisons
KEY: TCH
Homml
Finnr
Hochb
Ho-Si
Holm
Sidak
Bonfr
=
=
=
=
=
=
=
=
Tukey-Ciminera-Heyse procedure
Hommel procedure
Finner procedure
Hochberg procedure
Holm-Sidak procedure
Holm procedure
Sidak procedure
Bonferroni procedure
P Value Adjustments in PEPI
The PEPI-4.0 ADJUSTP module (Abramson JH and Gahlinger PM., 2001) provides three of
these procedures. The results agree with the above mcp.ado output.
ADJUSTP
-
Multiple Significance Tests: Adjusted P Values
DATA
Total number of tests in the set = 3
No.
1
2
3
Original
P
0.051
0.050
0.049
HOLM's
adjusted P
0.1470
0.1470
0.1470
Chapter 2-8 (revision 8 May 2011)
HOMMEL's
adjusted P
0.0510
0.0510
0.0510
FINNER's
adjusted P
0.1399
0.1399
0.1399
p. 22
Approaches to the multiple comparison problem (k sample comparisons)
The two most common approaches to the multiple comparison problem are:
1) For the k-sample comparison, we simply generalize (or extend) the two-sample statistical test
to simultaneously compare more than two samples, the test generating a single p value (only
one chance of being statistically significant so alpha remains at 0.05 without inflation). These
are sometimes referred to as tests of the global hypothesis. A oneway analysis of variance is
an example of this approach.
2) For the k-sample comparison, we perform as many two-sample comparisons as we are
interested in, generating many p values, but we compare these p values to alphas smaller than
0.05, so that taken as a set (family) of comparisons, the family alpha never exceeds 0.05.
This is the same thing as using a p value adjustment procedure, such as Hommel’s procedure.
It turns out that the 2nd option is more useful, since we almost always want to know which
groups differ, rather than only knowing that there are one or more differences among them. The
2nd option is also more powerful, giving you significance more often, while still controlling for
the Type I error (family alpha).
The 1st option is useful for our Table 1 “Patient Characteristics” since it allows us to report only
one p value to make our point of equivalence among the three or more study groups.
Primary-Secondary Hypothesis Approach
If you consider one endpoint, or one comparison in general, to be the most important endpoint of
interest, you can call this the primary outcome and test it at the nominal significance level of
alpha = 0.05, with no need for a multiple comparison adjustment. Or, if you have a set of
primary outcomes, you can apply a multiple comparison procedure to this set. All other
endpoints are considered secondary endpoints, which do not need a multiple comparison
adjustment. These unadjusted secondary endpoints, however, are considered descriptive, or
exploratory, and thus do not provide confirmatory evidence for the research hypothesis. A good
citation for this approach is Freemantle (2001).
Browner et al (1988) is another citation for this. In their discussion of multiple hypotheses, they
suggest,
“A good rule is to establish in advance as many hypotheses as make sense, but specify
one as the primary hypothesis. This helps to focus the study on its main objective, and
provides a clear basis for the main sample size calculation. In addition, a single primary
hypothesis can be tested statistically without argument about whether to adjust for
multiple hypothesis testing.”
The FDA Guidance Document, E-9, Section 5.6, allows the selection of a primary outcome
variable to avoid adjustment for multiple comparisons on this outcome, or to limit the multiple
comparison adjustment to a set of primary variables.
Chapter 2-8 (revision 8 May 2011)
p. 23
Common Misconception of Thinking Analysis of Variance (ANOVA) Must Precede
Pairwise Comparisons
Although there is no need to test the global hypothesis first, many researchers and reviewers have
been mislead into thinking that it is a necessary step, and they think pairwise comparisons can
only be done if the global hypothesis (oneway ANOVA) comes out significant. The pairwise
comparisons using a multiple comparison procedure have gained the name post hoc tests,
implying they are only done following a significant F statistic from a oneway ANOVA.
Using this approach leads to lost opportunities to demonstrate significant results because the
oneway ANOVA is a very conservative test. Going straight to the pairwise comparisons that are
adjusted for multiple comparisons will lead to significant findings more often, while still
controlling the Type I error rate. Discussing the situation where the means of four groups are
compared, H0: 1 = 2 = 3 = 4 , Zolman (1993, p.109) explains the conservatism of the
ANOVA test,
“The omnibus F-test, which is the proper analysis to tests these hypotheses, is the most
conservative test that could be used. This overall F-test requires a very large difference
between means to attain significance and reject H0. The reason for this conservatism is
that there are 25 possible combinations of mean comparisons (pairs and combinations)
when there are four treatment means (6.9). The F-test evaluates all these 25 comparisons
and maintains the overall level of significance at some previously specified level (e.g., .05
or .01).”
-----Note: the “(6.9)” is simply referring to a section of Zolman’s book.
This misconception that ANOVA must be used first before multiplicity adjusted pairwise
comparisons are performed arises because it is still being presented in statistics textbooks and
taught in statistics courses. Purposely not singling out any one book, these books make
statements like,
Before proceeding to pairwise comparisons, call post-hoc tests, the one-way analysis of
variance (ANOVA) F-test that simultaneously compares all the means must first be
significant. If it is not significant, making pairwise comparisons is unjustified.
This approach is propagated by authors of statistics textbooks basically cut-and-pasting ideas
presented by already published textbooks, rather than the author becoming familar with the
multiple comparison method literature which has long outdated this approach. It takes decades
for new ideas to make their way into statistical textbooks.
Apparently the idea started with Fisher (1935), in reference to his LSD procedure. [Fisher is the
same famous statistician who derived the Fisher’s exact test, and who popularized hypothesis
testing using the alpha=0.05 level of significance. Snedecor and Cochran (1980, p.234) state,
“...Fisher (8) wrote, “When the z test (i.e., the F test) does not demonstrate significance,
much caution should be used before claiming significance for special comparisons.” In
line with this remark, investigators are sometimes advised to use the LSD method only if
Chapter 2-8 (revision 8 May 2011)
p. 24
F is significant. This method has been called the protected LSD, or PSD method. If F is
not significant, we declare no significant differences between means.”
___
8. Fisher RA. (1935). The Design of Experiments. Edinburgh, Oliver & Boyd.
The Fisher LSD, or PSD, multiple comparison procedure is a two-stage procedure. First,
compute a oneway ANOVA. If significant, go on to step 2, which is basically do all of the
pairwise comparisons using an equal variance independent groups Student t-test. The type I error
is nearly kept at the nominal alpha (0.05) by first requiring a significant ANOVA.
Since then, multiple comparison procedures have been proposed that do not require the ANOVA
step to protect against an inflated type I error. Proponents of the practice of requiring a
significant F test from an ANOVA before using any of the newer multiple comparison procedure
have lost tract of the context in which Fisher’s original idea was proposed.
Almost no multiple comparison procedure introduced after the Fisher LSD procedure require a
significant ANOVA. These multiple comparison procedures are designed to control the Type I
error rate by themselves, not requiring any calculation from the ANOVA and not requiring
ANOVA be significant in order to control the Type I error rate.
In an attempt to correct this erroneous thinking, Dunnett and Goldsmith (2006, p.438), in their
discussion of multiple-comparison adjusted pairwise comparisons, state,
“...it is not necessary to do a preliminary F test before proceeding with multiple
comparisons (as some texts recommend). Performing a preliminary F test may miss
important single effects which become diluted (averaged out) with other effects.”
Aickin and Gensler (1996) make a similar statement in their discussion of the Holm’s multiple
comparison procedure, which is one of the p value adjustment procedures that improve upon the
Bonferroni approach,
“The second point is that it is traditional to approach data such as these using analysis of
variance. However, in the traditional F test, only the null hypothesis of no differences
among means is tested. The F test gives no guidance concerning which groups are
different when the F statistic is significant and provides little power to detect a small
number of differences when most means coincide. For this reason, if individual group
differences are to be interpreted, there is no reason to perform the analysis of variance; it
is better to proceed directly to Holm’s procedure.”
Chapter 2-8 (revision 8 May 2011)
p. 25
Cut-and-Paste Response to Editor Insisting on ANOVA In Place Of or Preceding a
Multiple Comparison Procedure
If you report a multiple comparison procedure and the journal editor insists that ANOVA is
appropriate, here is a cut-and-paste response you can use:
The editor makes a comment that ANOVA is appropriate for these data, rather than simply
reporting multiple-comparison adjusted p values. Although we sincerely appreciate that the
editor’s point-of-view is shared by many, it is well-known among statisticians who have kept
up on the multiple comparison literature that such an approach is not needed, in general, and
particularly when using any of the Bonferoni-extension p value adjustment procedures, such
as the one we used. Since our approach is correct, we did not revise our paper to add any tests
of the global hypothesis using ANOVA, but instead keep our multiple comparison procedure
adjustments to the p values, which is adequate to protect against a Type I error. To meet the
editor half way, we added two methods paper citations that explicitly state that the ANOVA is
not needed. Our Statistical Methods Section sentence now reads,
“P values are adjusted for multiple comparisons using <fill in> multiple procedure, which
controls the type I error without the need to first test the global hypothesis with ANOVA.[
Dunnett and Goldsmith (2006); Aickin and Gensler (1996)]”
------Dunnett C, Goldsmith C. When and how to do multiple comparisons. In Buncher CR,
Tsay J-Y, eds., Statistics in the Pharmaceutical Industry. 3rd ed. New York, Chapman &
Hall/CRC, 2006, pp.421-452.
Aickin M, Gensler H. Adjusting for multiple testing when reporting research results: the
Bonferroni vs Holm methods. Am J Public Health 1996;86:726-728.
Here is a detailed explanation of why adjusting for multiple comparisons, without also using
ANOVA, is the best approach:
Although there is no need to test the global hypothesis first, many researchers and reviewers
have been mislead into thinking that it is a necessary step, and they think pairwise
comparisons can only be done if the global hypothesis (oneway ANOVA) comes out
significant. The pairwise comparisons using a multiple comparison procedure have gained
the name post hoc tests, implying they are only done following a significant F statistic from a
oneway ANOVA.
This misconception arises because it is still being presented in statistics textbooks and taught
in statistics courses. Purposely not singling out any specific book, these textbooks make
statements like,
Before proceeding to pairwise comparisons, call post-hoc tests, the one-way analysis
of variance (ANOVA) F-test that simultaneously compares all the means must first be
significant. If it is not significant, making pairwise comparisons is unjustified.
This approach is propagated by authors of statistics textbooks basically cut-and-pasting ideas
presented by already published textbooks, rather than the author becoming familiar with the
Chapter 2-8 (revision 8 May 2011)
p. 26
multiple comparison method literature which has long outdated this approach. It takes
decades for new ideas to make their way into statistical textbooks.
Apparently the idea started with Fisher (1935), in reference to his LSD procedure. [Fisher is
the same famous statistician who derived the Fisher’s exact test, and who popularized
hypothesis testing using the alpha=0.05 level of significance. Snedecor and Cochran (1980,
p.234) state,
“...Fisher (8) wrote, “When the z test (i.e., the F test) does not demonstrate
significance, much caution should be used before claiming significance for special
comparisons.” In line with this remark, investigators are sometimes advised to use the
LSD method only if F is significant. This method has been called the protected LSD,
or PSD method. If F is not significant, we declare no significant differences between
means.”
___
8. Fisher RA. (1935). The Design of Experiments. Edinburgh, Oliver & Boyd.
The Fisher LSD, or PSD, multiple comparison procedure is a two-stage procedure. First,
compute a oneway ANOVA. If significant, go on to step 2, which is basically do all of the
pairwise comparisons using an equal variance independent groups Student t-test. The type I
error is nearly kept at the nominal alpha (0.05) by first requiring a significant ANOVA.
Since then, multiple comparison procedures have been proposed that do not require the
ANOVA step to protect against an inflated type I error. Proponents of the practice of
requiring a significant F test from an ANOVA before using any of the newer multiple
comparison procedure have lost tract of the context in which Fisher’s original idea was
proposed.
Almost no multiple comparison procedure introduced after the Fisher LSD procedure requires
a significant ANOVA. These multiple comparison procedures are designed to control the
Type I error rate by themselves, not requiring any calculation from the ANOVA and not
requiring ANOVA be significant in order to control the Type I error rate.
In an attempt to correct this erroneous thinking, Dunnett and Goldsmith (2006, p.438), in their
discussion of multiple-comparison adjusted pairwise comparisons, state,
“...it is not necessary to do a preliminary F test before proceeding with multiple
comparisons (as some texts recommend). Performing a preliminary F test may miss
important single effects which become diluted (averaged out) with other effects.”
Aickin and Gensler (1996) make a similar statement in their discussion of the Holm’s
multiple comparison procedure, which is one of the p value adjustment procedures that
improves upon the Bonferroni approach,
“The second point is that it is traditional to approach data such as these using analysis
of variance. However, in the traditional F test, only the null hypothesis of no
differences among means is tested. The F test gives no guidance concerning which
Chapter 2-8 (revision 8 May 2011)
p. 27
groups are different when the F statistic is significant and provides little power to
detect a small number of differences when most means coincide. For this reason, if
individual group differences are to be interpreted, there is no reason to perform the
analysis of variance; it is better to proceed directly to Holm’s procedure.”
References
Aickin M, Gensler H. (1996). Adjusting for multiple testing when reporting research results:
the Bonferroni vs Holm methods. Am J Public Health 86:726-728.
Dunnett C, Goldsmith C. (2006). When and how to do multiple comparisons. In Buncher CR,
Tsay J-Y, eds., Statistics in the Pharmaceutical Industry. 3rd ed. New York, Chapman
& Hall/CRC, pp. 421-452.
Fisher RA. (1935). The Design of Experiments. Edinburgh, Oliver & Boyd.
Munro BH. (2001). Statistical Methods for Health Care Research. 4th ed. Philadelphia,
Lippincott.
Snedecor GW, Cochran WG. (1980). Statistical Methods, 7th Ed. Ames, Iowa, The Iowa
State University Press.
Chapter 2-8 (revision 8 May 2011)
p. 28
Special Case of Multiplicity Adjustment: Controlling the False Discovery Rate
Above, we discussed the most commonly taught and understood approach to multiplicity.
Statisticians call this controlling the Familywise Error Rate (FWER). In that situation, a set of
comparisons, called a family, are made and a conclusion of a statistically significant effect is
made if any of the individual comparisons turns out statistically significance. For example, if
three groups are compared, the conclusion will be whether or not there is a difference among the
groups.
In some special situations, we are interested in the individual comparisons themselves, rather
than making a single global statement about them. In this situation, we want to keep the false
positive error rate at the nominal alpha value, usually 5%. Doing this, we will have at most 5%
of the individual comparisons determined to be significant when the effects are not real. This is
called controlling the False Discovery Rate (FDR).
The FDR procedures are more powerful than the FWER procedures, leading to more significant
findings (Benjamini and Hochberg, 1995).
The best known procedure for doing this is the Benjamini-Hochberg procedure (Benjamini and
Hochberg, 1995). Benjamini and Hochberg suggest three situations where controlling the FDR is
more appropriate than controlling the FWER (Benjamini and Hochberg, 1995, p.292 section 2.2):
1) multiple endpoints problem. In this situation, an overall decision or recommendation is
reached from examining the individual comparisons, but the individual comparisons are each
important. Deciding to develop a new drug based on multiple discoveries of its benefit is an
example. We wish to make as many discoveries as possible about the benefits of the drug,
which will enhance a decision in favor of the new drug, while controlling the FDR.
2) multiple subgroups problem. Here we make multiple individual decisions, without an overall
decision being required. We might compare two treatments in several subgroups, and decisions
about the treatments are made separately for each subgroup. We are willing to accept a prespecified proportion of misses, say 5%, by controlling the FDR.
3) screening problem. Here multiple potential effects are screened to weed out the null effects.
For example, we might screen various chemicals for potential drug development. We want to
obtain as many discoveries as possible, but still wish to control the FDR, because a large fraction
of leads would burden the second phase of the confirmatory analysis.
The screening problem is what is encountered in genetics studies, where a large list of allelles, or
single-nucelotide polymorphisms (SNPs) are examined. Rosner (2006, pp.579-581) points out
that control for the FWER is not appropriate in genetics studies, while an FDR procedure is.
Moyé (2008, pp. 630-631) states the same thing,
“In addition, False Discovery Rate (FDR) offers a new and interesting perspective on the
multiple comparisons problem. Instead of controlling the chance of any false postives (as
Bonferroni does), the FDR controls the expected proportion of flase positives among all
tests (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001). It is very useful in
micro-array analyses in which thousands of significance tests are executed.”
Chapter 2-8 (revision 8 May 2011)
p. 29
The Benjamini-Hochberg procedure is a p value adjustment procedure. Like the procedures
discussed above for controlling the FWER, the p values are obtained from whatever test statistic
is appropriate, and then the p values are adjusted for multiplicity.
Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995; Rosner, 2006):
p = unadjusted P value arranged in sort order (smallest to largest)
adjusted p = (k/i)p , where k=number of comparisons, and i=1,2,…,k is the rank
from smallest to largest.
For 3 comparisons, this amounts to comparing
smallest p to adjusted alpha=(1/3)(0.05)=0.0167
middle p to adjusted alpha=(2/3)(0.05)=0.0333
largest p to adjusted alpha=(3/3)(0.05)=0.05
Example: applying the formula to the following unadjusted p values:
001 .020 .029
Adjusted p values are: (3/1)(.001) = .003
(3/2)(.020) = .030
(3/3)(.029) = .029
We discover an anolomy since the second adjusted p value is smaller than
the third adjusted p value, which is illogical it was the other way around
before adjustment.
Correction for anomalies is like the Hochberg procedure. With the p values in ascending
sort order, if an adjusted p value is smaller than the previous adjusted p value (which is
illogical), then set the previous adjusted p value to the value of the adjusted p value.
After correction for anolomies,
Adjusted p values are: (3/1)(.001) = .003
(3/2)(.020) = .030 -> 0.029
(3/3)(.029) = .029
Chapter 2-8 (revision 8 May 2011)
p. 30
P Value Adjustment to Control FDR in Stata (fdri command)
[False Discovery Rate Immediate]
Such a p value adjustment procedure is not available in Stata. However, I wrote a program to do
it, fdri.ado, which is available in the datasets and do files directory of the electronic course
manual. It is also available as an appendix to this chapter, which you can use to create it.
Note: Using fdri.ado
In the command window, execute the command
sysdir
This tells you the directories Stata searches to find commands, or ado files.
It will look like:
STATA: C:\Program Files\Stata10\
UPDATES: C:\Program Files\Stata10\ado\updates\
BASE: C:\Program Files\Stata10\ado\base\
SITE: C:\Program Files\Stata10\ado\site\
PLUS: c:\ado\plus\
PERSONAL: c:\ado\personal\
OLDPLACE: c:\ado\
I suggest you copy the file fdri.ado and fdri.sthlp from the course CD to the c:\ado\personal\
directory. Having done that, fdri becomes an executable command in your installation of Stata.
If the directory c:\ado\personal\ does not exit, then you should create it using Windows Explorer
(My Documents icon), and then copy the two files into this directory.
These two files are also available in the appendix to this chapter.
To get help for fdri, use help fdri in the command window. This help file also contains the
suggested citations, which is helpful.
To execute, use the command fdri followed by a list of p values you want to adjust, each p value
separated by a space.
A nice feature of my fdri command is that the suggested reference is cited in the help file. To see
this,
help fdri
Chapter 2-8 (revision 8 May 2011)
p. 31
The following is an example output from the Stata do-file fdri.ado, using the three p values
shown as an example in the Benjami-Hochberg description two pages above,
fdri .001 .020 .029
P Value Adjustment for Controlling False Discovery Rate
SORTED ORDER: before anomaly corrected
Unadj
Adjusted
P Val
BenHoc
0.0010 0.003
0.0200 0.030
0.0290 0.029
SORTED ORDER: anomaly corrected
Working from largest to smallest, if Benjamini-Hochberg
preceding smaller P > P then set preceding smaller P to P
Unadj
Adjusted
P Val
BenHoc
0.0010 0.003
0.0200 0.029
0.0290 0.029
ORIGINAL ORDER: anomaly corrected
Unadj
Adjusted
P Val
BenHoc
0.0010 0.003
0.0200 0.029
0.0290 0.029
----------------------------------------*Adjusted for 3 multiple comparisons
KEY: BenHoc = Benjamini-Hochberg procedure
Chapter 2-8 (revision 8 May 2011)
p. 32
Next, using the example from Benjamini and Hochberg (1995),
fdri .0001 .0004 .0019 .0095 .0201 .0278 .0298 .0344 ///
.0459 .03240 .4262 .5719 .6528 .7590 1.000
P Value Adjustment for Controlling False Discovery Rate
SORTED ORDER: before anomaly corrected
Unadj
Adjusted
P Val
BenHoc
0.0001 0.002
0.0004 0.003
…
1.0000
1.000
SORTED ORDER: anomaly corrected
Working from largest to smallest, if Benjamini-Hochberg
preceding smaller P > P then set preceding smaller P to P
Unadj
Adjusted
P Val
BenHoc
0.0001 0.002
0.0004 0.003
…
1.0000
1.000
ORIGINAL ORDER: anomaly corrected
Unadj
Adjusted
P Val
BenHoc
0.0001 0.002
0.0004 0.003
0.0019 0.010
0.0095 0.036
0.0201 0.057
0.0278 0.057
0.0298 0.057
0.0344 0.057
0.0459 0.069
0.0324 0.057
0.4262 0.581
0.5719 0.715
0.6528 0.753
0.7590 0.813
1.0000 1.000
----------------------------------------*Adjusted for 15 multiple comparisons
KEY: BenHoc = Benjamini-Hochberg procedure
We see that we get to keep the significant findings for the four smallest p values, which is
consistent with the result in the Benjamini and Hochberg article.
Chapter 2-8 (revision 8 May 2011)
p. 33
An example of an article that used the Benjamini-Hochberg procedure is:
Chen P, Liang J, Wang Z, et al. Association of common PALB2 polymorphisms with breast
cancer risk: a case-control study. Clin Cancer Res 2008 Sep 15;14(18):5931-7. In their Abstract,
they report,
“RESULTS: Based on the multiple hypothesis testing with the Benjamini-Hochberg
method, tagging SNPs (tSNP) rs249954, rs120963, and rs16940342 were found to be
associated with an increase of breast cancer risk (false discovery rate-adjusted P values
of 0.004, 0.028, and 0.049, respectively) under the dominant model.”
Article Suggestion
Here is a suggestion for your Statistical Methods Section,
Given that we had a genetics study, where a large list of SNPs were examined, we report
Benjamini-Hochberg adjusted p values, which maintains the false discovery rate (FDR) at
the nominal alpha 0.05 level (Benjamini-Hochberg, 1995). In such studies, controlling
for multiplicity in the standard fashion, such as with the Bonferroni procedure which
controls the family-wise error rate (FWER), is not justified, while control for the FDR
provides the correct control for multiplicity (Benjamini-Hochberg, 1995; Rosner, 2006;
Moyé, 2008).
An example of an investigator making a statement like this, only more detailed and even better,
to justify the use of controlling the FDR is Scott et al. (2010),
“Statistical tests of the univariate relationships between these baseline predictor
variables and the 6 outcome variables at each of the 2 follow-up visits resulted in 204
(CRVO analyses) and 240 (BRVO analyses) P values. If so many hypotheses are tested
without special precautions, some relationships would likely appear significant by chance
alone (i.e., type I error). To mitigate this, we controlled the false discovery rate (FDR)8,9
at 5% separately within the CRVO and BRVO disease area analyses.
Modern clinical trials may feature multiple, co-primary endpoints, with the
statistical significance of any one of the end points potentially serving as a basis for a
claim of efficacy. In that situation, one typically controls family-wide type I error (FWE).
However, the aim of this article is not to claim efficacy of a particular treatment but to
nominate important predictive relationships. Here, controlling FDR is more appropriate.
Controlling FWE at a level of 0.05 ensures that the probability of incorrectly rejecting at
least 1 null hypothesis is only 5%. In contrast, controlling FDR at a level of 0.05 ensures
that the expected proportion,among all rejected null hypotheses, of incorrectly rejected
null hypotheses is only 5%. The FDR is often implemented in genomics research areas
such as gene chips, where multiplicity is a well-recognized phenomenon of concern.
Benjamini and Hochberg8 introduced FDR methodology for independent hypothesis
tests. Benjamini and Yekutieli9 later showed that the original method suffices for some
types of dependence and introduced a conservative correction that works for all types of
dependence. We chose the FDR criterion to try to ensure that no more than 5% of
Chapter 2-8 (revision 8 May 2011)
p. 34
the results we claim to be significant would fail to be confirmed if subsequently
investigated with new data, consistent with recommendations by Benjamini et al.10”
--------------8. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and
powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 1995;57:
289 –300.
9. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing
under dependency. Ann Stat 2001;29:1165–88.
10. Benjamini Y, Drai D, Elmer G, et al. Controlling the false discovery rate in behavior
genetics research. Behav Brain Res 2001;125:279–84.
Chapter 2-8 (revision 8 May 2011)
p. 35
Dichotomous Dependent Variable
In the two-sample case in the previous chapter, we analyzed data like this using the chi-square
test (if the minimum expected cell frequency assumption was met) and Fisher’s exact test
otherwise. For the k-sample case, we use the chi-square test again (with the same assumption)
and the Fisher-Freeman-Halton test otherwise (Barnard’s test is only for 2  2 tables). This gives
us one p value (testing the global hypothesis) so no multiple comparison adjustment is required.
For pairwise comparisons, we would use the same tests (including Barnard’s test) and then adjust
the p values using one of the procedures discussed above. We would then report the adjusted p
values, rather than the original unadjusted p values.
To illustrate, consider the following data:
Five-year Survival Following Treatment for Unspecified Cancer
(Hypothetical data) By Three Therapies
Chemo
Surgery
Radiation
total
survived 5 years 10
14
25
49
died
90
86
75
251
total
100
100
100
300
tabi 10 14 25 \ 90 86 75 , expect col chi2
+--------------------+
| Key
|
|--------------------|
|
frequency
|
| expected frequency |
| column percentage |
+--------------------+
|
col
row |
1
2
3 |
Total
-----------+---------------------------------+---------1 |
10
14
25 |
49
|
16.3
16.3
16.3 |
49.0
|
10.00
14.00
25.00 |
16.33
-----------+---------------------------------+---------2 |
90
86
75 |
251
|
83.7
83.7
83.7 |
251.0
|
90.00
86.00
75.00 |
83.67
-----------+---------------------------------+---------Total |
100
100
100 |
300
|
100.0
100.0
100.0 |
300.0
|
100.00
100.00
100.00 |
100.00
Pearson chi2(2) =
8.8300
Pr = 0.012
The global test is known to be conservative, and so this chi-square would not normally be done.
However, if we wanted to, we could use it and stop here, just reporting: “There was a significant
difference among the three therapies in 5-year survival (p = 0.012).”
The reader would naturally want to know if Radiation therapy is significantly better than Surgery,
as well as significantly better than Chemo, and if Surgery is significantly better than Chemo.
Chapter 2-8 (revision 8 May 2011)
p. 36
Our next step would be (which would normally be our first step, skipping the global test):
tabi 10 14 \ 90 86 , expect chi2
tabi 10 25 \ 90 75 , expect chi2
tabi 14 25 \ 86 75 , expect chi2
+--------------------+
| Key
|
|--------------------|
|
frequency
|
| expected frequency |
+--------------------+
|
col
row |
1
2 |
Total
-----------+----------------------+---------1 |
10
14 |
24
|
12.0
12.0 |
24.0
-----------+----------------------+---------2 |
90
86 |
176
|
88.0
88.0 |
176.0
-----------+----------------------+---------Total |
100
100 |
200
|
100.0
100.0 |
200.0
Pearson chi2(1) =
0.7576
Pr = 0.384
|
col
row |
1
2 |
Total
-----------+----------------------+---------1 |
10
25 |
35
|
17.5
17.5 |
35.0
-----------+----------------------+---------2 |
90
75 |
165
|
82.5
82.5 |
165.0
-----------+----------------------+---------Total |
100
100 |
200
|
100.0
100.0 |
200.0
Pearson chi2(1) =
7.7922
Pr = 0.005
|
col
row |
1
2 |
Total
-----------+----------------------+---------1 |
14
25 |
39
|
19.5
19.5 |
39.0
-----------+----------------------+---------2 |
86
75 |
161
|
80.5
80.5 |
161.0
-----------+----------------------+---------Total |
100
100 |
200
|
100.0
100.0 |
200.0
Pearson chi2(1) =
3.8541
Pr = 0.050
Obtaining adjusted p value with mcpi,
mcpi .384 .005 .050
ORIGINAL ORDER: anomaly corrected
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.3840 0.568 0.384 0.384 0.384 0.384 0.384 0.766 1.000
0.0050 0.009 0.015 0.015 0.015 0.015 0.015 0.015 0.015
0.0500 0.085 0.100 0.074 0.100 0.098 0.100 0.143 0.150
----------------------------------------------------------------*Adjusted for 3 multiple comparisons
Chapter 2-8 (revision 8 May 2011)
p. 37
Now, we are really frustrated because we lost our significance between surgery and radiation.
The question, then, is was it really necessary to adjust the p values. After all, it is not like 3
doses of the same drug--it is 3 distinct therapies (more like 3 distinct hypotheses).
There is a lot of confusion around this issue, and you might encounter a reviewer who claims you
need a multiple comparison procedure here.
Actually, you do not. Look back at the 5 aspects of clinical trials that Pocock states multiplicity
arises. Notice that in all five aspects listed by Pocock, you are testing one global hypothesis with
multiple comparisons. That is not the case with the three cancer treatments, unless the research
question is merely whether or not differences exist among cancer treatments (and you don’t really
care which treatments outperform which other treatments). If the interest is in how specific
cancer treatments compare with other specific treatments, which of course it is, then you had
three hypotheses to test when you designed the study--reporting the global hypothesis test which
we did above is nothing but a waste of space in the article. You should instead report three
separate statements, something to the effect of:
Radiation therapy was significantly better than chemotherapy (p = 0.005) . Radiation
therapy was also significantly better than surgery (p = .050). There was no significant
difference between chemotherapy and surgery (p = 0.384).
The editor might say, “A multiple comparison procedure is needed since the three therapies were
tested using the same sample of patients.” On the surface, this sounds credible. After all, the p
value is the probability of observing the effect you did simply by taking a sample, and you only
took one sample.
What we need is a good reference to support our position. Here it is (see box).
Chapter 2-8 (revision 8 May 2011)
p. 38
Reference for Not Adjusting Multiple Arm Comparisons for Multiplicity
Dunnett and Goldsmith (2006) state:
“Here, some typical examples arising in pharmaceutical research to illustrate some of the
reasons for using (or not using) multiple comparison procedures will be considered. In
general, the use of an appropriate multiple comparison test to make inferences concerning
treatment contrasts is indicated in the following situations:
1. To make an inference concerning a particular contrast which has been selected on the
basis of how the data have turned out.
2. To make an inference which requires the simultaneous examination of several
treatment contrasts.
3. In “data dredging,” viz., assembling the data in various ways to determine whether
some interesting differences will emerge.
On the other hand, multiple comparison procedures are usually not appropriate when
particular contrasts to be tested are selected in advance and are reported individually
rather than as a group. In such situations, the comparison error rate is usually of primary
concern and the standard tests of significance can be used, rather than a multiple
comparison test.”
Note: The statistical term “contrast” is by Dunnett and Goldsmith simply to make their
statements more general. You can think of a contrast as any comparison, such as group 1 vs
group 2 (the usual type of comparison) or perhaps something more unusual such as (group 1 +
group 2)/2 vs. group 3. For our purposes, simply change contrasts to comparisons in situation 2.
We would state in our protocol:
“The three treatment arms will be compared with each other using chi-square tests, or
with Fisher’s exact tests if the minimum expected cell frequency assumption is not met.
No adjustment for multiplicity is required, as the inference related to the study aim does
not require the simultaneous examination of the three comparisons (Dunnett and
Goldsmith, 2006). That is, three separate comparisons are made to test three separate
hypotheses, which are reported and discussed separately, rather than using the three
comparisons to support a single conclusion; therefore, applying multiple-comparision
procedure would not be appropriate (Dunnett and Goldsmith, 2006). ”
If we made this statement of not needing a multiplicity adjustment, and the misinformed reviewer
came back with a request to make the adjustment anyway, we would include the entire threeparagraph Dunnett and Goldsmith quote given above in our response to support our position.
Chapter 2-8 (revision 8 May 2011)
p. 39
When to Use a Global Test
Regardless of the level of measurement, global tests are fine for Table 1 Patient Characteristics,
since we really don’t care if we miss significance, and one p value is easier to deal with than
many. It is common practice to use a global test for Table 1 comparisons.
As an example, this was done in the Florez et al (2006) paper. They refer to global tests in their
Table 1 footnote, and they use Holm’s adjusted p values in their results, Table 2.
The global test is conservative, however, not giving significance often enough. If you really care
about significance, such as when the comparison is for the outcome variable, then you should
skip the global test and go straight to the pairwise comparisons, adjusting the pairwise
comparisons for multiplicity using a p value adjustment procedure. The smallest adjusted
pairwise p value is the p value for the global hypothesis, if you want to report a global hypothesis
conclusion.
Nominal Dependent Variable
In the two-sample case in the previous chapter, we analyzed data like this using the chi-square
test (if the minimum expected cell frequency assumption was met) and the Fisher-FreemanHalton test otherwise. For the k-sample case, we use the chi-square test again (with the same
assumption) and the Fisher-Freeman-Halton test otherwise. This gives us one p value (testing the
global hypothesis) so no multiple comparison adjustment is required.
For pairwise comparisons, we would use the same tests and then adjust the p values using one of
the procedures discussed above. We would then report the adjusted p values, rather than the
original unadjusted p values. There is really no need to test the global hypothesis, so we should
just use pairwise comparisons to test for a study effect.
Ordinal Dependent Variable
In the two-sample case in the previous chapter, we analyzed data like this using the WilcoxonMann-Whitney test. For the k-sample case, we use the Kruskal-Wallis analysis of variance by
ranks test (also can be called Kruskal-Wallis nonparametric analysis of variance). Analysis of
variance is popularly abbreviated as ANOVA. This test is nothing more than a k-sample
extension of the Wilcoxon-Mann-Whitney test (for 2 groups, it is identically the WilcoxonMann-Whitney test). This gives us one p value (testing the global hypothesis) so no multiple
comparison adjustment is required.
For pairwise comparisons, we would use Wilcoxon-Mann-Whitney tests and then adjust the p
values using one of the procedures discussed above. We would then report the adjusted p values,
rather than the original unadjusted p values.
Dalton et al. (1987) report a study where they obtain absorption profiles from women following
the administration of ointment containing 20, 30, and 40 mg of progesterone to the nasal mucosa.
Chapter 2-8 (revision 8 May 2011)
p. 40
Their dataset is reproduced in Altman’s biostatistics textbook (1991). A subset of these data is
found in the file progesterone.dta. The dependent variable is actually interval scaled, but let’s
analyze it as an ordinal variable anyway so we can compare the result to an analysis of it as an
interval scale variable in the next section.
Start the Stata program and read in the data,
File
Open
Find the directory where you copied the course CD: BiostatsCourse
Find the subdirectory datasets & do-files
Single click on progesterone.dta
Open
use "C:\Documents and Settings\u0032770.SRVR\Desktop\BiostatsCourse\
datasets & do-files\ progesterone.dta", clear
which must be all on one line, or use:
cd "C:\Documents and Settings\u0032770.SRVR\Desktop\BiostatsCourse\"
cd "datasets & do-files"
use progesterone.dta, clear
Simultaneously comparing all four groups:
Statistics
Summaries, tables & tests
Nonparametric tests of hypotheses
Kruskal-Wallis rank test
Main tab: Outcome variable: peakval
Variable defining groups: group
OK
kwallis peakval ,by(group)
Test: Equality of populations (Kruskal-Wallis test)
+----------------------------------------------------------+
|
group | Obs | Rank Sum |
|-----------------------------------------+-----+----------|
| Grp 1 (0.2ml of 100 mg/ml one nostril) |
6 |
76.50 |
| Grp 2 (0.3ml of 100 mg/ml one nostril) |
6 |
45.00 |
| Grp 3 (0.2ml of 200 mg/ml one nostril) |
4 |
34.50 |
| Grp 4 (0.2ml of 100 mg/ml each nostril) |
4 |
54.00 |
+----------------------------------------------------------+
chi-squared =
probability =
3.841 with 3 d.f.
0.2791
chi-squared with ties =
probability =
0.2788
Chapter 2-8 (revision 8 May 2011)
3.844 with 3 d.f.
<- use this one (should always correct for ties)
p. 41
Alternatively, we can skip the Kruskal-Wallis test, and instead compute three Wilcoxon-MannWhitney tests and adjust the p values for multiplicity.
ranksum peakval if
ranksum peakval if
ranksum peakval if
mcpi 0.1093 0.3359
group==1 | group==2 ,by(group)
group==1 | group==3 ,by(group)
group==2 | group==3 ,by(group)
1.000
ORIGINAL ORDER: anomaly corrected
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.1093 0.182 0.328 0.293 0.328 0.293 0.328 0.293 0.328
0.3359 0.508 0.672 0.459 0.672 0.559 0.672 0.707 1.000
1.0000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
----------------------------------------------------------------*Adjusted for 3 multiple comparisons
Interval Dependent Variable
In the two-sample case in the previous chapter, we analyzed data like this using the t test. For the
k-sample case, we use one-way analysis of variance (or one-way ANOVA). This test is nothing
more than a k-sample extension of the equal variance t test (for two groups, the tests are
identical). This gives us one p value (testing the global hypothesis) so no multiple comparison
adjustment is required.
Like the t test, the one-way ANOVA also has the assumptions of normally distributed data and
equal variances among the groups (called the homogeneity of variance assumption). The
homogeneity of variance assumption is tested using Bartlett’s test when we use Stata’s oneway
command. In actuality, the t test and one-way ANOVA are very robust to these assumptions, so
there generally is no need to test if the assumptions are met.
How to test the assumptions, however, is shown next, in case you ever really needed to do that
(to satisify a demanding journal reviewer, for example).
The syntax for the one-way ANOVA is,
oneway depvar groupvar
oneway depvar groupvar , tabulate means standard obs
<-- one-way ANOVA
<-- one-way ANOVA with descriptive statistics
Chapter 2-8 (revision 8 May 2011)
p. 42
Computing a one-way ANOVA,
Statistics
Linear models and related
ANOVA/MANOVA
One-way ANOVA
Main tab: Response variable: peakval
Factor variable: group
Output: produce summary table
OK
oneway peakval group, tabulate
| Summary of Serum Progesterone Peak
|
Value (nmol/l)
Dose Group |
Mean
Std. Dev.
Obs.
------------+-----------------------------------Grp 1 (0. |
27.033334
9.8798115
6
Grp 2 (0. |
18.8
5.9275625
6
Grp 3 (0. |
22.025
13.131482
4
Grp 4 (0. |
26.85
6.8656151
4
------------+-----------------------------------Total |
23.525
9.129125
20
Analysis of Variance
Source
SS
df
MS
F
Prob > F
-----------------------------------------------------------------------Between groups
261.026689
3
87.0088963
1.05
0.3965
Within groups
1322.45087
16
82.6531791
-----------------------------------------------------------------------Total
1583.47755
19
83.3409239
Bartlett's test for equal variances:
chi2(3) =
2.6306
Prob>chi2 = 0.452
We did not observe a significant difference among the four groups (p = 0.397). Bartlett’s test
was not significant (p = 0.452) so the assumption of equal variances was not shown to be
violated.
However, Bartlett’s tests is sensitive to normality, so a better test for equality of variances is
Levene’s test (works for both t test and oneway ANOVA). Levene’s test is said to be “robust to
the normality assumption” meaning that it provides an accurate comparison of the variances even
if the normality assumption is violated. I suggest, then, that you ignore the Bartlett’s test printed
with the oneway output, and use the following command to test equality of variances using
Levene’s test, if you really want to test this assumption:
Chapter 2-8 (revision 8 May 2011)
p. 43
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Robust equal variance test
Main tab: Variable: peakval
Variable defining two comparison groups: group
OK
robvar peakval , by(group)
| Summary of Serum Progesterone Peak
|
Value (nmol/l)
Dose Group |
Mean
Std. Dev.
Freq.
------------+-----------------------------------Grp 1 (0. |
27.033334
9.8798115
6
Grp 2 (0. |
18.8
5.9275625
6
Grp 3 (0. |
22.025
13.131482
4
Grp 4 (0. |
26.85
6.8656151
4
------------+-----------------------------------Total |
23.525
9.129125
20
W0
= .66609748
df(3, 16)
Pr > F = .58499634
W50 = .55434711
df(3, 16)
Pr > F = .65261607
W10 = .66609748
df(3, 16)
Pr > F = .58499634
<- This one is Levene’s test—use this
If the homogeneity of variance assumption is not satisfied, we can either transform the data or
drop back to the Kruskal-Wallis ANOVA. Using Kruskal-Wallis ANOVA when the variances
are not equal among the groups is apparently a controversy among statisticians. Daniel (1995,
p.598) advocates using the Kruskal-Wallis ANOVA when either the normality assumption or the
equal variances assumption for the interval-scaled one-way ANOVA is not met. On the other
hand, Glantz and Slinker (2001, 327-328) claims that the Kruskal-Wallis ANOVA assumes the
distributions have the same shape, and thus the same dispersion (or variance), and so is not a
suitable alternative to the one-way ANOVA when there are unequal variances. [Glantz and
Slinker propose using either the Brown-Forsythe F statistic or the Welch W statistic when the
variances are not equal. These tests are not available in Stata, but they are available as an option
in the SPSS ONEWAY procedure.] Siegel and Castellan (1988) never mention this “same
shape” assumption of Kruskal-Wallis ANOVA in their nonparametric statistics text.
Usually, statisticians will use the Kruskal-Wallis ANOVA if the homogeneity variance
assumption is not met, despite the above controversy. It turns out the parametric ANOVA is
robust to the homogeneity of variance assumption, so you generally don’t have to worry about
the assumption anyway.
For pairwise comparisons, we would use t tests and then adjust the p values using one of the
procedures discussed above. We would then report the adjusted p values, rather than the original
unadjusted p values. We’ll do that below.
Chapter 2-8 (revision 8 May 2011)
p. 44
Getting back to the above progesterone example, we saw that the Kruskal-Wallis ANOVA was
more powerful (smaller p value) than was the parametric one-way ANOVA, which is normally
not the case. This suggests that perhaps we violated the normality assumption, since the equality
of variances assumption seemed adequately justified.
Investigating this with a boxplot,
Graphics
Box plot
Main tab: Variables: peakval
By tab: Draw subgraphs for unique values of variables: group
OK
10
20
30
40
50
graph box peakval, by(group)
Grp 1 (0.2ml of 100 mg/ml
Grp 2 (0.3ml
one nostril)
of 100 mg/ml
Grp 3 (0.2ml
one nostril)
of 200 Grp
mg/ml
4 (0.2ml
one nostril)
of 100 mg/ml each nostril)
Besides realizing we need shorter value labels for the x-axis to be labeled properly, we see an
outlier in Group 1. Verifying lack of normality with the Shapiro-Wilk test,
Statistics
Summaries, tables & tests
Distributional plots & tests
Shapiro-Wilk normality test
Main tab: Variables: peakval
by/if/in tab: Repeat command by groups:
Variables that define groups: group
OK
by group, sort : swilk peakval
<or>
bysort group: swilk peakval
Chapter 2-8 (revision 8 May 2011)
p. 45
_______________________________________________________________
-> group = Grp 1 (0.2ml of 100 mg/ml one nostril)
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+------------------------------------------------peakval |
6
0.85346
1.815
0.963 0.16784
_______________________________________________________________
-> group = Grp 2 (0.3ml of 100 mg/ml one nostril)
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+------------------------------------------------peakval |
6
0.95127
0.604
-0.676 0.75052
_______________________________________________________________
-> group = Grp 3 (0.2ml of 200 mg/ml one nostril)
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+------------------------------------------------peakval |
4
0.88967
1.272
0.301 0.38160
_______________________________________________________________
-> group = Grp 4 (0.2ml of 100 mg/ml each nostril)
Shapiro-Wilk W test for normal data
Variable |
Obs
W
V
z
Prob>z
-------------+------------------------------------------------peakval |
4
0.98355
0.190
-1.422 0.92248
we see that there is not sufficient evidence to reject the normality assumption for Group 1.
However, this might largely be due to the small sample size (n=6).
Let’s compute the pairwise t tests and adjust them for multiple comparisons. This is the best
approach, anyway, since ANOVA is very conservative.
Statistics
Summaries, tables & tests
Classical tests of hypotheses
Two-group mean-comparison test
Main tab: Variable name: time
Group variable name: group
by/if/in tab: If (expression): group==1 | group==2
OK
ttest peakval if group==1 | group==2, by(group)
Doing this for all pairwise comparisons, by changing the “if” expression,
ttest peakval if group==1 | group==2, by(group)
ttest peakval if group==1 | group==3, by(group)
ttest peakval if group==2 | group==3, by(group)
Chapter 2-8 (revision 8 May 2011)
p. 46
. ttest peakval if group==1 | group==2, by(group)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------Grp 1 (0 |
6
27.03333
4.033416
9.879811
16.66511
37.40156
Grp 2 (0 |
6
18.8
2.419917
5.927563
12.5794
25.0206
---------+-------------------------------------------------------------------combined |
12
22.91667
2.562989
8.878456
17.27557
28.55777
---------+-------------------------------------------------------------------diff |
8.233334
4.703663
-2.24708
18.71375
-----------------------------------------------------------------------------diff = mean(Grp 1 (0) - mean(Grp 2 (0)
t =
1.7504
Ho: diff = 0
degrees of freedom =
10
Ha: diff < 0
Pr(T < t) = 0.9447
Ha: diff != 0
Pr(|T| > |t|) = 0.1106
Ha: diff > 0
Pr(T > t) = 0.0553
. ttest peakval if group==1 | group==3, by(group)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------Grp 1 (0 |
6
27.03333
4.033416
9.879811
16.66511
37.40156
Grp 3 (0 |
4
22.025
6.565741
13.13148
1.129881
42.92012
---------+-------------------------------------------------------------------combined |
10
25.03
3.440867
10.88098
17.24622
32.81378
---------+-------------------------------------------------------------------diff |
5.008334
7.236197
-11.67837
21.69503
-----------------------------------------------------------------------------diff = mean(Grp 1 (0) - mean(Grp 3 (0)
t =
0.6921
Ho: diff = 0
degrees of freedom =
8
Ha: diff < 0
Pr(T < t) = 0.7458
Ha: diff != 0
Pr(|T| > |t|) = 0.5084
Ha: diff > 0
Pr(T > t) = 0.2542
. ttest peakval if group==2 | group==3, by(group)
Two-sample t test with equal variances
-----------------------------------------------------------------------------Group |
Obs
Mean
Std. Err.
Std. Dev.
[95% Conf. Interval]
---------+-------------------------------------------------------------------Grp 2 (0 |
6
18.8
2.419917
5.927563
12.5794
25.0206
Grp 3 (0 |
4
22.025
6.565741
13.13148
1.129881
42.92012
---------+-------------------------------------------------------------------combined |
10
20.09
2.824396
8.931523
13.70077
26.47923
---------+-------------------------------------------------------------------diff |
-3.225
6.007753
-17.0789
10.6289
-----------------------------------------------------------------------------diff = mean(Grp 2 (0) - mean(Grp 3 (0)
t = -0.5368
Ho: diff = 0
degrees of freedom =
8
Ha: diff < 0
Pr(T < t) = 0.3030
Chapter 2-8 (revision 8 May 2011)
Ha: diff != 0
Pr(|T| > |t|) = 0.6060
Ha: diff > 0
Pr(T > t) = 0.6970
p. 47
We next adjust the three p values for 3 multiple comparisons:
mcpi 0.1106 0.5084 0.6060
ORIGINAL ORDER: anomaly corrected
Unadj ---------------------- Adjusted --------------------------P Val
TCH
Homml Finnr Hochb Ho-Si Holm
Sidak Bonfr
0.1106 0.184 0.332 0.296 0.332 0.296 0.332 0.296 0.332
0.5084 0.708 0.606 0.655 0.606 0.758 1.000 0.881 1.000
0.6060 0.801 0.606 0.655 0.606 0.758 1.000 0.939 1.000
----------------------------------------------------------------*Adjusted for 3 multiple comparisons
KEY: TCH
Homml
Finnr
Hochb
Ho-Si
Holm
Sidak
Bonfr
=
=
=
=
=
=
=
=
Tukey-Ciminera-Heyse procedure
Hommel procedure
Finner procedure
Hochberg procedure
Holm-Sidak procedure
Holm procedure
Sidak procedure
Bonferroni procedure
No multiple comparison procedure helped, as can be expected since they only make the p values
the same or larger. Still, it is the p values from our choice of one of these procedures that we
would report. Our choice should not include TCH, however, since that is for highly correlated
comparisons (repeated measurements, for example).
Chapter 2-8 (revision 8 May 2011)
p. 48
References
Abramson JH, Gahlinger PM. (2001). Computer Programs for Epidemiologists: PEPI Version
4.0. Salt Lake City, UT, Sagebrush Press.
Aickin M, Gensler H. (1996). Adjusting for multiple testing when reporting research results: the
Bonferroni vs Holm methods. Am J Public Health 86:726-728.
Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman &
Hall/CRC, 1991, pp.426-433.
Benjamini Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful
approach to multiple testing. J R Statist Soc, Series B (Methodological) 57(1):289-300.
Benjamini Y, Yekutieli D. (2001). The control of the false discovery rate in multiple testing
under dependency. Ann Statistics 29(4):1165-1188.
Browner WS, Newman TB, Cummings SR, Hulley SB. (1998). Getting ready to estimate sample
size: hypothses and Underlying principles. In Hulley SB, Cummings SR. Designing
Clinical Research: An Epidemiologic Approach. Baltimore, Williams & Wilkins, 1988.
Cummings SR, Ensrud K, Delmas PD, et al. (2010). Lasofoxifene in postmenopausal women
with osteoporosis. N Engl J Med 362(8):686-96.
Dalton ME, Bromhan DR, Ambrose CL, Osborne J, Dalton KD. (1987). Nasal absorption of
progesterone in women. Br J Obstet Gynaecol 94(1):85-8.
Daniel WW. (1995). Biostatistics: A Foundation for Analysis in the Health Sciences. 6th ed. New
York, John Wiley & Sons.
Dunnett C, Goldsmith C. (2006). When and how to do multiple comparisons. In Buncher CR,
Tsay J-Y, eds., Statistics in the Pharmaceutical Industry. 3rd ed. New York, Chapman &
Hall/CRC, pp. 421-452.
Finner H. (1993). On a monotonicity problem in step-down multiple test procedures. Journal of
the American Statistical Association 88:920-923.
Fisher RA. (1935). The Design of Experiments. Edinburgh, Oliver & Boyd.
Florez JC, Jablonski KA, Bayley N, et al. (2006). TCF7L2 polymorphisms and progression to
diabetes in the diabetes prevention program. NEJM 355(3):241-250.
Freemantle N. (2001). Interpreting the results of secondary end points and subgroup analyses in
clinical trials: should we lock the crazy aunt in the attic? BMJ 322:989-91.
Glantz SA, Slinker BK. (2001). Primer of Applied Regression and Analysis of Variance. 2nd ed.
New York, McGraw-Hill.
Chapter 2-8 (revision 8 May 2011)
p. 49
Holm S. (1979). A simple sequentially rejective multiple test procedure. Scan J Stat 6:65-70.
Hommel G. (1989). A comparison of two modified Bonferroni procedures. Biometrika 76:624625.
Horton NJ, Switzer SS. (2005). Statistical methods in the Journal. [letter] NEJM 353;18:197779.
Levin B. (1996). Annotation: on the Holm, Simes, and Hochberg multiple test procedures. Am J
Public Heath 86;5:628-629.
Ludbrook J. (1998). Multiple comparison procedures updated. Clinical and Experimental
Pharmacology and Physiology 25:1032-1037.
Moyé LA. (2008). The multiple comparison issue in health care research. In, Rao CR, Miller JP,
Rao DC (eds), Handbook of Statistics 27: Epidemiology and Medical Statistics, New
York, Elsevier, pp.616-655.
Pocock SJ. (1983). Clinical Trials: A Practical Approach. New York, John Wiley & Sons.
Rosner B. (2006). Fundamentals of Biostatistics, 6th ed. Belmont CA, Thomson Brooks/Cole.
Sankoh AJ, Huque MF, Dubey SD. (1997). Some comments on frequently used multiple
endpoint adjustment methods in clinical trials. Statistics in Medicine 16:2529-2542.
Scott IU, VanVeldhuisen PC, Oden NL, et al. (2010). Baseline predictors of visual acuity and
retinal thickness outcomes in pateints with retinal vein occlusion: standard care versus
corticosteriod for retinal vein occlusion study report 10. Ophthalmology (in press).
Snedecor GW, Cochran WG. (1980). Statistical Methods, 7th Ed. Ames, Iowa, The Iowa State
University Press.
Siegel S, Castellan NJ Jr. (1988). Nonparametric Statistics for the Behavioral Science. 2nd ed.
New York, McGraw-Hill.
Simes RJ. (1986). An improved Bonferroni procedure for multiple tests of significance.
Biometrika 73:751-754.
Tukey JW, Ciminera JL, Heyse JF. (1985). Testing the statistical certainty of a response to
increasing doses of a drug. Biometrics 41:295-301.
Witte JS, Elson RC, Cardon LR. (2000). On the relative sample size required for multiple
comparisons. Statist Med. 19;369-372.
Wright SP. (1992). Adjusted p-values for simultaneous inference. Biometrics 48:1005-1013.
Chapter 2-8 (revision 8 May 2011)
p. 50
Zolman JF. (1993). Biostatistics: Experimental Design and Statistical Inference. New York,
Oxford University Press.
Chapter 2-8 (revision 8 May 2011)
p. 51
Appendix. mcpi, fdri, and help files
mcpi.ado
If you do not have access to the mcpi.ado file, then you can create it yourself. Cut-and-paste
the following into the Stata do-file editor.
* file: mcpi.ado
* p-value adjusted multiple-comparison procedures immediate
* author: Greg Stoddard
updated: 23Mar2011
* compute adjusted p values using several p-value based
* multiple-comparison procedures, providing p value list on
* same command line
* syntax:
*
mcp <p-value list>, where "p-value list" is a list
of p values to be adjusted
capture program drop mcpi
program define mcpi
version 10
preserve
clear
quietly set obs 0
tempvar pval
gen `pval'=.
local arg=1
local stop=0
while (`"`arg'"'~="" & `"`arg'"'~="," & `stop'==0 ) {
quietly set obs `arg'
quietly capture replace `pval'=``arg'' in l
if `pval'==. in l {
local stop=1
quietly drop in l
}
local arg=`arg'+1
}
tempvar f_p tch_p hs_p h_p ho_p s_p b_p hom_p c origorder sortorder
quietly gen `origorder'=_n
local K=_N /* K comparisons, where K = # p-values entered */
sort `pval'
quietly gen `sortorder'=_n
quietly gen `f_p' = 1-(1-`pval')^(`K'/`sortorder')
// Finner adjusted p
quietly gen `tch_p' = 1-(1-`pval')^sqrt(`K')
// Tukey-Ciminera-Heyse adjusted p
quietly gen `hs_p' = 1-(1-`pval')^(`K'-(`sortorder'-1))
// Holm-Sidak adjusted p
quietly gen `h_p' = (`K'-(`sortorder'-1))*`pval'
// Holm adjusted p
quietly gen `ho_p' = (`K'-(`sortorder'-1))*`pval'
// Hochberg adjusted p
quietly gen `s_p' = 1-(1-`pval')^`K'
// Sidak adjusted p
quietly gen `b_p' = `K'*`pval'
// Bonferroni adjusted p
* -- begin Hommel
* using algorithm in appendix of Wright (1992) for computing Hommel's
* procedure on sorted p values
quietly gen `hom_p' = `pval'
quietly gen `c'=.
Chapter 2-8 (revision 8 May 2011)
p. 52
forvalues m=`K'(-1)2 {
local km = `K'-`m'
local km1 = `km'+1
forvalues i=`km1'(1)`K' {
quietly replace `c' = (`m'*`pval')/(`m'+`i'-`K') if `i'==_n
}
quietly sum `c' if _n >= `km1'
local cmin = r(min)
quietly replace `hom_p' = `cmin' if `hom_p'<`cmin' & _n >=`km1'
forvalues i=1(1)`km' {
quietly replace `c' = min(`cmin',`m'*`pval') if `i'==_n
quietly replace `hom_p'=`c' if `hom_p' < `c' & `i'==_n
}
}
* -- end Hommel
display as text _newline "SORTED ORDER: before anomaly corrected"
display as text _continue
display as text "Unadj ---------------------- Adjusted " _continue
display as text "---------------------------"
display as text "P Val
TCH
Homml Finnr Hochb Ho-Si" _continue
display as text " Holm
Sidak Bonfr"
forvalues i=1(1)`K' {
display as result %6.4f `pval'[`i'] %7.3f `tch_p'[`i'] _continue
display as result %7.3f `hom_p'[`i'] %7.3f `f_p'[`i'] _continue
display as result %7.3f `ho_p'[`i'] %7.3f `hs_p'[`i'] _continue
display as result %7.3f `h_p'[`i'] %7.3f `s_p'[`i'] _continue
display as result %7.3f `b_p'[`i'] %7.3f
}
quietly replace `h_p' = 1 if(`h_p' > 1)
// If Holm P > 1 (undefined) then set to 1
quietly replace `b_p' = 1 if(`b_p' > 1)
// If Bonfer P > 1 (undefined) then set to 1
quietly replace `f_p' = 1 if(`f_p' > 1)
// If Finner P > 1 (undefined) then set to 1 /
quietly replace `f_p' = `f_p'[_n-1] ///
if (_n>1 & `f_p'[_n] < `f_p'[_n-1])
// set to preceding p if smaller
quietly replace `hs_p' = `hs_p'[_n-1] ///
if (_n>1 & `hs_p'[_n] < `hs_p'[_n-1])
// set to preceding p if smaller
quietly replace `h_p' = `h_p'[_n-1] ///
if (_n>1 & `h_p'[_n] < `h_p'[_n-1])
// set to preceding p if smaller */
* for Hochberg, set to preceding p if larger, working backwards
forvalues i=`K'(-1)1 {
quietly replace `ho_p' = `ho_p'[`i'+1] ///
if (`i'<`K' & `ho_p'[`i'] > `ho_p'[`i'+1]) & `i'==_n
}
display as text _newline "SORTED ORDER: anomaly corrected"
display as text " (1) If Finner or Holm or Bonfer P > 1" _continue
display as text "(undefined) then set to 1"
display as text " (2) If Finner or Hol-Sid or Holm P < " _continue
display as text "preceding smaller P"
display as text "
(illogical) then set to preceding P"
display as text " (3) Working from largest to smallest," _continue
display as text " if Hochberg preceding"
display as text "
smaller P > P then set preceding smaller P to P"
display as text "Unadj ---------------------- Adjusted " _continue
display as text "---------------------------"
display as text "P Val
TCH
Homml Finnr Hochb Ho-Si" _continue
display as text " Holm
Sidak Bonfr"
forvalues i=1(1)`K' {
display as result %6.4f `pval'[`i'] %7.3f `tch_p'[`i'] _continue
Chapter 2-8 (revision 8 May 2011)
p. 53
display
display
display
display
as
as
as
as
result
result
result
result
%7.3f
%7.3f
%7.3f
%7.3f
`hom_p'[`i'] %7.3f `f_p'[`i'] _continue
`ho_p'[`i'] %7.3f `hs_p'[`i'] _continue
`h_p'[`i'] %7.3f `s_p'[`i'] _continue
`b_p'[`i'] %7.3f
}
sort `origorder' /* restore original input order */
display as text _newline "ORIGINAL ORDER: anomaly corrected"
display as text "Unadj ---------------------- Adjusted " _continue "
display as text "---------------------------"
display as text "P Val
TCH
Homml Finnr Hochb Ho-Si" _continue
display as text " Holm
Sidak Bonfr"
forvalues i=1(1)`K' {
display as result %6.4f `pval'[`i'] %7.3f `tch_p'[`i'] _continue
display as result %7.3f `hom_p'[`i'] %7.3f `f_p'[`i'] _continue
display as result %7.3f `ho_p'[`i'] %7.3f `hs_p'[`i'] _continue
display as result %7.3f `h_p'[`i'] %7.3f `s_p'[`i'] _continue
display as result %7.3f `b_p'[`i'] %7.3f
}
display as text "-----------------------------------" _continue
display as text "------------------------------"
display as text "*Adjusted for " _N " multiple comparisons" _newline
display as text "KEY: TCH
= Tukey-Ciminera-Heyse procedure"
display as text "
(use TCH only with highly " _continue
display as text "correlated comparisons)"
display as text "
Homml = Hommel procedure"
display as text "
Finnr = Finner procedure"
display as text "
Hochb = Hochberg procedure"
display as text "
Ho-Si = Holm-Sidak procedure"
display as text "
Holm = Holm procedure"
display as text "
Sidak = Sidak procedure"
display as text "
Bonfr = Bonferroni procedure"
restore
end
exit
Next, from inside the do-file editor, on the menu bar, click on,
File
Save As…
Save in: < see footnote >
File name: mcpi
Save as type: Ado files (*.ado)
Save
------The goal for “Save in” is to save the file mcpi.ado in the directory, C:\ado\personal.
In Windows XP, the C: is listed as “Preload (C:)”, then you click on or create the
“ado” subdirectory, then you click on or create the “personal” subdirectory.
Chapter 2-8 (revision 8 May 2011)
p. 54
mcpi.sthlp
If you do not have access to the mcpi.sthlp file, which is the help file for mcpi, then you can
create it yourself. Cut-and-paste the following into the Stata do-file editor.
.help for ^mcpi^
.-
(Greg Stoddard)
Syntax for ^mcpi^
---------------------------------------------------------------------^mcpi^ pvaluelist
, where pvaluelist is a list of the
p values to be adjusted, separated
by spaces
Description
----------^mcpi^ computes several p value adjustment multiple comparison procedures
and outputs three tables:
1) sorted multiple comparison adjusted p values after applying
the procedure equation
2) sorted multiple comparison adjusted p values after correcting
the values in the first table for anomolies (such as adjusted
p > 1)
3) the final table, with the adjusted p values in the original
sort order (this is the only table actually needed--the first
two tables are for useful only for verifying the calculations)
All procedures adjust for the number of comparisons equal to the number
of p values in pvaluelist.
Multiple Comparison Procedures Used
----------------------------------TCH
Homml
Finnr
Hochb
Ho-Si
Holm
Sidak
Bonfr
= Tukey-Ciminera-Heyse procedure (Sankoh, 1997)
(TCH assumes highly correlated comparisons)
= Hommel procedure (Wright, 1992)
= Finner procedure (Finner, 1993)
= Hochberg procedure (Wright, 1992)
= Holm-Sidak procedure (Ludbrook, 1998)
= Holm procedure (Ludbrook, 1998)
= Sidak procedure (Ludbrook, 1998)
= Bonferroni procedure (Ludbrook, 1998)
Suggested References
Finner H. On a monotonicity problem in step-down multiple test procedures.
Journal of the American Statistical Association 1993;88:920-923.
Ludbrook J. Multiple comparison procedures updated. Clinical and
Experimental Pharmacology and Physiology 1998;25:1032-1037.
Sankoh AJ. Hugue MF, Dubey SD. Some comments on frequently used multiple
endpoint adjustment methods in clinical trials. Statistics in Medicine
1997;16:2529-2542.
Wright SP. Adjusted p-values for simultaneous inference. Biometrics
1992;48:1005-1013.
Chapter 2-8 (revision 8 May 2011)
p. 55
Examples
-------. mcpi .013 .023 .045 .150
Author
-----Greg Stoddard, University of Utah School of Medicine, Salt Lake City, Utah
USA
Email: ^greg.stoddard@@hsc.utah.edu^
Next, from inside the do-file editor, on the menu bar, click on,
File
Save As…
Save in: < see footnote 1 >
File name: mcpi
Save as type: Help files (*.sthlp) < see footnote 2 >
Save
------1) The goal for “Save in” is to save the file mcpi.sthlp in the directory, C:\ado\personal.
In Windows XP, the C: is listed as “Preload (C:)”, then you click on or create the
“ado” subdirectory, then you click on or create the “personal” subdirectory.
2) If you have Stata version 9, then use the “hlp” file extension.
Chapter 2-8 (revision 8 May 2011)
p. 56
fdri.ado
If you do not have access to the fdri.ado file, then you can create it yourself. Cut-and-paste
the following into the Stata do-file editor.
* file: fdri.ado
* false discovery rate (FDR) multiple-comparison procedures immediate
* author: Greg Stoddard
updated: 23Mar2011
* compute adjusted p values using Benjamini-Hochberg procedure
* syntax:
*
fdri <p-value list>, where "p-value list" is a list of
p values to be adjusted
capture program drop fdri
program define fdri
version 10
preserve
clear
quietly set obs 0
tempvar pval
gen `pval'=.
local arg=1
local stop=0
while (`"`arg'"'~="" & `"`arg'"'~="," & `stop'==0 ) {
quietly set obs `arg'
quietly capture replace `pval'=``arg'' in l
if `pval'==. in l {
local stop=1
quietly drop in l
}
local arg=`arg'+1
}
tempvar bh_p c origorder sortorder
quietly gen `origorder'=_n
local K=_N /* K comparisons, where K = # p-values entered */
sort `pval'
quietly gen `sortorder'=_n
quietly gen `bh_p' = (`K'/`sortorder')*`pval'
// Benjamini-Hochberg adjusted p
display as text _newline "P Value Adjustment for " _continue
display as text "Controlling False Discovery Rate"
display as text _newline "SORTED ORDER: before anomaly corrected"
display as text "Unadj
Adjusted"
display as text "P Val
BenHoc"
forvalues i=1(1)`K' {
display as result %6.4f `pval'[`i'] %7.3f `bh_p'[`i']
}
* for Hochberg, set to preceding p if larger, working backwards
forvalues i=`K'(-1)1 {
quietly replace `bh_p' = `bh_p'[`i'+1] ///
if (`i'<`K' & `bh_p'[`i'] > `bh_p'[`i'+1]) & `i'==_n
}
display as text _newline "SORTED ORDER: anomaly corrected"
display as text
" Working from largest to" _continue
display as text " smallest, if Benjamini-Hochberg"
display as text
" preceding smaller P > P" _continue
display as text " then set preceding smaller P to P"
display as text "Unadj
Adjusted"
display as text "P Val
BenHoc"
forvalues i=1(1)`K' {
Chapter 2-8 (revision 8 May 2011)
p. 57
display as result %6.4f `pval'[`i'] %7.3f `bh_p'[`i']
}
sort `origorder' /* restore original input order */
display as text _newline "ORIGINAL ORDER: anomaly corrected"
display as text "Unadj
Adjusted"
display as text "P Val
BenHoc"
forvalues i=1(1)`K' {
display as result %6.4f `pval'[`i'] %7.3f `bh_p'[`i']
}
display as text "-----------------------------------------"
display as text "*Adjusted for " _N " multiple comparisons" _newline
display as text "KEY: BenHoc = Benjamini-Hochberg procedure"
restore
end
exit
Next, from inside the do-file editor, on the menu bar, click on,
File
Save As…
Save in: < see footnote >
File name: mcpi
Save as type: Ado files (*.ado)
Save
------The goal for “Save in” is to save the file fdri.ado in the directory, C:\ado\personal.
In Windows XP, the C: is listed as “Preload (C:)”, then you click on or create the
“ado” subdirectory, then you click on or create the “personal” subdirectory.
Chapter 2-8 (revision 8 May 2011)
p. 58
fdri.sthlp
If you do not have access to the fdri.sthlp file, which is the help file for fdri, then you can create
it yourself. Cut-and-paste the following into the Stata do-file editor.
.help for ^fdri^
.-
(Greg Stoddard)
Syntax for ^fdri^
---------------------------------------------------------------------^fdri^ pvaluelist
, where pvaluelist is a list of the
p values to be adjusted, separated
by spaces
Description
----------^fdri^ computes the Benjamini-Hochberg p value adjustment multiple
comparison procedure to control the False Discovery Rate (FDR) and
outputs three tables:
1) sorted multiple comparison adjusted p values after applying
the procedure equation
2) sorted multiple comparison adjusted p values after correcting
the values in the first table for anomolies (p values switching
their order of magnitude after adjustment)
3) the final table, with the adjusted p values in the original
sort order (this is the only table actually needed--the first
two tables are for useful only for verifying the calculations)
The procedure adjusts for the number of comparisons equal to the number
of p values in pvaluelist.
Multiple Comparison Procedures Used
----------------------------------BenHoc = Bejamini-Hochberg procedure (Bejamini-Hocberg, 1995)
Suggested Reference
Benjamini Y, Hochberg Y. Controlling the false discovery rate:
a practical an powerful approach to multiple testing. J R
Statist Soc, Series B (Methodological) 1995;57(1):289-300.
Example
-------. fdri .013 .023 .045 .150
Author
-----Greg Stoddard, University of Utah School of Medicine, Salt Lake City, Utah
USA
Email: ^greg.stoddard@@hsc.utah.edu^
Chapter 2-8 (revision 8 May 2011)
p. 59
Next, from inside the do-file editor, on the menu bar, click on,
File
Save As…
Save in: < see footnote 1 >
File name: mcpi
Save as type: Help files (*.sthlp) < see footnote 2 >
Save
------1) The goal for “Save in” is to save the file fdri.sthlp in the directory, C:\ado\personal.
In Windows XP, the C: is listed as “Preload (C:)”, then you click on or create the
“ado” subdirectory, then you click on or create the “personal” subdirectory.
2) If you have Stata version 9, then use the “hlp” file extension.
Chapter 2-8 (revision 8 May 2011)
p. 60
Download