Chapter 2-8. Multiplicity and Comparisons of 3 or More Independent Groups Multiplicity Whether our dependent variable is dichotomous, nominal, ordinal, or interval, if we have more than two groups in our independent variable, then we have the potential for the statistical problem referred to as the multiple-comparison problem, or synonymously, multiplicity. Table 4-1. Extending the two-sample comparison hypothesis to more than two samples Dependent Variable dichotomous nominal (e.g., 3 categories) ordinal interval 2-sample comparison* H0: no association H0: 1 = 2 H0: no association H0: 11 = 12 21 = 22 31 = 32 H0: no association H0: median1 = median2 H0: no association H0: 1 = 2 K-sample comparison (global hypothesis)** H0: no association H0: 1 = 2 = … = k H0: no association H0: 11 = 12 = … = 1k 21 = 22 = … = 2k 31 = 32 = … = 3k H0: no association H0: median1 = median2 = … = mediank H0: no association H0: 1 = 2 = … = k *where denotes the population proportion (reserving p to denote the sample proportion) denotes the population mean (reserving X to denote the sample mean) **The global hypothesis is the simultaneous equality of all averages, rather than the equality of specific pairs of averages. In the two-sample comparison problem, we used a statistical test to give a single p value to test the hypothesis of no association between the grouping variable and the dependent variable. This is identically a hypothesis of no difference in averages between the two groups. For the interval dependent variable, two-sample case, it is clear that the hypothesis can be tested with a single t test. For a three-sample case, the null hypothesis would be false if 1 2 or 1 3 or 2 3 . We could test this hypothesis using three tests, such as t tests, using one test for each of the three mean comparisons. If any of the three tests are significant, then the overall hypothesis of no difference among the three groups would be rejected. If we do this, however, with each p value being compared to alpha = 0.05, it turns out that the overall hypothesis of no difference among the three means is actually being tested at an alpha > 0.05. _____________________ ource: Stoddard GJ. Biostatistics and Epidemiology Using Stata: A Course Manual [unpublished manuscript] University of Utah School of Medicine, 2011. http://www.ccts.utah.edu/biostats/?pageId=5385 Chapter 2-8 (revision 8 May 2011) p. 1 Statisticians use the term family-wise alpha to when referring to a group of comparisons. This inflated Type I error (rejecting H0 more often than we should) is called the multiplecomparison problem. An intuitive analogy is flipping a coin. If you flip the coin once, the probability of it coming up heads is 1/2. If you continue flipping the coin, the probability that it comes up heads at least one time approaches 1.0 (a sure thing). The hypothesis H0: 1 = 2 = … = k being rejected if any of the possible pairwise t tests is statistically significant is analogously inflated. A classic example of this situation is a study comparing three doses of a drug against a placebo. If low dose has greater effectiveness than placebo, or moderate dose has greater effectiveness than placebo, or high dose has greater effectiveness than placebo, then you intend to conclude that the drug is effective. In this situation, you give yourself 3 changes to get significance (any of which will lead to the same conclusion of drug effectiveness relative to placebo). We can discover how much inflation occurs using a Monte Carlo simulation. Chapter 2-8 (revision 8 May 2011) p. 2 Here is the simulation problem: Suppose that we are interested in the absorption profile resulting from the administration of ointment containing 20, 30, and 40 mg of progesterone to the nasal mucosa of women. Women are to be randomized into these three dose groups, giving a parallel groups design (rather than a crossover design). We want to be able to detect the following profile, expecting to see differences in absorption of this magnitude or greater: Absorption of Serum Progesterone Peak Value (nmol/l) Dosagemean std.dev 20mg 20 10 30mg 25 10 40mg 30 10 A power analysis, for an alpha=0.05, two-sided t-test with equal variances, tells us that we need N=64 in each group to have 80% power to detect this difference between groups 20mg and 30mg, similarly between 30mg and 40mg (both have a mean difference of 5). An N=64 gives us 100% power to detect this difference (mean difference of 10) between 20mg and 40mg. Simulation: Draw a random sample of size N=3x64=192 (N=64 in each of three groups) from the same normally distributed population with mean 25 and standard deviation 10. The null hypothesis is: H0: 1 = 2 = 3 which is actually correct because all three groups have a population mean of 25. Compare the three groups using t tests. Repeat this process of sampling and making three pairwise comparisons with t tests 10,000 times. Tally the number of significant results (number of times p < 0.05). The result is: Grp 1 vs Grp 1 vs Grp 2 vs Times at chance 2 significant 488 out of 10000 samples (4.88%) 3 significant 504 out of 10000 samples (5.04%) 3 significant 524 out of 10000 samples (5.24%) least one of the three comparisons significant by was 1244 out of 10000 samples (12.44%) Chapter 2-8 (revision 8 May 2011) p. 3 For the curious, the Stata code for this simulation is: clear set obs 192 gen peakval=. gen group=1 in 1/64 replace group=2 in 65/128 replace group=3 in 129/192 set seed 999 scalar sumsig12 = 0 // sig12 counts times grp 1 vs 2 has p<0.05 scalar sumsig13 = 0 // sig13 counts times grp 1 vs 3 has p<0.05 scalar sumsig23 = 0 // sig23 counts times grp 2 vs 3 has p<0.05 scalar atleastone = 0 // atleastone counts times at least one // significant for same sample scalar n_times = 0 // number of times a sample was drawn forvalues i=1(1)10000{ scalar sig12=0 // initialize to false scalar sig13=0 scalar sig23=0 quietly replace peakval=25+invnorm(uniform())*10 // Normal (mean=25, SD=10) quietly ttest peakval if group==1 | group==2 , by(group) if r(p)<0.05 { scalar sig12=1 } // if significant, change to true quietly ttest peakval if group==1 | group==3 , by(group) if r(p)<0.05 { scalar sig13=1 } quietly ttest peakval if group==2 | group==3 , by(group) if r(p)<0.05 { scalar sig23=1 } if sig12==1 { scalar sumsig12=sumsig12+1 } // if significant, add 1 to counter if sig13==1 { scalar sumsig13=sumsig13+1 } if sig23==1 { scalar sumsig23=sumsig23+1 } if (sig12==1 | sig13==1 | sig23==1) { scalar atleastone=atleastone+1 } scalar n_times=n_times+1 // increment the samples drawn counter } display "Grp 1 vs 2 significant " sumsig12 " out of " n_times /* */ " samples (" sumsig12/n_times*100 "%)" display "Grp 1 vs 3 significant " sumsig13 " out of " n_times /* */ " samples (" sumsig13/n_times*100 "%)" display "Grp 2 vs 3 significant " sumsig23 " out of " n_times /* */ " samples (" sumsig23/n_times*100 "%)" display "Times at least one of the three comparisons " /* */ "significant by" display " chance was " atleastone " out of " n_times /* */ " samples (" atleastone/n_times*100 "%)" Chapter 2-8 (revision 8 May 2011) p. 4 Redisplaying the simulation results, Grp 1 vs Grp 1 vs Grp 2 vs Times at chance 2 significant 488 out of 10000 samples (4.88%) 3 significant 504 out of 10000 samples (5.04%) 3 significant 524 out of 10000 samples (5.24%) least one of the three comparisons significant by was 1244 out of 10000 samples (12.44%) If we set our alpha (probability of Type I error) at 0.05, then we should get 5% significant differences in this simulation, if we just consider one pair (either 20mg vs 30mg, or 20mg vs 40mg, or 30mg vs 40mg). As it should be, that this is what happened. If the research hypothesis is: “HA: dosage is related to amount absorbed” and we intend to conclude our research hypothesis is demonstrated if any of the three pairwise comparisons come out significant, then we see that our probability of making a Type I Error is inflated to 12.44%. This multiple comparison problem can be more generally described as multiplicity. The problem of “multiplicity” is that in any substantial clinical trial, or in any observation study, it is all too easy to think up a whole multiplicity of hypotheses, each one geared to exploring different aspects of response to treatment. (Pocock, 1983, p. 228) The multiplicity problem has five main aspects (Pocock, 1983, p. 228): (1) Multiple treatments Some trials have more than two treatments. The number of possible treatment comparisons increases rapidly with the number of treatments. (2) Multiple end-points There may be many different ways of evaluating how each patient responds to treatment. It is possible to make a separate treatment comparison for each end-point. (3) Repeated measurements In some trials one can monitor each patient’s progress by recording his disease state at several fixed time points after start of treatment. One could then produce a separate analysis for each time point. (4) Subgroup analyses One may record prognostic information about each patient prior to treatment. Patients may then be classified into prognostic subgroups and each subgroup analyzed separately. (5) Interim analyses In most trials there is a gradual accumulation of data as more and more patients are evaluated. One may undertake repeated interim analyses of the accumulating data while the trial is in progress. Chapter 2-8 (revision 8 May 2011) p. 5 The problem arises in that the more significant test one performs, the more likely at least one test will be significant by chance alone (sampling variability). If the comparisons are independent, the probability can be determined as follows: For one comparison, P(significant by chance) = alpha P(not significant by chance) = 1 – alpha For two comparisons, P(at least one significant by chance) = 1 - P(neither is significant) = 1 - (1 – alpha)(1 – alpha) For k comparions, P(at least one significant by chance) = 1 – (1-alpha)k For alpha=0.05, the formula produces the following probabilities: k 1 2 3 4 5 Prob .050 .098 .143 .185 .226 Thus, multiplicity increases the risk of committing a false-positive error, or Type I error (concluding a significant effect when it does not exist in the sampled population). The formula P(at least one significant by chance) = 1 – (1-alpha)k assumes that the k comparison are “independent” of each other. This independence approximately holds when comparing several study groups on the same outcome. Some lack of independence is introduced by using each group more than once (1 vs 2)(1 vs 3)(2 vs 3), as pointed out by Ludbrook (1998), which explains why our simultion resulted in 0.124, instead of 0.143 as predicted by the formula. If we set up the simulation so that 6 groups are sampled 10,000 times, and then only use each group once (1 vs 2)(3 vs 4)(5 vs 6), the comparisons will be independent. The results of such a simulation are: Grp 1 vs Grp 3 vs Grp 5 vs Times at chance 2 significant 483 out of 10000 samples (4.83%) 4 significant 518 out of 10000 samples (5.18%) 6 significant 484 out of 10000 samples (4.84%) least one of the three comparisons significant by was 1413 out of 10000 samples (14.13%) which is very close to the formula (simulation: 14.1% , formula: 14.3%). Chapter 2-8 (revision 8 May 2011) p. 6 For the curious, the Stata code for this simulation was: clear set obs 384 // n=64 x 6 set seed 999 capture drop peakval group gen peakval =25+invnorm(uniform())*10 // Normal (mean=25, SD=10) quietly gen group = 1 quietly replace group = 2 in 65/128 quietly replace group = 3 in 129/192 quietly replace group = 4 in 193/256 quietly replace group = 5 in 257/320 quietly replace group = 6 in 321/384 tab group * scalar sumsig12 = 0 // sig12 counts times grp 1 vs 2 has p<0.05 scalar sumsig34 = 0 // sig34 counts times grp 3 vs 4 has p<0.05 scalar sumsig56 = 0 // sig56 counts times grp 5 vs 6 has p<0.05 scalar atleastone = 0 // times at least one significant for same sample scalar n_times = 0 // n_times is number of times a sample was drawn forvalues i=1(1)10000{ scalar sig12=0 // initialize to false scalar sig34=0 scalar sig56=0 quietly replace peakval=25+invnorm(uniform())*10 // Normal (mean=25, SD=10) quietly ttest peakval if group==1 | group==2 , by(group) if r(p)<0.05 { scalar sig12=1 scalar sumsig12=sumsig12+1 } // if significant, change to true quietly ttest peakval if group==3 | group==4 , by(group) if r(p)<0.05 { scalar sig34=1 scalar sumsig34=sumsig34+1 } quietly ttest peakval if group==5 | group==6 , by(group) if r(p)<0.05 { scalar sig56=1 scalar sumsig56=sumsig56+1 } if (sig12==1 | sig34==1 | sig56==1) { scalar atleastone=atleastone+1 } scalar n_times=n_times+1 // increment the samples drawn counter } display "Grp 1 vs 2 significant " sumsig12 " out of " n_times /* */ " samples (" sumsig12/n_times*100 "%)" display "Grp 3 vs 4 significant " sumsig34 " out of " n_times /* */ " samples (" sumsig34/n_times*100 "%)" display "Grp 5 vs 6 significant " sumsig56 " out of " n_times /* */ " samples (" sumsig56/n_times*100 "%)" display "Times at least one of the three comparisons " /* */ "significant by" display " chance was " atleastone " out of " n_times /* */ " samples (" atleastone/n_times*100 "%)" Chapter 2-8 (revision 8 May 2011) p. 7 For interim analysis problems, however, the interim tests are clearly not independent, since the data from the earlier test is included in each later test. The following table (third column) shows how the probability of at least one significant test changes with additional looks at the data (Pocock, 1983, p. 148): k 1 2 3 4 5 Independent Tests Probability .050 .098 .143 .185 .226 Interim Tests Probability 0.05 0.08 0.11 0.13 0.14 which we can see inflates much less rapidly than the formula used above for the independent comparison case (2nd column). Exercise Read section 5.6 Adjustment of Significance and Confidence Levels in the guidance document E9 Statistical Principles for Clinical Trials. How Frequently Are Multiple Comparison Procedures Used Horton and Switzer (2006) surveyed what statistical methods are used in research articles published in NEJM. They found that 23% of research articles published in 2004-2005 reported using a multiple comparison procedure. P Value Based Multiple-Comparison Procedures There are many multiple-comparison procedures designed for interval scale data and independent groups. The statistical package SPSS has about 20 of these. However, they don’t apply to ordinal or nominal scaled data, nor to paired samples. Fortunately, there are a number of multiple-comparison procedures that simply adjust the p value, and thus it makes no difference which test was used to produce the p value. So, these procedures work for all comparisons, regardless of the level of measurement, or whether it is an independent sample or related sample case. Below are several such procedures. In these formulas, alpha is replaced with 0.05, which is almost always what alpha is set to be. Chapter 2-8 (revision 8 May 2011) p. 8 Bonferroni procedure (Ludbrook, 1998): p = unadjusted P value (P value from test statistic, not yet adjusted for multiplicity) adjusted p = kp , where k=number of comparisons For k=3 comparisons, this amounts to comparing each p to adjusted alpha= α/k = 0.05/3=0.0167 which is identical to multiplying each p vaue by 3 and then comparing to alpha=0.05. If an adjusted p value is greater than 1, set the adjusted p value to 1 (since p>1 is undefined). This is the most conservative p value adjustment procedure of all (known to be needlessly conservative). ____________________________________________________________________________ Note on Bonferroni Procedure: algebraic identity of adjusted p value formula and adjusted alpha formula (if you are curious) In most introductory statistics textbooks, the Bonferroni procedure is presented simply as, adjusted alpha = α/k , where k = number of comparisons. This is not very helpful, since it requires informing the reader what the adjusted alpha is that the reader is supposed to compare the p value against. The reader really appreciates just having an adjusted p value, instead, so only one alpha is needed in your article, the alpha almost always being 0.05. Coming up with the adjusted p value is simple enough. We want p ≤ adjusted α, as our adjusted rule for statistical significance. Solving this inequality, p ≤ adjusted α p ≤ α/k , where k = number of comparisons, and α = 0.05, the original alpha kp ≤ α so, adjusted p = kp is now compared against the original, or nominal, alpha. ____________________________________________________________________________ Chapter 2-8 (revision 8 May 2011) p. 9 Holm procedure (Sankoh et al., 1997): p = unadjusted P value arranged in sort order (smallest to largest) adjusted p = (k-i)p , where k=number of comparisons, and i=0,1,…,(k-1) For 3 comparisons, this amounts to comparing smallest p to adjusted alpha=0.05/3=0.0167 middle p to adjusted alpha=0.05/2=0.025 largest p to adjusted alpha=0.05/1=0.05 which is identical to multiplying each p value as follows: 3 × smallest p , 2 × middle p , 1 × largest p If an adjusted p value is greater than 1, set the adjusted p value to 1 (since p>1 is undefined). Also, with the p values in ascending sort order, if an adjusted p value is smaller than the previous adjusted p value (which is illogical), then set the adjusted p value to the value of the previous adjusted p value. It is obvious that this procedure is a big improvement over the Bonferroni procedure, since every p value but the smallest is compared against a larger alpha. Šidák procedure (Ludbrook, 1998): p = unadjusted P value adjusted p = 1-(1-p)k , where k=number of comparisons For 3 comparisons, this amounts to comparing each p to adjusted alpha=1-(1-0.05)1/3 = 0.01695 This procedure provides a trivial improvement over Bonferroni, since we get to compare our p values against a trivially larger alpha. Holm-Šidák (Ludbrook, 1998): p = unadjusted P value adjusted p = 1-(1-p)k-i , where k=number of comparisons, and i=0,1,…,(k-1) sort order For 3 comparisons, this amounts to comparing smallest p to adjusted alpha=1-(1-0.05)1/3 = 0.01695 middle p to adjusted alpha=1-(1-0.05)1/2 = 0.0253 largest p to adjusted alpha=1-(1-0.05)1/1 = 0.05 With the p values in ascending sort order, if an adjusted p value is smaller than the previous adjusted p value (which is illogical), then set the adjusted p value to the value of the previous adjusted p value. This procedure provides a trivial improvement over the Holm procedure, since we get to compare our p values against a trivially larger alpha. Chapter 2-8 (revision 8 May 2011) p. 10 Hochberg procedure (Wright, 1992): p = unadjusted P value arranged in sort order (smallest to largest) adjusted p = (k-i)p , where k=number of comparisons, and i=0,1,…,(k-1) For 3 comparisons, this amounts to comparing smallest p to adjusted alpha=0.05/3=0.0167 middle p to adjusted alpha=0.05/2=0.025 largest p to adjusted alpha=0.05/1=0.05 which looks just like the Holm procedure at this stage. It gains its advantage over the Holm procedure in the way the following adjustment is made. Adjustments of anomalies are opposite of the Holm’s procedure. With the p values in ascending sort order, if an adjusted p value is smaller than the previous adjusted p value (which is illogical), then set the previous adjusted p value to the value of the adjusted p value. With this approach, no adjusted p value can be larger than the largest unadjusted p value. This procedure is more powerful than the Holm procedure (Ludbrook, 1998). Chapter 2-8 (revision 8 May 2011) p. 11 Finner’s procedure (Finner, 1993): p = unadjusted P value adjusted p = 1-(1-p)k/i , where k=number of comparisons, and i=1,…,k sort order For k=3 comparisons, this amounts to comparing smallest p to adjusted alpha=1-(1-0.05)1/3 = 0.0170 middle p to adjusted alpha=1-(1-0.05)2/3 = 0.0336 largest p to adjusted alpha=1-(1-0.05)3/3 = 0.05 If an adjusted p value is greater than 1, set the adjusted p value to 1 (since p>1 is undefined). Also, with the p values in ascending sort order, if an adjusted p value is smaller than the previous adjusted p value (which is illogical), then set the adjusted p value to the value of the previous adjusted p value. This procedure is an improvement over all the above procedures since we get to compare our p values against a larger alpha. However, by working backwards to correct illogically ordered adjusted p values, the Hochberg procedure will win out over Finner’s procedure in many cases when the largest unadjusted p value is 0.05 or just under it. ____________________________________________________________________________ Note on Finner’s Procedure: algebraic identity of adjusted p value formula and adjusted alpha formula (if you are curious) Starting with Finner (1993, p.922, Corollary 3.1), αi = 1 – (1 – α)i/k , i = 1,…,k in ascending sort order from smallest p to largest p We want, pi ≤ αi , as our adjusted rule for statistical significance. Dropping the subscripts, which are now assumed, and solving p≤α p ≤ 1 – (1 – α)i/k p – 1 ≤ -(1 - α)i/k 1 – p ≥ (1 - α)i/k (1 – p) k/i ≥ (1 - α) (1 – p) k/i – 1 ≥ - α 1 – (1– p) k/i ≤ α This adjusted p value formula, and the above stated correction for anomolies following applying the formula, can be found in Adramson and Gahlinger (2001, p.13). ____________________________________________________________________________ Chapter 2-8 (revision 8 May 2011) p. 12 Hommel’s procedure (Wright, 1992): p = unadjusted P value arranged in sort order (smallest to largest). Let j be the number of comparisons in the largest subset of comparisons for which p > i(0.05/k), which represent the nonsignificant P values in the Simes (1986) procedure, where k=number of comparisons, and i=1,…,k. If there are no nonsignificant Simes tests [all p i(0.05/k)], then all comparisons are significant. Otherwise, the comparison is significant when p 0.05/j. This procedure is more powerful than the Hochberg procedure (Wright, 1992). General Comments All of the above p value adjustment procedures maintain the desired alpha (0.05) for the combined set of k comparisons when the comparisons are independent. The formulas are not self adjusting when the comparisons are correlated (such with repeated measures data), which makes some researchers nervous about using them. We will discuss this below. However, both the Bonferroni and the Holm procedures maintain the desired alpha (0.05) , regardless of independence or dependence among the p-values (Wright, 1992). [Wright bases this claim on the existance of mathematical proofs that show that the Bonferroni and Holm’s procedures maintain the alpha at 0.05, regardless of the correlation structure.] Given that the more powerful Holm procedure shares that feature with the Bonferroni procedure, there is no rational reason to prefer the Bonferroni over the Holm procedure. Researchers using Bonferroni’s procedure are simply doing so because they are uninformed. Three papers presenting mathematic proofs that Holm’s procedure accomplishes the same protection against a Type I error as Bonferroni’s procedures, are: 1) Holm’s original paper (1979): sophisticated and elegant proof 2) Aickin and Geisler (1996): rendition of Holm’s proof, but easier to read 3) Levin (1996): simplified justification of why it works (more like an outline of the proof) Both the Bonferroni and Holm procedures are too conservative if the endpoints are correlated, so you do not get significance as often as you should.( Sankoh et al, 1997) The procedures of Hochberg and Hommel are even more powerful than the Holm procedure; but strictly speaking, they are known to maintain the family-wise alpha only for independent p-values (Wright, 1992, p.1011). However, that is a weak reason to prefer Holm’s over the more powerful procedures, such as the Hochberg or Hommel’s procedures. Simulations by Sankoh et al (1997) have shown the Chapter 2-8 (revision 8 May 2011) p. 13 Hochberg and Hommel’s procedures to maintain the alpha of 0.05 over a wide range of correlation structures in the data, so the lack of a mathematical proof that they maintain alpha in all situations is of little concern. Protocol Suggestion for Holm Procedure If you feel timid about using the Holm procedure over Bonferroni, because you see everyone else using Bonferroni, you could educated your reader by saying something like: The p values for sets of multiple comparisons will be adjusted for multiplicity using Holm’s multiple comparison procedure. Like the Bonferroni procedure, the Holm procedure maintains the desired alpha (0.05) regardless of the correlation structure of the endpoints, while being more powerful than the Bonferroni procedure. (Sankoh et al., 1997) …or, if you are comparing groups on the same variable, you could be more specific: The p values for pairwise group comparisons will be adjusted for multiplicity using Holm’s multiple comparison procedure. Like the Bonferroni procedure, the Holm procedure maintains the desired alpha (0.05) regardless of the correlation structure of the outcome variable among the groups, while being more powerful than the Bonferroni procedure. (Sankoh et al., 1997) It is sufficient, however, to just say: The p values for pairwise group comparisons will be adjusted for multiplicity using Holm’s multiple comparison procedure. (Sankoh et al., 1997) Protocol Suggestion for Any of These Procedures For any of the procedures given in this chapter, it is sufficient to say: The p values for pairwise group comparisons will be adjusted for multiplicity using <fill in the blank> multiple comparison procedure. (fill in suggested citation) Example. Cummings et al. (N Engl J Med 2010) performed a randomized controlled trial of lasofoxifene for the treatment of osteoporosis. The study had two active drug groups, 0.25 mg lasofoxifene and 0.5 mg lasofoxifene, both compared to placebo. In their sample size paragraph, they state, “For the primary analyses, each dose of lasofoxifene was compared with placebo, and the Hochberg procedure was used to control for multiple comparisons.8” --------8 Hochberg Y. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 1988;75:800-2. Chapter 2-8 (revision 8 May 2011) p. 14 Response to Reviewer for Use of the Hommel’s Procedure Although the single sentence suggested above is sufficient for a protocol or article, here is some wording you can use to justify Hommel’s procedure if you need to. You might need do this, for example, if a journal reviewer questions the use of the procedure out of his or her own lack of familarity. Here is a cut-and-paste response you can use for such a situation: The p values for pairwise group comparisons were adjusted for multiplicity using Hommel’s multiple comparison procedure (Wright, 1992). A good statistical practice is to a priori choose the most powerful statistical test available for which the assumptions are justified, which is why the Student t-test is commonly selected over its nonparametric alternatives. Dozens of multiple comparison procedures are available to choose from, for both the analysis of variance based approaches and for the more generally applicable Bonferroni-like p value adjustment procedures. A good statistical practice, then, is to a priori choose a specific multiple comparison procedure over less powerful procedures. The Bonferroni-like p value adjustment class of procedures was selected because these procedures have fewer assumptions and apply to any test statistic. The Hommel’s p value adjustment procedure was selected from this class, because it is known to be more powerful than several alternative procedures, including the Bonferroni, Holm’s, and Hochberg’s procedures (Wright, 1992). Furthermore, simulations have shown that Hommel’s procedure maintains alpha at 0.05 for both independent comparisons and even for non-independent comparisons over a wide range of correlation structures in the data (Sankoh et al, 1997). _____ Wright SP. (1992). Adjusted P-Values for Simultaneous Inference. Biometrics 48:1005-1013. Sankoh AJ, Huque MF, Dubey SD. (1997). Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Statistics in Medicine 16:2529-2542. Chapter 2-8 (revision 8 May 2011) p. 15 Example An example of a paper that uses Holm’s procedure is Florez et al (N Engl J Med, 2006). In their Statistical Methods section, they state, “Nominal two-sided P values are reported and adjusted for multiple comparisons (three genotypic groups within each trait) with the use of the Holm procedure.19” Explanatory note: Statisticians use the term “nominal significance level” to refer to the original selected alpha (almost universally this is 0.05). Florez’s use of “nominal” is nonstandard and will probably confuse most readers. They are simply saying that they took the original P values from the statistical tests and then adjusted them for multiple comparisons before reporting. They apparently were looking for a way to avoid saying “unadjusted P values” because they were applying the Holm’s adjustment to results from a regression model that adjusted for covariates (so where adjusted P values already, in that sense). Their reference 19 is Holm’s original article (Holm, 1979), which is the most correct citation, but the Sankoh paper, which is recommended above, is far easier to read and therefore more useful to a reader who wants more information. It would have been more clear to say, Two-sided P values are reported after adjusting for multiple comparisons (three genotypic groups within each trait) with the use of the Holm procedure. Chapter 2-8 (revision 8 May 2011) p. 16 The Correlated Endpoints Situation Above, we compared the independent comparison situation to the correlated endpoint situation (illustrated with multiple comparisons performed as interim analyses): k 1 2 3 4 5 Independent Tests Probability .050 .098 .143 .185 .226 Interim Tests Probability 0.05 0.08 0.11 0.13 0.14 The above p value adjustment procedures, which attempt to hold the independent tests alpha at 0.05, will be conservative, and thus not get significance often enough, for the correlated endpoint situation. This is because the procedures are making greater adjustment than is necessary in the correlated endpoint situation. The Tukey-Ciminera-Heyse procedure (1985) was proposed as a p value adjustment procedure specifically for correlated hypotheses (dependent tests, which would include repeated measures data). The example of correlated hypotheses presented in that original paper was the use of several outcome variables used for toxicity testing, where if any one of the outcomes was significant, a conclusion of toxicity was demonstrated. Tukey-Ciminera-Heyse procedure (Sankoh, 1997; Tukey et al, 1985): p = unadjusted P value adjusted p = 1 (1 p) k , where k = the number of comparisons For 3 comparisons, this amounts to comparing each p to adjusted alpha= 1 (1 0.05)1 / 3 = 0.029 This procedure is generally an improvement over many of the other procedures. Sankoh et al (1997) demonstrated with simulation that the Tukey-Ciminera-Heyse procedure maintains alpha appropriately in the situation with the endpoints are highly correlated (r=.90), but rejects too often with lesser correlation (see Sankoh, Tables II and III). Sankoh et al (1997) also demonstrated that the other procedures described above are too conservative in the correlated endpoint situation (Tables II and III, shown for Hochberg and Hommell procedures). Some correction factors are listed in the Sankoh paper to adjust for the correlation structure of the endpoints, so that the alpha is maintained right at 0.05 (not too conservative, not too liberal). Chapter 2-8 (revision 8 May 2011) p. 17 P Value Adjustments in Stata (mcpi command) [Multiple Comparison Procedures Immediate] These p value adjustment procedures are not available in Stata. However, I wrote a program to do them, mcpi.ado, which is available in the datasets and do files subdirectory of the electronic course manual. It is also available as an appendix to this chapter, which you can use to create it. Note: Using mcpi.ado In the command window, execute the command sysdir This tells you the directories Stata searches to find commands, or ado files. It will look like: STATA: C:\Program Files\Stata10\ UPDATES: C:\Program Files\Stata10\ado\updates\ BASE: C:\Program Files\Stata10\ado\base\ SITE: C:\Program Files\Stata10\ado\site\ PLUS: c:\ado\plus\ PERSONAL: c:\ado\personal\ OLDPLACE: c:\ado\ I suggest you copy the file mcpi.ado and mcpi.sthlp from the electronic course manual to the c:\ado\personal\ directory. Having done that, mcpi becomes an executable command in your installation of Stata. If the directory c:\ado\personal\ does not exit, then you should create it using Windows Explorer (My Documents icon), and then copy the two files into this directory. These two files are also available in the appendix to this chapter. To get help for mcpi, use help mcpi in the command window. This help file also contains the suggested citations, which is helpful. To execute, use the command mcpi followed by a list of p values you want to adjust, each p value separated by a space. A nice feature of my mcpi command is that the suggested references are cited in the help file. To see this, help mcpi Chapter 2-8 (revision 8 May 2011) p. 18 The following is an example output from the Stata do-file mcpi.ado, using the unadjusted p values found in the illustrative example of Abramson and Gahlinger (2001, pp 12-13). mcpi .702 .045 .003 .0001 .135 .007 .110 .135 .004 .018 SORTED ORDER: before anomaly Unadj ---------------------P Val TCH Homml Finnr 0.0001 0.000 0.001 0.001 0.0030 0.009 0.024 0.015 0.0040 0.013 0.028 0.013 0.0070 0.022 0.049 0.017 0.0180 0.056 0.108 0.036 0.0450 0.135 0.180 0.074 0.1100 0.308 0.220 0.153 0.1350 0.368 0.270 0.166 0.1350 0.368 0.270 0.149 0.7020 0.978 0.702 0.702 corrected Adjusted --------------------------Hochb Ho-Si Holm Sidak Bonfr 0.001 0.001 0.001 0.001 0.001 0.027 0.027 0.027 0.030 0.030 0.032 0.032 0.032 0.039 0.040 0.049 0.048 0.049 0.068 0.070 0.108 0.103 0.108 0.166 0.180 0.225 0.206 0.225 0.369 0.450 0.440 0.373 0.440 0.688 1.100 0.405 0.353 0.405 0.765 1.350 0.270 0.252 0.270 0.765 1.350 0.702 0.702 0.702 1.000 7.020 SORTED ORDER: anomaly corrected (1) If Finner or Holm or Bonfer P > 1 (undefined) then set to 1 (2) If Finner or Hol-Sid or Holm P < preceding smaller P (illogical) then set to preceding P (3) Working from largest to smallest, if Hochberg preceding smaller P > P then set preceding smaller P to P Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0001 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.0030 0.009 0.024 0.015 0.027 0.027 0.027 0.030 0.030 0.0040 0.013 0.028 0.015 0.032 0.032 0.032 0.039 0.040 0.0070 0.022 0.049 0.017 0.049 0.048 0.049 0.068 0.070 0.0180 0.056 0.108 0.036 0.108 0.103 0.108 0.166 0.180 0.0450 0.135 0.180 0.074 0.225 0.206 0.225 0.369 0.450 0.1100 0.308 0.220 0.153 0.270 0.373 0.440 0.688 1.000 0.1350 0.368 0.270 0.166 0.270 0.373 0.440 0.765 1.000 0.1350 0.368 0.270 0.166 0.270 0.373 0.440 0.765 1.000 0.7020 0.978 0.702 0.702 0.702 0.702 0.702 1.000 1.000 ORIGINAL ORDER: anomaly corrected Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.7020 0.978 0.702 0.702 0.702 0.702 0.702 1.000 1.000 0.0450 0.135 0.180 0.074 0.225 0.206 0.225 0.369 0.450 0.0030 0.009 0.024 0.015 0.027 0.027 0.027 0.030 0.030 0.0001 0.000 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.1350 0.368 0.270 0.166 0.270 0.373 0.440 0.765 1.000 0.0070 0.022 0.049 0.017 0.049 0.048 0.049 0.068 0.070 0.1100 0.308 0.220 0.153 0.270 0.373 0.440 0.688 1.000 0.1350 0.368 0.270 0.166 0.270 0.373 0.440 0.765 1.000 0.0040 0.013 0.028 0.015 0.032 0.032 0.032 0.039 0.040 0.0180 0.056 0.108 0.036 0.108 0.103 0.108 0.166 0.180 ----------------------------------------------------------------*Adjusted for 10 multiple comparisons KEY: TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr = = = = = = = = Tukey-Ciminera-Heyse procedure Hommel procedure Finner procedure Hochberg procedure Holm-Sidak procedure Holm procedure Sidak procedure Bonferroni procedure We notice that Finner’s procedure provides the greatest number of significant adjusted p values for this combination of unadjusted p values. Chapter 2-8 (revision 8 May 2011) p. 19 Finner is certainly not the winner in all cases, as the following example illustrates, where Hommel’s procedure does better. SORTED ORDER: before anomaly Unadj ---------------------P Val TCH Homml Finnr 0.0210 0.036 0.023 0.062 0.0220 0.038 0.023 0.033 0.0230 0.040 0.023 0.023 corrected Adjusted --------------------------Hochb Ho-Si Holm Sidak Bonfr 0.063 0.062 0.063 0.062 0.063 0.044 0.044 0.044 0.065 0.066 0.023 0.023 0.023 0.067 0.069 SORTED ORDER: anomaly corrected (1) If Finner or Holm or Bonfer P > 1 (undefined) then set to 1 (2) If Finner or Hol-Sid or Holm P < preceding smaller P (illogical) then set to preceding P (3) Working from largest to smallest, if Hochberg preceding smaller P > P then set preceding smaller P to P Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0210 0.036 0.023 0.062 0.023 0.062 0.063 0.062 0.063 0.0220 0.038 0.023 0.062 0.023 0.062 0.063 0.065 0.066 0.0230 0.040 0.023 0.062 0.023 0.062 0.063 0.067 0.069 ORIGINAL ORDER: anomaly corrected Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0210 0.036 0.023 0.062 0.023 0.062 0.063 0.062 0.063 0.0220 0.038 0.023 0.062 0.023 0.062 0.063 0.065 0.066 0.0230 0.040 0.023 0.062 0.023 0.062 0.063 0.067 0.069 ----------------------------------------------------------------*Adjusted for 3 multiple comparisons KEY: TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr = = = = = = = = Tukey-Ciminera-Heyse procedure Hommel procedure Finner procedure Hochberg procedure Holm-Sidak procedure Holm procedure Sidak procedure Bonferroni procedure Chapter 2-8 (revision 8 May 2011) p. 20 A very nice feature of Hochberg’s procedure is that if the largest unadjusted p value in 0.05, then all the p value will be significant (no matter how many comparisons are done), since the adjusted p values never exceed the largest unadjusted p value for that procedure. SORTED ORDER: before anomaly Unadj ---------------------P Val TCH Homml Finnr 0.0470 0.080 0.049 0.134 0.0480 0.082 0.049 0.071 0.0490 0.083 0.049 0.049 corrected Adjusted --------------------------Hochb Ho-Si Holm Sidak Bonfr 0.141 0.134 0.141 0.134 0.141 0.096 0.094 0.096 0.137 0.144 0.049 0.049 0.049 0.140 0.147 SORTED ORDER: anomaly corrected (1) If Finner or Holm or Bonfer P > 1 (undefined) then set to 1 (2) If Finner or Hol-Sid or Holm P < preceding smaller P (illogical) then set to preceding P (3) Working from largest to smallest, if Hochberg preceding smaller P > P then set preceding smaller P to P Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0470 0.080 0.049 0.134 0.049 0.134 0.141 0.134 0.141 0.0480 0.082 0.049 0.134 0.049 0.134 0.141 0.137 0.144 0.0490 0.083 0.049 0.134 0.049 0.134 0.141 0.140 0.147 ORIGINAL ORDER: anomaly corrected Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0470 0.080 0.049 0.134 0.049 0.134 0.141 0.134 0.141 0.0480 0.082 0.049 0.134 0.049 0.134 0.141 0.137 0.144 0.0490 0.083 0.049 0.134 0.049 0.134 0.141 0.140 0.147 ----------------------------------------------------------------*Adjusted for 3 multiple comparisons Chapter 2-8 (revision 8 May 2011) p. 21 If the largest value does exceed 0.05, however, than the advantage shown in the previous example is lost. SORTED ORDER: before anomaly Unadj ---------------------P Val TCH Homml Finnr 0.0490 0.083 0.051 0.140 0.0500 0.085 0.051 0.074 0.0510 0.087 0.051 0.051 corrected Adjusted --------------------------Hochb Ho-Si Holm Sidak Bonfr 0.147 0.140 0.147 0.140 0.147 0.100 0.098 0.100 0.143 0.150 0.051 0.051 0.051 0.145 0.153 SORTED ORDER: anomaly corrected (1) If Finner or Holm or Bonfer P > 1 (undefined) then set to 1 (2) If Finner or Hol-Sid or Holm P < preceding smaller P (illogical) then set to preceding P (3) Working from largest to smallest, if Hochberg preceding smaller P > P then set preceding smaller P to P Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0490 0.083 0.051 0.140 0.051 0.140 0.147 0.140 0.147 0.0500 0.085 0.051 0.140 0.051 0.140 0.147 0.143 0.150 0.0510 0.087 0.051 0.140 0.051 0.140 0.147 0.145 0.153 ORIGINAL ORDER: anomaly corrected Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.0490 0.083 0.051 0.140 0.051 0.140 0.147 0.140 0.147 0.0500 0.085 0.051 0.140 0.051 0.140 0.147 0.143 0.150 0.0510 0.087 0.051 0.140 0.051 0.140 0.147 0.145 0.153 ----------------------------------------------------------------*Adjusted for 3 multiple comparisons KEY: TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr = = = = = = = = Tukey-Ciminera-Heyse procedure Hommel procedure Finner procedure Hochberg procedure Holm-Sidak procedure Holm procedure Sidak procedure Bonferroni procedure P Value Adjustments in PEPI The PEPI-4.0 ADJUSTP module (Abramson JH and Gahlinger PM., 2001) provides three of these procedures. The results agree with the above mcp.ado output. ADJUSTP - Multiple Significance Tests: Adjusted P Values DATA Total number of tests in the set = 3 No. 1 2 3 Original P 0.051 0.050 0.049 HOLM's adjusted P 0.1470 0.1470 0.1470 Chapter 2-8 (revision 8 May 2011) HOMMEL's adjusted P 0.0510 0.0510 0.0510 FINNER's adjusted P 0.1399 0.1399 0.1399 p. 22 Approaches to the multiple comparison problem (k sample comparisons) The two most common approaches to the multiple comparison problem are: 1) For the k-sample comparison, we simply generalize (or extend) the two-sample statistical test to simultaneously compare more than two samples, the test generating a single p value (only one chance of being statistically significant so alpha remains at 0.05 without inflation). These are sometimes referred to as tests of the global hypothesis. A oneway analysis of variance is an example of this approach. 2) For the k-sample comparison, we perform as many two-sample comparisons as we are interested in, generating many p values, but we compare these p values to alphas smaller than 0.05, so that taken as a set (family) of comparisons, the family alpha never exceeds 0.05. This is the same thing as using a p value adjustment procedure, such as Hommel’s procedure. It turns out that the 2nd option is more useful, since we almost always want to know which groups differ, rather than only knowing that there are one or more differences among them. The 2nd option is also more powerful, giving you significance more often, while still controlling for the Type I error (family alpha). The 1st option is useful for our Table 1 “Patient Characteristics” since it allows us to report only one p value to make our point of equivalence among the three or more study groups. Primary-Secondary Hypothesis Approach If you consider one endpoint, or one comparison in general, to be the most important endpoint of interest, you can call this the primary outcome and test it at the nominal significance level of alpha = 0.05, with no need for a multiple comparison adjustment. Or, if you have a set of primary outcomes, you can apply a multiple comparison procedure to this set. All other endpoints are considered secondary endpoints, which do not need a multiple comparison adjustment. These unadjusted secondary endpoints, however, are considered descriptive, or exploratory, and thus do not provide confirmatory evidence for the research hypothesis. A good citation for this approach is Freemantle (2001). Browner et al (1988) is another citation for this. In their discussion of multiple hypotheses, they suggest, “A good rule is to establish in advance as many hypotheses as make sense, but specify one as the primary hypothesis. This helps to focus the study on its main objective, and provides a clear basis for the main sample size calculation. In addition, a single primary hypothesis can be tested statistically without argument about whether to adjust for multiple hypothesis testing.” The FDA Guidance Document, E-9, Section 5.6, allows the selection of a primary outcome variable to avoid adjustment for multiple comparisons on this outcome, or to limit the multiple comparison adjustment to a set of primary variables. Chapter 2-8 (revision 8 May 2011) p. 23 Common Misconception of Thinking Analysis of Variance (ANOVA) Must Precede Pairwise Comparisons Although there is no need to test the global hypothesis first, many researchers and reviewers have been mislead into thinking that it is a necessary step, and they think pairwise comparisons can only be done if the global hypothesis (oneway ANOVA) comes out significant. The pairwise comparisons using a multiple comparison procedure have gained the name post hoc tests, implying they are only done following a significant F statistic from a oneway ANOVA. Using this approach leads to lost opportunities to demonstrate significant results because the oneway ANOVA is a very conservative test. Going straight to the pairwise comparisons that are adjusted for multiple comparisons will lead to significant findings more often, while still controlling the Type I error rate. Discussing the situation where the means of four groups are compared, H0: 1 = 2 = 3 = 4 , Zolman (1993, p.109) explains the conservatism of the ANOVA test, “The omnibus F-test, which is the proper analysis to tests these hypotheses, is the most conservative test that could be used. This overall F-test requires a very large difference between means to attain significance and reject H0. The reason for this conservatism is that there are 25 possible combinations of mean comparisons (pairs and combinations) when there are four treatment means (6.9). The F-test evaluates all these 25 comparisons and maintains the overall level of significance at some previously specified level (e.g., .05 or .01).” -----Note: the “(6.9)” is simply referring to a section of Zolman’s book. This misconception that ANOVA must be used first before multiplicity adjusted pairwise comparisons are performed arises because it is still being presented in statistics textbooks and taught in statistics courses. Purposely not singling out any one book, these books make statements like, Before proceeding to pairwise comparisons, call post-hoc tests, the one-way analysis of variance (ANOVA) F-test that simultaneously compares all the means must first be significant. If it is not significant, making pairwise comparisons is unjustified. This approach is propagated by authors of statistics textbooks basically cut-and-pasting ideas presented by already published textbooks, rather than the author becoming familar with the multiple comparison method literature which has long outdated this approach. It takes decades for new ideas to make their way into statistical textbooks. Apparently the idea started with Fisher (1935), in reference to his LSD procedure. [Fisher is the same famous statistician who derived the Fisher’s exact test, and who popularized hypothesis testing using the alpha=0.05 level of significance. Snedecor and Cochran (1980, p.234) state, “...Fisher (8) wrote, “When the z test (i.e., the F test) does not demonstrate significance, much caution should be used before claiming significance for special comparisons.” In line with this remark, investigators are sometimes advised to use the LSD method only if Chapter 2-8 (revision 8 May 2011) p. 24 F is significant. This method has been called the protected LSD, or PSD method. If F is not significant, we declare no significant differences between means.” ___ 8. Fisher RA. (1935). The Design of Experiments. Edinburgh, Oliver & Boyd. The Fisher LSD, or PSD, multiple comparison procedure is a two-stage procedure. First, compute a oneway ANOVA. If significant, go on to step 2, which is basically do all of the pairwise comparisons using an equal variance independent groups Student t-test. The type I error is nearly kept at the nominal alpha (0.05) by first requiring a significant ANOVA. Since then, multiple comparison procedures have been proposed that do not require the ANOVA step to protect against an inflated type I error. Proponents of the practice of requiring a significant F test from an ANOVA before using any of the newer multiple comparison procedure have lost tract of the context in which Fisher’s original idea was proposed. Almost no multiple comparison procedure introduced after the Fisher LSD procedure require a significant ANOVA. These multiple comparison procedures are designed to control the Type I error rate by themselves, not requiring any calculation from the ANOVA and not requiring ANOVA be significant in order to control the Type I error rate. In an attempt to correct this erroneous thinking, Dunnett and Goldsmith (2006, p.438), in their discussion of multiple-comparison adjusted pairwise comparisons, state, “...it is not necessary to do a preliminary F test before proceeding with multiple comparisons (as some texts recommend). Performing a preliminary F test may miss important single effects which become diluted (averaged out) with other effects.” Aickin and Gensler (1996) make a similar statement in their discussion of the Holm’s multiple comparison procedure, which is one of the p value adjustment procedures that improve upon the Bonferroni approach, “The second point is that it is traditional to approach data such as these using analysis of variance. However, in the traditional F test, only the null hypothesis of no differences among means is tested. The F test gives no guidance concerning which groups are different when the F statistic is significant and provides little power to detect a small number of differences when most means coincide. For this reason, if individual group differences are to be interpreted, there is no reason to perform the analysis of variance; it is better to proceed directly to Holm’s procedure.” Chapter 2-8 (revision 8 May 2011) p. 25 Cut-and-Paste Response to Editor Insisting on ANOVA In Place Of or Preceding a Multiple Comparison Procedure If you report a multiple comparison procedure and the journal editor insists that ANOVA is appropriate, here is a cut-and-paste response you can use: The editor makes a comment that ANOVA is appropriate for these data, rather than simply reporting multiple-comparison adjusted p values. Although we sincerely appreciate that the editor’s point-of-view is shared by many, it is well-known among statisticians who have kept up on the multiple comparison literature that such an approach is not needed, in general, and particularly when using any of the Bonferoni-extension p value adjustment procedures, such as the one we used. Since our approach is correct, we did not revise our paper to add any tests of the global hypothesis using ANOVA, but instead keep our multiple comparison procedure adjustments to the p values, which is adequate to protect against a Type I error. To meet the editor half way, we added two methods paper citations that explicitly state that the ANOVA is not needed. Our Statistical Methods Section sentence now reads, “P values are adjusted for multiple comparisons using <fill in> multiple procedure, which controls the type I error without the need to first test the global hypothesis with ANOVA.[ Dunnett and Goldsmith (2006); Aickin and Gensler (1996)]” ------Dunnett C, Goldsmith C. When and how to do multiple comparisons. In Buncher CR, Tsay J-Y, eds., Statistics in the Pharmaceutical Industry. 3rd ed. New York, Chapman & Hall/CRC, 2006, pp.421-452. Aickin M, Gensler H. Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am J Public Health 1996;86:726-728. Here is a detailed explanation of why adjusting for multiple comparisons, without also using ANOVA, is the best approach: Although there is no need to test the global hypothesis first, many researchers and reviewers have been mislead into thinking that it is a necessary step, and they think pairwise comparisons can only be done if the global hypothesis (oneway ANOVA) comes out significant. The pairwise comparisons using a multiple comparison procedure have gained the name post hoc tests, implying they are only done following a significant F statistic from a oneway ANOVA. This misconception arises because it is still being presented in statistics textbooks and taught in statistics courses. Purposely not singling out any specific book, these textbooks make statements like, Before proceeding to pairwise comparisons, call post-hoc tests, the one-way analysis of variance (ANOVA) F-test that simultaneously compares all the means must first be significant. If it is not significant, making pairwise comparisons is unjustified. This approach is propagated by authors of statistics textbooks basically cut-and-pasting ideas presented by already published textbooks, rather than the author becoming familiar with the Chapter 2-8 (revision 8 May 2011) p. 26 multiple comparison method literature which has long outdated this approach. It takes decades for new ideas to make their way into statistical textbooks. Apparently the idea started with Fisher (1935), in reference to his LSD procedure. [Fisher is the same famous statistician who derived the Fisher’s exact test, and who popularized hypothesis testing using the alpha=0.05 level of significance. Snedecor and Cochran (1980, p.234) state, “...Fisher (8) wrote, “When the z test (i.e., the F test) does not demonstrate significance, much caution should be used before claiming significance for special comparisons.” In line with this remark, investigators are sometimes advised to use the LSD method only if F is significant. This method has been called the protected LSD, or PSD method. If F is not significant, we declare no significant differences between means.” ___ 8. Fisher RA. (1935). The Design of Experiments. Edinburgh, Oliver & Boyd. The Fisher LSD, or PSD, multiple comparison procedure is a two-stage procedure. First, compute a oneway ANOVA. If significant, go on to step 2, which is basically do all of the pairwise comparisons using an equal variance independent groups Student t-test. The type I error is nearly kept at the nominal alpha (0.05) by first requiring a significant ANOVA. Since then, multiple comparison procedures have been proposed that do not require the ANOVA step to protect against an inflated type I error. Proponents of the practice of requiring a significant F test from an ANOVA before using any of the newer multiple comparison procedure have lost tract of the context in which Fisher’s original idea was proposed. Almost no multiple comparison procedure introduced after the Fisher LSD procedure requires a significant ANOVA. These multiple comparison procedures are designed to control the Type I error rate by themselves, not requiring any calculation from the ANOVA and not requiring ANOVA be significant in order to control the Type I error rate. In an attempt to correct this erroneous thinking, Dunnett and Goldsmith (2006, p.438), in their discussion of multiple-comparison adjusted pairwise comparisons, state, “...it is not necessary to do a preliminary F test before proceeding with multiple comparisons (as some texts recommend). Performing a preliminary F test may miss important single effects which become diluted (averaged out) with other effects.” Aickin and Gensler (1996) make a similar statement in their discussion of the Holm’s multiple comparison procedure, which is one of the p value adjustment procedures that improves upon the Bonferroni approach, “The second point is that it is traditional to approach data such as these using analysis of variance. However, in the traditional F test, only the null hypothesis of no differences among means is tested. The F test gives no guidance concerning which Chapter 2-8 (revision 8 May 2011) p. 27 groups are different when the F statistic is significant and provides little power to detect a small number of differences when most means coincide. For this reason, if individual group differences are to be interpreted, there is no reason to perform the analysis of variance; it is better to proceed directly to Holm’s procedure.” References Aickin M, Gensler H. (1996). Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am J Public Health 86:726-728. Dunnett C, Goldsmith C. (2006). When and how to do multiple comparisons. In Buncher CR, Tsay J-Y, eds., Statistics in the Pharmaceutical Industry. 3rd ed. New York, Chapman & Hall/CRC, pp. 421-452. Fisher RA. (1935). The Design of Experiments. Edinburgh, Oliver & Boyd. Munro BH. (2001). Statistical Methods for Health Care Research. 4th ed. Philadelphia, Lippincott. Snedecor GW, Cochran WG. (1980). Statistical Methods, 7th Ed. Ames, Iowa, The Iowa State University Press. Chapter 2-8 (revision 8 May 2011) p. 28 Special Case of Multiplicity Adjustment: Controlling the False Discovery Rate Above, we discussed the most commonly taught and understood approach to multiplicity. Statisticians call this controlling the Familywise Error Rate (FWER). In that situation, a set of comparisons, called a family, are made and a conclusion of a statistically significant effect is made if any of the individual comparisons turns out statistically significance. For example, if three groups are compared, the conclusion will be whether or not there is a difference among the groups. In some special situations, we are interested in the individual comparisons themselves, rather than making a single global statement about them. In this situation, we want to keep the false positive error rate at the nominal alpha value, usually 5%. Doing this, we will have at most 5% of the individual comparisons determined to be significant when the effects are not real. This is called controlling the False Discovery Rate (FDR). The FDR procedures are more powerful than the FWER procedures, leading to more significant findings (Benjamini and Hochberg, 1995). The best known procedure for doing this is the Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995). Benjamini and Hochberg suggest three situations where controlling the FDR is more appropriate than controlling the FWER (Benjamini and Hochberg, 1995, p.292 section 2.2): 1) multiple endpoints problem. In this situation, an overall decision or recommendation is reached from examining the individual comparisons, but the individual comparisons are each important. Deciding to develop a new drug based on multiple discoveries of its benefit is an example. We wish to make as many discoveries as possible about the benefits of the drug, which will enhance a decision in favor of the new drug, while controlling the FDR. 2) multiple subgroups problem. Here we make multiple individual decisions, without an overall decision being required. We might compare two treatments in several subgroups, and decisions about the treatments are made separately for each subgroup. We are willing to accept a prespecified proportion of misses, say 5%, by controlling the FDR. 3) screening problem. Here multiple potential effects are screened to weed out the null effects. For example, we might screen various chemicals for potential drug development. We want to obtain as many discoveries as possible, but still wish to control the FDR, because a large fraction of leads would burden the second phase of the confirmatory analysis. The screening problem is what is encountered in genetics studies, where a large list of allelles, or single-nucelotide polymorphisms (SNPs) are examined. Rosner (2006, pp.579-581) points out that control for the FWER is not appropriate in genetics studies, while an FDR procedure is. Moyé (2008, pp. 630-631) states the same thing, “In addition, False Discovery Rate (FDR) offers a new and interesting perspective on the multiple comparisons problem. Instead of controlling the chance of any false postives (as Bonferroni does), the FDR controls the expected proportion of flase positives among all tests (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001). It is very useful in micro-array analyses in which thousands of significance tests are executed.” Chapter 2-8 (revision 8 May 2011) p. 29 The Benjamini-Hochberg procedure is a p value adjustment procedure. Like the procedures discussed above for controlling the FWER, the p values are obtained from whatever test statistic is appropriate, and then the p values are adjusted for multiplicity. Benjamini-Hochberg procedure (Benjamini and Hochberg, 1995; Rosner, 2006): p = unadjusted P value arranged in sort order (smallest to largest) adjusted p = (k/i)p , where k=number of comparisons, and i=1,2,…,k is the rank from smallest to largest. For 3 comparisons, this amounts to comparing smallest p to adjusted alpha=(1/3)(0.05)=0.0167 middle p to adjusted alpha=(2/3)(0.05)=0.0333 largest p to adjusted alpha=(3/3)(0.05)=0.05 Example: applying the formula to the following unadjusted p values: 001 .020 .029 Adjusted p values are: (3/1)(.001) = .003 (3/2)(.020) = .030 (3/3)(.029) = .029 We discover an anolomy since the second adjusted p value is smaller than the third adjusted p value, which is illogical it was the other way around before adjustment. Correction for anomalies is like the Hochberg procedure. With the p values in ascending sort order, if an adjusted p value is smaller than the previous adjusted p value (which is illogical), then set the previous adjusted p value to the value of the adjusted p value. After correction for anolomies, Adjusted p values are: (3/1)(.001) = .003 (3/2)(.020) = .030 -> 0.029 (3/3)(.029) = .029 Chapter 2-8 (revision 8 May 2011) p. 30 P Value Adjustment to Control FDR in Stata (fdri command) [False Discovery Rate Immediate] Such a p value adjustment procedure is not available in Stata. However, I wrote a program to do it, fdri.ado, which is available in the datasets and do files directory of the electronic course manual. It is also available as an appendix to this chapter, which you can use to create it. Note: Using fdri.ado In the command window, execute the command sysdir This tells you the directories Stata searches to find commands, or ado files. It will look like: STATA: C:\Program Files\Stata10\ UPDATES: C:\Program Files\Stata10\ado\updates\ BASE: C:\Program Files\Stata10\ado\base\ SITE: C:\Program Files\Stata10\ado\site\ PLUS: c:\ado\plus\ PERSONAL: c:\ado\personal\ OLDPLACE: c:\ado\ I suggest you copy the file fdri.ado and fdri.sthlp from the course CD to the c:\ado\personal\ directory. Having done that, fdri becomes an executable command in your installation of Stata. If the directory c:\ado\personal\ does not exit, then you should create it using Windows Explorer (My Documents icon), and then copy the two files into this directory. These two files are also available in the appendix to this chapter. To get help for fdri, use help fdri in the command window. This help file also contains the suggested citations, which is helpful. To execute, use the command fdri followed by a list of p values you want to adjust, each p value separated by a space. A nice feature of my fdri command is that the suggested reference is cited in the help file. To see this, help fdri Chapter 2-8 (revision 8 May 2011) p. 31 The following is an example output from the Stata do-file fdri.ado, using the three p values shown as an example in the Benjami-Hochberg description two pages above, fdri .001 .020 .029 P Value Adjustment for Controlling False Discovery Rate SORTED ORDER: before anomaly corrected Unadj Adjusted P Val BenHoc 0.0010 0.003 0.0200 0.030 0.0290 0.029 SORTED ORDER: anomaly corrected Working from largest to smallest, if Benjamini-Hochberg preceding smaller P > P then set preceding smaller P to P Unadj Adjusted P Val BenHoc 0.0010 0.003 0.0200 0.029 0.0290 0.029 ORIGINAL ORDER: anomaly corrected Unadj Adjusted P Val BenHoc 0.0010 0.003 0.0200 0.029 0.0290 0.029 ----------------------------------------*Adjusted for 3 multiple comparisons KEY: BenHoc = Benjamini-Hochberg procedure Chapter 2-8 (revision 8 May 2011) p. 32 Next, using the example from Benjamini and Hochberg (1995), fdri .0001 .0004 .0019 .0095 .0201 .0278 .0298 .0344 /// .0459 .03240 .4262 .5719 .6528 .7590 1.000 P Value Adjustment for Controlling False Discovery Rate SORTED ORDER: before anomaly corrected Unadj Adjusted P Val BenHoc 0.0001 0.002 0.0004 0.003 … 1.0000 1.000 SORTED ORDER: anomaly corrected Working from largest to smallest, if Benjamini-Hochberg preceding smaller P > P then set preceding smaller P to P Unadj Adjusted P Val BenHoc 0.0001 0.002 0.0004 0.003 … 1.0000 1.000 ORIGINAL ORDER: anomaly corrected Unadj Adjusted P Val BenHoc 0.0001 0.002 0.0004 0.003 0.0019 0.010 0.0095 0.036 0.0201 0.057 0.0278 0.057 0.0298 0.057 0.0344 0.057 0.0459 0.069 0.0324 0.057 0.4262 0.581 0.5719 0.715 0.6528 0.753 0.7590 0.813 1.0000 1.000 ----------------------------------------*Adjusted for 15 multiple comparisons KEY: BenHoc = Benjamini-Hochberg procedure We see that we get to keep the significant findings for the four smallest p values, which is consistent with the result in the Benjamini and Hochberg article. Chapter 2-8 (revision 8 May 2011) p. 33 An example of an article that used the Benjamini-Hochberg procedure is: Chen P, Liang J, Wang Z, et al. Association of common PALB2 polymorphisms with breast cancer risk: a case-control study. Clin Cancer Res 2008 Sep 15;14(18):5931-7. In their Abstract, they report, “RESULTS: Based on the multiple hypothesis testing with the Benjamini-Hochberg method, tagging SNPs (tSNP) rs249954, rs120963, and rs16940342 were found to be associated with an increase of breast cancer risk (false discovery rate-adjusted P values of 0.004, 0.028, and 0.049, respectively) under the dominant model.” Article Suggestion Here is a suggestion for your Statistical Methods Section, Given that we had a genetics study, where a large list of SNPs were examined, we report Benjamini-Hochberg adjusted p values, which maintains the false discovery rate (FDR) at the nominal alpha 0.05 level (Benjamini-Hochberg, 1995). In such studies, controlling for multiplicity in the standard fashion, such as with the Bonferroni procedure which controls the family-wise error rate (FWER), is not justified, while control for the FDR provides the correct control for multiplicity (Benjamini-Hochberg, 1995; Rosner, 2006; Moyé, 2008). An example of an investigator making a statement like this, only more detailed and even better, to justify the use of controlling the FDR is Scott et al. (2010), “Statistical tests of the univariate relationships between these baseline predictor variables and the 6 outcome variables at each of the 2 follow-up visits resulted in 204 (CRVO analyses) and 240 (BRVO analyses) P values. If so many hypotheses are tested without special precautions, some relationships would likely appear significant by chance alone (i.e., type I error). To mitigate this, we controlled the false discovery rate (FDR)8,9 at 5% separately within the CRVO and BRVO disease area analyses. Modern clinical trials may feature multiple, co-primary endpoints, with the statistical significance of any one of the end points potentially serving as a basis for a claim of efficacy. In that situation, one typically controls family-wide type I error (FWE). However, the aim of this article is not to claim efficacy of a particular treatment but to nominate important predictive relationships. Here, controlling FDR is more appropriate. Controlling FWE at a level of 0.05 ensures that the probability of incorrectly rejecting at least 1 null hypothesis is only 5%. In contrast, controlling FDR at a level of 0.05 ensures that the expected proportion,among all rejected null hypotheses, of incorrectly rejected null hypotheses is only 5%. The FDR is often implemented in genomics research areas such as gene chips, where multiplicity is a well-recognized phenomenon of concern. Benjamini and Hochberg8 introduced FDR methodology for independent hypothesis tests. Benjamini and Yekutieli9 later showed that the original method suffices for some types of dependence and introduced a conservative correction that works for all types of dependence. We chose the FDR criterion to try to ensure that no more than 5% of Chapter 2-8 (revision 8 May 2011) p. 34 the results we claim to be significant would fail to be confirmed if subsequently investigated with new data, consistent with recommendations by Benjamini et al.10” --------------8. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol 1995;57: 289 –300. 9. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat 2001;29:1165–88. 10. Benjamini Y, Drai D, Elmer G, et al. Controlling the false discovery rate in behavior genetics research. Behav Brain Res 2001;125:279–84. Chapter 2-8 (revision 8 May 2011) p. 35 Dichotomous Dependent Variable In the two-sample case in the previous chapter, we analyzed data like this using the chi-square test (if the minimum expected cell frequency assumption was met) and Fisher’s exact test otherwise. For the k-sample case, we use the chi-square test again (with the same assumption) and the Fisher-Freeman-Halton test otherwise (Barnard’s test is only for 2 2 tables). This gives us one p value (testing the global hypothesis) so no multiple comparison adjustment is required. For pairwise comparisons, we would use the same tests (including Barnard’s test) and then adjust the p values using one of the procedures discussed above. We would then report the adjusted p values, rather than the original unadjusted p values. To illustrate, consider the following data: Five-year Survival Following Treatment for Unspecified Cancer (Hypothetical data) By Three Therapies Chemo Surgery Radiation total survived 5 years 10 14 25 49 died 90 86 75 251 total 100 100 100 300 tabi 10 14 25 \ 90 86 75 , expect col chi2 +--------------------+ | Key | |--------------------| | frequency | | expected frequency | | column percentage | +--------------------+ | col row | 1 2 3 | Total -----------+---------------------------------+---------1 | 10 14 25 | 49 | 16.3 16.3 16.3 | 49.0 | 10.00 14.00 25.00 | 16.33 -----------+---------------------------------+---------2 | 90 86 75 | 251 | 83.7 83.7 83.7 | 251.0 | 90.00 86.00 75.00 | 83.67 -----------+---------------------------------+---------Total | 100 100 100 | 300 | 100.0 100.0 100.0 | 300.0 | 100.00 100.00 100.00 | 100.00 Pearson chi2(2) = 8.8300 Pr = 0.012 The global test is known to be conservative, and so this chi-square would not normally be done. However, if we wanted to, we could use it and stop here, just reporting: “There was a significant difference among the three therapies in 5-year survival (p = 0.012).” The reader would naturally want to know if Radiation therapy is significantly better than Surgery, as well as significantly better than Chemo, and if Surgery is significantly better than Chemo. Chapter 2-8 (revision 8 May 2011) p. 36 Our next step would be (which would normally be our first step, skipping the global test): tabi 10 14 \ 90 86 , expect chi2 tabi 10 25 \ 90 75 , expect chi2 tabi 14 25 \ 86 75 , expect chi2 +--------------------+ | Key | |--------------------| | frequency | | expected frequency | +--------------------+ | col row | 1 2 | Total -----------+----------------------+---------1 | 10 14 | 24 | 12.0 12.0 | 24.0 -----------+----------------------+---------2 | 90 86 | 176 | 88.0 88.0 | 176.0 -----------+----------------------+---------Total | 100 100 | 200 | 100.0 100.0 | 200.0 Pearson chi2(1) = 0.7576 Pr = 0.384 | col row | 1 2 | Total -----------+----------------------+---------1 | 10 25 | 35 | 17.5 17.5 | 35.0 -----------+----------------------+---------2 | 90 75 | 165 | 82.5 82.5 | 165.0 -----------+----------------------+---------Total | 100 100 | 200 | 100.0 100.0 | 200.0 Pearson chi2(1) = 7.7922 Pr = 0.005 | col row | 1 2 | Total -----------+----------------------+---------1 | 14 25 | 39 | 19.5 19.5 | 39.0 -----------+----------------------+---------2 | 86 75 | 161 | 80.5 80.5 | 161.0 -----------+----------------------+---------Total | 100 100 | 200 | 100.0 100.0 | 200.0 Pearson chi2(1) = 3.8541 Pr = 0.050 Obtaining adjusted p value with mcpi, mcpi .384 .005 .050 ORIGINAL ORDER: anomaly corrected Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.3840 0.568 0.384 0.384 0.384 0.384 0.384 0.766 1.000 0.0050 0.009 0.015 0.015 0.015 0.015 0.015 0.015 0.015 0.0500 0.085 0.100 0.074 0.100 0.098 0.100 0.143 0.150 ----------------------------------------------------------------*Adjusted for 3 multiple comparisons Chapter 2-8 (revision 8 May 2011) p. 37 Now, we are really frustrated because we lost our significance between surgery and radiation. The question, then, is was it really necessary to adjust the p values. After all, it is not like 3 doses of the same drug--it is 3 distinct therapies (more like 3 distinct hypotheses). There is a lot of confusion around this issue, and you might encounter a reviewer who claims you need a multiple comparison procedure here. Actually, you do not. Look back at the 5 aspects of clinical trials that Pocock states multiplicity arises. Notice that in all five aspects listed by Pocock, you are testing one global hypothesis with multiple comparisons. That is not the case with the three cancer treatments, unless the research question is merely whether or not differences exist among cancer treatments (and you don’t really care which treatments outperform which other treatments). If the interest is in how specific cancer treatments compare with other specific treatments, which of course it is, then you had three hypotheses to test when you designed the study--reporting the global hypothesis test which we did above is nothing but a waste of space in the article. You should instead report three separate statements, something to the effect of: Radiation therapy was significantly better than chemotherapy (p = 0.005) . Radiation therapy was also significantly better than surgery (p = .050). There was no significant difference between chemotherapy and surgery (p = 0.384). The editor might say, “A multiple comparison procedure is needed since the three therapies were tested using the same sample of patients.” On the surface, this sounds credible. After all, the p value is the probability of observing the effect you did simply by taking a sample, and you only took one sample. What we need is a good reference to support our position. Here it is (see box). Chapter 2-8 (revision 8 May 2011) p. 38 Reference for Not Adjusting Multiple Arm Comparisons for Multiplicity Dunnett and Goldsmith (2006) state: “Here, some typical examples arising in pharmaceutical research to illustrate some of the reasons for using (or not using) multiple comparison procedures will be considered. In general, the use of an appropriate multiple comparison test to make inferences concerning treatment contrasts is indicated in the following situations: 1. To make an inference concerning a particular contrast which has been selected on the basis of how the data have turned out. 2. To make an inference which requires the simultaneous examination of several treatment contrasts. 3. In “data dredging,” viz., assembling the data in various ways to determine whether some interesting differences will emerge. On the other hand, multiple comparison procedures are usually not appropriate when particular contrasts to be tested are selected in advance and are reported individually rather than as a group. In such situations, the comparison error rate is usually of primary concern and the standard tests of significance can be used, rather than a multiple comparison test.” Note: The statistical term “contrast” is by Dunnett and Goldsmith simply to make their statements more general. You can think of a contrast as any comparison, such as group 1 vs group 2 (the usual type of comparison) or perhaps something more unusual such as (group 1 + group 2)/2 vs. group 3. For our purposes, simply change contrasts to comparisons in situation 2. We would state in our protocol: “The three treatment arms will be compared with each other using chi-square tests, or with Fisher’s exact tests if the minimum expected cell frequency assumption is not met. No adjustment for multiplicity is required, as the inference related to the study aim does not require the simultaneous examination of the three comparisons (Dunnett and Goldsmith, 2006). That is, three separate comparisons are made to test three separate hypotheses, which are reported and discussed separately, rather than using the three comparisons to support a single conclusion; therefore, applying multiple-comparision procedure would not be appropriate (Dunnett and Goldsmith, 2006). ” If we made this statement of not needing a multiplicity adjustment, and the misinformed reviewer came back with a request to make the adjustment anyway, we would include the entire threeparagraph Dunnett and Goldsmith quote given above in our response to support our position. Chapter 2-8 (revision 8 May 2011) p. 39 When to Use a Global Test Regardless of the level of measurement, global tests are fine for Table 1 Patient Characteristics, since we really don’t care if we miss significance, and one p value is easier to deal with than many. It is common practice to use a global test for Table 1 comparisons. As an example, this was done in the Florez et al (2006) paper. They refer to global tests in their Table 1 footnote, and they use Holm’s adjusted p values in their results, Table 2. The global test is conservative, however, not giving significance often enough. If you really care about significance, such as when the comparison is for the outcome variable, then you should skip the global test and go straight to the pairwise comparisons, adjusting the pairwise comparisons for multiplicity using a p value adjustment procedure. The smallest adjusted pairwise p value is the p value for the global hypothesis, if you want to report a global hypothesis conclusion. Nominal Dependent Variable In the two-sample case in the previous chapter, we analyzed data like this using the chi-square test (if the minimum expected cell frequency assumption was met) and the Fisher-FreemanHalton test otherwise. For the k-sample case, we use the chi-square test again (with the same assumption) and the Fisher-Freeman-Halton test otherwise. This gives us one p value (testing the global hypothesis) so no multiple comparison adjustment is required. For pairwise comparisons, we would use the same tests and then adjust the p values using one of the procedures discussed above. We would then report the adjusted p values, rather than the original unadjusted p values. There is really no need to test the global hypothesis, so we should just use pairwise comparisons to test for a study effect. Ordinal Dependent Variable In the two-sample case in the previous chapter, we analyzed data like this using the WilcoxonMann-Whitney test. For the k-sample case, we use the Kruskal-Wallis analysis of variance by ranks test (also can be called Kruskal-Wallis nonparametric analysis of variance). Analysis of variance is popularly abbreviated as ANOVA. This test is nothing more than a k-sample extension of the Wilcoxon-Mann-Whitney test (for 2 groups, it is identically the WilcoxonMann-Whitney test). This gives us one p value (testing the global hypothesis) so no multiple comparison adjustment is required. For pairwise comparisons, we would use Wilcoxon-Mann-Whitney tests and then adjust the p values using one of the procedures discussed above. We would then report the adjusted p values, rather than the original unadjusted p values. Dalton et al. (1987) report a study where they obtain absorption profiles from women following the administration of ointment containing 20, 30, and 40 mg of progesterone to the nasal mucosa. Chapter 2-8 (revision 8 May 2011) p. 40 Their dataset is reproduced in Altman’s biostatistics textbook (1991). A subset of these data is found in the file progesterone.dta. The dependent variable is actually interval scaled, but let’s analyze it as an ordinal variable anyway so we can compare the result to an analysis of it as an interval scale variable in the next section. Start the Stata program and read in the data, File Open Find the directory where you copied the course CD: BiostatsCourse Find the subdirectory datasets & do-files Single click on progesterone.dta Open use "C:\Documents and Settings\u0032770.SRVR\Desktop\BiostatsCourse\ datasets & do-files\ progesterone.dta", clear which must be all on one line, or use: cd "C:\Documents and Settings\u0032770.SRVR\Desktop\BiostatsCourse\" cd "datasets & do-files" use progesterone.dta, clear Simultaneously comparing all four groups: Statistics Summaries, tables & tests Nonparametric tests of hypotheses Kruskal-Wallis rank test Main tab: Outcome variable: peakval Variable defining groups: group OK kwallis peakval ,by(group) Test: Equality of populations (Kruskal-Wallis test) +----------------------------------------------------------+ | group | Obs | Rank Sum | |-----------------------------------------+-----+----------| | Grp 1 (0.2ml of 100 mg/ml one nostril) | 6 | 76.50 | | Grp 2 (0.3ml of 100 mg/ml one nostril) | 6 | 45.00 | | Grp 3 (0.2ml of 200 mg/ml one nostril) | 4 | 34.50 | | Grp 4 (0.2ml of 100 mg/ml each nostril) | 4 | 54.00 | +----------------------------------------------------------+ chi-squared = probability = 3.841 with 3 d.f. 0.2791 chi-squared with ties = probability = 0.2788 Chapter 2-8 (revision 8 May 2011) 3.844 with 3 d.f. <- use this one (should always correct for ties) p. 41 Alternatively, we can skip the Kruskal-Wallis test, and instead compute three Wilcoxon-MannWhitney tests and adjust the p values for multiplicity. ranksum peakval if ranksum peakval if ranksum peakval if mcpi 0.1093 0.3359 group==1 | group==2 ,by(group) group==1 | group==3 ,by(group) group==2 | group==3 ,by(group) 1.000 ORIGINAL ORDER: anomaly corrected Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.1093 0.182 0.328 0.293 0.328 0.293 0.328 0.293 0.328 0.3359 0.508 0.672 0.459 0.672 0.559 0.672 0.707 1.000 1.0000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 ----------------------------------------------------------------*Adjusted for 3 multiple comparisons Interval Dependent Variable In the two-sample case in the previous chapter, we analyzed data like this using the t test. For the k-sample case, we use one-way analysis of variance (or one-way ANOVA). This test is nothing more than a k-sample extension of the equal variance t test (for two groups, the tests are identical). This gives us one p value (testing the global hypothesis) so no multiple comparison adjustment is required. Like the t test, the one-way ANOVA also has the assumptions of normally distributed data and equal variances among the groups (called the homogeneity of variance assumption). The homogeneity of variance assumption is tested using Bartlett’s test when we use Stata’s oneway command. In actuality, the t test and one-way ANOVA are very robust to these assumptions, so there generally is no need to test if the assumptions are met. How to test the assumptions, however, is shown next, in case you ever really needed to do that (to satisify a demanding journal reviewer, for example). The syntax for the one-way ANOVA is, oneway depvar groupvar oneway depvar groupvar , tabulate means standard obs <-- one-way ANOVA <-- one-way ANOVA with descriptive statistics Chapter 2-8 (revision 8 May 2011) p. 42 Computing a one-way ANOVA, Statistics Linear models and related ANOVA/MANOVA One-way ANOVA Main tab: Response variable: peakval Factor variable: group Output: produce summary table OK oneway peakval group, tabulate | Summary of Serum Progesterone Peak | Value (nmol/l) Dose Group | Mean Std. Dev. Obs. ------------+-----------------------------------Grp 1 (0. | 27.033334 9.8798115 6 Grp 2 (0. | 18.8 5.9275625 6 Grp 3 (0. | 22.025 13.131482 4 Grp 4 (0. | 26.85 6.8656151 4 ------------+-----------------------------------Total | 23.525 9.129125 20 Analysis of Variance Source SS df MS F Prob > F -----------------------------------------------------------------------Between groups 261.026689 3 87.0088963 1.05 0.3965 Within groups 1322.45087 16 82.6531791 -----------------------------------------------------------------------Total 1583.47755 19 83.3409239 Bartlett's test for equal variances: chi2(3) = 2.6306 Prob>chi2 = 0.452 We did not observe a significant difference among the four groups (p = 0.397). Bartlett’s test was not significant (p = 0.452) so the assumption of equal variances was not shown to be violated. However, Bartlett’s tests is sensitive to normality, so a better test for equality of variances is Levene’s test (works for both t test and oneway ANOVA). Levene’s test is said to be “robust to the normality assumption” meaning that it provides an accurate comparison of the variances even if the normality assumption is violated. I suggest, then, that you ignore the Bartlett’s test printed with the oneway output, and use the following command to test equality of variances using Levene’s test, if you really want to test this assumption: Chapter 2-8 (revision 8 May 2011) p. 43 Statistics Summaries, tables & tests Classical tests of hypotheses Robust equal variance test Main tab: Variable: peakval Variable defining two comparison groups: group OK robvar peakval , by(group) | Summary of Serum Progesterone Peak | Value (nmol/l) Dose Group | Mean Std. Dev. Freq. ------------+-----------------------------------Grp 1 (0. | 27.033334 9.8798115 6 Grp 2 (0. | 18.8 5.9275625 6 Grp 3 (0. | 22.025 13.131482 4 Grp 4 (0. | 26.85 6.8656151 4 ------------+-----------------------------------Total | 23.525 9.129125 20 W0 = .66609748 df(3, 16) Pr > F = .58499634 W50 = .55434711 df(3, 16) Pr > F = .65261607 W10 = .66609748 df(3, 16) Pr > F = .58499634 <- This one is Levene’s test—use this If the homogeneity of variance assumption is not satisfied, we can either transform the data or drop back to the Kruskal-Wallis ANOVA. Using Kruskal-Wallis ANOVA when the variances are not equal among the groups is apparently a controversy among statisticians. Daniel (1995, p.598) advocates using the Kruskal-Wallis ANOVA when either the normality assumption or the equal variances assumption for the interval-scaled one-way ANOVA is not met. On the other hand, Glantz and Slinker (2001, 327-328) claims that the Kruskal-Wallis ANOVA assumes the distributions have the same shape, and thus the same dispersion (or variance), and so is not a suitable alternative to the one-way ANOVA when there are unequal variances. [Glantz and Slinker propose using either the Brown-Forsythe F statistic or the Welch W statistic when the variances are not equal. These tests are not available in Stata, but they are available as an option in the SPSS ONEWAY procedure.] Siegel and Castellan (1988) never mention this “same shape” assumption of Kruskal-Wallis ANOVA in their nonparametric statistics text. Usually, statisticians will use the Kruskal-Wallis ANOVA if the homogeneity variance assumption is not met, despite the above controversy. It turns out the parametric ANOVA is robust to the homogeneity of variance assumption, so you generally don’t have to worry about the assumption anyway. For pairwise comparisons, we would use t tests and then adjust the p values using one of the procedures discussed above. We would then report the adjusted p values, rather than the original unadjusted p values. We’ll do that below. Chapter 2-8 (revision 8 May 2011) p. 44 Getting back to the above progesterone example, we saw that the Kruskal-Wallis ANOVA was more powerful (smaller p value) than was the parametric one-way ANOVA, which is normally not the case. This suggests that perhaps we violated the normality assumption, since the equality of variances assumption seemed adequately justified. Investigating this with a boxplot, Graphics Box plot Main tab: Variables: peakval By tab: Draw subgraphs for unique values of variables: group OK 10 20 30 40 50 graph box peakval, by(group) Grp 1 (0.2ml of 100 mg/ml Grp 2 (0.3ml one nostril) of 100 mg/ml Grp 3 (0.2ml one nostril) of 200 Grp mg/ml 4 (0.2ml one nostril) of 100 mg/ml each nostril) Besides realizing we need shorter value labels for the x-axis to be labeled properly, we see an outlier in Group 1. Verifying lack of normality with the Shapiro-Wilk test, Statistics Summaries, tables & tests Distributional plots & tests Shapiro-Wilk normality test Main tab: Variables: peakval by/if/in tab: Repeat command by groups: Variables that define groups: group OK by group, sort : swilk peakval <or> bysort group: swilk peakval Chapter 2-8 (revision 8 May 2011) p. 45 _______________________________________________________________ -> group = Grp 1 (0.2ml of 100 mg/ml one nostril) Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------peakval | 6 0.85346 1.815 0.963 0.16784 _______________________________________________________________ -> group = Grp 2 (0.3ml of 100 mg/ml one nostril) Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------peakval | 6 0.95127 0.604 -0.676 0.75052 _______________________________________________________________ -> group = Grp 3 (0.2ml of 200 mg/ml one nostril) Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------peakval | 4 0.88967 1.272 0.301 0.38160 _______________________________________________________________ -> group = Grp 4 (0.2ml of 100 mg/ml each nostril) Shapiro-Wilk W test for normal data Variable | Obs W V z Prob>z -------------+------------------------------------------------peakval | 4 0.98355 0.190 -1.422 0.92248 we see that there is not sufficient evidence to reject the normality assumption for Group 1. However, this might largely be due to the small sample size (n=6). Let’s compute the pairwise t tests and adjust them for multiple comparisons. This is the best approach, anyway, since ANOVA is very conservative. Statistics Summaries, tables & tests Classical tests of hypotheses Two-group mean-comparison test Main tab: Variable name: time Group variable name: group by/if/in tab: If (expression): group==1 | group==2 OK ttest peakval if group==1 | group==2, by(group) Doing this for all pairwise comparisons, by changing the “if” expression, ttest peakval if group==1 | group==2, by(group) ttest peakval if group==1 | group==3, by(group) ttest peakval if group==2 | group==3, by(group) Chapter 2-8 (revision 8 May 2011) p. 46 . ttest peakval if group==1 | group==2, by(group) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------Grp 1 (0 | 6 27.03333 4.033416 9.879811 16.66511 37.40156 Grp 2 (0 | 6 18.8 2.419917 5.927563 12.5794 25.0206 ---------+-------------------------------------------------------------------combined | 12 22.91667 2.562989 8.878456 17.27557 28.55777 ---------+-------------------------------------------------------------------diff | 8.233334 4.703663 -2.24708 18.71375 -----------------------------------------------------------------------------diff = mean(Grp 1 (0) - mean(Grp 2 (0) t = 1.7504 Ho: diff = 0 degrees of freedom = 10 Ha: diff < 0 Pr(T < t) = 0.9447 Ha: diff != 0 Pr(|T| > |t|) = 0.1106 Ha: diff > 0 Pr(T > t) = 0.0553 . ttest peakval if group==1 | group==3, by(group) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------Grp 1 (0 | 6 27.03333 4.033416 9.879811 16.66511 37.40156 Grp 3 (0 | 4 22.025 6.565741 13.13148 1.129881 42.92012 ---------+-------------------------------------------------------------------combined | 10 25.03 3.440867 10.88098 17.24622 32.81378 ---------+-------------------------------------------------------------------diff | 5.008334 7.236197 -11.67837 21.69503 -----------------------------------------------------------------------------diff = mean(Grp 1 (0) - mean(Grp 3 (0) t = 0.6921 Ho: diff = 0 degrees of freedom = 8 Ha: diff < 0 Pr(T < t) = 0.7458 Ha: diff != 0 Pr(|T| > |t|) = 0.5084 Ha: diff > 0 Pr(T > t) = 0.2542 . ttest peakval if group==2 | group==3, by(group) Two-sample t test with equal variances -----------------------------------------------------------------------------Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] ---------+-------------------------------------------------------------------Grp 2 (0 | 6 18.8 2.419917 5.927563 12.5794 25.0206 Grp 3 (0 | 4 22.025 6.565741 13.13148 1.129881 42.92012 ---------+-------------------------------------------------------------------combined | 10 20.09 2.824396 8.931523 13.70077 26.47923 ---------+-------------------------------------------------------------------diff | -3.225 6.007753 -17.0789 10.6289 -----------------------------------------------------------------------------diff = mean(Grp 2 (0) - mean(Grp 3 (0) t = -0.5368 Ho: diff = 0 degrees of freedom = 8 Ha: diff < 0 Pr(T < t) = 0.3030 Chapter 2-8 (revision 8 May 2011) Ha: diff != 0 Pr(|T| > |t|) = 0.6060 Ha: diff > 0 Pr(T > t) = 0.6970 p. 47 We next adjust the three p values for 3 multiple comparisons: mcpi 0.1106 0.5084 0.6060 ORIGINAL ORDER: anomaly corrected Unadj ---------------------- Adjusted --------------------------P Val TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr 0.1106 0.184 0.332 0.296 0.332 0.296 0.332 0.296 0.332 0.5084 0.708 0.606 0.655 0.606 0.758 1.000 0.881 1.000 0.6060 0.801 0.606 0.655 0.606 0.758 1.000 0.939 1.000 ----------------------------------------------------------------*Adjusted for 3 multiple comparisons KEY: TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr = = = = = = = = Tukey-Ciminera-Heyse procedure Hommel procedure Finner procedure Hochberg procedure Holm-Sidak procedure Holm procedure Sidak procedure Bonferroni procedure No multiple comparison procedure helped, as can be expected since they only make the p values the same or larger. Still, it is the p values from our choice of one of these procedures that we would report. Our choice should not include TCH, however, since that is for highly correlated comparisons (repeated measurements, for example). Chapter 2-8 (revision 8 May 2011) p. 48 References Abramson JH, Gahlinger PM. (2001). Computer Programs for Epidemiologists: PEPI Version 4.0. Salt Lake City, UT, Sagebrush Press. Aickin M, Gensler H. (1996). Adjusting for multiple testing when reporting research results: the Bonferroni vs Holm methods. Am J Public Health 86:726-728. Altman DG. (1991). Practical Statistics for Medical Research. New York, Chapman & Hall/CRC, 1991, pp.426-433. Benjamini Y, Hochberg Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc, Series B (Methodological) 57(1):289-300. Benjamini Y, Yekutieli D. (2001). The control of the false discovery rate in multiple testing under dependency. Ann Statistics 29(4):1165-1188. Browner WS, Newman TB, Cummings SR, Hulley SB. (1998). Getting ready to estimate sample size: hypothses and Underlying principles. In Hulley SB, Cummings SR. Designing Clinical Research: An Epidemiologic Approach. Baltimore, Williams & Wilkins, 1988. Cummings SR, Ensrud K, Delmas PD, et al. (2010). Lasofoxifene in postmenopausal women with osteoporosis. N Engl J Med 362(8):686-96. Dalton ME, Bromhan DR, Ambrose CL, Osborne J, Dalton KD. (1987). Nasal absorption of progesterone in women. Br J Obstet Gynaecol 94(1):85-8. Daniel WW. (1995). Biostatistics: A Foundation for Analysis in the Health Sciences. 6th ed. New York, John Wiley & Sons. Dunnett C, Goldsmith C. (2006). When and how to do multiple comparisons. In Buncher CR, Tsay J-Y, eds., Statistics in the Pharmaceutical Industry. 3rd ed. New York, Chapman & Hall/CRC, pp. 421-452. Finner H. (1993). On a monotonicity problem in step-down multiple test procedures. Journal of the American Statistical Association 88:920-923. Fisher RA. (1935). The Design of Experiments. Edinburgh, Oliver & Boyd. Florez JC, Jablonski KA, Bayley N, et al. (2006). TCF7L2 polymorphisms and progression to diabetes in the diabetes prevention program. NEJM 355(3):241-250. Freemantle N. (2001). Interpreting the results of secondary end points and subgroup analyses in clinical trials: should we lock the crazy aunt in the attic? BMJ 322:989-91. Glantz SA, Slinker BK. (2001). Primer of Applied Regression and Analysis of Variance. 2nd ed. New York, McGraw-Hill. Chapter 2-8 (revision 8 May 2011) p. 49 Holm S. (1979). A simple sequentially rejective multiple test procedure. Scan J Stat 6:65-70. Hommel G. (1989). A comparison of two modified Bonferroni procedures. Biometrika 76:624625. Horton NJ, Switzer SS. (2005). Statistical methods in the Journal. [letter] NEJM 353;18:197779. Levin B. (1996). Annotation: on the Holm, Simes, and Hochberg multiple test procedures. Am J Public Heath 86;5:628-629. Ludbrook J. (1998). Multiple comparison procedures updated. Clinical and Experimental Pharmacology and Physiology 25:1032-1037. Moyé LA. (2008). The multiple comparison issue in health care research. In, Rao CR, Miller JP, Rao DC (eds), Handbook of Statistics 27: Epidemiology and Medical Statistics, New York, Elsevier, pp.616-655. Pocock SJ. (1983). Clinical Trials: A Practical Approach. New York, John Wiley & Sons. Rosner B. (2006). Fundamentals of Biostatistics, 6th ed. Belmont CA, Thomson Brooks/Cole. Sankoh AJ, Huque MF, Dubey SD. (1997). Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Statistics in Medicine 16:2529-2542. Scott IU, VanVeldhuisen PC, Oden NL, et al. (2010). Baseline predictors of visual acuity and retinal thickness outcomes in pateints with retinal vein occlusion: standard care versus corticosteriod for retinal vein occlusion study report 10. Ophthalmology (in press). Snedecor GW, Cochran WG. (1980). Statistical Methods, 7th Ed. Ames, Iowa, The Iowa State University Press. Siegel S, Castellan NJ Jr. (1988). Nonparametric Statistics for the Behavioral Science. 2nd ed. New York, McGraw-Hill. Simes RJ. (1986). An improved Bonferroni procedure for multiple tests of significance. Biometrika 73:751-754. Tukey JW, Ciminera JL, Heyse JF. (1985). Testing the statistical certainty of a response to increasing doses of a drug. Biometrics 41:295-301. Witte JS, Elson RC, Cardon LR. (2000). On the relative sample size required for multiple comparisons. Statist Med. 19;369-372. Wright SP. (1992). Adjusted p-values for simultaneous inference. Biometrics 48:1005-1013. Chapter 2-8 (revision 8 May 2011) p. 50 Zolman JF. (1993). Biostatistics: Experimental Design and Statistical Inference. New York, Oxford University Press. Chapter 2-8 (revision 8 May 2011) p. 51 Appendix. mcpi, fdri, and help files mcpi.ado If you do not have access to the mcpi.ado file, then you can create it yourself. Cut-and-paste the following into the Stata do-file editor. * file: mcpi.ado * p-value adjusted multiple-comparison procedures immediate * author: Greg Stoddard updated: 23Mar2011 * compute adjusted p values using several p-value based * multiple-comparison procedures, providing p value list on * same command line * syntax: * mcp <p-value list>, where "p-value list" is a list of p values to be adjusted capture program drop mcpi program define mcpi version 10 preserve clear quietly set obs 0 tempvar pval gen `pval'=. local arg=1 local stop=0 while (`"`arg'"'~="" & `"`arg'"'~="," & `stop'==0 ) { quietly set obs `arg' quietly capture replace `pval'=``arg'' in l if `pval'==. in l { local stop=1 quietly drop in l } local arg=`arg'+1 } tempvar f_p tch_p hs_p h_p ho_p s_p b_p hom_p c origorder sortorder quietly gen `origorder'=_n local K=_N /* K comparisons, where K = # p-values entered */ sort `pval' quietly gen `sortorder'=_n quietly gen `f_p' = 1-(1-`pval')^(`K'/`sortorder') // Finner adjusted p quietly gen `tch_p' = 1-(1-`pval')^sqrt(`K') // Tukey-Ciminera-Heyse adjusted p quietly gen `hs_p' = 1-(1-`pval')^(`K'-(`sortorder'-1)) // Holm-Sidak adjusted p quietly gen `h_p' = (`K'-(`sortorder'-1))*`pval' // Holm adjusted p quietly gen `ho_p' = (`K'-(`sortorder'-1))*`pval' // Hochberg adjusted p quietly gen `s_p' = 1-(1-`pval')^`K' // Sidak adjusted p quietly gen `b_p' = `K'*`pval' // Bonferroni adjusted p * -- begin Hommel * using algorithm in appendix of Wright (1992) for computing Hommel's * procedure on sorted p values quietly gen `hom_p' = `pval' quietly gen `c'=. Chapter 2-8 (revision 8 May 2011) p. 52 forvalues m=`K'(-1)2 { local km = `K'-`m' local km1 = `km'+1 forvalues i=`km1'(1)`K' { quietly replace `c' = (`m'*`pval')/(`m'+`i'-`K') if `i'==_n } quietly sum `c' if _n >= `km1' local cmin = r(min) quietly replace `hom_p' = `cmin' if `hom_p'<`cmin' & _n >=`km1' forvalues i=1(1)`km' { quietly replace `c' = min(`cmin',`m'*`pval') if `i'==_n quietly replace `hom_p'=`c' if `hom_p' < `c' & `i'==_n } } * -- end Hommel display as text _newline "SORTED ORDER: before anomaly corrected" display as text _continue display as text "Unadj ---------------------- Adjusted " _continue display as text "---------------------------" display as text "P Val TCH Homml Finnr Hochb Ho-Si" _continue display as text " Holm Sidak Bonfr" forvalues i=1(1)`K' { display as result %6.4f `pval'[`i'] %7.3f `tch_p'[`i'] _continue display as result %7.3f `hom_p'[`i'] %7.3f `f_p'[`i'] _continue display as result %7.3f `ho_p'[`i'] %7.3f `hs_p'[`i'] _continue display as result %7.3f `h_p'[`i'] %7.3f `s_p'[`i'] _continue display as result %7.3f `b_p'[`i'] %7.3f } quietly replace `h_p' = 1 if(`h_p' > 1) // If Holm P > 1 (undefined) then set to 1 quietly replace `b_p' = 1 if(`b_p' > 1) // If Bonfer P > 1 (undefined) then set to 1 quietly replace `f_p' = 1 if(`f_p' > 1) // If Finner P > 1 (undefined) then set to 1 / quietly replace `f_p' = `f_p'[_n-1] /// if (_n>1 & `f_p'[_n] < `f_p'[_n-1]) // set to preceding p if smaller quietly replace `hs_p' = `hs_p'[_n-1] /// if (_n>1 & `hs_p'[_n] < `hs_p'[_n-1]) // set to preceding p if smaller quietly replace `h_p' = `h_p'[_n-1] /// if (_n>1 & `h_p'[_n] < `h_p'[_n-1]) // set to preceding p if smaller */ * for Hochberg, set to preceding p if larger, working backwards forvalues i=`K'(-1)1 { quietly replace `ho_p' = `ho_p'[`i'+1] /// if (`i'<`K' & `ho_p'[`i'] > `ho_p'[`i'+1]) & `i'==_n } display as text _newline "SORTED ORDER: anomaly corrected" display as text " (1) If Finner or Holm or Bonfer P > 1" _continue display as text "(undefined) then set to 1" display as text " (2) If Finner or Hol-Sid or Holm P < " _continue display as text "preceding smaller P" display as text " (illogical) then set to preceding P" display as text " (3) Working from largest to smallest," _continue display as text " if Hochberg preceding" display as text " smaller P > P then set preceding smaller P to P" display as text "Unadj ---------------------- Adjusted " _continue display as text "---------------------------" display as text "P Val TCH Homml Finnr Hochb Ho-Si" _continue display as text " Holm Sidak Bonfr" forvalues i=1(1)`K' { display as result %6.4f `pval'[`i'] %7.3f `tch_p'[`i'] _continue Chapter 2-8 (revision 8 May 2011) p. 53 display display display display as as as as result result result result %7.3f %7.3f %7.3f %7.3f `hom_p'[`i'] %7.3f `f_p'[`i'] _continue `ho_p'[`i'] %7.3f `hs_p'[`i'] _continue `h_p'[`i'] %7.3f `s_p'[`i'] _continue `b_p'[`i'] %7.3f } sort `origorder' /* restore original input order */ display as text _newline "ORIGINAL ORDER: anomaly corrected" display as text "Unadj ---------------------- Adjusted " _continue " display as text "---------------------------" display as text "P Val TCH Homml Finnr Hochb Ho-Si" _continue display as text " Holm Sidak Bonfr" forvalues i=1(1)`K' { display as result %6.4f `pval'[`i'] %7.3f `tch_p'[`i'] _continue display as result %7.3f `hom_p'[`i'] %7.3f `f_p'[`i'] _continue display as result %7.3f `ho_p'[`i'] %7.3f `hs_p'[`i'] _continue display as result %7.3f `h_p'[`i'] %7.3f `s_p'[`i'] _continue display as result %7.3f `b_p'[`i'] %7.3f } display as text "-----------------------------------" _continue display as text "------------------------------" display as text "*Adjusted for " _N " multiple comparisons" _newline display as text "KEY: TCH = Tukey-Ciminera-Heyse procedure" display as text " (use TCH only with highly " _continue display as text "correlated comparisons)" display as text " Homml = Hommel procedure" display as text " Finnr = Finner procedure" display as text " Hochb = Hochberg procedure" display as text " Ho-Si = Holm-Sidak procedure" display as text " Holm = Holm procedure" display as text " Sidak = Sidak procedure" display as text " Bonfr = Bonferroni procedure" restore end exit Next, from inside the do-file editor, on the menu bar, click on, File Save As… Save in: < see footnote > File name: mcpi Save as type: Ado files (*.ado) Save ------The goal for “Save in” is to save the file mcpi.ado in the directory, C:\ado\personal. In Windows XP, the C: is listed as “Preload (C:)”, then you click on or create the “ado” subdirectory, then you click on or create the “personal” subdirectory. Chapter 2-8 (revision 8 May 2011) p. 54 mcpi.sthlp If you do not have access to the mcpi.sthlp file, which is the help file for mcpi, then you can create it yourself. Cut-and-paste the following into the Stata do-file editor. .help for ^mcpi^ .- (Greg Stoddard) Syntax for ^mcpi^ ---------------------------------------------------------------------^mcpi^ pvaluelist , where pvaluelist is a list of the p values to be adjusted, separated by spaces Description ----------^mcpi^ computes several p value adjustment multiple comparison procedures and outputs three tables: 1) sorted multiple comparison adjusted p values after applying the procedure equation 2) sorted multiple comparison adjusted p values after correcting the values in the first table for anomolies (such as adjusted p > 1) 3) the final table, with the adjusted p values in the original sort order (this is the only table actually needed--the first two tables are for useful only for verifying the calculations) All procedures adjust for the number of comparisons equal to the number of p values in pvaluelist. Multiple Comparison Procedures Used ----------------------------------TCH Homml Finnr Hochb Ho-Si Holm Sidak Bonfr = Tukey-Ciminera-Heyse procedure (Sankoh, 1997) (TCH assumes highly correlated comparisons) = Hommel procedure (Wright, 1992) = Finner procedure (Finner, 1993) = Hochberg procedure (Wright, 1992) = Holm-Sidak procedure (Ludbrook, 1998) = Holm procedure (Ludbrook, 1998) = Sidak procedure (Ludbrook, 1998) = Bonferroni procedure (Ludbrook, 1998) Suggested References Finner H. On a monotonicity problem in step-down multiple test procedures. Journal of the American Statistical Association 1993;88:920-923. Ludbrook J. Multiple comparison procedures updated. Clinical and Experimental Pharmacology and Physiology 1998;25:1032-1037. Sankoh AJ. Hugue MF, Dubey SD. Some comments on frequently used multiple endpoint adjustment methods in clinical trials. Statistics in Medicine 1997;16:2529-2542. Wright SP. Adjusted p-values for simultaneous inference. Biometrics 1992;48:1005-1013. Chapter 2-8 (revision 8 May 2011) p. 55 Examples -------. mcpi .013 .023 .045 .150 Author -----Greg Stoddard, University of Utah School of Medicine, Salt Lake City, Utah USA Email: ^greg.stoddard@@hsc.utah.edu^ Next, from inside the do-file editor, on the menu bar, click on, File Save As… Save in: < see footnote 1 > File name: mcpi Save as type: Help files (*.sthlp) < see footnote 2 > Save ------1) The goal for “Save in” is to save the file mcpi.sthlp in the directory, C:\ado\personal. In Windows XP, the C: is listed as “Preload (C:)”, then you click on or create the “ado” subdirectory, then you click on or create the “personal” subdirectory. 2) If you have Stata version 9, then use the “hlp” file extension. Chapter 2-8 (revision 8 May 2011) p. 56 fdri.ado If you do not have access to the fdri.ado file, then you can create it yourself. Cut-and-paste the following into the Stata do-file editor. * file: fdri.ado * false discovery rate (FDR) multiple-comparison procedures immediate * author: Greg Stoddard updated: 23Mar2011 * compute adjusted p values using Benjamini-Hochberg procedure * syntax: * fdri <p-value list>, where "p-value list" is a list of p values to be adjusted capture program drop fdri program define fdri version 10 preserve clear quietly set obs 0 tempvar pval gen `pval'=. local arg=1 local stop=0 while (`"`arg'"'~="" & `"`arg'"'~="," & `stop'==0 ) { quietly set obs `arg' quietly capture replace `pval'=``arg'' in l if `pval'==. in l { local stop=1 quietly drop in l } local arg=`arg'+1 } tempvar bh_p c origorder sortorder quietly gen `origorder'=_n local K=_N /* K comparisons, where K = # p-values entered */ sort `pval' quietly gen `sortorder'=_n quietly gen `bh_p' = (`K'/`sortorder')*`pval' // Benjamini-Hochberg adjusted p display as text _newline "P Value Adjustment for " _continue display as text "Controlling False Discovery Rate" display as text _newline "SORTED ORDER: before anomaly corrected" display as text "Unadj Adjusted" display as text "P Val BenHoc" forvalues i=1(1)`K' { display as result %6.4f `pval'[`i'] %7.3f `bh_p'[`i'] } * for Hochberg, set to preceding p if larger, working backwards forvalues i=`K'(-1)1 { quietly replace `bh_p' = `bh_p'[`i'+1] /// if (`i'<`K' & `bh_p'[`i'] > `bh_p'[`i'+1]) & `i'==_n } display as text _newline "SORTED ORDER: anomaly corrected" display as text " Working from largest to" _continue display as text " smallest, if Benjamini-Hochberg" display as text " preceding smaller P > P" _continue display as text " then set preceding smaller P to P" display as text "Unadj Adjusted" display as text "P Val BenHoc" forvalues i=1(1)`K' { Chapter 2-8 (revision 8 May 2011) p. 57 display as result %6.4f `pval'[`i'] %7.3f `bh_p'[`i'] } sort `origorder' /* restore original input order */ display as text _newline "ORIGINAL ORDER: anomaly corrected" display as text "Unadj Adjusted" display as text "P Val BenHoc" forvalues i=1(1)`K' { display as result %6.4f `pval'[`i'] %7.3f `bh_p'[`i'] } display as text "-----------------------------------------" display as text "*Adjusted for " _N " multiple comparisons" _newline display as text "KEY: BenHoc = Benjamini-Hochberg procedure" restore end exit Next, from inside the do-file editor, on the menu bar, click on, File Save As… Save in: < see footnote > File name: mcpi Save as type: Ado files (*.ado) Save ------The goal for “Save in” is to save the file fdri.ado in the directory, C:\ado\personal. In Windows XP, the C: is listed as “Preload (C:)”, then you click on or create the “ado” subdirectory, then you click on or create the “personal” subdirectory. Chapter 2-8 (revision 8 May 2011) p. 58 fdri.sthlp If you do not have access to the fdri.sthlp file, which is the help file for fdri, then you can create it yourself. Cut-and-paste the following into the Stata do-file editor. .help for ^fdri^ .- (Greg Stoddard) Syntax for ^fdri^ ---------------------------------------------------------------------^fdri^ pvaluelist , where pvaluelist is a list of the p values to be adjusted, separated by spaces Description ----------^fdri^ computes the Benjamini-Hochberg p value adjustment multiple comparison procedure to control the False Discovery Rate (FDR) and outputs three tables: 1) sorted multiple comparison adjusted p values after applying the procedure equation 2) sorted multiple comparison adjusted p values after correcting the values in the first table for anomolies (p values switching their order of magnitude after adjustment) 3) the final table, with the adjusted p values in the original sort order (this is the only table actually needed--the first two tables are for useful only for verifying the calculations) The procedure adjusts for the number of comparisons equal to the number of p values in pvaluelist. Multiple Comparison Procedures Used ----------------------------------BenHoc = Bejamini-Hochberg procedure (Bejamini-Hocberg, 1995) Suggested Reference Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical an powerful approach to multiple testing. J R Statist Soc, Series B (Methodological) 1995;57(1):289-300. Example -------. fdri .013 .023 .045 .150 Author -----Greg Stoddard, University of Utah School of Medicine, Salt Lake City, Utah USA Email: ^greg.stoddard@@hsc.utah.edu^ Chapter 2-8 (revision 8 May 2011) p. 59 Next, from inside the do-file editor, on the menu bar, click on, File Save As… Save in: < see footnote 1 > File name: mcpi Save as type: Help files (*.sthlp) < see footnote 2 > Save ------1) The goal for “Save in” is to save the file fdri.sthlp in the directory, C:\ado\personal. In Windows XP, the C: is listed as “Preload (C:)”, then you click on or create the “ado” subdirectory, then you click on or create the “personal” subdirectory. 2) If you have Stata version 9, then use the “hlp” file extension. Chapter 2-8 (revision 8 May 2011) p. 60