Topic 9 – Multiple Comparisons Multiple Comparisons of Treatment Means Reading: 17.7-17.8 1 Overview Brief Review of One-Way ANOVA Pairwise Comparisons of Treatment Means Multiplicity of Testing Linear Combinations & Contrasts of Treatment Means 2 Review: One-Way ANOVA Analysis of Variance (ANOVA) models provide an efficient way to compare multiple groups. In a single factor ANOVA, The Model F-test will test the equality of all group means at the same time. If this test is significant, then our next goal is to identify specific differences. This is our big topic for this lesson. 3 Review: Cell Means Model Basic ANOVA Model is: Yij i ij 2 ~ N 0, where ij Notation: “i” subscript indicates the level of the factor i 1,2,3,..., a “j” subscript indicates observation number within the group j 1,2,3,..., ni 4 Review: Factor Effects Model Yij i ij i 1, 2,..., k j 1, 2,..., ni ij ~ N 0, 2 i 0 Relationship to Cell Means: i i 5 Review: Notation DOT indicates “sum” BAR indicates “average” or “divide by cell/sample size” Y is the mean for all observations Yi is the mean for the observations in Level i of Factor A. Sometimes we omit the “dots” for brevity, but the meaning is the same. 6 Review: Components of Variation Variation between groups gets “explained” by allowing the groups to have different means. This variation contributes to MSR. Variation within groups is unexplained, and contributes to MSE. The ratio F = MSR / MSE forms the basis for testing the hypothesis that all group means are the same. 7 Review: Components of Variation Of course the individual components would sum to zero, so we must square them. It turns out that all cross-product terms cancel, and we have: å (Y 2 2 2 - Y gg) = å (Y i g - Y gg) + å (Y ij - Y i g) i, j i, j i, j 1444442 444443 1444442 444443 1444442 444443 ij SST SSA SSE BETWEEN WITHIN GROUPS GROUPS 8 Review: ANOVA Table Source SS DF MS F Factor A SSA a–1 MSA Error SSE N–a MSA MSE MSE Total SST N–1 9 Review: Model F Test Null Hypothesis (Cell Means) H 0 : 1 2 a Alternative Hypothesis H a : There exists some pair of population means not equal. If we conclude the alternative, then it makes sense to try to determine specific differences. For Factor Effects model: H 0 : i 0 for all i 10 Further Comparisons The F-test is Significant... ...What Next? 11 Pairwise Comparisons Generally our next step is that we want to find out more specifics about the actual differences between treatment groups. Which groups are actually different? We can compare two groups by looking at the difference between means. H 0 : i j H a : i j 12 Pairwise Comparisons (2) Can rewrite null hypothesis as i j 0 and so proceed to look at the difference between means. Estimate difference by Yi Y j . (Note that to this point, it’s the same as a two-sample T test) A critical value and standard error are all we need for a confidence interval. 13 Variance for Difference Recall that the variance associated to the 2 /n . mean of any given sample is So if we take the difference in means for two of our samples, the variance will be Var Yi Y j 2 ni 2 nj Remember we have assumed equal sample variances, but we don’t know 2. 14 SE for Difference in Means Estimate by the MSE and then take the square root in order to get the SE: 2 1 1 SE Yi Y j MSE n n j i If the cell sizes happen to be equal: 2MSE SE Yi Y j n 15 Confidence Interval So the confidence interval will be Y i Y j tCRIT 1 1 MSE n n j i Is the use of a t critical value appropriate? What critical value should be used? 16 Multiple Comparisons We need to compare all of the treatment means. How many comparisons is this? Suppose we decide to just look at the “largest” difference? Does this mean we don’t need to adjust for multiple comparisons? 17 Multiple Comparisons (2) The fact that we are effectively doing a large number of pairwise comparisons means... Each test takes a 5% chance of making a Type I error (showing a difference where in reality none exists). The overall Type I error rate (chance of at least one Type I error) will be much larger than 5% Effectively, the testing procedure becomes biased in favor of rejecting at least one H0 18 Valid Approaches How do we adjust for this multiplicity issue? Least Significant Differences Procedure (unadjusted!) – Relies on a significant F-test Bonferroni Adjustment – turns out to be too conservative for all pairwise comparisons Tukey Adjustment – best for all pairwise comparisons, usually best for our class because we usually will compare all pairs Dunnett Adjustment – Appropriate for comparing each treatment to a control (fewer tests). 19 Least Significant Differences No adjustment The LSD procedure goes as follows: Verify that the model F test is significant to confirm the existence of differences. Unadjusted differences are used (t tests). At a minimum, the means that are the furthest apart are presumed to be different. Note: Our textbook mislabels this (what they call LSD is actually Bonferroni adjusted LSD). 20 LSD (2) So if we use the LSD procedure, then we are NOT making any formal adjustment Type I error IS inflated by the number of tests Some things we can do: Use strict requirements on the F test: use α=0.005 instead of α=0.05 Additionally we could strengthen the requirements on the T-tests: using α=0.01 Neither is a formal adjustment, Type I error is uncontrolled 21 LSD – Why it Works Some p-values are stronger than others. When the F-test is “very” significant, we can be more sure that some groups do have different means and LSD will find those We are informally adjusting for multiplicity by “strengthening” our requirements for alpha. Works great when we are exploring, maybe to be followed by a more rigorous study Not too concerned about Type I errors 22 Example Means Treatment 1 Mean = 13 Treatment 2 Mean = 27 Treatment 3 Mean = 14 Treatment 4 Mean = 24 Overall F-test – Significant, p < 0.001 Pairwise Tests 1v2: <0.001 1v3: 0.8721 1v4: <0.001 2v3: <0.001 2v4: 0.0473 3v4: <0.001 23 Example (2) There are two clear groups here (1,3) and (2,4). Between these groups the differences are clear. Because the p-value for 2v4 is so borderline, we should not consider these to be different. 24 Lines Plot (Example) A convenient way to represent this information is via a “lines” plot. Treatment TRT 2 TRT 4 TRT 3 TRT 1 Mean 27 24 14 13 Grouping A A B B 25 Lines Plot (2) There can be overlapping groups. For example, we might wind up with something like: Treatment TRT 2 TRT 4 TRT 5 TRT 3 TRT 1 TRT 6 Mean 27 24 19 A B 14 13 1 Grouping A A B B C 26 Bonferroni Adjustment Still uses a t critical value, but we formally adjust our T-tests and use a Bonferroni t There are (a)(a – 1)/2 pairwise tests. Divide alpha by this number for the pairwise comparisons (can be expensive) 6 treatments, 15 pairs: effective α=0.00333 8 treatments, 28 pairs: effective α=0.00178. We are formally adjusting the t-critical value to avoid Type I error inflation. 27 Bonferroni (2) The advantage here is that you don’t need to worry about the F-test. (It is possible that you can have significant T-tests without a significant F-test!) Bonferroni works the best when: you are only interested in a few of the comparisons (not all pairs are being compared, don’t have to break up α as much!) you have planned your tests in advance (you know which ones you want to compare before the analysis) 28 Comparison LSD vs. Bonferroni Control of the Type I Error Rate? Power? 29 Tukey’s Method Concept: The pairwise comparisons are dependent (they involve the same means). We can take advantage of that dependence to get more power than a Bonferroni adjustment (with the same alpha). The change is in the critical value. Instead of a Tdistribution, we use the studentized range distribution (Q) Critical values in Table A-6 (similar to F-tables); to actually get a usable critical value “Q” we must divide q from the table by 2 . Q q , a , n a 2 30 Tukey’s Method (2) Our CI becomes: Yi Yj Q MSE 1 ni n1j This CI will be narrower than the Bonferroni intervals, but still wider than the LSD intervals since it does take care of the overall Type I error rate. The Tukey method can only be used for pairwise comparisons of means It also works better when cell sizes are equal! It is best for all pairwise comparisons! 31 Tukey vs. Bonferroni Remember the only thing that changes is the critical value! Tukey is always better if you are doing ALL pairwise comparisons If you only need a small number (planned in advance), Bonferroni can be superior So by comparing the critical values you can see which method is advantageous (you’ll do this in the homework) Bonferroni t vs. Tukey Q crit. values The smaller critical value gives more power! 32 Minimum Significant Differences Because of the structure of the confidence interval, zero will be included in the interval if and only if the difference in means is less than: CRIT MSE n1i n1j Or if the cell sizes are the same: 2MSE CRIT n 33 Minimum Significant Difference (2) This is the half-width of the CI, and is called the minimum significant difference Any two means that differ by a larger value will be considered statistically different. Note that this value will generally be shown in the SAS output and it depends upon the comparison method in use. 34 Example Suppose that you have six treatment groups and the treatment means are: TRT 1: 52 TRT 2: 76 TRT 3: 58 TRT 4: 54 TRT 5: 83 TRT 6: 46 Suppose we want to compare all 6 treatments, which adjustment is appropriate? ______ From this adjustment, we calculate the Minimum Significant Difference as 10. Which groups are significantly different? Construct a “LINES” plot 35 Example (2) First sort the means (increasing or decreasing order): Treatment TRT 5 TRT 2 TRT 3 TRT 4 TRT 1 TRT 6 Mean 83 76 58 54 52 46 Grouping 36 Example (2) Now, starting at the top, form the first group (remember the Tukey-MSD is 10). Treatment TRT 5 TRT 2 TRT 3 TRT 4 TRT 1 TRT 6 Mean 83 76 58 54 52 46 Grouping A A B 37 Example (3) Continue down the table (algorithmically): Treatment TRT 5 TRT 2 TRT 3 TRT 4 TRT 1 TRT 6 Mean 83 76 58 54 52 46 Grouping A A B B C B C C 38 Example (4) Notice that when a group ends, you simply drop down to the next group mean and start comparing again It is not unusual at all to have some overlap between groups, so you may have to backward check groups above Remember this process only works for cell sizes that are the same (or very similar). WHY? 39 Dunnett’s Method Specifically designed for comparing each treatment to a control group! Based on another distribution (similar to Tukey) that reflects the dependence between these a-1 tests. Like Tukey for “all pairwise comparisons”, Dunnett is the most powerful method for “treatment vs. control” comparisons. Our book does not have these critical values, but it is easy to use Dunnett in SAS (and it will provide you with the minimum significant difference as well). 40 Example Suppose in our previous example, treatment 6 was a control. We should have used Dunnett’s instead of Tukey. We calculate the Dunnet MSD as 7 Treatment TRT 5 TRT 2 TRT 3 TRT 4 TRT 1 CONTROL Mean 83 76 58 54 52 46 Which groups are now different? 41 Summary: Pairwise Comps. For pairwise comparison of treatments: Dunnett is the most powerful if considering treatments versus control. Tukey is the most powerful if considering ALL pairwise comparisons. Bonferroni should only be used if you have a relatively small number of pre-planned comparisons of interest LSD is appropriate for exploratory studies (to be followed up by a more well-planned study). 42 SAS Code & Output MEANS statement is added to PROC GLM in order to compare levels for a variable listed in the CLASS statement. proc glm data=bloodtype; class type; model resp=type /solution; means type /tukey lines; 43 Other Options / Formatting BON – use Bonferroni instead of Tukey (will produce full output, but you should want only part of it, right?) ALPHA = ??? changes your significance level CLM calls for CI’s for the means (BON would apply) CLDIFF calls for the CI’s for differences DUNNETT <‘xxx’> uses Dunnett’s method where xxx is the name of the control group DUNNETTU / DUNNETTL if you want one-sided comparisons (strictly better or worse than control) 44 Output (Tukey, Lines) Blood Type Example Alpha 0.05 Error Degrees of Freedom 8 Error Mean Square 2.083333 Critical Value of Studentized Range 4.52880 Minimum Significant Difference 3.774 Means with the same letter are not significantly different. GROUP Mean N type A A A 33.667 3 B 32.667 3 AB B 27.667 3 A C 22.667 3 O 45 Output (CLDIFF, BON) NOTE: This test controls the Type I experimentwise error rate, but it generally has a higher Type II error rate than Tukey's for all pairwise comparisons. Alpha 0.05 Error Degrees of Freedom 8 Error Mean Square 2.083333 Critical Value of t 3.47888 Minimum Significant Difference 4.0999 Comparisons significant at the 0.05 level are indicated by ***. 46 Output (CLDIFF, BON) type Comparison B - AB B - A B - O AB - B AB - A AB - O A - B A - AB A - O O - B O - AB O - A Difference Between Means 1.000 6.000 11.000 -1.000 5.000 10.000 -6.000 -5.000 5.000 -11.000 -10.000 -5.000 (2) Simultaneous 95% Confidence Limits -3.100 5.100 1.900 10.100 6.900 15.100 -5.100 3.100 0.900 9.100 5.900 14.100 -10.100 -1.900 -9.100 -0.900 0.900 9.100 -15.100 -6.900 -14.100 -5.900 -9.100 -0.900 *** *** *** *** *** *** *** *** *** *** 47 Different Sample Sizes For illustration, we just delete one of the points from Type B. Sample sizes are now 3, 2, 3, 3. What will happen to the CI’s? 48 Different Sample Sizes Tukey's Studentized Range (HSD) Test for resp Alpha 0.05 Error Degrees of Freedom 7 Error Mean Square 2 Critical Value of Studentized Range 4.68124 Minimum Significant Difference 4.0541 Harmonic Mean of Cell Sizes 2.666667 NOTE: Cell sizes are not equal. Means with the same letter are not significantly different. GROUP Mean N type A A A 33.000 2 B 32.667 3 AB B 27.667 3 A C 22.667 3 O 49 Confidence Limits Tukey's Studentized Range (HSD) Test for resp Alpha 0.05 Error Degrees of Freedom 7 Error Mean Square 2 Critical Value of Studentized Range 4.68124 type Comparison B - AB B - A B - O AB - A AB - O A - O Difference Between Means 0.333 5.333 10.333 5.000 10.000 5.000 Simultaneous 95% Confidence Limits -3.940 4.607 1.060 9.607 6.060 14.607 1.178 8.822 6.178 13.822 1.178 8.822 *** *** *** *** *** 50 Confidence Limits Confidence Limits involving Type B are of width 8.55. Those not involving Type B are of width 7.64. Why? 51 Questions? 52 Beyond Pairwise Comparisons We may want to compare “groupings” of the means rather than individual means. This involves linear combinations and contrasts of means. 53 Linear Combination of Means A linear combination of means is the sum of means that have been multiplied by constants. Constants may be anything you like. Sometimes some of them will be zero. If the constants sum to zero – then we call the linear combination a contrast. You should note that any pairwise comparison is a contrast. 54 Linear Combination (2) Consider the fixed effects model Yij i ij i ij It is not difficult to conduct a hypothesis test related to any linear combination of means that we choose: L ci i 55 Linear Combinations (Examples) H 0 : 1 2 3 4 H 0 : 31 2 3 4 H 0 : 2 1 7 3 0.69 4 0 a H 0 : 1a i a i 1 H 0 : i j 56 Linear Combinations (Example) Take one example: H 0 : 1 2 3 4 Let’s put it in “standard” form H 0 :11 12 13 14 0 Do the constants sum to zero? What does this mean? Contrasts are “fair” comparisons Not all linear combinations are contrasts 57 Linear Combinations (Examples) H 0 : 1 2 3 4 H 0 : 31 2 3 4 H 0 : 2 1 7 3 0.69 4 0 a H 0 : 1a i a i 1 H 0 : i j Which of these are contrasts? 58 Construction of the t-test Our statistic under H0 has a T distribution with N – k (error) degrees of freedom t0 Lˆ ciYi Lˆ L0 Var Lˆ Var Lˆ Var ciYi ci2Var Yi MSE ci2 ni 59 Linear Combinations (Example) Take one example: H 0 :11 12 13 14 0 L0 0 Lˆ 1Y1 1Y2 1Y3 1Y4 t0 Lˆ L0 Var Lˆ 2 2 2 2 (1) (1) ( 1) ( 1) ˆ Var L MSE n n n n 2 3 4 1 60 Linear Combinations (Pairwise Example) Another example: H 0 : i j 0 L0 0 Lˆ Yi Y j t0 Yi Y j 1 1 MSE n n j i (1) 2 (1) 2 1 1 ˆ Var L MSE MSE n nj i ni n j 61 Why T test instead of overall F test? With T tests, you can address specific hypotheses that you are interested in rather than just testing the overall equality of means. A note on the F-test: The ANOVA F-test in reality jointly tests all possible contrasts It decreases the power that we would get if we only test those of interest to the experiment This is why on occasion individual T tests may test significant while the overall F test does not. Anova F test just can’t look close enough to see what is going on! 62 Multiplicity Issues Because we are often looking at multiple tests or confidence intervals, if we use standard t-critical values the Overall Type I Error Rate (as we’ve seen in the past) will not be well controlled. Another issue is that not all of the linear combinations you test will be independent. This actually turns out to be a good thing, because it is possible to take advantage of the dependencies in developing, e.g., Tukey or Dunnett Adjustments). 63 Multiplicity Issues (2) Another issue of particular importance is data snooping. This is often done in an exploratory study where we want to search for differences. In this case, we’ll probably decide what to test after seeing the sample means. By doing this, we effectively perform all possible tests, and as we’ve discussed before the testing procedure becomes biased in favor of rejecting the null for at least one test. 64 Can we data snoop? It turns out that in some cases, we can “data snoop” in a fair and reasonable manner. We’ve already seen that the Tukey adjustment may be used to perform all pairwise comparisons (we sacrifice a bit of power for control of alpha). It is possible to expand this to all-possiblecontrasts using a Scheffe adjustment. 65 Scheffé’s Method Scheffe’s method obtains a critical value S that may be used to set up simultaneous CI’s for all contrasts. Again, we sacrifice power for control of the significance level. The critical value is based on the F distribution: S a 1 F ,a1, N a CI is given by: c Y i i S MSE ci2 ni 66 Scheffé’s Method (2) Remember to apply Scheffe, you MUST have a contrast. That is you must have: c i 0 Chosen when you have unplanned contrasts Also chosen AND recommended even for pairwise comparisons if you have vastly different cell sizes 67 Comparison Of Methods LSD Procedure will always have the most power (but won’t control Type I errors) Usually for exploratory studies to be followed by a more well planned experiment Bonferroni will be most powerful for a few pre-planned comparisons while controlling the Type I Error Rate Tukey will be the most powerful for all pairwise comparisons while controlling the Type I Error Rate 68 Comparison of Methods Dunnett will be the most powerful for comparing treatments to a control while controlling for the Type I error rate. Scheffe will usually the least powerful! But it will control the Type I error rate for ALL CONTRASTS Allows data snooping! Also useful if cell sizes are vastly different 69 General Form of Test / CI A confidence interval for any linear combination may be obtained by considering: c Y i i CRIT MSE ci2 ni As long as we make an appropriate choice for the critical value, everything else is identical. 70 Contrasts in SAS Consider testing whether B/AB groups are the same as A/O groups in the blood type example. L1 A AB B O L2 12 A 12 AB 12 B 12 O proc glm data=bloodtype; class type; model resp=type ; contrast 'L1' type 1 -1 -1 1; contrast 'L2' type 0.5 -0.5 -0.5 0.5; estimate 'L1' type 1 -1 -1 1; estimate 'L2' type 0.5 -0.5 -0.5 0.5; 71 SAS Output Contrast L1 L2 Parameter L1 L2 DF 1 1 Contrast SS 192.00000 192.00000 Estimate -16.000 -8.000 Mean Square 192.00000 192.00000 Std Error 1.66667 0.83333 F Value 92.16 92.16 t Value -9.60 -9.60 Pr > F <.0001 <.0001 Pr > |t| <.0001 <.0001 72 SAS Output (comments) Only difference is that one set of estimates is double the other. Test statistics and Pvalues are the same. Scheffe is not (and for some reason cannot be) applied. So if this is an unplanned comparison, you would need to utilize the estimate and SE, along with the appropriate Scheffe critical value, to develop your CI. 73 Example Suppose we did want to use Scheffe. Get F-critical value on 3 and 8 DF from the tables: 4.07 a 1 F 3 4.07 3.49 Take S The Scheffe adjusted CI is 16 3.49 1.667 21.8, 10.2 Conclude the groupings are different. 74 Summary: Scheffe Scheffe’s main advantage is that it works for UNPLANNED contrasts. It effectively “allows” data snooping. It is still better to have a few PLANNED contrasts or linear combinations. Then you can apply Bonferroni with a little bit more power. 75 Questions? 76 CLG Activity 77