One-way Analysis of Variance (ANOVA) (Chapter 8, Daniel) Suppose we wish to compare k population means ( k 2 ). This situation can arise in two ways. If the study is observational, we are obtaining independently drawn samples from k distinct populations and we wish to compare the population means for some numerical response of interest. If the study is experimental, then we are using a completely randomized design to obtain our data from k distinct treatment groups. In a completely randomized design the experimental units are randomly assigned to one of k treatments and the response value from each unit is obtained. The mean of the numerical response of interest is then compared across the different treatment groups. There two main questions of interest: 1) Are there at least two population means that differ? 2) If so, which population means differ and how much do they differ by? More formally: H o : 1 2 ... k H a : at least two population means differ, i.e. i j for some i j. If we reject the null then we use comparative methods to answer question 2 above. Assumptions: 1. Samples are drawn independently (completely randomized design) 2. Population variances are equal, i.e. 12 22 k2 . 3. Populations are normally distributed. We examine checking these assumptions in the examples that follow. The test procedure compares the variation in observations between samples to the variation within samples. If the variation between samples is large relative to the variation within samples we are likely to conclude that the population means are not all equal. The diagrams below illustrate this idea... Between Group Variation >> Within Group Variation (Conclude population means differ) Between Group Variation Within Group Variation (Fail to conclude the population means differ) The name analysis of variance gets its name because we are using variation to decide whether the population means differ. 1 Within Group Variation To measure the variation within groups we use the following formula: (n1 1) s12 (n2 1) s22 (nk 1) sk2 s n1 n2 nk k 2 W Note: This is an extension of pooled variance estimate from the two-sample t-Test 2 for independent samples ( s p ) to more than two populations. This is an estimate of the variance common to all k populations ( 2 ). Between Group Variation To measure the variation between groups we use the following formula: n1 ( x1 x ) 2 n2 ( x2 x ) 2 nk ( xk x ) 2 s k 1 where, 2 B xi sample mean response for the i th treatment group x grand mean of the response from all treatment groups The larger the differences between the k sample means for the treatment groups are, the larger s B2 will be. If the null hypothesis is true, i.e. the k treatments have the same population mean, we expect the between group variation to the same as the within group variation ( s B2 sW2 ). If s B2 sW2 we have evidence against the null hypothesis in support of the alternative, i.e. we have evidence to suggest that at least two of the population means are different. Test Statistic F n = overall sample size s B2 ~ F-distribution with numerator df = k -1 and denominator df = n – k sW2 The larger F is the strong the evidence against the null hypothesis, i.e. Big FReject Ho. 2 EXAMPLE 1 - Weight Gain in Anorexia Patients Data File: Anorexia.JMP These data give the pre- and post-weights of patients being treated for anorexia nervosa. There are actually three different treatment plans being used in this study, and we wish to compare their performance in terms of the mean weight gain of the patients. The patients in this study were randomly assigned to one of three therapies. The variables in the data file are: group - treatment group (1 = Family therapy, 2 = Standard therapy, 3 = Cognitive Behavioral therapy) Treatment – treatment group by name (Behavioral, Family, Standard) prewt - weight at the beginning of treatment postwt - weight at the end of the study period Weight Gain - weight gained (or lost) during treatment (postwt-prewt) We begin our analysis by examining comparative displays for the weight gained across the three treatment methods. To do this select Fit Y by X from the Analyze menu and place the grouping variable, Treatment, in the X box and place the response, Weight Gain, in the Y box and click OK. Here boxplots, mean diamonds, normal quantile plots, and comparison circles have been added. Things to consider from this graphical display: Do there appear to be differences in the mean weight gain? Are the weight changes normally distributed? Is the variation in weight gain equal across therapies? 3 Checking the Equality of Variance Assumption To test whether it is reasonable to assume the population variances are equal for these three therapies select UnEqual Variances from the Oneway Analysis pull down-menu. Equality of Variance Test H o : 12 22 k2 H a : the population variances are not all equal We have no evidence to conclude that the variances/standard deviations of the weight gains for the different treatment programs differ (p >> .05). ONE-WAY ANOVA TEST FOR COMPARING THE THERAPY MEANS To test the null hypothesis that the mean weight gain is the same for each of the therapy methods we will perform the standard one-way ANOVA test. To do this in JMP select Means, Anova/t-Test from the Oneway Analysis pull-down menu. The results of the test are shown in the Analysis of Variance box. s B2 307.22 sW2 56.68 The between group variation is 5.42 times larger than the within group variation! This provides strong evidence against the null hypothesis (p = .0065). The p-value contained in the ANOVA table is .0065, thus we reject the null hypothesis at the .05 level and conclude statistically significant differences in the mean weight gain experienced by patients in the different therapy groups exist. MULTIPLE COMPARISONS Because we have concluded that the mean weight gains across treatment method are not all equal it is natural to ask the secondary question: Which means are significantly different from one another? We could consider performing a series of two-sample t-Tests and constructing confidence intervals for independent samples to compare all possible pairs of means, however if the number of treatment groups is large we will almost certainly find two treatment means as being significantly different. Why? Consider a situation where we have k = 7 different treatments that we wish to compare. To compare all possible pairs of means (1 vs. 2, 1 k (k 1) 21 two-sample t-Tests. If vs. 3, …, 6 vs. 7) would require performing a total of 2 we used P(Type I Error) .05 for each test we expect to make 21(.05) 1 Type I Error, i.e. we expect to find one pair of means as being significantly different when in fact they are not. This problem only becomes worse as the number of groups, k , gets larger. 4 Experiment-wise Error Rate Another way to think about this is to consider the probability of making no Type I Errors when making our pair-wise comparisons. When k = 7 for example, the probability of making no Type I Errors is (.95) 21 .3406 , i.e. the probability that we make at least one Type I Error is therefore .6596 or a 65.96% chance. Certainly this unacceptable! Why would conduct a statistical analysis when you know that you have a 66% of making an error in your conclusions? This probability is called the experiment-wise error rate. Bonferroni Correction There are several different ways to control the experiment-wise error rate. One of the easiest ways to control experiment-wise error rate is use the Bonferroni Correction. If we plan on making m comparisons or conducting m significance tests the Bonferroni Correction is to simply use as our significance level rather than . This simple m correction guarantees that our experiment-wise error rate will be no larger than . This correction implies that our p-values will have to be less than rather than to be m considered statistically significant. Multiple Comparison Procedures for Pair-wise Comparisons of k Population Means When performing pair-wise comparison of population means in ANOVA there are several different methods that can be employed. These methods depend on the types of pair-wise comparisons we wish to perform. The different types available in JMP are summarized briefly below: Compare each pair using the usual two-sample t-Test for independent samples. This choice does not provide any experiment-wise error rate protection! (DON’T USE) Compare all pairs using Tukey’s Honest Significant Difference (HSD) approach. This is best choice if you are interested comparing each possible pair of treatments. Compare with the means to the “Best” using Hsu’s method. The best mean can either be the minimum (if smaller is better for the response) or maximum (if bigger is better for the response). Compare each mean to a control group using Dunnett’s method. Compares each treatment mean to a control group only. You must identify the control group in JMP by clicking on an observation in your comparative plot corresponding to the control group before selecting this option. 5 Multiple Comparison Options in JMP EXAMPLE 1 - Weight Gain in Anorexia Patients (cont’d) For these data we are probably interested in comparing each of the treatments to one another. For this we will use Tukey’s multiple comparison procedure for comparing all pairs of population means. Select Compare Means from the Oneway Analysis menu and highlight the All Pairs, Tukey HSD option. Beside the graph you will now notice there are circles plotted. There is one circle for each group and each circle is centered at the mean for the corresponding group. The size of the circles are inversely proportional to the sample size, thus larger circles will drawn for groups with smaller sample sizes. These circles are called comparison circles and can be used to see which pairs of means are significantly different from each other. To do this, click on the circle for one of the treatments. Notice that the treatment group selected will be appear in the plot window and the circle will become red & bold. The means that are significantly different from the treatment group selected will have circles that are gray. These color differences will also be conveyed in the group labels on the horizontal axis. In the plot below we have selected Standard treatment group. 6 The results of the pair-wise comparisons are also contained the output window. The matrix labeled Comparison for all pairs using Tukey-Kramer HSD identifies pairs of means that are significantly different using positive entries in this matrix. Here we see only treatments 2 and 3 significantly differ. The next table conveys the same information by using different letters to represent populations that have significantly different means. Notice treatments 2 and 3 are not connected by the same letter so they are significantly different. Finally the CI’s in the Ordered Differences section give estimates for the differences in the population means. Here we see that mean weight gain for patients in treatment 3 is estimated to be between 2.09 lbs. and 13.34 lbs. larger than the mean weight gain for patients receiving treatment 2 (see highlighted section below). 7 EXAMPLE 2 – Motor Skills in Children and Lead Exposure These data come from a study of children living in close proximity to a lead smelter in El Paso, TX. A study was conducted in 1972-73 where children living near the lead smelter were sampled and their blood lead levels were measured in both 1972 and in1973. IQ tests and motor skills tests were conducted on these children as well. In this example we will examine difference in finger-wrist tap scores for three groups of children. The groups were determined using their blood lead levels from 1972 and 1973 as follows: Lead Group 1 = the children in this group had blood levels below 40 micrograms/dl in both 1972 and 1973 (we can think of these children as a “control group”). Lead Group 2 = the children in this group had lead levels above 40 micrograms/dl in 1973 (we can think of these children as the “currently exposed” group). Lead Group 3 = the children in this group had lead levels above 40 microgram/dl in 1972, but had levels below 40 in 1973 (we can think of these children as the “previously exposed” group). The response that was measured (MAXFWT) was the maximum finger-wrist tapping score for the child using both their left and right hands to do the tapping (to remove hand dominance). These data are contained in the JMP file: Maxfwt Lead.JMP. Select Fit Y by X and place Lead Group in the X,Factor box and MAXFWT in the Y,Response box. After selecting Quantiles and Normal Quantile Plots we obtain the following comparative display. The wrist-tap scores appear to be approximately normally distributed. The variance in the wrist-tap scores may appear to be a bit larger for Lead Group 1, however because this is the largest group we expect to see more observations in the extremes. The interquartile ranges for the three lead groups appear to be approximately equal. We can check the equality of variance assumption formally by selecting the UnEqual Variance option. 8 Formally Checking the Equality of the Population Variances None of the equality of variance tests suggests that the population variances are unequal (p > .05). ONE-WAY ANOVA FOR COMPARING THE MEAN WRIST-TAP SCORES ACROSS LEAD GROUP To test the null hypothesis that the mean finger wrist-tap score is the same for each of the lead exposure groups we will perform the standard one-way ANOVA test. To do this in JMP select Means, Anova from the Oneway Analysis pull-down menu. The results of the test are shown in the Analysis of Variance box. The p-value contained in the ANOVA table is .0125, thus we reject the null hypothesis at the .05 level and conclude that statistically significant differences in the mean finger wrist tap scores of the children in the different lead exposure groups exist (p < .05). Multiple Comparisons using Tukey’s HSD Only the mean finger wrist-tap scores of lead groups 1 and 2 significantly differ, i.e. children who currently have high lead levels have a significantly smaller mean than the children who did not have high lead levels in either year of the study. 9 Tukey’s HSD Pair-wise Comparisons (cont’d) The results of the pair-wise comparisons are also contained the output window. The matrix labeled Comparison for all pairs using Tukey-Kramer HSD identifies pairs of means that are significantly different using positive entries in this matrix. Here we see only lead groups 1 and 2 significantly differ. The next table conveys the same information by using different letters to represent populations that have significantly different means. Notice lead groups 1 and 2 are not connected by the same letter so they are significantly different. Finally the CI’s in the Ordered Differences section give estimates for the differences in the population means. Here we see that the mean finger wrist tap score for children who currently have a high blood lead level is estimated to be between .83 and 14.18 taps smaller than the mean finger wrist tap score for children who did not have high lead levels in either year of the study. 10 Randomized Complete Block (RCB) Designs (Section 8.3 Daniel) EXAMPLE 1 – Comparing Methods of Determining Blood Serum Level Data File: Serum-Meth.JMP The goal of this study was determine if four different methods for determining blood serum levels significantly differ in terms of the readings they give. Suppose we plan to have 6 readings from each method which we will then use to make our comparisons. One approach we could take would be to find 24 volunteers and randomly allocate six subjects to each method and compare the readings obtained using the four methods. (Note: this is called a completely randomized design). There is one major problem with this approach, what is it? Instead of taking this approach it would clearly be better to use each method on the same subject. This removes subject to subject variation from the results and will allow us to get a clearer picture of the actual differences in the methods. Also if we truly only wish to have 6 readings for each method, this approach will only require the use of 6 subjects versus the 24 subjects the completely randomized approach discussed above requires, thus reducing the “cost” of the experiment. The experimental design where each patient’s serum level is determined using each method is called a randomized complete block (RCB) design. Here the patients serve as the blocks; the term randomized refers to the fact that the methods will be applied to the patients blood sample in a random order, and complete refers to the fact that each method is used on each patients blood sample. In some experiments where blocking is used it is not possible to apply each treatment to each block resulting in what is called an incomplete block design. These are less common and we will not discuss them in this class. The table below contains the raw data from the RCB experiment to compare the serum determination methods. Method 2 3 4 Subject 1 1 360 435 391 502 2 1035 1152 1002 1230 3 632 750 591 804 4 581 703 583 790 5 463 520 471 502 6 1131 1340 1144 1300 11 Visualizing the need for Blocking Select Fit Y by X from the Analyze menu and place Serum Level in the Y, Response box and Method in the X, Factor box. The resulting comparative plot is shown below. Does there appear to be any differences in the serum levels obtained from the four methods? This plot completely ignores the fact that the same six blood samples were used for each method. We can incorporate this fact visually by selecting Oneway Analysis > Matching Column... > then highlight Patient in the list. This will have the following effect on the plot. Now we can clearly see that ignoring the fact the same six blood samples were used for each method is a big mistake! On the next page we will show how to correctly analyze these data. 12 Correct Analysis of RCB Design Data in JMP First select Fit Y by X from the Analyze menu and place Serum Level in the Y, Response box, Method in the X, Factor box, and Patient in the Block box. The results from JMP are shown below. Notice the Y axis is “Serum Level – Block Centered”. This means that what we are seeing in the display is the differences in the serum level readings adjusting for the fact that the readings for each method came from the same 6 patients. Examining the data in this way we can clearly see that the methods differ in the serum level reported when measuring blood samples from the same patient. The results of the ANOVA clearly show we have strong evidence that the four methods do not give the same readings when measuring the same blood sample (p < .0001). The tables below give the block corrected mean for each method and the block means used to make the adjustment. As was the case with one-way ANOVA (completely randomized) we may still wish to determine which methods have significantly different mean serum level readings when measuring the same blood sample. We can again use multiple comparison procedures. 13 Select Compare Means... > All Pairs, Tukey’s HSD. We can see that methods 4 & 2 differ significantly from methods 1 & 3 but not each other. The same can be said for methods 1 & 3 when compared to methods 4 & 2. The confidence intervals quantify the size of the difference we can expect on average when measuring the same blood samples. For example, we see that method 4 will give between 90.28 and 255.05 higher serum levels than method 3 on average when measuring the same blood sample. Other comparisons can be interpreted in similar fashion. 14