Lecture 2: ANOVA – Review and Refresh Last class we discussed comparing two means to each other. Now we want to generalize to the case where we have more than two means. Examples: 1. Are there differences in mean yield for three varieties of wheat? 2. Is there a difference in the mean hourly wages for three different ethnic groups? 3. IS there a difference in the mean Pb content in the three main lakes in Eastern WA, (Couer D’Alene, Liberty Lake and Newman Lake) 4. Is there a difference in the cholesterol ratio among the 4 groups Young male, Young female, Older males, Older females. In each case we are interested in comparing multiple means to each other. In each case we are assuming that the treatments were randomly assigned to the homogenous units, or units are selected at random from homogenous groups. Hence the DESIGN that goes along with OneWay ANOVA is COMPLETELY RANDOMIZED DESIGN (CRD). So the hypothesis of interest is: Ho: Ha: at least one is unequal. So we are interested in finding whether or not at least one of the treatments are different from another. We are also interested in identifying WHICH ones are different. First we will discuss the ANALYSIS aspect of the problem and then we will address the design aspect of the problem. The logic behind ANOVA: The idea is we decide whether the means are the same or not based on their variability. The general idea is that we don’t know the populations means, but can only observe the sample means and the standard deviations. Obviously the sample means are not going to be exactly equal to each other. So the idea is figuring out how different the means are based on their variability. So if the variability across samples is a lot bigger than variability within the samples (normal variation) we decide the sample means are different. Lets consider the following example from your book. Assume that we wish to compare the three ethnic mean hour wages based on samples of five workers selected from each of the ethnic groups Table 1 1 5.9 5.92 5.91 5.89 5.88 5.90 2 5.51 5.50 5.50 5.49 5.50 5.50 3 5.01 5.00 4.99 4.98 5.02 5.00 What can you say about the variability in the three groups for Table 1 and 2? Table 2 1 5.90 4.42 7.51 7.89 3.78 5.90 2 6.31 3.54 4.73 7.20 5.72 5.50 3 4.52 6.93 4.48 5.55 3.52 5.00 Do the data in Table 1 present sufficient evidence to indicate differences among the three population means? An inspection of the data indicates very little variation within a sample, whereas the variability among the sample means is much larger. Since the variability among the sample means is so large in comparison to the within-sample variation, we might conclude that the corresponding population means are different. Table 2 illustrates a situation in which the sample means are the same as given in Table 1 but the variability within a sample is much larger. In contrast to the data in Table 1, the between-sample variation is small relative to the within sample variability. We would be less likely to conclude that the corresponding population means differ based on these data. Let’s look at the mean and Standard Devaition: Mean: SD: Mean: SD: 5.9 .016 5.9 1.82 5.5 .007 5.0 .016 5.5 1.42 5.0 1.30 So in ANOVA what we do is find the WITHIN sample (sometimes called treatments) variance and compare that to the ACROSS sample variance. If the variation across the samples is MUCH greater than the variation within the samples, we can conclude that the samples means are different from each other. If the across sample variation is not big compared to the within sample variance, we cannot conclude that the means are different. So the main work is to calculate the WITHIN and ACROSS sample variation. In ANOVA we call WITHIN variation the ERROR Variation. And the ACROSS variation the TREATMENT variance. Measuring Variability within and across: So how do we measure variability within a sample? Lets go back to the two sample situation: How did we measure variability within the samples, we used the pooled variance (assuming equal variances for both populations). It’s the same idea extended when you have multiple means: We pool it across all the samples. So for the data in Table 1: Lets first define some notation: Level(i) 1 2 3 ni 5 5 5 𝑦̅𝑖. 5.9000 5.5000 5.0000 Here t=# of treatments=3 N=n1+n2+n3=15 si 0.0158 0.0071 0.0158 So the Variation within we use the notation sw is: (𝑛1 − 1)𝑠12 + (𝑛2 − 1)𝑠22 + (𝑛3 − 1)𝑠32 = 𝑛1 + 𝑛2 + 𝑛3 − 3 2 𝑆𝑊 Pooling the variances across the three groups. Now, how do we get Variability across samples? We find the variances for the 3 sample means. (𝑦̅1. − 𝑦̅.. )2 + (𝑦̅2. − 𝑦̅.. )2 + (𝑦̅3. − 𝑦̅.. )2 𝑠𝐵 = 3−1 Where, 𝑦̅.. = 𝑛1 𝑦̅1. +𝑛2 𝑦̅2. +𝑛3 𝑦̅3. 𝑛1 +𝑛2 +𝑛3 Lets calculate the terms for our example: 2 𝑠𝑊 = 𝑦̅.. = 𝑠𝐵 2 (5−1)(.0158)2 +(5−1)(.0071)2 +(5−1)(.0158)2 5+5+5−3 =.000183 5(5.9) + 5(5.5) + 5(5.0) = 5.47 5+5+5 (5.9 − 5.47)2 + (5.5 − 5.47)2 + (5.0 − 5.47)2 = = 1.0167 3−1 𝑠𝐵 2 So a logical choice for testing is the ratio 𝑠 𝑊 than 1? How much? 2 , in our case is this bigger A bit of distribution THEORY: Before we find a mathematical way to look at how different is different we will digress a bit and talk about distributions and their relationships’. I will assume you ALL have used and know what the NORMAL distribution is. Normal: Let Y follow a Normal distribution with mean and standard deviation f(y) = 1 y 2 exp( ( ) ) 2 2 1 If Y* = a + bY, Y* follows a Normal distribution with E(Y*) = a + b and Var(Y*) = b2. A normal distribution with mean 0 and variance 1 is called the STANDARD normal distribution. Chi-square distribution: Let z1, …, zn be iid N(0,1). Then X2 = z12+ … +zn2 follows a chi-square distribution with n degrees of freedom. If there are some linear constraints upon the z’s like: z1+….+zn = a (some constant) then the degrees of freedom DECREASE by the number of constraints in the model. Students t-distribution: Let z follow a N (0,1) and 2() follow a chi-square distribution with degrees of freedom then: z t follows a t distribution with degrees of freedom. 2 ( ) / F-distribution: Let 2() and 2() follow INDEPENDENT chi-square distributions with and degrees of freedom then: 12 ( 1 ) / 1 F 2 follows a F distribution with and degrees of 2 (2 ) / 2 freedom. So Back to our Problem: 𝑠𝐵 2 We want to figure out how large is large for the ratio of 𝑠 𝑊 2 . So we need to look at the model of ANOVA and assumptions to understand the distribution of the statistic. Model Based Approach To ANOVA (AOV) Cell-means model yij = i + ij OR (the most commonly used form) Treatment-effect model yij = + i + ij Where: yij is our response variable : our overall mean (grand mean) i: our effect from treatment i ij: our error terms. Our assumption is that eij are independent and follows a Normal distribution with mean 0 and variance s2. So are assumptions are: 1. Normality of the Y’s 2. Independence among the Y’s 3. All the Y’s have equal variance (variance does not depend upon the group they belong to). So the hypothesis we are testing are: Ho: k Ha: at least one inequality Ho: k=(0) Ha: at least one inequality Lets consider the following: 𝑦𝑖𝑗 − 𝑦̅.. = 𝑦𝑖𝑗 − 𝑦̅𝑖. + 𝑦̅𝑖. − 𝑦̅.. So, ∑(𝑦𝑖𝑗 − 𝑦̅.. )2 = ∑(𝑦𝑖𝑗 − 𝑦̅𝑖. )2 + ∑ 𝑛𝑖 (𝑦̅𝑖. − 𝑦̅.. )2 𝑖,𝑗 𝑖,𝑗 𝑗 Total Sum Squares=Within Sum Squares + Between Sum Squares Now lets reason this out, Y is normal so (𝑦𝑖𝑗 − 𝑦̅.. ) is also normal so (𝑦𝑖𝑗 −𝑦̅.. )2 is chi-square with (∑ 𝑛𝑖 − 1) degrees of freedom. Similarly 𝜎2 ∑𝑖,𝑗(𝑦𝑖𝑗 −𝑦̅𝑖. )2 𝜎2 = 𝑠𝑤2 is a chi-square with (∑ 𝑛𝑖 − 𝑡) degrees of freedom and 𝜎2 = 𝑠𝐵2 is a chi-square with (𝑡 − 1) degrees of freedom. So the ∑𝑗 𝑛𝑖 (𝑦̅𝑖. −𝑦̅.. )2 𝑠𝐵 2 Ratio 𝑠 𝑊 2 follows F with ((t-1), (∑ 𝑛𝑖 − 𝑡)) degrees of freedom. So we determine how large is large using the F table. This leads us to the ANOVA table: Source Between Samples (Treatment) Degrees of Freedom dff=(t-1) Sums of Squares SSF= Mean F testpSquares statistic value SSF/dff SSF/SSE ∑ 𝑛𝑖 (𝑦̅𝑖. − 𝑦̅.. )2 𝑖 Within Samples (Error) dfe= (∑ 𝑛𝑖 − 𝑡) Total dft= (∑ 𝑛𝑖 − 1) SSE= SSE/dfe ∑(𝑦𝑖𝑗 − 𝑦̅𝑖. )2 𝑖,𝑗 TSS= ∑(𝑦𝑖𝑗 − 𝑦̅.. )2 𝑖,𝑗 This is the ANOVA table used in practice. In theory we generally add a column for the Expected(MS). Lets look at the ANOVA table for the data in Table 1 and how we complete it. Source Between Samples (Treatment) Ethinicity Within Samples (Error) Total Degrees of Freedom dff=(t-1) =3-1=2 Sums of Squares SSF= ∑ 𝑛𝑖 (𝑦̅𝑖. Mean Squares SSF/dff =1.0167 F testpstatistic value* SSF/SSE <.001 =5545.45 𝑖 dfe= (∑ 𝑛𝑖 − 𝑡) =15-3=12 dft= (∑ 𝑛𝑖 − 1) =15-1=14 − 𝑦̅.. )2 =2.03333 SSE= ∑(𝑦𝑖𝑗 − 𝑦̅𝑖. )2 SSE/dfe =.000183 𝑖,𝑗 =.002200 TSS= ∑(𝑦𝑖𝑗 − 𝑦̅.. )2 𝑖,𝑗 =2.03553 *For finding p-value we need to look at the F distribution with degrees of freedom 2 and 12. If you look at the F-table F(.001,2,12)=18.64. 5545.45 is bigger than that, so p-value <.001 As we are using SAS in our labs lets look at what the SAS output gives us for Data set 1: The SAS System The MEANS Procedure factor=A Analysis Variable : wage N Mean Std Dev Minimum Maximum 5 5.9000000 0.0158114 5.8800000 5.9200000 factor=B Analysis Variable : wage N Mean Std Dev Minimum Maximum 5 5.5000000 0.0070711 5.4900000 5.5100000 factor=C Analysis Variable : wage N Mean Std Dev Minimum Maximum 5 5.0000000 0.0158114 4.9800000 5.0200000 The GLM Procedure Class Level Information Class factor Levels Values 3 ABC Number of Observations Read 15 Number of Observations Used 15 The SAS System The GLM Procedure Dependent Variable: wage Source DF Sum of Squares Mean Square F Value Pr > F Model 2 2.03333333 1.01666667 5545.45 <.0001 Error 12 0.00220000 0.00018333 Corrected Total 14 2.03553333 R-Square Coeff Var Root MSE wage Mean 0.998919 Source DF factor 0.247684 0.013540 5.466667 Type I SS Mean Square F Value Pr > F 2 2.03333333 1.01666667 5545.45 <.0001 Source DF Type III SS Mean Square F Value Pr > F factor 2 2.03333333 1.01666667 5545.45 <.0001 Lets look at the ANOVA Table for Data Set 2: The SAS System The MEANS Procedure factor=A1 Analysis Variable : wage N Mean Std Dev Minimum Maximum 5 5.9000000 1.8191344 3.7800000 7.8900000 factor=B1 Analysis Variable : wage N Mean Std Dev Minimum Maximum 5 5.5000000 1.4167745 3.5400000 7.2000000 factor=C1 Analysis Variable : wage N Mean Std Dev Minimum Maximum 5 5.0000000 1.2960131 3.5200000 6.9300000 The SAS System The GLM Procedure Class Level Information Class Levels Values factor 3 A1 B1 C1 Number of Observations Read 15 Number of Observations Used 15 The SAS System The GLM Procedure Dependent Variable: wage Source DF Sum of Squares Mean Square F Value Pr > F Model 2 2.03333333 1.01666667 Error 12 27.98460000 2.33205000 Corrected Total 14 30.01793333 0.44 0.6565 R-Square Coeff Var Root MSE wage Mean 0.067737 Source DF factor 27.93485 1.527105 5.466667 Type I SS Mean Square F Value Pr > F 2 2.03333333 1.01666667 0.44 0.6565 Source DF Type III SS Mean Square F Value Pr > F factor 2 2.03333333 1.01666667 0.44 0.6565 So our findings from the analysis are: 1. For Data set 1 there is a significant difference across the ethnicities with a p-value <.001 2. For data set 2 there is NOT a significant difference between the three ethnicities. How did these results emerge: 1. The SSF was the SAME for both data sets, however the SSE was MUCH smaller in Data set 1 than 2. 2. So in the first data set almost all the total variability was explained by the difference among the groups and very little by error. Not the case for the second data set. 3. So essentially ANOVA detects differences looking at how the total variability is broken up into variability for treatments and variability for errors. One thing we need to check now – was ANOVA appropriate in this scenario. Were the assumptions met? Lets discuss assumptions: 1. Normality 2. Independence 3. Equal Variance The Procedure to check to see if assumptions are met is called diagnostics. Its similar to being a doctor and looking at symptoms to see if the model is “well” or not. In each case we can “test” or look at plots. We will start with the easier ways first looking at plots. Before we do that, lets talk about the assumptions and what these assumptions are about: 1. Normality: here we assume that our response variable (in turn our errors) are normally distributed. 2. Independence: here again we assume that each error term is independent of the other. 3. Equal Variance: we assume that each error term comes from a distribution with the same variance. So lets think about how to diagnose problems. So let us take a long hard look at our model. Model: for ONE-WAY ANOVA fixed effects: (we will discuss random effects in Lecture 4) yij = + i + ij observed value=overall mean+deviation for the ith group + random error yij =random + i is NOT random (fixed) ij is random ij= yij - ( + i) Lets think of ij as we don’t know the true value of the mean of the yij’s we can never observe ij How do we estimate ij? 𝑦𝑖𝑗 − 𝑦̅𝑖. , which is our observed error seems reasonable. Actually we call our OBSERVED error our RESIDUAL and represent it by 𝑒𝑖𝑗 so that, 𝑒𝑖𝑗 = 𝑦𝑖𝑗 − 𝑦̅𝑖. REMEMBER all our assumptions are on the ERROR but we use the RESIDUAL as a proxy to check these assumptions out. So, to check for normality we check to see if the residuals are normally distributed. We do this by the normal probability plot. What is the normal probability plot? It’s a plot of the ranked residuals and the expected values of the ranked residuals if we assume the distribution is normal. So if the distribution is normal we expect the plot to look like a STRAIGHT Line. Any deviation from the straight line would mean departures from normality. For equality of Variance we look at the plot of Residuals versus the Explanatory Variable. The idea is if the variances are equal, the residuals should be of equal spread for the different X’s. For non-independence: we look at the plot of residuals versus predicted and do not want to see any patterns in general. Consider the plots from Data Set 1: Consider the plots for data set 2: Some words of advice: 1. We can do tests for these assumptions and we will talk about them in LAB. 2. While we are making an assumption on the error we test on the residuals, so use these as guidelines to diagnose problems but not something irrevocable. If assumptions are not met, we have options. More and more new techniques are coming up which deal with non-standard situations. So its just a matter of getting the right procedure. We can always transform our data. Not ALWAYS the best option as the analysis is then done on the transformed Response and might not have much physical meaning. The book talks about some standard techniques in section 8.5. You can read that. My advice is: unless the transformation allows the response to be physically interpretable they are best avoided. An Example A development engineer is interested in determining if varying the cotton content in a synthetic fiber affects the tensile strength. % Cotton 15 7 20 12 25 14 30 19 35 7 7 17 18 25 10 Observed Tensile Strength 15 11 9 12 18 18 18 19 19 22 19 23 11 15 11 One-Way Analysis of Variance Analysis of Variance for resp Source DF SS MS Trt 4 475.76 118.94 Error 20 161.20 8.06 Total 24 636.96 F 14.76 P 0.000 Residuals Versus the Fitted Values (response is strength) 6 5 4 Residual 3 2 1 0 -1 -2 -3 -4 20 15 10 Fitted Value Normal Probability Plot .999 .99 Probability .95 .80 .50 .20 .05 .01 .001 -4 -3 -2 -1 0 1 2 3 4 5 RESI1 Average: -0.0000000 StDev: 2.59165 N: 25 Anderson-Darling Normality Test A-Squared: 0.519 P-Value: 0.170 Example The given observations are tomato yields (kg/plot) for four different levels of electrical conductivity (EC) of the soil. Chosen EC levels were 1.6, 3.8, 6.0, and 10.2 nmhos/cm. EC level 1.6 3.8 6.0 10.2 59.5 55.2 51.7 44.6 53.3 59.1 48.8 48.5 Yield 56.8 52.8 53.9 41.0 63.1 54.5 49.0 47.3 Construct Analysis of Variance Table and use it to test the null hypothesis of no difference in true mean antigen concentrations for the three groups. One-way ANOVA: yeild versus trt Analysis of Variance for yeild Source DF SS MS trt 3 377.8 125.9 Error 12 123.8 10.3 Total 15 501.6 F 12.20 P 0.001 Normal Probability Plot of the Residuals (response is yeild) 2 Normal Score 1 0 -1 -2 -5 -4 -3 -2 -1 0 1 2 3 4 5 Residual Residuals Versus the Fitted Values (response is yeild) 5 4 3 Residual 2 1 0 -1 -2 -3 -4 -5 45 47 49 51 53 Fitted Value 55 57 59 So our next steps would be to look at the follow-up of ANOVA, ie Multiple Comparison. Review: Things we learned in this set of notes: 1. Completely Randomized Design 2. Logic of ANOVA 3. Partitioning of the Sums of Squares 4. Model for ANOVA 5. Diagnostics