Analysis of Variance Lecture 19, Dec 4, 2006 A. Introduction 1. If we observe an association in sample data between a dichotomous variable and an interval-level variable, we used a difference between means test to decide whether those variables are associated in the population from which our sample was drawn or whether the observed sample association was due to some random process 2. What do we do when we want to decide whether we have sample data and want to know whether we can infer from them whether a multi-category independent variable is associated with an interval-level variable in the population the sample came from? 3. For example, you want to know if some of the faculty who teach required courses are tougher graders than others. You find data showing the 203 grades given by 9 faculty members in required courses. a. how can we tell if some of these faculty (a multi-category nominal variable) tend to be tougher grading than others? b. we could use a difference in means test to compare the grades of each possible pair of professors, testing H0 that µ1 = µ2. c. with 9 categories, this would require 36 difference-in-means tests d. it would be difficult to draw any general conclusions from 36 tests B. One-way ANOVA is a more straightforward way to assess an association between a multicategory variable and an interval-level variable and it allows us to directly test the H0 that µ1 = µ2 = µ3 = µg (where g = number of categories in the categorical variable) 1. Illustration with just three faculty—Professors H, M, and L. a. What if H’s grades = 4.0, 3.9, 3.8, 3.8, 3.7, 3.7, 3.6, 3.6, 3.5, 3.5, 3.5, 3.4, 3.4, 3.3, 3.3, 3.2, 3.2, 3.1, and 3.0, and M and L had the same distributions? (1) Given these data, could we reject H0 that µh = µm = µ? <no> Why not? (a) their means would be identical (b) their dispersions would be identical b. What if H gave every student a grade between 3.7 and 4.0 with a mean of 3.85; M gave every student a grade between 3.3 and 3.6 with a mean of 3.45, and L gave every student a grade between 2.9 and 3.2 with a mean of 3.15? 1 2 (1) Based just on these data would you conclude that these professors differ in their propensities to give high or low grades? In other words, do you suspect that we could reject the H0 that µh = µm = µl? <yes> Why? (2) Their means differ, which is inconsistent with H0 that h = m = L (3) their distributions do not overlap, suggesting that they differ in how tough their grading is, so we would suspect that h m l 2. These two extremes illustrate two sources of variation in Y (interval variable) a. variation across the category (professors’) means ( Y H, Y M, Y L) around Y b. variation in the spread of each Yi around its category-specific mean ( Y H, Y M, Y L) 3. How can we measure these difference sources of grade dispersion? a. We can measure the dispersion of each prof’s mean grade around the grand mean ( Y ) with a version of the formula of the variance: ( Y p - Y )2 / (ng - 1) b. We can measure the dispersion of the grades for each professor around that professor’s mean grade ( Y g) and then pool these dispersions for all the professors with a variant of the formula for the variance: g [(Yp - Y p)2 / (np-1)] 4. The sum of these two sources of variation in Y = the total variance in Y ( Y p - Y )2 / (ng - 1) + g [(Yp - Y p)2 / (np-1)] = [(Yi - Y )2/(n - 1)] 5. We use these two estimates of variation in Y—the between variance and the within variance in statistical inference regarding the association between X and Y a. the between variance refers to the variation in Y between the means of Y for each category of X around Y . Because knowing X explains this proportion of the variation in Y it is the explained variance b. the within variance is the sum of how much Y varies within each category of X around that category mean for Y, summed over each category of X. Here we are pooling estimates of variation for each value of X, and since knowing X does not explain this variation in Y we also call it the unexplained variance. 6. These estimates of the variation in Y are actually sums of squares because they don’t take into account the number of cases (actually, degrees of freedom) on which each is based. Now let’s turn them into estimates of the variance in Y. 3 a. the between sum of squares estimate is based on the number of categories of X so it has g – 1 df. b. the within sum of squares estimate is based on the size of the sample minus the number of df we have already used; so it has n – g df. c. notice that the between and within df = the df for the sample variance: n - 1 = (n – g) + (g – 1) 7. Formulae for variance estimates when g = 3 (H, M, L) a. total variance = Σ(Yi - Y )2 / n - 1 b. between variance = Σ( Y g - Y i)2 / (g – 1) c. pooled estimate of within variance = Σg [Σ(Yx - Y x1)2 / n - g]] 8. The quotient from dividing an estimated sum of squares by its df is called the mean squared deviations (or mean square) 9. Sum of squared devs in Y df Mean squared deviation (1) TSS = (Yi - Y )2 n-1 (Y - Y )2 / n-1 (2) BSS = ( Y g - Y )2 g-1 ( Y g - Y )2 / g - 1 (3) WSS = (Y - Y g)2 n- g (Y – Y g)2 / n - g C. The logic of one-way ANOVA 1. ANOVA compares two independent estimates of the variation in Y to assess whether their ratio is more consistent with a H0 that X and Y are not related in the population or Ha that they are related in the population 2. If X and Y are not related, then we would expect σ2y.x1 = σ2y.x2 = σ2 y.x3 = σ2 y.xg pooled estimate of the within-X variance in Y will approximately equal and the the total variance in Y, while the between variance will 0 3. In contrast, according to Ha, enough of the variation in Y will be explained by X so the between variance will be large compared to the within variance 4. We compare the between and within estimates of the variance in Y to see if the sample-level association between X and Y is more consistent with H0 or Ha 5. The statistical test for the association between X and Y is a ratio of the estimates of the between to the within variance in Y 4 a. if the population means of Y for each category of X were identical, then any variation in sample values of Y within these categories would result from random variation of Y around Y x b. WSS estimates the variation in Y that X cannot explain. If X and Y are unrelated, then the estimate of the unexplained variance in Y ((Yxi – Y x)2 should not exceed the estimate of the explained variance, and the ratio of between- to within-category sum of squares should 1. c. The greater the explained (between) variance relative to the unexplained (within) variance, the less likely the association between X and Y in the sample data stemmed from some random process, and the more likely it reflects a real association between X and Y in the population from which the sample was drawn. d. The stronger the association between X and Y in the sample, the larger BSS relative to WSS. And the larger the ratio of BSS to WSS, the less likely observed in the sample stemmed from some random process will be that the association and the more likely it reflects an association between X and Y in the population D. The F distribution and the F test of statistical significance 1. The F distribution is the sampling distribution of the ratio of two independent estimates of the same variance a. this is what we do when we use F when testing whether a multiple regression equation is statistically significant 2. Table D shows the probability of getting any particular ratio, given its number of categories and the number of cases 3. As Table D shows, F has a distinct sampling distribution for each combination of df1 (the between estimate) and df2 (the within estimate) 4. Test statistic for F test = the ratio of between mean square/within mean square [ Y g - Y )2/g – 1] / [(Y - Y g)2/n – g] 5. Illustration of decomposition of variation in Y by revisiting our earlier examples in which the professor accounted for none of the variance in grades or all of the variation in grades around her/his mean grade ( Y g). 5 a. when the means and variances for each professor were equal, X doesn’t any of the variance in grades and the between mean squared deviation is explain very small compared to the within mean squared deviation b. when the 3 professors differed in their mean grade and their distributions did overlap, X explained much of the variation in grading, and the ratio of not between mean squares to within mean squares would exceed 1 and probably large E. Example 1. Hypotheses a. H0: xa = xb = xc so X and Y are not associated If H0 is true, the BSS WSS, and the ratio of BSS:WSS will 1 b. Ha: µs not equal; X and Y are associated 2. Alpha level: We’ll take a 5% risk of concluding that two variables are unrelated when they are in fact related in the population a. Our text simplifies finding the cutoff point for the region of rejection by presenting 3 tables, one for α = .05, one for α = .01, and α = .001 b. For α = .05, we’ll use the table on p. 871. These sampling distributions give the probability of getting any particular F ratio if H0 is true. c. We can’t determine the cutoff for the region of rejection until we know the sample size and the number of categories of X. 3. Data a. To find the value that marks off the critical region, we must calculate the dfs associated with the BSS and the WSS. (1) the dfs associated with the BSS (df1 = g – 1) appears across top of table. (2) the dfs associated with the WSS (df2 = n - g) is on left side of table. b. we’ll reject H0 if F > 3.88 [draw rejection region] c. estimate of between mean square = 30/2 d. estimate of within mean square = 16/12 4. Calculate test statistic—F ratio F = (30/2)/(16/12) = 15/1.333 = 11.25 5. Decision: 11.25 > 3.88 so we reject H0 that X and Y are not related with a 5% chance of a type I error 6 If H0 were true, the means for the three categories should be closer together. Instead, the between-category mean square was much larger than within-category mean square, and the F ratio > 1 F. ANOVA can be applied whenever you are comparing the variance across multiple independent groups 1. Difference between two means a. The number of df associated with BSS will be 1, so you will look in the first column of F table to find the cutoff for the region of rejection associated with WSS df. b. If you look closely at this column, you will see that these are the same numbers associated with the df in the t-distribution table c. F1,df2 = t2df2 2. We can use the F test more generally to compare any two or more variances—we can use it to test for homoskedasticity (σ2x1 = σ2x2 = σ2x3 = σ2x4 = σ2x5 = . . . σ2xg ) 3. We use the F test when assessing whether a multiple regression model significantly explains variation in Y G. Assessing the strength of the association between a categorical independent variable and an interval-level dependent variable 1. a related measure, eta sq. (2) assesses the strength of the association between a categorical variable with more than two categories and an interval-level variable. eta square (2) is also called the correlation ratio. Not used much, but you should just know it exists 2. 2= between SS/total SS: the ratio of the amount of variance in the dep var. that is due to X to the total amount of variance in the dep. var.