1 Simple ANalysis Of Variance (ANOVA) • Oftentimes we have more than two groups that we want to compare. The purpose of ANOVA is to allow us to compare group means from several independent samples. • In general, ANOVA procedures are generalizations of the t-test and it can be shown that, if one is only interested in the difference between two groups on one independent categorical (i.e. grouping variable), that the independent samples t-test is a special case of ANOVA. • A one-way ANOVA refers to having only one independent grouping variable or factor, which is the independent variable. It is possible to have more than one grouping variable, but we will start with the simplest case. • If one only has two levels of the grouping variable then one can simply conduct an independent samples t-test, but if one has more than two levels of the grouping variable than one needs to conduct an ANOVA. • Since we have more than two groups in ANOVA we need to figure out a way to describe the difference between all the means. One way to do this is to figure out the variance between the sample means because a large variance implies that the sample means differ a lot, whereas a small variance implies that the sample means are not that different. This will give us a single numeric value for the difference between all the sample means. • The statistic used in ANOVA partitions the variance into two components: (1) the between treatment1 variability and (2) the within treatment variability. • Whenever from different samples are compared there are three sources that can cause differences to be observed between the sample means: 1. Difference due to Treatment 2. Individual Differences 3. Differences due to Experimental Error These are the three different sources of variability that can be cause one to observe differences between treatment groups and so these sources of variability are referred to as the between treatment variability. Only two of these sources of variability can be observed within a treatment group, specifically individual differences and experimental error, and these are referred to as the within treatment variability2. • The statistics used in ANOVA, the F –statistic, uses a ratio of between treatment variability and within treatment variability to test whether or not there is a difference among treatments. Specifically: F= 1 between treatment variabilty treatmenteffect+ individualdifferences + experimental error = within treatment variablity individualdifferences + experimental error Note that groups do not always represent treatments. Oftentimes ANOVA is used to determine differences in intact groups such as those that differ by ethnicity or gender. 2 It should be noted that your book, and many statistical software packages refer to the within treatment variability as the error variability. 2 • If the treatment effect is small than the ratio will be close to one. Therefore, an F-statistic close to one would be expected if the null hypothesis were true and there were no treatment differences. • If the treatment effect is large then the ratio will be much greater than one because the between treatment variability will be much larger than the within treatment variability. • The hypotheses tested in ANOVA are: H0: µ1 = µ2 = µ3 = ... = µK H1: at least one mean is different from the rest where K = the total number of groups or sample means being compared • In the population, group j has mean µj and variance σ 2j . In the sample, group j has mean X j and variance s 2j . The sample size for each group j is nj and the total number of observations, N = n1 + n2 + n3 + … + nK. The grand mean, of all observations is X . • The assumptions underlying the test are the same as the assumption underlying the t-test for independent samples. Specifically, 1. Each group, j in the population is normally distributed with mean µj 2. The variance in each group is the same so that σ12 = σ 22 = K = σ 2K = σ 2 , otherwise known as the homogeneity of variance assumption. 3. Each observation is independent of each other • The computations underlying a simple one-way ANOVA are pretty straightforward if you remember that a variance is composed of two parts: (1) the sum of squared deviations from the mean (SS) and (2) the degrees of freedom (df), which can be though of as the number of potentially different values that are used to compute the SS minus 1. • Therefore the total variance, across all groups, is computed using SStotal = ∑(X − X ) 2 and dftotal = N – 1. We partition this variance into two parts, the within treatment variance and the between treatment variance. Note that the total variance is simply the sum of within treatment variance and between treatment variance and the df for the total variance is simply the sum of the df associated with the within treatment variance and between treatment variance. • The within-treatment or within-group variance is computed using SSwithin = SSerror = ∑ ( X − X j ) 2 , which represents the sum the squared deviations from each group mean and dfwithin = dferror = (n1 – 1) + (n2 – 1) + (n3 – 1)) + … + (nK – 1) = (total number of observations) – (number of groups) = N – K. The ratio of SSwithin and dfwithin is known as the Mean Square within groups (MSwithin) or Mean Square Error (MSerror) • The between-treatment variability is computed using the SSbetween = SStreatment = ∑ n j ( X j − X ) 2 , which represents the sum of the squared deviations of all group means from the grand (overall) mean and dfbetween = dftreatment = K – 1, or the number of groups minus 3 one. . The ratio of SSbetween and dfbetween is known as the Mean Square between groups (MSbetween) or Mean Square Treatment (MStreatment) • The F-statistic is calculated by computing the ratio of Mean Square between groups (MSbetween or MStreatment) and Mean Square within groups (MSwithin or MSerror). Specifically, F= • MS between MS within This ratio follows a sampling distribution known as the F distribution which is a family of distributions based on the df of the numerator and the df of the denominator. Example A psychologist is interested in determining the extent to which physical attractiveness may influence a person’s judgment of other personal characteristics, such as intelligence or ability. So he selects three groups of subjects and asks them to pretend to be a company personnel manager and he gives them all a stack of identical job applications which include picture of the applicants. One group of subjects is given only pictures of very attractive people, another group is given only pictures of average looking people and a third group is given only pictures of unattractive people. Subjects are asked to rate the quality of each applicant on a scale of 0 (which represents very poor qualities) to 10 (which represents excellent qualities). The following data is obtained: 5 3 4 3 Attractive 4 5 3 5 4 6 8 6 6 5 8 Average 5 6 4 7 3 7 6 8 4 3 2 2 Unattractive 3 1 4 1 What should he conclude? Well, we first need to calculate the grand mean and the means for each of the three groups: X = 5 + 4 + 4 + 6 + 5 + 3 + 4 + ... + 1 = 4.32 34 X1 = 5 + 4 + 4 + 3 + ... + 5 = 4.55 11 X2 = 6 + 5 + 3 + 6 + ... + 8 = 5.92 12 X3 = 4 + 3 + 1 + ... + 1 = 2.36 11 1 2 3 4 Now we can calculate3 MSwithin = ∑(X − X j )2 = N −K (5 − 4.55) 2 + (4 − 4.55) 2 + ... + (6 − 5.92) 2 + (5 − 5.92) 2 + ... + (4 − 2.36) 2 + ... + (1 − 2.36) 2 = 1.94 34 − 3 ∑n and MSbetween = j (X j − X )2 K −1 = 11(4.55 − 4.32) 2 + 12(5.92 − 4.32) 2 + ... + 11(2.36 − 4.32) 2 .582 + 30.72 + 42.26 = = 36.63 2 2 So the F-statistic = 36.63/1.94 = 18.88, but how likely is it to have obtained this value if the null hypothesis is true? With 2 and 31 df the critical F, at α = .05, is approximately, 3.32. So the psychologist can reject the null hypothesis and conclude that person’s judgment of the job qualifications of prospective applicants appears to be influenced by how attractive the prospective applicant is. • The ANOVA procedure is robust to violations of the assumptions, especially the assumption of normality. Violating the assumption of homogeneity of variance is especially problematic if the groups consist of different sample sizes. • Levene’s test, which we talked about before in terms of the t-test, can be used to test if the homogeneity of variance assumption has been violated. If it has, then the Welch procedure can be used to adjust the df used in ANOVA, similar to what we talked about for the t-test. • If the normality assumption is violated then the data can be transformed (because this won’t change the results of the statistical test it will just re-scale things) to be more normally distributed. Common transformation include: 1. Taking the square root of each observation is beneficial if the data is very skewed. 2. Taking the log of each observation is beneficial if the data is very positively skewed. 3. Taking the reciprocal of each observation (i.e. 1/observation) is beneficial if there are very large values in the positive tail of the distribution. Another approach to dealing with a violation of the normality assumption is to use a trimmed sample which removes a fixed percentage of the extreme values in each of the tails of the distribution or a Windsorized sample which replaces the values that are trimmed with the most extreme observations in the tail that are left. In the latter case the df need to be adjusted by the number of values that are replaced. • 3 As we explore more complicated ANOVA models (models with more than one grouping variable) it will become important to be able to differentiate between fixed factors (or groups) and random factors. Note: Answers obtained by hand, from Excel, or from a statistical software package will all most likely vary slightly due to rounding error. 5 • A fixed factor is one in which the researcher is only interested in the various levels of the different groups that are being studied. These levels are not assumed to be representative of, nor generalizable to, other levels of the group. • A random factor is one in which the researcher considers the various levels of the grouping variable to be a random sample from all possible levels. In this situation the results of the statistical test may be generalized to other levels of the group. • It should be noted that there is a direct relationship between the t-test for independent samples and the ANOVA, when K = 2. Specifically, it can be shown mathematically that the F-statistic = the t-statistic, squared (i.e. F = t2) Power and Effect Size • Similar to the t-test, finding statistical significance does not tell us whether the differences are important from a practical perspective. Several measure of effect size have been proposed, all of which differ in terms of how biased they are. • η2 (eta-squared) or the correlation ratio is one of the oldest measure of effect size. It represents the percentage of total variability that can be “accounted for” by differences in the grouping variable or the percentage by with the error variability (i.e. within treatment variability) is reduced by considering group membership. This is done by calculating the ratio of SSbetween and SStotal Specifically: η2 = SS between SS total 73.25 = .55 , meaning 55% of the variation in ratings can 133.44 be accounted for by differences in the independent variable (i.e. the groups). For our previous example, η 2 = This effect size measure is biased upwards, meaning it is larger than would be expected if it were to have been calculated from the population, rather than estimated from the sample. • An alternative effect size measure to η2 is ω2 (omega-squared). It also measures the percentage of total variability that can be “accounted for” by between group variability but does so by using MS values, rather than SS values, thereby making use of sample size information. Specifically, for a fixed effect4 ANOVA: ω2 = For our previous example, ω 2 = SS between − (k − 1) MS within SS total + MS within 73.25 − (3 − 1)(1.942) 69.37 = = .51 . 133.44 + 1.942 135.38 This measure of effect size has been found to be less biased than η2. Note that it is smaller for what we obtained for η2. 4 Note that this measure of effect size is computed slightly differently for a random effects ANOVA model and that the formula for a random effects ANOVA model is not presented here. 6 • Estimating power for ANOVA is a straightforward extension of how power was estimated for the t-test. We simply use different notation, and different tables. Moreover, we assume equal sample sizes in each group, which is the optimal situation. • In an ANOVA context, φ′ is comparable to d in the independent t-test context, and separates out the effect size from the sample size. However, we need to incorporate the fact that we are using variance estimates in the ANOVA context. Specifically, φ′ = ∑ (µ − µ) / K 2 j MS within So, if we were to assume that the population values correspond exactly to what we obtained in our example (unlikely as this may be) then φ′ = • [(4.55 − 4.32) 2 + (5.92 − 4.32) 2 + (2.36 − 4.32) 2 ] / 3 2.143 = = 1.10 1.942 1.942 Furthermore, in an ANOVA context, φ is comparable to δ in the independent t-test context, in that it incorporates sample size to allow us to determine how large of a sample we need to detect meaningful differences, from a practical perspective. However, even though we may wind up with unequal sample sizes in our group we calculate power based on the assumption of equal sample sizes. Specifically, φ = φ′ n where n = the number of subjects in each group So, if we were to assume that we expected 12 subjects in each of our groups in our example then: φ = φ′ n = 1.1 12 = 3.81 • In an ANOVA context we can use to the non-centrality parameter for the F distribution, which is the mean of the F-distribution if the null hypothesis is false, with K – 1 and N – K df for the numerator and denominator, respectively. For our example, we will use an estimate corresponding to φ = 3.0, because our table in the book does not go any higher and we will compare it to the non-centrality parameter with 2 df for the numerator and 30 df for the denominator (because our book does not have very fine gradiations for df in the denominator. Using the table in the book we find that β = .03 if we want to conduct our test at α = .01. Therefore, since Power = 1 - β the power of the experiment we ran was approximately .97.