Chapter 12 One-Way Analysis of Variance Introduction to the Practice of STATISTICS SEVENTH EDI T I O N Moore / McCabe / Craig Lecture Presentation Slides Chapter 12 One-Way Analysis of Variance 12.1 Inference for One-Way Analysis of Variance 12.2 Comparing the Means 2 12.1 Inference for One-Way Analysis of Variance The Idea of ANOVA Comparing Several Means The Problem of Multiple Comparisons The ANOVA F Test 3 Introduction The two sample t procedures of Chapter 7 compared the means of two populations or the mean responses to two treatments in an experiment. In this chapter we’ll compare any number of means using Analysis of Variance. Note: We are comparing means even though the procedure is Analysis of Variance. 4 The Idea of ANOVA When comparing different populations or treatments, the data are subject to sampling variability. We can pose the question for inference in terms of the mean response. Analysis of variance (ANOVA) is the technique used to compare several means. One-way ANOVA is used for situations in which there is only one way to classify the populations of interest. Two-way ANOVA is used to analyze the effect of two factors. 5 The Idea of ANOVA The details of ANOVA are a bit daunting. The main idea is that when we ask if a set of means gives evidence for differences among the population means, what matters is not how far apart the sample means are, but how far apart they are relative to the variability of individual observations. The Analysis of Variance Idea Analysis of variance compares the variation due to specific sources with the variation among individuals who should be similar. In particular, ANOVA tests whether several populations have the same mean by comparing how far apart the sample means are with how much variation there is within the sample. 6 The Idea of ANOVA The sample means for the three samples are the same for each set. The variation among sample means for (a) is identical to (b). The variation among the individuals within the three samples is much less for (b). CONCLUSION: the samples in (b) contain a larger amount of variation among the sample means relative to the amount of variation within the samples, so ANOVA will find more significant differences among the means in (b) – assuming equal sample sizes here for (a) and (b). – Note: larger samples will find more significant differences. 7 Comparing Several Means Do SUVs and trucks have lower gas mileage than midsize cars? Response variable: gas mileage (mpg) Groups: vehicle classification 31 midsize cars 31 SUVs 14 standard-size pickup trucks Data from the Environmental Protection Agency’s Model Year 2003 Fuel Economy Guide, www.fueleconomy.gov. 8 Comparing Several Means Means: Midsize: 27.903 SUV: 22.677 Pickup: 21.286 Mean gas mileage for SUVs and pickups appears less than for midsize cars. Are these differences statistically significant? 9 Comparing Several Means Means: Midsize: 27.903 SUV: 22.677 Pickup: 21.286 Null hypothesis: The true means (for gas mileage) are the same for all groups (the three vehicle classifications). We could look at separate t tests to compare each pair of means to see if they are different: 27.903 vs. 22.677, 27.903 vs. 21.286, & 22.677 vs. 21.286 H0: μ1 = μ2 H0: μ1 = μ3 H0: μ2 = μ3 However, this gives rise to the problem of multiple comparisons! 10 Problem of Multiple Comparisons We have the problem of how to do many comparisons at the same time with some overall measure of confidence in all the conclusions. Statistical methods for dealing with this problem usually have two steps: An overall test to find any differences among the parameters we want to compare A detailed follow-up analysis to decide which groups differ and how large the differences are Follow-up analyses can be quite complex; we will look at only the overall test for a difference in several means and examine the data to make follow-up conclusions. 11 The One-Way ANOVA Model Random sampling always produces chance variations. Any “factor effect” would thus show up in our data as the factor-driven differences plus chance variations (“error”): Data = fit + residual The one-way ANOVA model analyzes situations where chance variations are normally distributed N(0,σ) such that: 12 The ANOVA F Test We want to test the null hypothesis that there are no differences among the means of the populations. The basic conditions for inference are that we have random samples from each population and that the population is Normally distributed. The alternative hypothesis is that there is some difference. That is, not all means are equal. This hypothesis is not one-sided or twosided. It is “many-sided.” This test is called the analysis of variance F test (ANOVA). 13 Conditions for ANOVA Like all inference procedures, ANOVA is valid only in some circumstances. The conditions under which we can use ANOVA are: Conditions for ANOVA Inference We have I independent SRSs, one from each population. We measure the same response variable for each sample. The ith population has a Normal distribution with unknown mean µi. One-way ANOVA tests the null hypothesis that all population means are the same. All of the populations have the same standard deviation , whose value is unknown. Checking Standard Deviations in ANOVA The results of the ANOVA F test are approximately correct when the largest sample standard deviation is no more than twice as large as the smallest sample standard deviation. 14 The ANOVA F Statistic To determine statistical significance, we need a test statistic that we can calculate: The ANOVA F Statistic The analysis of variance F statistic for testing the equality of several means has this form: variation among the sample means F= variation among individuals in the same sample • F must be zero or positive – F is zero only when all sample means are identical – F gets larger as means move further apart • Large values of F are evidence against H0: equal means • The F test is upper-one-sided 15 F Distributions The F distributions are a family of right-skewed distributions that take only values greater than 0. A specific F distribution is determined by the degrees of freedom of the numerator and denominator of the F statistic. When describing an F distribution, always give the numerator degrees of freedom first. Our brief notation will be F(df1, df2) with df1 degrees of freedom in the numerator and df2 degrees of freedom in the denominator. Degrees of Freedom for the F Test We want to compare the means of I populations. We have an SRS of size n from the ith population, so that the total number of observations in all samples combined is: N = n1 + n2 + … + ni If the null hypothesis that all population means are equal is true, the ANOVA F statistic has the F distribution with I – 1 degrees of freedom in the numerator and N – I degrees of freedom in the denominator. 16 The ANOVA F Test F= variation among the sample means variation among individuals in the same sample The measures of variation in the numerator and denominator are mean squares: Numerator: Mean Square for Groups (MSG) n1(x1 x )2 n2 (x2 x )2 L nI (xI x )2 MSG I 1 Denominator: Mean Square for Error (MSE) (n1 1)s12 (n2 1)s22 L (nI 1)sI2 MSE NI MSE is also called the pooled sample variance, written as sp2 (sp is the pooled standard deviation) sp2 estimates the common variance 2 17 The ANOVA Table Source of variation Sum of squares SS DF Mean square MS F P value F crit Among or between “groups” n ( x x ) I -1 SSG/DFG MSG/MSE Tail area above F Value of F for a Within groups or “error” ( n 1 ) s i i N-I SSE/DFE Total SST=SSG+SSE N–1 2 i i 2 ( x x ) 2 ij R2 = SSG/SST √MSE = sp Coefficient of determination Pooled standard deviation The sum of squares represents variation in the data: SST = SSG + SSE. The degrees of freedom reflect the ANOVA model: DFT = DFG + DFE. Data (“Total”) = fit (“Groups”) + residual (“Error”) 18 Example P-value<.05 significant differences Follow-up analysis There is significant evidence that the three types of vehicle do not all have the same gas mileage. From the confidence intervals (and looking at the original data), we see that SUVs and pickups have similar fuel economy and both are distinctly poorer than midsize cars. 19 12.2 Comparing the Means Contrasts Multiple Comparisons Power of the F Test* 20 Introduction You have calculated a p-value for your ANOVA test. Now what? If you found a significant result, you still need to determine which treatments were different from which. You can gain insight by looking back at your plots (boxplot, mean ± s). There are several tests of statistical significance designed specifically for multiple tests. You can choose contrasts, or multiple comparisons. You can find the confidence interval for each mean mi shown to be significantly different from the others. 21 Introduction Contrasts can be used only when there are clear expectations BEFORE starting an experiment, and these are reflected in the experimental design. Contrasts are planned comparisons. Patients are given either drug A, drug B, or a placebo. The three treatments are not symmetrical. The placebo is meant to provide a baseline against which the other drugs can be compared. Multiple comparisons should be used when there are no justified expectations. Those are pair-wise tests of significance. We compare gas mileage for eight brands of SUVs. We have no prior knowledge to expect any brand to perform differently from the rest. Pair-wise comparisons should be performed here, but only if an ANOVA test on all eight brands reached statistical significance first. It is NOT appropriate to use a contrast test when suggested comparisons appear only after the data is collected. 22 Contrasts When an experiment is designed to test a specific hypothesis that some treatments are different from other treatments, we can use contrasts to test for significant differences between these specific treatments. Contrasts are more powerful than multiple comparisons because they are more specific. They are more able to pick up a significant difference. You can use a t test on the contrasts or calculate a t confidence interval. The results are valid regardless of the results of your multiple sample ANOVA test (you are still testing a valid hypothesis). 23 Contrasts A contrast is a combination of population means of the form y a im i where the coefficients ai have sum 0. To test the null hypothesis H0: y = 0 use the t statistic: t cSE c with degrees of freedom DFE that is associated with sp. The alternative hypothesis can be one- or two-sided. The corresponding sample contrast is: c a ix i The standard error of c is: A level C confidence interval for the difference y is: c t* SE c 2 2 a a where t* is the critical value defining i i SE s MSE c p the middle C% of the t distribution n n i i with DFE degrees of freedom. 24 Contrasts Contrasts are not always readily available in statistical software packages (when they are, you need to assign the coefficients “ai”), or may be limited to comparing each sample to a control. If your software doesn’t provide an option for contrasts, you can test your contrast hypothesis with a regular t test using the formulas we just highlighted. Remember to use the pooled variance and degrees of freedom as they reflect your better estimate of the population variance. Then you can look up your p-value in a table of t distribution. 25 Example Do nematodes affect plant growth? A botanist prepares 16 identical planting pots and adds different numbers of nematodes into the pots. Seedling growth (in mm) is recorded two weeks later. One group contains no nematodes at all. If the botanist planned this group as a baseline/control, then a contrast of all the nematode groups against the control would be valid. 26 Example Contrast of all the nematode groups against the control: Combined contrast hypotheses: H0: µ1 = 1/3 (µ2+ µ3 + µ4) vs. Ha: µ1> 1/3 (µ2+ µ3 + µ4) one tailed xi G1: 0 nematode 10.65 G2: 1,000 nematodes 10.425 G3: 5,000 nematodes 5.6 G4: 10,000 nematodes 5.45 si 2.053 1.486 1.244 1.771 Contrast coefficients: (+1 −1/3 −1/3 −1/3) or (+3 −1 −1 −1) c a x 3 * 10 . 65 10 . 425 5 . 6 5 . 45 10 . 475 i i 2 2 2 a 3 ( 1 ) i SE s 2 . 78 * 3 * 2 . 9 c p n 4 4 i t c SE 10 . 5 2 . 9 3 . 6 df : N 12 c In Excel: TDIST(3.6,12,1) = tdist(t, df, tails) ≈ 0.002 (p-value). Nematodes result in significantly shorter seedlings (alpha 1%). 27 Example ANOVA: H0: all µi are equal vs. Ha: not all µi are equal ANO VA SeedlingLengt h Bet ween G r oups Wit hin G r oups Tot al Sum of Squar es 100. 647 33. 328 133. 974 df 3 12 15 Mean Squar e 33. 549 2. 777 F 12. 080 Sig. . 001 not all µi are equal Planned comparison: H0: µ1 = 1/3 (µ2+ µ3 + µ4) vs. Cont r ast Coef f i ci ent s Ha: µ1> 1/3 (µ2+ µ3 + µ4) one tailed Contrast coefficients: (+3 −1 −1 −1) Cont r as t Cont r ast 1 0 -3 Nem at odesLevel 1000 5000 1 1 10000 1 Te st s Valu e of Con t r as Con t t r as St t d . Er r or t See d lin gL e ng t hAs s u m e eq u al v a r ia1n c e s - 10 . 4 75 0 2 . 8 8 6 50 - 3. 6 2 9 Doe s no t a s s u m e 1e qu a l - 10 . 4 75 0 3 . 3 4 8 23 - 3. 1 2 9 v ar ia n c e s df Sig . 12 4. 139 Nematodes result in significantly shorter seedlings (alpha 1%). ( 2- t aile d ) . 00 3 . 03 4 28 Multiple Comparisons Multiple comparison tests are variants on the two-sample t test. They use the pooled standard deviation sp = √MSE. The pooled degrees of freedom DFE. And they compensate for the multiple comparisons. We compute the t statistic for all pairs of means: A given test is significant (µi and µj significantly different), when |tij| ≥ t** (df = DFE). The value of t** depends on which procedure you choose to use. 29 The Bonferroni Procedure The Bonferroni procedure performs a number of pair-wise comparisons with t tests and then multiplies each p-value by the number of comparisons made. This ensures that the probability of making any false rejection among all comparisons made is no greater than the chosen significance level α. As a consequence, the higher the number of pair-wise comparisons you make, the more difficult it will be to show statistical significance for each test. But the chance of committing a type I error also increases with the number of tests made. The Bonferroni procedure lowers the working significance level of each test to compensate for the increased chance of type I errors among all tests performed. 30 Simultaneous Confidence Intervals We can also calculate simultaneous level C confidence intervals for all pair-wise differences (µi − µj) between population means: 11 CI : ( x x ) t * * s i j p n i n j spis the pooled variance, MSE. t** is the t critical value with degrees of freedom DFE = N – I, adjusted for multiple, simultaneous comparisons (e.g., Bonferroni procedure). 31 Power* The power, or sensitivity, of a one-way ANOVA is the probability that the test will be able to detect a difference among the groups (i.e., reach statistical significance) when there really is a difference. Estimate the power of your test while designing your experiment to select sample sizes appropriate to detect an amount of difference between means that you deem important. Too small a sample is a waste of experiment, but too large a sample is also a waste of resources. A power of about 80% is often suggested. 32 Power Computations* ANOVA power is affected by: The significance level a The sample sizes and number of groups being compared The differences between group means µi The guessed population standard deviation You need to decide what alternative Ha you would consider important, detect statistically for the means µi, and guess the common standard deviation σ (from similar studies or preliminary work). The power computations then require calculating a non-centrality parameter λ, which follows the F distribution with DFG and DFE degrees of freedom to arrive at the power of the test. 33 Chapter 12 One-Way Analysis of Variance 12.1 Inference for One-Way Analysis of Variance 12.2 Comparing the Means 34