1 252anova 1/26/07 (Open this document in 'Outline' view!) Roger Even Bove F. ANALYSIS OF VARIANCE 1. 1-Way Analysis of Variance a. The ANOVA model - relation to regression The one-way ANOVA model is used to compare the means of more than two samples, taken from populations that are all assumed to have the same variance. Each sample (called a treatment) is usually represented as a column, but there is no requirement that each column have the same number of items in it. We will assume the model xij a j eij , where i 1 through n j , and n n j .We thus have m j 1,m treatments, n j items in each column and a total of n observations. (Thus x ij should be the number in column i and row j ) b. An ANOVA problem The following data describes monthly expenses for energy in three random samples of essentially identical homes. Each column represents expenses on one fuel. .05 . Fuel 1 2 3 89 104 86 101 120 98 87 98 100 87 110 96 Sum 364 432 380 H 0 : 1 2 3 Our hypotheses are In the notation used here, i is replaced by a dot to indicate that H 1 : Not all equal a mean has been taken, that is x j is the mean of column j , in particular, the mean of column 1 is 364 432 380 91 , x2 108 and x3 95 . The overall or grand mean, is the mean of all the 4 4 4 numbers in the problem, and is often indicated by x , but x seems to be a more appealing notation. x1 364 432 380 98 . 12 We compute three sums of squares. (i) The total sum of squares is the same thing as the 89 98 2 101 98 2 2 x ij x the problem. SST 2 j i 87 98 87 98 2 x numerator of the sample variance of the numbers in 104 98 2 86 98 2 120 98 2 98 98 2 1148 98 98 2 100 98 2 110 98 2 96 98 2 2 (ii) The sum of squares within treatments has the same number of terms, but highlights the contribution to the total sum of squares generated by the difference between the individual numbers and the column 89 912 104 108 2 86 95 2 2 101 91 120 108 2 98 95 2 (treatment) means. SSW x ij x j 2 516 (iii) 2 98 108 2 100 95 2 87 91 j i 87 912 110 108 2 96 95 2 The sum of squares between treatments also has the same number of terms, but it highlights the contribution to the total sum of squares generated by the difference between the column (treatment) means and the overall mean. 91 98 2 108 98 2 95 98 2 91 98 2 108 98 2 95 98 2 2 SSB x. j x 632 2 2 2 j i 91 98 108 98 95 98 91 98 2 108 98 2 95 98 2 But, because of the repetition of the column mean, this can be simplified to n x x 2 491 98 2 4108 98 2 495 98 2 632 . SSB j .j j But note that SSB SSW SST , so that the computation of one of the three sums of squares is unnecessary. The material is summarized in a table like the one below. Source SS DF MS F SSB MSB MSB F m 1 SSB Between m 1 MSW SSW MSW nm Within SSW nm SST n 1 Total We fill in the table with the numbers we have computed and compare the F that we have computed with an F with the appropriate significance level and degrees of freedom shown in the DF column. If the F that we have computed is larger than the table F , reject the null hypothesis. Source SS DF MS F.05 H0 F Between 632 2 316 5.51 F 2,9 4.26 s Column means equal Within 516 9 57.333 Total 1148 11 The ‘s’ for ‘significant difference’ indicates that the null hypothesis of equality of means has been rejected. ‘ns’ for ‘no significant difference’ would indicate that the null hypothesis has not been rejected. 3 c. A format for ANOVA If we use the same simplifications that we use in calculating a sample variance, we can get the tableau below. Fuel 2 104 120 98 110 3 86 98 100 96 Sum Sum 364 + 432 + 380 1176 nj 4+ 4+ 4 12 n x j 91.00 108.00 95.00 SS 33260 + 46920 + 36216 x 2j 8281 + 11664 + 9025 x 1176 98 x 12 116396 ij x 28970 x 2 ij 2 j 2 xij2 nx 2 116396 1298 2 1148 2 2 2 2 2 2 2 . j x n j x. j nx 491 4108 495 12 98 x SSB x SST 1 89 101 87 87 ij x 428970 12 98 2 632 Source SS Between DF 632 2 MS 316 F F.05 5.51 F 2,9 4.26 s H0 Column means equal Within 516 9 57.333 Total 1148 11 Explanation: Since the Sum of Squares (SS) column must add up, 516 is found by subtracting 632 from 1148. Since n 12 , the total degrees of freedom are n 1 11 . Since there are 3 random samples or columns, the degrees of freedom for Between is 3 – 1 = 2. Since the Degrees of Freedom (DF) column must add up, 9 = 11 – 2. The Mean Square (MS) column is found by dividing the SS column by the DF column. MSB 316 is MSB and 57.333 is MSW . F , and is compared with F.05 from the F table MSW df1 2, df 2 9 . To see this as Minitab output go to 252anovaex1. d. Confidence Intervals i. A single Confidence Interval If we desire a single interval, we use the formula for the difference between two means when the variance is known. For example, if we want the difference between means of column 1 and column 2. 1 2 x1 x2 t n m s 2 1 1 , where s MSW . n1 n 2 4 ii. Scheffé Confidence Interval If we desire intervals that will simultaneously be valid for a given confidence level for all possible intervals 1 1 between column means, use 1 2 x1 x2 m 1Fm1,n m s . n n 2 1 iii. Bonferroni Confidence Interval If we only need k different intervals, use 1 2 x1 x2 t n m s 1 1 n1 n 2 2k iv. Tukey Confidence Interval This also applies to all possible differences. 1 2 x1 x2 q m,n m s 2 1 1 This gives rise to Tukey’s HSD (Honestly Significant n1 n 2 Difference) procedure. Two sample means x .1 and x .2 are significantly different if x.1 x.2 is greater than 1 2 qm, n m s 2 1 1 n1 n2 2. 2 -Way Analysis of Variance a. The 2-way model We will assume R rows, C columns and P observations per cell. Thus our model reads xijk i j ij ijk , where i 1 through R, j 1 through C, and k 1 through P . We will be testing three pairs of Hypotheses - (i) H 01 : All row means equal (All i zero); H 11 : Not all row means equal , (ii) H 02 : All column means equal (All j zero); H 12 : Not all column means equal, (iii) H 03 : No interaction (All ij zero) ; H 13 : Interaction. This is similar to one-way ANOVA with RC groups , but the ‘between’ variation is itemized as to whether it is due to variation between row means, variation between column means or interaction. If we remember that n RCP and m RC , we can rewrite the one way ANOVA table diagram on the previous table as below. As previously, we get the items in the MS column by dividing the numbers in the SS column, by the numbers in the DF column. The F is then found by dividing MSB by MSW . Source SS DF MS F F H0 RC 1 SSB MSB Between ___ ___ Treatment means equal MSW Within SSW RC P 1 RCP 1 SST Total We can now rewrite the same table with the ‘between’ items split up. Source SS DF MS SSR MSR Rows R 1 C 1 SSC MSC Columns SSI Interaction R 1C 1 MSI Within SSW RC P 1 Total SST RCP 1 MSW F ___ ___ ___ F ___ ___ ___ H0 Row means equal Column means equal No Interaction 5 b. An example Insulation 1 Insulation 2 (Factor B1 ) (Factor B 2 ) 89 87 Fuel 1 (Factor A1 ) 101 87 120 98 Fuel 2 (Factor A2 ) 110 104 100 86 Fuel 3 (Factor A3 ) 98 96 This problem has R 3 rows, C 2 columns and, within each cell P 2 measurements. We can compute a table of means which shows means for each cell, row and column, as well as an overall mean. Insulation 1 Insulation 2 Row means (Factor B1 ) (Factor B 2 ) Fuel 1 (Factor A1 ) x11 95 x12 87 x1 91 Fuel 2 (Factor A2 ) x 21 115 x 22 101 x 2 108 Fuel 3 (Factor A3 ) x31 99 x32 91 x3 95 Column Means x1 103 x2 93 x x 98 Now we do the computation of sums of squares, using the same simplification that we use in computing a sample variance. SST x i j x ijk 2 k 89 98 101 98 2 87 98 2 87 98 2 120 98 2 96 98 2 2 89 2 101 2 87 2 87 2 120 2 96 2 1298 2 1148 SSW x i j xij ijk 2 k 89 95 101 95 2 87 87 2 87 87 2 120 115 2 96 912 2 89 2 101 2 295 2 87 2 87 2 287 2 120 2 110 2 2115 2 86 2 96 2 2912 192 SSR CP x i x 2 22 91 98 2 108 98 2 95 98 2 22912 108 2 105 2 398 2 i 632 SSC RP x j x 2 i SSI P x i ij 32103 98 2 93 98 2 32103 2 93 2 298 2 300 xi x j x 2 , but we do not compute this because j SST SSR SSC SSI SSW , so that SSI SST SSR SSC SSW 1148 632 300 192 24 6 Out ANOVA table is thus: Source SS DF MS F Rows A 632 2 316 Columns B 300 1 300 9.36s 24 2 12 0.38ns Interaction AB F 2,6 5.14 F.05 F 1,6 5.99 9.88s .05 2,6 F.05 5.14 H0 Row means equal Column means equal No Interaction Within 192 6 32 Total 1148 11 So we reject H 01 and H 02 ,but do not reject H 03 . To explain further, We get the degrees of freedom for rows by taking the number of rows minus 1,. We do the same for columns. Then the interaction degrees of freedom are the product of row and column degrees of freedom. The total degrees of freedom comes from subtracting 1 from the total number of items in the problem. The ‘within’ degrees of freedom comes from subtracting the other degrees of freedom from the ‘total’ degrees of freedom. The ‘MS’ column comes from dividing the SS column by the DF column. The ‘F’ column is calculated by dividing the items in the ‘MS’ column by s MSW 32 . To see this as Minitab output go to 252anovaex2. An example of 2-way ANOVA with one measurement per cell is in 252anovaex3. c. Confidence Intervals i. A Single Confidence Interval If we desire a single interval we use the formula for a Bonferroni Confidence Interval below with m 1 . ii. Scheffé Confidence Interval If we desire intervals that will simultaneously be valid for a given confidence level for all possible intervals between means, use the following formulas. For cell means, use 11 21 x11 x 21 For row means, use 1 2 x1 x 2 RC 1FRC 1, RCP 1 2MSW P R 1FR 1, RC P 1 2MSW PC For column means, use 1 2 x1 x2 C 1FC 1, RCP 1 2MSW Note that if P 1 , replace RC P 1 with PR R 1C 1 . . . 7 iii. Bonferroni Confidence Interval If we only need m different intervals, use for cell means 11 21 x11 x 21 t RC P 1 2m Use for row means 1 2 x1 x 2 t RC P 1 2m 2MSW . P 2MSW . PC Use for column means 1 2 x1 x2 t RC P 1 2MSW . PR 2m iv. Tukey Confidence Interval For cell means, use 11 21 x11 x 21 qRC, RC P 1 For row means, use 1 2 x1 x 2 qR , RC P 1 MSW . PC For column means, use 1 2 x1 x2 qC , RC P 1 Note that if P 1 , replace RC P 1 with MSW . P MSW PR R 1C 1 . 3. More than 2-Way analysis of Variance See 252anovaex4. 4. Kruskal-Wallis Test Equivalent to one-way ANOVA when the underlying distribution is non-normal. H 0 : Columns come from same distribution or medians equal. Example: Use same example as for one-way ANOVA, but assume that data comes from non-normal source. Assume that .05 . There are n 12 data items, so rank them from 1 to 12. Let n i be the number of items in column i and SRi be the rank sum of column i . n Treatment 1 89 101 87 87 n Original Data Treatment Treatment 2 3 104 86 120 98 98 100 110 96 SRi i . Treatment 1 4 9 2.5 2.5 18.0 Ranked Data Treatment Treatment 2 3 10 1 12 6.5 6.5 8 11 . 5 . 39.5 20.5 4 4 4 ni To check the ranking, note that the sum of the three rank sums is 18.0 + 39.5 + 20.5 = 78.0, and that the nn 1 12 13 78 . sum of the first n numbers is 2 2 8 12 SRi 2 3n 1 Now, compute the Kruskal-Wallis statistic H nn 1 i ni 12 18 .0 2 39 .52 20 .52 313 1 576 .125 39 5.3173 . If we look up this result in the (4, 4 4 13 12 13 4 4, 4) section of the Kruskal-Wallis table (Table 9) , we find that the p-value for H 5.6538 is .054 and that the p-value for H 4.6539 is .097, so the p-value for H 5.3173 must lie between these two. Since both are above .05 , do not reject H 0 . If the size of the problem is larger than those shown in Table 9, use the 2 distribution, with df m 1 , where m is the number of columns. For example, if each of m 3 columns contains 6 items, .05 and H 5.3173 , compare H with .2052 5.9915 . Since H is smaller than .205 , do not reject the null hypothesis. 5. Friedman Test Equivalent to two-way ANOVA with one observation per cell when the underlying distribution is nonnormal. H 0 : Columns come from same distribution or medians equal. Note that the only difference between this and the Kruskal-Wallis test is that the data is cross-classified in the Friedman test. Example: Three groups of 4 matched workers are trained to do a task by four different methods. When each worker is observed later, he or she is given a grade of 1 through 10 on performance of the task. Note that because this data is ordinal, ANOVA is not appropriate. Assume that .05 . In the data below, the methods are represented by c 4 columns, and the groups by r 3 rows.. In each row the numbers are ranked from 1 to c 4 . For each column, compute SRi , the rank sum of column i . Group 1 Group 2 Group 3 Method 1 9 6 9 Original Data Method Method 2 3 4 1 5 2 1 2 Method 4 7 8 6 Method 1 4 3 4 11 Ranked Data Method Method 2 3 2 1 2 1 1 2 5 4 Method 4 3 4 3 10 SRi To check the ranking, note that the sum of the four rank sums is 11 + 5 + 4 + 10 = 30, and that the sum of cc 1 the c numbers in a row is . However, there are r rows, so we must multiply the expression by r . 2 rcc 1 345 SRi 30 . So we have 2 2 9 12 Now compute the Friedman statistic F2 rc c 1 12 112 52 42 10 2 3 4 5 SR 3r c 1 2 i i 1 335 262 45 7.4 . If we find the place on the 5 Friedman Table (Table 8) for 4 columns and 3 rows, we find that the p-value for F2 7.4 is .033. Since the p-value is below .05 , reject the null hypothesis. If the size of the problem is larger than those shown in Table 10, use the 2 distribution, with df c 1 , where c is the number of columns. For example, if each of c 5 columns contains 6 items, .05 and F2 7.4 , compare F2 with .2054 9.4877 . Since F2 is not larger than .205 , do not reject the null hypothesis. 6. Tests for Equality of Variances – See 252mvar.