Chapter 5: Comparisons Among Several Samples We are now going to extend the t-tools to analyze data from three or more independent groups. In the two-sample case we were interested in µ2 − µ1 . Now that we have more than two groups, what are we interested in? • Possibilities: – a few pairwise differences – all pairwise differences – specific linear combinations of means • It is important to figure out exactly how questions of interest can be answered through the model parameters (i.e. means) before doing the analysis. 5.1 Case Studies 5.1.1 Diet Restriction and Longevity • What are the questions of interest? How can they be answered through model parameters (µ’s)? • Are all pairwise comparisons of means meaningful? 5.1.2 The Spock Conspiracy Trial • Two questions: 1. Q1: Is there evidence that the women are underrepresented on the Spock judge’s venires compared to the venires of the other judges? 2. Q2: Is there any evidence that there are differences in women’s representation in the venires of the other six judges? The question of interest for this study will be addressed more by comparing different models, than comparing any pairs of means. 1 5.2 Comparing Any Two of the Several Means 5.2.1 An ideal model for several-sample comparisons • The populations model assumes: 1. Normal distributions for each population 2. Equal standard deviations for the populations 3. Independent observations within each group AND among groups. • WHY do we assume equal standard deviations? 1. It is conceptually and statistically difficult to compare populations with unequal variability 2. For many problems, a treatment is associated with an effect on the mean, but not on the SD (additive) 3. We can often transform to a scale where the spread is the approximatly the same. • Notation: – I = total number of samples (or groups) – µi = ith population mean (estimated by Y i ) – σ = population standard deviation – ni = number of observations within sample (group) i – How many parameters are in the model (assuming different populations for each group)? • Treatment Effects: – The text will use the terminology of samples instead of treatment groups. But, all methods apply also to data collected from randomized experiments (as in the case study). – What does the additive treatment effects model look like now? – Once again, we analyze the data from randomized experiments and random samples from populations in exactly the same way, but the concluding statements are worded in terms of treatment effects rather than differences in population means. 2 5.2.2 The Pooled estimate of Standard Deviation • How should we estimate σ 2 ? • Pooled estimate of variance, s2p : – A weighted average of sample variances with weights equal the their d.f.’s s2p = (n1 − 1)s21 + (n2 − 1)s22 + . . . + (nI − 1)s2I (n1 − 1) + (n2 − 1) + . . . + (nI − 1) with d.f. = n − I 5.2.3 t-Tests and Confidence Intervals for Differences of Means • Calculating a standard error for the difference between two means: – Where does the sp come from? – What are the degrees of freedom associated with it? • So, HOW do I compare two groups when I have several? 3 5.3 The One-way Analysis of Variance (ANOVA) F -test • QUESTION: • An analysis of variance (ANOVA) F -test assesses evidence regarding this question. • Make no mistake, the ANOVA F -test is a test of !! 5.3.1 The Extra-Sum-of-Squares Principle • An ANOVA F -test compares two models using an Extra Sum-of-Squares F -test (ESS F-test) • We will call one of the models the FULL model and the other the REDUCED model. – The reduced model is a special case of the full model. It is obtained by setting parameters in the full model equal to eachother. – The reduced model is the null hypothesis. – The full model always has MORE parameters. • The usual (default) ANOVA comparison: FULL Separate means model: REDUCED Equal means model: • Fitting the two models: FULL Separate means model: REDUCED Equal means model: • NOTE: – An ESS F-test always compares a full model to a reduced model, but they are not always the separate means model and the equal means model...it depends on the question! – To do the default ANOVA using a stats package, you only need to specify the FULL model and it automatically chooses the equal means model as the reduced model. 4 • RESIDUALS: – Residual for observation i = (Observed value - estimated mean) – Residuals depend on the model being fit! ∗ Residual for separate means model: ∗ Residual for equal means model: ∗ For which model will the residuals generally be larger? ∗ If the null hypothesis of no difference between any of the means is true, what do we expect when comparing residuals between the two models? • Residual Sums of Squares (SS): – RSS = A single summary of magnitude of the residuals from a particular model – Residual SS = sum of the squared residuals 1. Calculate the residuals for the specified model (i.e. subtract the estimated mean (based on the model) from each observation). 2. Square each residual 3. Add up all the squared residuals – It measures the variability in the observations that remains unexplained by the model for the means. – EACH model has a Residual SS associated with it! We will just refer to it as the model’s SS. 5 • Extra SS F-statistic: – IDEA: compare the SS for two models after taking into account the number of parameters in each to decide which is a more reasonable model for the data – A model with more parameters will always have a smaller SS. – Null hypothesis = – Questions: ∗ Is the increased number of parameters in the full model worth it? ∗ Is a “large” difference between the SS of the full model and reduced model due to a poor fit of the reduced model or just due to sampling variability (i.e. to chance)? – The calculations: Extra SS = Reduced SS − Full SS Extra d.f. = Reduced d.f. − Full d.f. Reduced SS−Full SS Extra SS/Extra d.f. ESS F-statistic = Reduced d.f.−Full d.f. = Full SS σ̂ 2 Full Full d.f. 2 – σ̂Full is our best estimate of σ 2 which comes from the residual SS for fitting the full model divided by the d.f. for the full model. – What other value did we previously calculate as our “best estimate” of σ 2 ? How is 2 it related to σ̂Full ? 6 • The F-test: – What results in a large F-statistic? – What is a large F-stat evidence against? – What does a p-value associated with an F-stat tell us? – How do we specify a particular F -distribution? ∗ numerator d.f. ∗ denominator d.f. – Using R to get a p-value from the F -distribution: ∗ If your F-stat is 3.2 with 3 numerator d.f. and 20 denominator d.f: 1-pf(3.2, 3, 20) 5.3.2 The Analysis of Variance Table for One-way Classification • What is an ANOVA table? – A table that organizes and displays the calculations used in the ESS-F-test comparing the equal means model to the separate means model. – The table can be extended to more complicated situations comparing multiple models (i.e. multiple F-tests and p-values in one table). • An ANOVA table in words: Source of Variation Sum of Squares Between Groups Extra SS Within Groups Full SS Total Red. SS d.f. Extra d.f. Full d.f. Red. d.f. 7 Mean Square Extra SS/Extra df Full SS/Full d.f. F-stat F ∗ = Extra MS/Full MS – Steps to fill out an ANOVA table OR doing an ESS F-test: 1. 2. 3. 4. 5. 6. 7. Calculate the Residual SS for the Reduced model. Calculate the Residual SS for the Full model. Subtract the Full SS from the Reduced SS to get the Extra SS. Subtract the Full d.f. from the Reduced d.f. to get the Extra d.f. Divide the Extra SS by the Extra d.f. to get the Between Groups MS Divide the Full SS by the Full d.f. to get the Within Groups MS Divide the Between Groups MS by the Within Groups MS to obtain the F statistic. 8. Obtain a p-value to accompany the F -statistic. • Let’s fill in an ANOVA table by doing an ESS F-test for comparing the separate means model to the equal means model for the Diet Restriction Longevity case study: – I don’t feel like calculating all the SS by hand, so let’s have R do it for us: diet.data <- read.csv("DietData.csv",header=TRUE) attach(diet.data) diet.data #contains the columns LIFETIME and DIET(group indicator) #need to be sure that DIET is a factor variable is.factor(DIET) #checks whether it is or not (TRUE or FALSE) DIET <- factor(DIET) # Changes DIET to a factor variable options(show.signif.stars=FALSE) sep.vs.eq.out <- lm(LIFETIME ~ DIET) anova(sep.vs.eq.out) Analysis of Variance Table Response: LIFETIME Df Sum Sq Mean Sq F value Pr(>F) DIET 5 12733.9 2546.8 57.104 < 2.2e-16 Residuals 343 15297.4 44.6 eq.out <- lm(LIFETIME ~ 1) anova(eq.out) Response: LIFETIME Df Sum Sq Mean Sq F value Pr(>F) Residuals 348 28031.4 80.5 8 anova(eq.out, sep.vs.eq.out) Model 1: Model 2: Res.Df 1 348 2 343 LIFETIME ~ 1 LIFETIME ~ DIET RSS Df Sum of Sq F Pr(>F) 28031 15297 5 12734 57.104 < 2.2e-16 9 The Spock Conspiracy Trial Case Study: Questions of interest: 1. Is there evidence that women are underrepresented on the Spock judge’s venires compared to the venires of the other judges? 2. Is there any evidence that there are differences in women’s representation in the venires of the other six judges? (If all other judges all had about the same percentage of women on their venires while the Spock judge had significantly fewer women, this would make a strong statement about that particular judge.) ********************************************* 50 Spock trial data 30 20 10 Percent Women in Venires 40 ● ● A B C D E F SPOCK Judge • Load data, check names, check that grouping variable is a factor variable, get summary stats, make boxplots, etc.: >spock.data <- read.csv("DrSpockData.csv",head=TRUE) >attach(spock.data) >names(spock.data) [1] "PERCENT" "JUDGE" 10 > is.factor(JUDGE) [1] TRUE > table(JUDGE) JUDGE A B 5 6 #calculate the individual group sample sizes C 9 D 2 E 6 F SPOCK 9 9 > tapply(PERCENT,JUDGE,mean) #calculate the group averages A B C D E F SPOCK 34.12000 33.61667 29.10000 27.00000 26.96667 26.80000 14.62222 > tapply(PERCENT,JUDGE,sd) #calculate the group SDs A B C D E F 11.941817 6.582224 4.592929 3.818377 9.010142 5.968878 SPOCK 5.038794 >boxplot(PERCENT~JUDGE, col=c(3,3,3,3,3,3,"yellow"), main="Spock trial data", ylab="Percent Women in Venires", xlab="Judge") #side-by-side boxplots 1. Is there evidence that women are underrepresented on the Spock judge’s venires compared to the venires of the other judges? • What two models should we compare to help answer this question? • Here is the default ANOVA table from the lm() function in R. > eq.v.sep.lm <- lm(PERCENT~JUDGE) > anova(eq.v.sep.lm) Analysis of Variance Table #calculates the SS Response: PERCENT Df Sum Sq Mean Sq F value Pr(>F) JUDGE 6 1927.08 321.18 6.7184 6.096e-05 Residuals 39 1864.45 47.81 – What is the estimate of σ 2 ? What are the d.f. associated with it? – What are the SS and d.f. associated with the full model? 11 – What are the SS and d.f. associated with the reduced model? – Perform the Extra SS F-test (results should match the above anova table): – Other checks using R: > eq.lm <- lm(PERCENT ~ 1) > anova(eq.lm) Response: PERCENT Df Sum Sq Mean Sq F value Pr(>F) Residuals 45 3791.5 84.3 > anova(eq.lm, eq.v.sep.lm) Analysis of Variance Table Model 1: Model 2: Res.Df 1 45 2 39 PERCENT ~ 1 PERCENT ~ JUDGE RSS Df Sum of Sq F Pr(>F) 3791.5 1864.4 6 1927.1 6.7184 6.096e-05 • Summary: There is convincing evidence that at least one of the judges’ means differs from one of the others’ (two-sided p-value = 0.00006 from a one-way analysis of variance F -test, F -stat=6.7184 on 6 and 39 degrees of freedom). 12 2. Is there any evidence that there are differences in women’s representation in the venires of the other six judges? • What are the full and reduced models we can use to help answer this question? Group : 1 2 3 4 5 6 7 2 3 4 5 6 7 Full model (separate-means) : Reduced model (others-equal): : • The estimated means for the two models are: Group : 1 Full model (separate-means) : Reduced model (others-equal): : • This test is not automatically computed via our usual ANOVA table. But, we can compute it since we know how to do Extra-SS F-tests on our own! (a) Notice that the full model is still the separate means model. Where can get the SS to go with this model? What is it? (b) Do we have the SS for the reduced model? i. Calculate it using the following formula: Red. SS (others-equal model) : n1 X j=1 How can we get it? (YSj − Y S· )2 + ni 7 X X (Yij − Y 0 )2 i=2 j=1 ii. Use R by creating a new group indicator column that puts all the other judges into one group and the Spock judge into the other group: >two.grp.ind <- c(rep("SPOCK",9),rep("OTHER",37)) #two.grp.ind <- ifelse(JUDGE=="SPOCK", "SPOCK", "OTHER") > two.grp.ind <- factor(two.grp.ind) > two.grp.lm <- lm(PERCENT~two.grp.ind) > anova(two.grp.lm) Response: PERCENT Df Sum Sq Mean Sq F value Pr(>F) two.grp.ind 1 1600.62 1600.62 32.145 1.030e-06 Residuals 44 2190.90 49.79 13 (c) Write down all the numbers we need to do the ESS F-test: Full SS = Full d.f. = Reduced SS = Reduced d.f. = (d) Calculate the Extra-SS F -test: F − statistic = #Do the calculation "by hand" in R: FullSS <- 1864.45 Fulldf <- 39 ReducedSS <- 2190.90 Reduceddf <- 44 ESS.F <- ((ReducedSS - FullSS)/(Reduceddf-Fulldf))/(FullSS/Fulldf) ESS.F #1.3658 ESS.pval <- 1-pf(ESS.F,5,39) ESS.pval #0.2582 (e) Check answer using the anova() function in R: >sep.lm <- lm(PERCENT ~ JUDGE) >anova(two.grp.lm, sep.lm) Model 1: Model 2: Res.Df 1 44 2 39 PERCENT ~ two.grp.ind PERCENT ~ JUDGE RSS Df Sum of Sq F Pr(>F) 2190.90 1864.45 5 326.46 1.3658 0.2582 (f) Conclusion: The ESS F-test indicates appropriateness of the 2 mean othersequal model (the reduced model), that is... appropriateness of lumping the 6 other judges together under one mean. (g) Summary: There is no evidence of differences in means among the six other judges (p-value = from an Extra Sum-of-Squares F-test, F-statistic= on and degrees of freedom). 14 3. Now, we have justification to assume that the other six judges have equal means. We now want to test whether the Spock judge’s mean is equal to the common mean of the other six: (a) Set up the full and reduced model: Group : 1 2 3 4 5 6 7 Full model (others-equal) : Reduced model (equal-means): : (b) How can we get the Reduced SS? (c) How can we get the Full SS? (d) Calculate the Extra-SS F test: F − statistic = (e) Check with anova() function: > Model 1: Model 2: Res.Df 1 45 2 44 anova(eq.lm,two.grp.lm) PERCENT ~ 1 PERCENT ~ two.grp.ind RSS Df Sum of Sq F Pr(>F) 3791.5 2190.9 1 1600.6 32.145 1.030e-06 (f) Summary: There is convincing evidence that the Spock judge’s mean differs from the common mean of the other six judges (p-value = from an Extra Sum-of-Squares F-test, F-statistic= on and degrees of freedom). 15 An ANOVA table summarizing more than two model comparisons: • Additive nature of the extra sums of squares for comparing nested models: – View the residual SS from each model as a distance from the data. The inclusion of more parameters gets the model closer to the data, but we must decide if the extra parameters are worth it. 16 5.6.1 Further Illustration of Different Sources of Variability • Variation when all means are equal: (Equal Means Model) – Study and understand Display 5.16 (see handout) – Notes: 1. Within group variation is always less than total variation (even if means are equal) 2. Sample averages are less variable than the individual sample values. What does this imply about the magnitude of between group (extra) variation compared to total variation? 3. If means are equal, the numerator and denominator of the F -statistic are unbiased estimates of σ 2 . What does this imply about what the magnitude of an F-statistic should be if the means are equal? • Variation when the means are different: (Separate Means Model) – Study and understand Display 5.17 (see handout) – Notice: 1. The within group variation is not changed from the equal means scenario. Why does this make sense? 2. The denominator of the F -test is always an estimate of σ 2 . The more different the means are, the more the numerator value differs from σ 2 and the farther the F -stat gets from 1. 17 5.5 Robustness and Model Checking 5.5.1 Robustness to Assumptions 1. Normality is not critical. (Exception: small and/or different sample sizes coupled with extremely long-tailed or skewed distributions) 2. Assumptions of independence within and across groups are critical. 3. Assumption of equal population standard deviations is crucial. • We pool the estimates of variability to make the most powerful comparisons possible. However, if one has a very different spread, then serious problems may result, even if comparisons do not involve a mean from that population. • Unlike the 2-sample t-tools the effect of unequal SD’s can be serious even if sample sizes are equal (See Display 5.13) 4. The tools are not resistant to severely outlying observations. 5.5.2. Using RESIDUAL PLOTS to assess assumptions • Assessing assumption of equal standard deviations: – Residuals exhibit the variation of the observations without the visual interference caused by differences among group means. – A scatter plot of the residuals vs. the group means can reveal a relationship between the spread and the group means. (See Display 5.15) LOOK FOR: 1. an increase in spread from left to right (funnel shaped pattern) suggests need for log or some other transformation. 2. seriously outlying observations. 3. If data are collected over time, plot the residuals vs. the time or order of data collection to reveal any serial effects. • How do I use R to make a residual vs. fitted (group means) plot? sep.lm <- lm(PERCENT~JUDGE) #calculates the SS for the specified full model sep.lm.resids <- sep.lm$residuals #gets the residuals from the separate means sep.lm.fits <- sep.lm$fitted #gets the grp averages associated with the #separate means model for each observation dev.new() plot(sep.lm.fits, sep.lm.resids, type="n", main="Residual plot for spock data (Equal vs. Separate)") points(eq.v.sep.fits, eq.v.sep.resids, pch=16, col=4) 18 Example residual plot Example residual plot ● ● ● ● ● ● ● ● 2 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● fake.resids ● 0 fake.resids ● ● ● ● −1 1 ● ● ● ● ● ● ● ● ● 2.0 2.5 3.0 3.5 4.0 4.5 5.0 2.0 2.5 3.0 3.5 fake.fits 4.0 4.5 5.0 fake.fits Figure 1: Example residual plots (from truly normal data with equal variances) • Assessing assumption of normality of the residuals: – Use a Normal QQ-plot (or normal probability plot) to assess the assumption of normality – How do I make them in R? qqnorm(sep.lm.resids) qqline(sep.lm.resids) #To see plots R will automatically make for you par(mfrow=c(1,2)) plot(sep.lm, which=c(1,2)) Normal Q−Q Plot Normal Q−Q Plot ● 2 ● 3 ●● ●●●● ● −2 ● 1 0 ● ● ● −2 −1 Sample Quantiles ● ● ● ● ●●●●●●● ● ● ●●● ●●●● ●● ●● ●●● ● ●●●● ●●●●● ●●●● ●● ●●● ● ● −2 0 1 ● ● ●● ● ● ●●●● ●● ● ●●●●● ●● ●● ●●● ●●●●● ●●● ● ●●●● ● ●●● ●● ●●● ●● ● ● ● −1 Sample Quantiles 2 ● ● ●● ● ● ● −1 0 1 2 −2 Theoretical Quantiles −1 0 1 2 Theoretical Quantiles Figure 2: Example normal probability plots (from truly normal residuals!) • Also use side-by-side boxplots and any other plots that seem useful! 19 5.6.2 Kruskal-Wallis Nonparametric Analysis of Variance • To deal with seriously outlying observations. (However, still need equal spread for test to be meaningful) • Relies on ranks of the observations and known variance of ranks. • Kruskal-Wallis test statistic: KW = Between group SS of ranks Extra SS of Ranks = 2 Variance of the ranks σR 2 is the variance of all n ranks (using divisor of n − 1) where σR • A two-sided p-value is found using the Chi-squared distribution with d.f = (Number of Groups - 1). – R-code: 1-pchisq(KW, d.f.) *************************************** 5.6.3 Random Effects • Fixed effects: there is direct interest in the particular groups chosen. • Random effects: the group means are a random sample from a population of means, and there is not direct interest in the particular groups chosen. – Examples ∗ Measurements taken on the yield of a machine operated by several operators. ∗ An ecologist measures ground cover at several randomly chosen sites in each region. The sites are chosen to be representative of the population of sites within the region. Site is then considered a random factor and site effects are random effects. ∗ Researchers randomly selected 6 brands of beer that make three types of interest (lager, pale ale, and porter). The 6 brands were selected to be representative of all brands that make all three types of beer. Brand is considered a random factor and brand effects are random effects. 20 • How do I know if the random effects model is the right one to use? 1. Is inference desired to a larger set from which these groups are a sample? 2. Are the groups truly a random sample from the larger set? – If answer YES to Question 1., then it is clear you would like to use a random effects model...however, statistical inference to the larger population is only justified IF you answer yes to both questions! • Other notes: – If the researchers were to repeat the studies, different levels (operators, sites, or brands) would be used. – IDEA: the levels are selected to be representative of some broader population of levels and we want to make inference to the broader population! – We assume that the population of levels is infinite. (In practice we check if the number of randomly chosen levels is small relative to the number of levels in the population (i.e. population size is at least 10 times as large). • The Random Effects Model: – Analysis of a one-way random effects model: ∗ Estimate of interest is often: σµ2 Intraclass correlation = 2 (σµ + σ 2 ) · It can be interpreted as the proportion of the total variance of the measurments that is explained by the between-group variability. – Estimates of interest and interpretation are different than for the analogous fixed effects analysis!! 21