ANOVA Handout

advertisement
ANOVA Handout
#1. 1-Way ANOVA Example
#We’ll analyze the Filling Machines data from Problem 16.11, p. 725-6 from the
#textbook: A company uses six filling machines of the same make and model to place
#detergent into cartons that show a label weight of 32 ounces
fillmach<-read.table("c://Classes//Stat214//CH16PR11.txt")
## what the data look like
fillmach[1:3,]
V1 V2 V3
1 -0.14 1 1
2 0.20 1 2
3 0.07 1 3
#assign variable names:
names(fillmach)=c("amount","machine","replication")
#We can make the variables more accessible using the attach statement:
attach(fillmach)
## machine is the factor in this experiment and has to be declared as such in R
machine<-factor(machine)
# Let’s visualize the data
boxplot(amount~ machine)
## apparently there appear to exist differences among the six machines with respect to the
## amount of detergent they put in the cartons
#we can apply ANOVA (single factor). Notice that we use the aov() and not lm().
#aov uses lm for specific ANOVA analysis including the Tukey multiple comparison.
## Is there a difference between machines?
summary(fm1 <- aov(amount ~ machine))
Df Sum Sq Mean Sq F value Pr(>F)
machine
5 2.2893 0.4579 14.784 3.636e-11 ***
Residuals 114 3.5306 0.0310
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# F-test is significant, hence we conclude that the mean amounts of fill differs among the
#six machines
## Let’s check the distributional assumptions
par(mfrow=c(2,2))
plot(fm1)
## This plot checks the basic assumptions for the classic ANOVA model with normally
##distributed, independent and constant error. Homoscedasticity ( variance of the error is
##constant) is very important (i.e. error in the model is constant and independent of the
##factor levels).
## The QQplot checks the normal distributions. Influential points can be detected in the
##Cook’s distance plot.
## There are a few of problematic points (8, 32 and 71).
#Next question is: where are the differences?
# We’ll consider all pairwise comparisons: Tukey’s Method
#Multiple comparisons by TukeyHSD must be done on a list made by
#aov()
## We usually do not look for differences if the ANOVA null-hypothesis is
not rejected
## The function TukeyHSD implements Tukey multiple comparisons
fm1Tukey<-TukeyHSD(fm1,"machine")
fm1Tukey
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = amount ~ machine)
$machine
diff
lwr
upr p adj
2-1 0.1170 -0.0443194 0.2783194 0.2934937
3-1 0.3865 0.2251806 0.5478194 0.0000000
4-1 0.2920 0.1306806 0.4533194 0.0000106
5-1 0.0515 -0.1098194 0.2128194 0.9392011
6-1 0.0780 -0.0833194 0.2393194 0.7260015
3-2 0.2695 0.1081806 0.4308194 0.0000588
4-2 0.1750 0.0136806 0.3363194 0.0252432
5-2 -0.0655 -0.2268194 0.0958194 0.8469184
6-2 -0.0390 -0.2003194 0.1223194 0.9815028
4-3 -0.0945 -0.2558194 0.0668194 0.5359056
5-3 -0.3350 -0.4963194 -0.1736806 0.0000003
6-3 -0.3085 -0.4698194 -0.1471806 0.0000029
5-4 -0.2405 -0.4018194 -0.0791806 0.0004684
6-4 -0.2140 -0.3753194 -0.0526806 0.0026737
6-5 0.0265 -0.1348194 0.1878194 0.9968910
# Differences between Brands are significant at 5% level if the confidence interval
#around the estimation of the difference does not contain zero.
#This can be visualized by a plot of the list:
par(mfrow=c(1,1))
plot(fm1Tukey)
## Statistically significant Differences exist between machines 1-3, 1-4, 2-3,2-4,3-5,36,5-4,6-4
#1. 2-Way ANOVA Example
#We’ll analyze the Cash Offers data from Problem 16.10, p. 725 from the textbook:
#A consumer organization studied the effect of age of automobile owner on size of cash
#offer for a used car by utilizing 12 persons in each of three age groups. Six of the twelve
#were females and six males.
cash<-read.table("c://Classes//Stat214//CH19PR10.txt")
## what the data look like
cash[1:3,]
V1 V2 V3 V4
1 21 1 1 1
2 23 1 1 2
3 19 1 1 3
names(cash)<-c('offer','age','sex','replicate')
cash$sex<-factor(cash$sex)
cash$age<-factor(cash$age)
## Next we plot an interaction plot
interaction.plot(cash$age,cash$sex,cash$offer)
## start with a full interaction model
FullMod<-lm(cash$offer~cash$sex+cash$age+cash$sex:cash$age)
#use the 'anova' function to get the ANOVA-table ( type I SS or sequential extra sum of
#squares):
anova(FullMod)
Analysis of Variance Table
Response: cash$offer
Df Sum Sq Mean Sq
F value
cash$sex
1
5.44
5.44
2.2791
cash$age
2
316.72
158.36
66.2907
cash$sex:cash$age 2
5.06
2.53
1.0581
Residuals
30
71.67
2.39
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Pr(>F)
0.1416
9.79e-12 ***
0.3597
## only age appears to be important as we expected it from the interaction plot
## let’s fit 1-way ANOVA with age only
summary(aov(cash$offer ~ cash$age))
Df Sum Sq Mean Sq F value Pr(>F)
cash$age 2
316.72
158.36 63.601 4.769e-12 ***
Residuals 33
82.17
2.49
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###Problem 19.14—Hay Fever (p. 868)
hay<-read.table("c://Classes//Stat214//CH19PR14.txt")
names(hay)<-c('relief','A','B','replicate')
hay$A<-factor(hay$A)
hay$B<-factor(hay$B)
## Next we plot an interaction plot
interaction.plot(hay$A,hay$B,hay$relief)
## interaction is evident
FullMod<-lm(hay$relief~hay$A*hay$B)
anova(FullMod)
Analysis of Variance Table
Response: hay$relief
Df Sum Sq Mean Sq F value
hay$A
2 220.020 110.010 1827.86
hay$B
2 123.660 61.830 1027.33
hay$A:hay$B 4
29.425
7.356 122.23
Residuals
27
1.625
0.060
Pr(>F)
< 2.2e-16 ***
< 2.2e-16 ***
< 2.2e-16 ***
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## interaction is significant as expected
# check the residuals for violations of model assumptions
par(mfrow=c(2,2))
plot(FullMod)
### Multiple Comparisons:
## One approach when interaction is important is to perform a one-way ANOVA with all
## combinations combined in one factor.
#A simple trick (however A and B need to be used as numerics):
C<-as.numeric(hay$A)*10+as.numeric(hay$B)
C
# [1] 11 11 11 11 12 12 12 12 13 13 13 13 21 21 21 21 22 22 22 22 23 23 23 23 31 31 31
#31 32 32 32 32 33
#[34] 33 33 33
C<- factor(C)
CMod<-lm(hay$relief~C)
anova(CMod)
Analysis of Variance Table
Response: hay$relief
Df Sum Sq Mean Sq F value Pr(>F)
C
8 373.10
46.64 774.91
< 2.2e-16 ***
Residuals 27
1.63
0.06
--Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## we now compare all the combinations of A and B coded in the new combined factor C
##
## We can apply Tukey multiple comparisons.
TukeyHSD(aov(hay$relief~C), "C", ordered = TRUE)
## not showing the result as is very long
## here’s the plot instead
par(mfrow=c(1,1))
plot(TukeyHSD(aov(hay$relief~C), "C", ordered = TRUE))
Download