Hmielowski Nov 13 ANOVA in R Tracy Hmielowski (thmiel1@lsu.edu) 11/13/09 Performing Analysis of Variance in R using biomass.txt dataset > biomass=read.table("C:/. . . . . . . /biomass.txt", header=T) > dim(biomass) [1] 2323 5 Read in the biomass.txt dataset Check dimensions of biomass.txt The dataset should have 2323 observations and five variables > biomass[1:10,] genet species cover rough total.biomass 1 113 Qufa n 1 1.68 2 115 Quin n 1 55.99 3 771 Qufa n 1 13.83 4 773 Quin n 1 35.37 5 774 Quin n 1 92.29 6 777 Quni n 1 9.98 7 779 Quin n 1 2.99 8 780 Quin n 1 18.05 9 781 Quni n 1 47.07 10 782 Quni n 1 5.36 Look at the first ten entries in the dataset Five variables: genet=ID for a given genet species = species code of the genet (all are oaks) cover = groundcover type, either ‘n’ or ‘o’ n – native ground cover o – old field rough = time since last fire in years (1-3) total.biomass = the aboveground biomass of a genet > qqnorm(biomass$total.biomass) > biomass$logbm=log(biomass$total.biomass) qqnorm normality plot of a given variable creates a new variable, called logbm, which is a LN transformation of total.biomass > biomass[1:10,] genet species cover rough total.biomass 1 113 Qufa n 1 1.68 2 115 Quin n 1 55.99 3 771 Qufa n 1 13.83 4 773 Quin n 1 35.37 5 774 Quin n 1 92.29 6 777 Quni n 1 9.98 7 779 Quin n 1 2.99 8 780 Quin n 1 18.05 9 781 Quni n 1 47.07 10 782 Quni n 1 5.36 Make sure that the new variable shows up > qqnorm(biomass$logbm) logbm 0.5187938 4.0251731 2.6268401 3.5658640 4.5249358 2.3005831 1.0952734 2.8931457 3.8516359 1.6789640 Check for normal distribution of the new variable We now have a measure of biomass that we can Put into an ANOVA > str(biomass) str() function looks at the structure of the data 'data.frame': 2323 obs. of 6 variables: $ genet : int 113 115 771 773 774 777 779 780 781 782 ... $ species : Factor w/ 12 levels "Qu","Qual","Qufa",..: 3 4 3 4 4 10 4 4 10 10 ... $ cover : Factor w/ 2 levels "n","o": 1 1 1 1 1 1 1 1 1 1 ... $ rough : num 1 1 1 1 1 1 1 1 1 1 ... We will need ‘factors’ to put into the ANOVA model $ total.biomass: num 1.68 55.99 13.83 35.37 92.29 ... Species and Cover are factors, we will need to Hmielowski Nov 13 $ logbm : num 0.519 4.025 2.627 3.566 4.525 ... > rough_f=factor(biomass$rough) > levels(rough_f) [1] "1" "2" "3" change Rough into a factor to use in the ANOVA factor() used to change a variable levels() shows the levels of a variable rough now shows as having three levels Now that we have a variable with a normal distribution and have created rough_f as a factor we can perform an analysis of variance. Check out help(aov) for further details on the canned function in R that we are using in the following steps. > cover.anova=aov(logbm~cover, data=biomass) First we will look at cover types logbm is the dependent variable in the model cover is the independent variable, as a factor must direct R to the dataset these variables are from > summary(cover.anova) Df Sum Sq Mean Sq F value Pr(>F) cover 1 99.6 99.6 44.857 2.651e-11 *** Residuals 2321 5151.7 2.2 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 summary() produces the ANOVA table *** aov() will give Type I SS *** > library(gplots) > plotmeans(logbm~cover, data=biomass) Will need to load gplots package We know that we have a difference in the means Use plotmeans() to see the means with 95% CI Same model that is used in the aov() function Next we can use rough_f to determine if there are differences in the mean biomass with time since fire > rough.anova=aov(logbm~rough_f, data=biomass) > summary(rough.anova) Df Sum Sq Mean Sq F value Pr(>F) rough_f 2 371.5 185.7 88.3 < 2.2e-16 *** Residuals 2320 4879.8 2.1 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > plot(rough.anova) This function tests the residuals of rough.anova There are multiple figures, hit enter to open each new figure. R will automatically identify outliers Hmielowski Nov 13 > TukeyHSD(rough.anova) Tukey multiple comparisons of means 95% family-wise confidence level TukeyHSD() function in R will look at the pairwise comparisions between groups Fit: aov(formula = logbm ~ rough_f, data = biomass) $rough_f diff lwr upr p adj 2-1 0.7726830 0.6136161 0.931750 0.0000000 3-1 0.9537106 0.7059758 1.201445 0.0000000 3-2 0.1810276 -0.0845878 0.446643 0.2465235 > plot(TukeyHSD(rough.anova)) Here we see the comparisions with adjusted p values Using the plot() function, in combination with the TukeyHSD() function we can plot the pairwise differences We can also use the aov() function to look for interaction effects between two variables. Here we will use ‘cover’ and ‘rough_f’ For the interaction we use * between the two variables > int.anova=aov(logbm~cover*rough_f, data=biomass) > summary(int.anova) Df Sum Sq Mean Sq F value Pr(>F) cover 1 99.6 99.6 48.703 3.873e-12 *** rough_f 2 380.6 190.3 93.083 < 2.2e-16 *** cover:rough_f 2 34.4 17.2 8.424 0.0002263 *** Residuals 2317 4736.7 2.0 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Remember that this summary is Type I SS > int.anova2=aov(logbm~rough_f*cover, data=biomass) > summary(int.anova2) Df Sum Sq Mean Sq F value Pr(>F) rough_f 2 371.5 185.7 90.851 < 2.2e-16 *** cover 1 108.7 108.7 53.167 4.184e-13 *** rough_f:cover 2 34.4 17.2 8.424 0.0002263 *** Residuals 2317 4736.7 2.0 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 If we change the order of the variables we see that the values for the Sum Sq change – Order of variables matters! > drop1(int.anova, ~., test="F") Single term deletions Model: logbm ~ cover * rough_f Df Sum of Sq RSS AIC F value <none> 4736.7 1667.1 Using the drop1() function we are able to calculate the Type III SS Pr(F) Hmielowski Nov 13 cover 1 29.4 4766.1 1679.5 14.393 0.0001521 *** rough_f 2 88.2 4824.8 1705.9 21.569 5.233e-10 *** cover:rough_f 2 34.4 4771.1 1679.9 8.424 0.0002263 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 We can look at where this interaction is occurring using the interaction.plot() function > interaction.plot(cover, rough, biomass$logbm) Interaction.plot() will plot the means of each of the six groups. If we think of imagine that this was a randomize block design, we can set up the ANOVA in R to take into account a block variable > block.anova=aov(logbm~rough_f+cover, data=biomass) Here logbm is still the dependent variable cover is being used as the block variable rough_f is the treatment factor Notice that a block variable follows a + Where the interaction uses a *