doc

advertisement
Hmielowski Nov 13
ANOVA in R
Tracy Hmielowski (thmiel1@lsu.edu)
11/13/09
Performing Analysis of Variance in R using biomass.txt dataset
> biomass=read.table("C:/. . . . . . . /biomass.txt", header=T)
> dim(biomass)
[1] 2323 5
Read in the biomass.txt dataset
Check dimensions of biomass.txt
The dataset should have 2323 observations
and five variables
> biomass[1:10,]
genet species cover rough total.biomass
1 113 Qufa n 1
1.68
2 115 Quin n 1
55.99
3 771 Qufa n 1
13.83
4 773 Quin n 1
35.37
5 774 Quin n 1
92.29
6 777 Quni n 1
9.98
7 779 Quin n 1
2.99
8 780 Quin n 1
18.05
9 781 Quni n 1
47.07
10 782 Quni n 1
5.36
Look at the first ten entries in the dataset
Five variables: genet=ID for a given genet
species = species code of the genet (all are oaks)
cover = groundcover type, either ‘n’ or ‘o’
n – native ground cover
o – old field
rough = time since last fire in years (1-3)
total.biomass = the aboveground biomass of a genet
> qqnorm(biomass$total.biomass)
> biomass$logbm=log(biomass$total.biomass)
qqnorm normality plot of a given variable
creates a new variable, called logbm, which is a
LN transformation of total.biomass
> biomass[1:10,]
genet species cover rough total.biomass
1 113 Qufa n 1
1.68
2 115 Quin n 1
55.99
3 771 Qufa n 1
13.83
4 773 Quin n 1
35.37
5 774 Quin n 1
92.29
6 777 Quni n 1
9.98
7 779 Quin n 1
2.99
8 780 Quin n 1
18.05
9 781 Quni n 1
47.07
10 782 Quni n 1
5.36
Make sure that the new variable shows up
> qqnorm(biomass$logbm)
logbm
0.5187938
4.0251731
2.6268401
3.5658640
4.5249358
2.3005831
1.0952734
2.8931457
3.8516359
1.6789640
Check for normal distribution of the new variable
We now have a measure of biomass that we can
Put into an ANOVA
> str(biomass)
str() function looks at the structure of the data
'data.frame': 2323 obs. of 6 variables:
$ genet
: int 113 115 771 773 774 777 779 780 781 782 ...
$ species : Factor w/ 12 levels "Qu","Qual","Qufa",..: 3 4 3 4 4 10 4 4 10 10 ...
$ cover
: Factor w/ 2 levels "n","o": 1 1 1 1 1 1 1 1 1 1 ...
$ rough
: num 1 1 1 1 1 1 1 1 1 1 ...
We will need ‘factors’ to put into the ANOVA model
$ total.biomass: num 1.68 55.99 13.83 35.37 92.29 ...
Species and Cover are factors, we will need to
Hmielowski Nov 13
$ logbm
: num 0.519 4.025 2.627 3.566 4.525 ...
> rough_f=factor(biomass$rough)
> levels(rough_f)
[1] "1" "2" "3"
change Rough into a factor to use in the ANOVA
factor() used to change a variable
levels() shows the levels of a variable
rough now shows as having three levels
Now that we have a variable with a normal distribution and have created rough_f as a factor we can perform an
analysis of variance. Check out help(aov) for further details on the canned function in R that we are using in the
following steps.
> cover.anova=aov(logbm~cover, data=biomass)
First we will look at cover types
logbm is the dependent variable in the model
cover is the independent variable, as a factor
must direct R to the dataset these variables are from
> summary(cover.anova)
Df Sum Sq Mean Sq F value Pr(>F)
cover
1 99.6 99.6 44.857 2.651e-11 ***
Residuals 2321 5151.7 2.2
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary() produces the ANOVA table
*** aov() will give Type I SS ***
> library(gplots)
> plotmeans(logbm~cover, data=biomass)
Will need to load gplots package
We know that we have a difference in the means
Use plotmeans() to see the means with 95% CI
Same model that is used in the aov() function
Next we can use rough_f to determine if there are differences in the mean biomass with time since fire
> rough.anova=aov(logbm~rough_f, data=biomass)
> summary(rough.anova)
Df Sum Sq Mean Sq F value Pr(>F)
rough_f
2 371.5 185.7 88.3 < 2.2e-16 ***
Residuals 2320 4879.8 2.1
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> plot(rough.anova)
This function tests the residuals of rough.anova
There are multiple figures, hit enter to open each
new figure. R will automatically identify outliers
Hmielowski Nov 13
> TukeyHSD(rough.anova)
Tukey multiple comparisons of means
95% family-wise confidence level
TukeyHSD() function in R will look at the pairwise
comparisions between groups
Fit: aov(formula = logbm ~ rough_f, data = biomass)
$rough_f
diff
lwr upr p adj
2-1 0.7726830 0.6136161 0.931750 0.0000000
3-1 0.9537106 0.7059758 1.201445 0.0000000
3-2 0.1810276 -0.0845878 0.446643 0.2465235
> plot(TukeyHSD(rough.anova))
Here we see the comparisions with adjusted p values
Using the plot() function, in combination with the
TukeyHSD() function we can plot the pairwise
differences
We can also use the aov() function to look for interaction effects between two variables. Here we will use ‘cover’
and ‘rough_f’
For the interaction we use * between
the two variables
> int.anova=aov(logbm~cover*rough_f, data=biomass)
> summary(int.anova)
Df Sum Sq Mean Sq F value Pr(>F)
cover
1 99.6 99.6 48.703 3.873e-12 ***
rough_f
2 380.6 190.3 93.083 < 2.2e-16 ***
cover:rough_f 2 34.4 17.2 8.424 0.0002263 ***
Residuals 2317 4736.7 2.0
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Remember that this summary is Type I SS
> int.anova2=aov(logbm~rough_f*cover, data=biomass)
> summary(int.anova2)
Df Sum Sq Mean Sq F value Pr(>F)
rough_f
2 371.5 185.7 90.851 < 2.2e-16 ***
cover
1 108.7 108.7 53.167 4.184e-13 ***
rough_f:cover 2 34.4 17.2 8.424 0.0002263 ***
Residuals 2317 4736.7 2.0
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
If we change the order of the variables
we see that the values for the Sum Sq
change – Order of variables matters!
> drop1(int.anova, ~., test="F")
Single term deletions
Model:
logbm ~ cover * rough_f
Df Sum of Sq RSS AIC F value
<none>
4736.7 1667.1
Using the drop1() function we are able to
calculate the Type III SS
Pr(F)
Hmielowski Nov 13
cover
1 29.4 4766.1 1679.5 14.393 0.0001521 ***
rough_f
2 88.2 4824.8 1705.9 21.569 5.233e-10 ***
cover:rough_f 2 34.4 4771.1 1679.9 8.424 0.0002263 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
We can look at where this interaction is occurring using the interaction.plot() function
> interaction.plot(cover, rough, biomass$logbm)
Interaction.plot() will plot the means of each
of the six groups.
If we think of imagine that this was a randomize block design, we can set up the ANOVA in R to take into account a
block variable
> block.anova=aov(logbm~rough_f+cover, data=biomass)
Here logbm is still the dependent variable
cover is being used as the block variable
rough_f is the treatment factor
Notice that a block variable follows a +
Where the interaction uses a *
Download