Biol 243-R Tutorials-2012

advertisement
Queen’s University!
September 2012
Introduction to Statistics
in R
Tutorial Material
BIOL 243
2012
Queen’s University
Edited by WA Nelson
1
Queen’s University!
September 2012
Table of Contents
Tutorial 1: Getting Started with R!
How to get and install R
5
Great, let’s get started!
5
Working with vectors
7
Working with data
8
Getting help on R functions
10
Let’s get plotting
11
The editor
13
Concluding remarks
14
Resources
15
Function List
15
Overview of the R software
16
Customizing the work environment
18
Tutorial 2: Graphs and Descriptive Statistics in R
19
Descriptive statistics
19
Contingency tables
21
Histograms
23
Bar plots
24
Box plots
26
Scatter plots
27
Function List
28
Tutorial 3: Hypothesis Testing in R
2
4
29
Finding critical values
29
Chi-squared test
29
One-sample t-test
32
Paired Two sample t-tests
35
Independent two sample t-tests
39
Queen’s University!
Function List
Tutorial 4: Regression and correlation in R
September 2012
42
43
Fitting the linear regression
43
Plotting the fit linear regression
47
Evaluating the assumptions of linear regression
47
Correlation
50
Function List
51
Tutorial 5: One-Factor ANOVA in R
52
Creating data frames for ANOVA
52
Fitting the ANOVA model
53
Evaluating the assumptions of ANOVA
56
Posteriori contrasts
58
Function List
59
Tutorial 6: Two-Factor ANOVA in R
60
Creating data frames for ANOVA
60
Fitting the ANOVA model
62
Evaluating the assumptions of ANOVA
66
Posteriori contrasts
67
Function List
69
Reference Cards
70
3
Functions
70
Arithmetic, indexing and Logic Operators
74
Plotting Options
74
Queen’s University!
September 2012
Tutorial 1: Getting Started with R!
W.A. Nelson
This guide is an introduction to the statistical software environment R. There are
many excellent books and online resources (see Resources) that give a broader
background and overview of the wonderful world of R. Rather than try to emulate
these resources, the intention here is to provide a beginner’s guide that will help
you become familiar with the basics of R quickly, while giving you the tools to
learn more as you encounter new problems.
What is R? R is an object-oriented programming language and computational
environment for a wide range of statistical and mathematical analyses. R can be
used as a calculator; as a versatile graphing tool; to run basic and advanced
statistics; for numerically solving differential equations and matrix models; and
for more specialized analyses such as inferring phylogenies, calculating genetic
distances and bioinformatics. R is supported and developed by academics, and
has an amazing array of references and help resources available on-line and in
print.
Here’s a short list of reasons to use R, even in an intro stats course1.
• R is used by the majority of academic statisticians.
• R in free!
• R is effectively platform independent. If you live in an all-Windows
environment, this may not be such a big deal but for those of us who use
Linux/Mac and work with people who use Windows, it’s a tremendous
advantage.
• R has unrivaled help through books and on-line resources (but the immediate
help functions in R can be difficult to understand).
• R makes the best graphics.
• The command-line interface — perhaps counterintuitively — is much, much
better for teaching. You can post your code exactly and students can
reproduce your work exactly.
• R is more than a statistics application. It is a full programming language.
• The online distribution system beats anything out there.
Along with its growing popularity, however, R has a reputation for being difficult
to learn because it uses command-line programming rather than the graphical
1
4
Why use R? by J. Holland Jones (monkeysuncle.stanford.edu/?p=367)
Queen’s University!
September 2012
user interface (menus, etc.) used by most commercial statistics programs.
Contents of this primer are influenced to a large extent by our perspective on the
challenges of learning R, and with the intent to provide a straightforward
roadmap to start using R efficiently. The first sections introduce you to using the
console (the window where commands are typed), understanding the structure of
arithmetic operations, plotting simple graphs, and getting help.
How to get and install R
R is free. Go to http://cran.r-project.org (CRAN stands for the Comprehensive R
Archive Network) and click on one of the Linux, MacOS X or Windows links in the
Download and Install R box (you won’t need any of the source code files, these are
for developers). Follow the links for the base files (for Windows users) and
download the latest version (‘R-2.15.1-win32.exe’ for Windows users and
‘R-2.15.1.dmg’ for MacOSX users). The downloaded file has an installer in it, so
double clicking the file and following the prompts is all you need to do to install
R.
Great, let’s get started!
Start R as you would any program. You should have a single window open, which
is called the console (see Overview of the R software for more details). This is the
R program. When you move your cursor to the console, you should see
>
(note that we will use a green background throughout this guide to indicate what
you see in the console). This is where commands are entered. Lets try it. Type
“2+2” and then press the the enter key:
> 2+2
You should see the answer appear (preceded by “[1]”) below the line you typed:
> 2+2
[1] 4
The number in square brackets just indicates that this is the first entry in the
vector being returned. Now let’s create some variables. First, create a new variable
‘dan’ and assign it the value 2.
> dan=2
then create a new variable ‘ted’ and assign it the value 5
> ted=5
now we can have some fun manipulating dan and ted.
5
Queen’s University!
September 2012
> dan-ted
[1] -3
> ted/dan
[1] 2.5
> dan*ted
[1] 10
If you want to raise something to the exponent, use the ‘^’ symbol
> ted^dan
[1] 25
The Reference card shows a list of common arithmetic operations. Now that you
have your feet wet, here’s a short list of useful R commands and operations 2:
• > is the prompt from R on the console, indicating that it is waiting for you to
enter a command
• Each command entered by hitting <return> (<enter> on a Windows PC) is
executed immediately and often generates a response; commands and
responses are shown in different colours (can be changed in Preferences, see
below) and this really helps when you are trying to find things on the console
after a long session.
• When you hit <return> (<enter> on a Windows PC) before a command is
completely typed a + will appear at the beginning of the next line
• R is case sensitive, so House is a different variable from house.
• R functions are followed by (), with or without something inside the brackets
depending on the function (e.g., min(x) gives the minimum value of the x
vector).
• Enter citation() to see how to cite R in your paper
• Don’t worry about spaces around symbols in a typed line; they do not have
any effect, so win=3+4 is the same as win = 3 + 4.
• The # sign begins a comment that will not be executed, and can either follow
a command, or begin a new line. This is a great way to makes notes to
yourself in the code.
• To assign variables, use the = assignment operator; xxx = yyy assigns yyy
to a new object xxx; most advanced R users and books use <- instead of =
but this is equivalent for our purposes and = is easier to remember and type.
• The up arrow ↑ scrolls back through previous commands that you have
entered on the console; you can either press <return> (<enter> on a Windows
PC) to execute the command again or edit it to correct a mistake. Very handy.
2
6
Starting to Use R by R. Montgomerie 2010
Queen’s University!
September 2012
• If you make a mistake when typing and R returns ‘syntax error’ or has a ‘+’
sign rather than the prompt, press the escape key. This will give you a fresh
prompt.
Working with vectors
A vector (i.e., a list of numbers) is created in R using the function c(). In R, all
functions use round brackets to accept input with each entry separated by a
comma. To create a vector all we need to do is to provide the function c() with a
list of numbers.
> julie=c(3.2,4.1,5.5,6.2,7.1,8.4,9.5)
To see what is in the variable, just type it’s name.
> julie
[1] 3.2 4.1 5.5 6.2 7.1 8.4 9.5
To access a particular entry in the vector, use square brackets at the end of the
name. Here the [3] indicates that you want the number in the third spot.
> julie[3]
[1] 5.5
We can even access a specific set of the elements in the vector
> a=c(3,4,5)
> julie[a]
[1] 5.5 6.2 7.1
Here the ‘a’ vector is used to specify which of the elements in ‘julie’ were needed.
The same thing can be achieved by entering:
> julie[c(3,4,5)]
[1] 5.5 6.2 7.1
Mathematical operations can be done directly on vectors—here’s some examples:
> julie-7
[1] -3.8 -2.9 -1.5 -0.8 0.1 1.4 2.5
> julie-julie
[1] 0 0 0 0 0 0 0
> julie^2
[1] 10.24 16.81 30.25 38.44 50.41 70.56 90.25
Notice that the operation is carried out on each entry of the vector independently.
The output from a new operation can also be assigned to a new variable:
> b=julie-7
We can look at the sum of all entries in the vector (using the function sum())
7
Queen’s University!
September 2012
> sum(b)
[1] -5
or skip a step and nest these two operations in one command
> sum(julie-7)
[1] -5
Some other functions that are useful for summarizing information in vectors are
the mean(), min() and max(). Remember that spaces within a function have no
influence, but R is very picky about commas and whether or not letters are
capitalized.
Working with data
Vectors are a great way to store certain types of data in R, but they consist of only
a single column of data. In this section, we will learn to store more than one
column of data in data frames, as well as to import into R. Data frames are much
like a collection of vectors, and are created using the data.frame() function.
Begin by creating two vectors (they must have the same number of elements)
> x=c(1,2,3,4,5,6,7,8)
> y=c(2.1,1.3,4.2,7.1,8.7,12.1,11.2,14.5)
Then use the data.frame() function to bind these two vectors into one
> steve=data.frame('X'=x,'Y'=y) #here ‘X’ and ‘Y’ become the
column labels
This will create an 8 x 2 data frame with column titles X and Y (you can choose
any name you like for the column titles). Type the variable name steve to see what
is in it. To access just the X column, you can type
> steve[,1]
[1] 1 2 3 4 5 6 7 8
where the square brackets give you access to specific rows and columns in the
format [rows , columns]. The blank before the comma means to select all rows. As
an alternative, data frames let you call a column by it’s name with the ’$’ sign:
> steve$X
[1] 1 2 3 4 5 6 7 8
If you want a specific value from the data frame, you can access it using either of
the indexing methods.
> steve[4,1]
[1] 4
> steve$X[4]
[1] 4
8
Queen’s University!
September 2012
Data frames are one of the more common ways to store data in R, and are often
the form required by functions (the function help file will always indicate the
form that the argument needs to be in).
Now we can learn about importing and exporting data. R can read and write many
types of files (e.g., .txt, .csv, .xls), but here we will just explore working with CSV
(comma separated variable) files. CSV files have a ‘.csv’ ending to the file name,
and can be created using a variety of different spreadsheet software, such as
Excel or Numbers. Reading data from .csv files is done with the read.csv()
function. To start, make sure that your data are organized in the .csv file the way
you want to use it in R, and give each column a title (we recommend that each
column have a title without spaces, like ‘BillWidth’, and that you keep these short
as you will have to type them out later). For example, the file
WidthLengthBeetles.csv (opened in Excel) in shown Figure 1.1. If you create your
data file in EXCEL, make sure you save it as a CSV file.
Figure 1.1 Example Excel file for creating data sets.
The data file is imported using
> my.dat=read.csv(file.choose())
The function file.choose() opens up a new window that lets you browse for
the file, and the contents of the file are assigned to the variable my.dat (note that
you can use a different name here if you want, but be aware that there are some
restrictions) as a data frame. Once you have imported the .csv file, type my.dat to
9
Queen’s University!
September 2012
see its contents. Notice that we used a period in the new variable name. In R,
variable names can include text, numbers, periods and underscores, but not
spaces or other symbols used for arithmetic such as “!”. This gives you the basics
of importing data, as well as creating and using data frames.
Getting help on R functions
This is probably the most important section of this primer because there are
thousands of functions in R, and it is essential to learn how to use the help files
(which can be difficult to understand). There are numerous sources for help, such
as web forums and blogs, books, and internet help sites (see Resources). Another
good source is the help files in R itself. For example, lets have a closer look at the
mean() function, which returns the average value of a vector. To get the help-file
for a function in R, simply type a question mark followed by the name.
> ?mean
This will open a new window with the help file. All help files have the same basic
sections: description, usage, arguments, values, references, see also, and
examples. Figure 1.2 shows an annotated help file for the mean() function. The
key parts for any function help-file are the ‘Arguments’ and ‘Value’ sections since
these describe the input and output of a function. For the mean() function, the
arguments are the data object x that you are computing the mean for, and two
options—trim and na.rm. The typical use is
> y=c(2.1,1.3,4.2,7.1,8.7,12.1,11.2,14.5)
> mean(y)
[1] 7.65
but if you have values that are not a number (designated by NA or NAN in R), the
na.rm=TRUE option allows you to ignore these entries when calculating the mean.
For example, the following commands return NA
> y=c(2.1,1.3,4.2,7.1,8.7,12.1,11.2,14.5,NA)
> mean(y)
[1] NA
while the following will drop the NA entry and calculate the correct mean
> mean(y,na.rm=TRUE)
[1] 7.65
The help files give an example of the function in use, and often give suggestions
for closely related functions.
10
Queen’s University!
September 2012
The library the function comes
from. Here it is the ʻbaseʼ package.
General description of the function
This shows the order of the arguments in the
function, as well as default values
This section gives you the details of what
arguments the function accepts, and any options
This section gives you the details of what the function returns
This section suggests related functions
This section gives a worked example
Figure 1.2 Navigating the help files for R functions. This
marked up sheet is for the mean() function.
Let’s get plotting
R can be used to generate some of the nicest graphs of any statistical software.
This section introduces you to two common plots to illustrate what can be done.
We start with a histogram of some data, which shows the number of observations
in your data set that fall in bins along the x-axis. Import the
WidthLengthBeetles.csv data set as described above. This data set has two
columns of data: Width and Length. To plot a histogram of just the wing width
data, type
> hist(my.dat$Width)
where the command my.dat$Width accesses just the width data as described
above. The plot should look like the image in Figure 1.3.
11
Queen’s University!
September 2012
3
2
0
1
Frequency
4
5
Histogram of my.dat$Width
2
4
6
8
10
my.dat$Width
Figure 1.3 Histogram using the hist() function.
It’s as simple as that. Graphs are plotted in a separate window, and a plotting
function (such as hist()) will open a new window if none are open.
The second type of plot we consider here is an X-Y scatter plot, which we can use
to plot the width and length at the same time. Scatter plots are created using the
plot() function, which requires the arguments X and Y (as well as other options
if desired), where X is a vector of all x values you want to plot, and Y is a vector of
the y values. The first entry of the X vector corresponds with the first entry of the
Y vector, and so on. For example, to create a basic plot of the weevil data we
would type:
> plot(my.dat$Width,my.dat$Length)
Axis labels are added using the xlab='...' and ylab='...'options, and a plot
title is added using the main='...' option, where the ‘...’ is where you place the
label text. There are many options available to make plots look nice (see Plotting
options), but we will wait until the second tutorial to go beyond adding labels.
The weevil data plot with labels added looks like:
> plot(my.dat$Width,my.dat$Length,xlab='Wing Width
(mm)',ylab='Wing Length (mm)',main='Plot of Wing Width by
Length in Cowpea Weevils')
12
Queen’s University!
September 2012
60
50
40
30
10
20
Wing Length (mm)
70
80
Plot of Wing Width by Length in Cowpea Weevils
3
4
5
6
7
8
9
Wing Width (mm)
Figure 1.4 Scatterplot using the plot() function.
The editor
Now that we have covered a number of functions and tools, we will want to save
our work for future use, which is done using an editor. The editor also lets you
build your own library of commonly used functions and notes on how you
implemented them. R has an editor embedded within it. The editor is linked to
the console, so we can ‘submit’ our work to the console right from the editor if we
want. To start a new editor file in MacOSX, click on ‘File’ from the menu bar and
select ‘new document’; in Windows OS, click on ‘File’ and select ‘New Script’. Save
this file (it should have a ‘.r’ ending). Now you can type your program. It’s a good
idea to add comments to help remind yourself of what each line does. Here’s an
example, notice that the prompts are shown in the console but not in the editor.
#This file calculates the mean wing width and length, and
plots the weevil data
my.dat=read.csv(file.choose())! #reads in the data file
mean(my.dat$Width)!
13
#calculates the mean wing width
Queen’s University!
September 2012
mean(my.dat$Length)! #calculates the mean wing length!
#Scatter plot of the weevil data
plot(my.dat$Width,my.dat$Length,xlab='Wing Width
(mm)',ylab='Wing Length (mm)',main='Plot of Wing Width by
Length in Cowpea Weevils')
To submit the program in MacOSX, select the text with your mouse, hold the 
key and press return; in Windows OS click ‘Edit’ from the menu and select ‘run
all’. In Windows OS, you can also run each line by selecting the line and pressing
the Ctrl followed by the ‘R’ key.
Concluding remarks
This first tutorial provides the elements to begin using R, which we will build on
in the following tutorials to expand your ability to do a range of statistics,
calculations and graphing. The Resource section below provides a list of on-line
books and websites for more information on these topics, as well as a series of
quick reference cards. The final sections provides a complete overview of the R
environment that will become a useful reference as you gain more experience
using the software.
14
Queen’s University!
September 2012
Resources
Websites
1. Quick R by Robert Kabacoff is a good resource for a quick introduction to
statistical functions in R. (www.statmethods.net)
2. R Tutorial by Chi Yau is a really good quick reference website and primer.
(www.r-tutor.com/)
Online Books and Manuals
3. Simple R by John Verzani is a more extensive guide to standard statistics from
one-way hypothesis testing, to ANOVA’s and regression. (cran.r-project.org/
doc/contrib/Verzani-SimpleR.pdf)
4. R Primer by Christopher Green is a primer for an upper year stats course, and
has a detailed introduction to logic tests and looping (as well as statistical
functions). (www.stat.washington.edu/cggreen/rprimer/first-edition/rprimer.
09212004.firsted.pdf)
5. An Introduction to R by Venables & Smith is a great resource for matrices and
matrix functions, as well as writing your own functions and more advanced
plotting functions. (cran.r-project.org/doc/manuals/R-intro.pdf)
6. Statistics Using R with Biological Examples by Seefeld & Linder is a nice
reference for statistical methods that go beyond frequentist statistics. (cran.rproject.org/doc/contrib/Seefeld_StatsRBio.pdf)
7. A Beginner’s Guide to R by A. Zurr is a comprehensive R book that is available
online through Queen’s library (type the title into QCAT).
8. Introductory Statistics with R by P. Dalgaard is a good statistics text book that
uses R, and is available online through Queen’s library (type the title into
QCAT).
Function List
Here is a list of the functions, operators and options covered in tutorial 1, see the
quick Reference Card for more details.
Functions: c( ), min( ), max( ), sum( ), mean( ), data.frame( ), read.csv( ),
file.choose( ), hist( ), plot( ).
Arithmetic operators: +, -, /, *, ^, =, [ ], ( ).
Plotting options: xlim=c( ), ylim=c( ), xlab=' ', ylab=' '.
15
Queen’s University!
September 2012
Overview of the R software
This overview is from Start using R by R. Mongomerie 3. The diagram below shows
the relation between the R programming language and various internal and
external components, all described below:
While this probably looks a bit daunting, we think it helps to understand what is
linked to the console and the console is what you will see and work with most of
the time.
Console
This is your main window that you see when you open the R application, where
you will type commands for R to implement, and where you will see the text
responses to those commands. The console has a toolbar at the top for quick
access to some commands.
R Language
This is the compiler that is the heart and soul, or more correctly the brain, of R.
Invisible to you.
3
Starting to Use R by R. Montgomerie 2010
16
Queen’s University!
September 2012
Working Directory
This is where R will look for files that you specify in commands, unless you give
the full path to a file. Set this in the Preferences.
Packages
R is complemented by packages that contain specific functions and data sets that
you can use once the Package is available in your workspace. Many packages are
built in and are immediately available when you download and install R; others
can be downloaded and installed into R. See http://CRAN.R-project.org for a long
list of available packages.
Editor
This is either an internal module or an external program (that you specify) to
compose and edit R command and programs and then to save them in a text file
for later use. The R editor highlights functions, commands and other things in
different colours and automatically closes brackets for you etc. R is easiest to
work with if you do all your basic work in the editor, then transfer commands to
the console, then save your editor files for easy later reference.
Package Installer
Use this to download new packages so that they can be loaded into your
workspace as needed using the Package Manager.
Quartz Window
All of your graphs will appear here. Size and shape can be adjusted by just
changing the window size using the handle in the lower right corner of the
window. You can save the window as a pdf file that can be opened and edited in
Adobe Illustrator.
Package Manager
Use this to load into the workspace packages that have been downloaded and
installed.
Data Manager
Lets you see all data sets and variables that are in your workspace, including
those in loaded Packages.
Data Editor
This is a window where you can edit your data. Use fix(dat) to see your data in the
Data Editor, where dat is the name of a data set object in your workspace (note
that we use dat for the generic name of your data set throughout this document)
Workspace
This is an invisible area where all of the work gets done, using the R Language to
manipulate objects. When you are working in an R session, all of the objects that
17
Queen’s University!
September 2012
you have loaded or created are stored in a workspace that can be saved at the end
of your session. You can also set a preference so that R always saves the current
workspace when you close R, then opens that same workspace when you re-open
R.
Workspace Browser
Use this menu item to see a listing of the objects in the current workspace in
spreadsheet format. You can use the toolbar on this window to delete objects or
edit them, which will open the Data Editor window.
History
Use the History icon in the toolbar to see all the commands used since you last
cleared the history. Double click on a command in this list to have it entered anew
at the prompt. You can also load previous history that you might have saved.
External Files
These are text files containing data, commands, functions, history or other
information that is either saved during an R session or read into the workspace
during a session. These are shown as diamonds in the diagram
Customizing the work environment
You could just start using R but your experience will be better if you customize
the working environment to your personal taste. You need to do this only once
(on each of your computers and the environment will be the same each time you
start R. First, customize the toolbar by control clicking on it and choosing
Customize Toolbar and setting it up the way you like it. Here’s an example:
Second, select the R>Preferences menu item and customize each of the settings.
You might want to make the background colour something other than plain
white, to make it easy to distinguish from other windows on your screen, or your
may want bigger font than the default. Don’t put anything on the tool bar that
you do not understand or use; you can always add things later.
18
Queen’s University!
September 2012
Tutorial 2: Graphs and Descriptive Statistics in R
M. Kelly
In this tutorial, we will learn how to calculate summary statistics and graphs with
R.
Descriptive statistics
Descriptive statistics are quantitative attributes that characterize a data set, such
as the mean, median, and standard deviation. For example, Table 2.1 shows
observations of mercury concentration in the sediments of two imaginary lakes.
Table 2.0 Sediment Concentrations in Two Lakes Studied for Mercury
Distribution
19
Depth (cm)
Lucky Lake (ppb)
OddBall Lake
(ppb)
1
64.07
122.17
2
55.36
100.29
3
61.17
79.54
4
72.51
86.07
5
87.68
78.24
6
72.31
77.25
7
76.67
69.89
8
63.05
66.08
9
68.33
91.26
10
59.87
73.68
11
86.48
67.58
12
80.71
73.08
13
58.85
102.54
Queen’s University!
September 2012
Depth (cm)
Lucky Lake (ppb)
OddBall Lake
(ppb)
14
45.02
83.73
15
63.17
88.86
16
72.06
106.63
17
71.51
67.72
18
69.88
82.56
19
63.55
93.73
20
87.78
71.41
At first glance, it can be difficult to see patterns in the data set. However, by
comparing mercury concentrations between lakes using descriptive statistics, you
can get some first impressions of the data. Since this data set is short, we can
enter it directly in the console.
> OddBall=c(122.17, 100.29, 79.54, 86.07, 78.24, 77.25, 69.89,
66.08, 91.26, 73.68, 67.58, 73.08, 102.54, 83.73, 88.86,
106.63, 67.72, 82.56, 93.73, 71.41)
> Lucky=c(64.07, 55.36, 61.17, 72.51, 87.68, 72.31, 76.67,
63.05, 68.33, 59.87,86.48, 80.71, 58.85, 45.02, 63.17, 72.06,
71.51,69.88, 63.55, 87.78)
The mean() function is used to calculate average mercury concentration for each
lake
> mean(Lucky)
[1] 69.0015
> mean(OddBall)
[1] 84.1155
which reveals that OddBall lake has a higher mean mercury concentration than
Lucky. Similarly, the standard deviation function sd() is one way to describe the
amount of variation within a data set, and find that Oddball lake has more
variation as well.
> sd(Lucky)
[1] 11.18146
> sd(OddBall)
[1] 15.03170
20
Queen’s University!
September 2012
At this point, we could summarize the two data sets by saying that the mean
mercury concentration in Lucky lake is 69.00 (s2=11.18) and in OddBall lake is
84.12 (s2=15.03), where s2 represents a sample estimate of the standard deviation
Other useful descriptive statistics are the minimum and maximum values, which
are calculated using the min() and max() functions respectively.
> max(Lucky)
[1] 87.78
> min(Lucky)
[1] 45.02
> max(OddBall)
[1] 122.17
> min(OddBall)
[1] 66.08
The mean is a suitable description of central tendency when the data does not
have extreme values. When a data set has extreme values, the median and
quantiles are often more valuable descriptors of central tendency and variation.
Medians are calculated using the median() function
> median(Lucky)
[1] 69.105
> median(OddBall)
[1] 81.05
and quantiles (including the 50% quantile, which is the median) using the
quantile() function.
> quantile(Lucky)
0%
25%
50%
75%
100%
45.020 62.580 69.105 73.550 87.780
> quantile(OddBall)
0%
25%
50%
75%
100%
66.0800 72.6625 81.0500 91.8775 122.1700
Contingency tables
Contingency tables summarize the frequency of observations that fall into one or
more categories, and provide a compact way to visualize relationships among the
categories. For example, mercury concentrations above 100 ppb might be
considered ‘at risk’, while those below 100 ppb could be considered ‘acceptable’.
Table 2.1 shows the contingency table for Lucky lake and OddBall lake.
21
Queen’s University!
September 2012
Table 2.1 Contingency Table of Lake Sediments at Risk of
Contamination
Sediment
Status
Lucky Lake
OddBall
Lake
Row Total
At Risk
0
4
4
Acceptable
20
16
36
Column
Total
20
20
40
To create a contingency table in R, we first need to create two new categorical
variables that indicate the source lake for the observation, and whether it is
acceptable or at risk. This type of data structure is referred to as stacked because
the data from each category is stacked on top of each other. Here, we will call the
vector indicating the source lake ‘lake’ (‘O’ is for OddBall lake, and ‘L’ is for Lucky
lake), and the vector indicating risk ‘risk’ (‘Y’ is for at risk, ‘N’ is for not at risk).
> lake=c('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'L', 'L',
'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L',
'L', 'L', 'L', 'L', 'L', 'L')
> risk=c('Y', 'Y', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
'N', 'N', 'Y', 'N', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'N',
'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
'N', 'N', 'N', 'N', 'N', 'N')
The contingency table is created using the table() function.
> table(lake,risk)
risk
lake N Y
L 20 0
O 16 4
The table shows that no sediment layers in Lucky lake are of concern, but four
layers in OddBall lake are.
Graphs
Tutorial 1 introduced some basic plotting functions. Here, we will introduce new
functions that allow you to create versatile graphs of your data.
22
Queen’s University!
September 2012
10
Distribution of Hg Concentrations in Lucky Lake Sediment
6
4
0
2
Observed Frequency
8
A
40
50
60
70
80
90
Hg Concentration (ppb)
10
Distribution of Hg Concentrations in OddBall Lake Sediment
0
2
4
6
8
B
60
70
80
90
100
110
120
130
Hg Concentration (ppb)
Figure 2.1 Data Distribution for Sediment Mercury
Concentration at Lucky Lake (A) and OddBall Lake (B)
Histograms
A histogram is a plot of the observed frequency of an event, and is created in R
using the hist() function. A basic histogram is created as follows:
> hist(Lucky)
23
Queen’s University!
September 2012
To create custom text for the figure, use the xlab=’...’, ylab=’...’ , and
main=’...’ options, which add the desired text (entered in place of ...) to the xaxis, y-axis and plot title respectively.
> hist(Lucky,main="Distribution of Hg Concentrations in Lucky
Lake Sediment",xlab="Hg Concentration (ppb)",ylab="Observed
Frequency")
These options work with most of the plotting functions used in R. The histogram
for each lake is shown in Figure 2.1.
Bar plots
The barplot() function is used to create bar plots of your data. The number of
bars in the figure is equal to the number of observations in the data set, and the y
-axis shows the value of the observations within a variable. For example, a
labelled bar plot of the Lucky lake data is created by
> barplot(Lucky,main='Sediment Mercury Concentration in Lucky
Lake')
With the default values, the function created a box plot with a y-axis that ranges
from 0 to 80, which leaves some bars to over hang the observed axis range. To
tidy this up, we can use the ylim option, which has the form ylim=c(a,b), where
the desired minimum and maximum are entered in ‘a’ and ‘b’ respectively. We can
also add color to the plot using the col option, which has the form col
=‘...’ (e.g., col =‘blue’), where the color is entered as text (type colors()
into R to see a list of all colors). The more polished barplot (Figure 2.2) is created
by
> barplot(Lucky,ylim=c(0,100),col='red',main='Sediment Mercury
Concentration in Lucky Lake',xlab='Sediment
Sample',ylab='Mercury Concentration (ppb)')
To visualize more than one category, we can use stacked or grouped bar plots. To
do this, we first create a data frame of the entire data set.
> BothLakes=data.frame('OddBall'=OddBall, 'Lucky'=Lucky)
The data frame (“BothLakes”) has two columns with each row a different depth.
Since we want to plot the different depths across the x-axis, the barplot()
function requires us to switch the rows and columns so that the data frame has
two rows (OddBall and Lucky), and 20 columns that represent the depths.
24
Queen’s University!
September 2012
80
60
40
20
0
Mercury Concentration (ppb)
100
Sediment Mercury Concentration in Lucky Lake
Sediment Sample
Figure 2.2 Bar plots of Mercury Concentrations (ppb) in
Sediments.
This is done using the t() function, which transposes the data set (i.e., switches
the rows and columns). This is done by first converting the dataframe to a matrix
using the as.matrix() function.
> BothLakes=as.matrix(BothLakes)
> BothLakes=t(BothLakes)
A stacked bar plot (Figure 2.3) is then created using
> barplot(BothLakes,main='Comparison of Mercury Concentration
by Depth Interval', xlab="Depth Interval", ylab='Concentration
of Mercury (ppb)', ylim=c(0,200),col=c('red', 'lightblue'))
A grouped bar is created using the beside option, which has the form
beside=TRUE.
> barplot(BothLakes,main='Comparison of Mercury Concentration
by Depth Interval', xlab="Depth Interval", ylab='Concentration
of Mercury (ppb)', ylim=c(0,150),col=c('red',
'lightblue'),beside=TRUE)
25
Queen’s University!
September 2012
150
100
0
50
Concentration of Mercury (ppb)
200
Comparison of Mercury Concentration by Depth Interval
Depth Interval
Figure 2.3 Stacked Barplot of Sediment Mercury in OddBall Lake (red)
and Lucky Lake (blue).
Box plots
A box and whisker plot illustrates the median and quantile levels. The function,
boxplot(), indicates the median by a dark band within a “box” which is bound by
the 1st and 3rd quantiles. “Whiskers” in this plot represent the most extreme
datapoint within 1.5 times the interquartile range (i.e., the 75% quantile minus the
25% quantile). Values outside the whiskers are plotted as points. To visualize the
mercury concentrations in each lake, we need to create stacked data similar to
what was used for the contingency tables. The first vector will be the mercury
concentrations for both lakes, and the second vector will indicate which lake the
observation is from.
> mercury=c(OddBall,Lucky)
> lake=c('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O',
'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'L', 'L',
'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L',
'L', 'L', 'L', 'L', 'L', 'L')
26
Queen’s University!
September 2012
100
80
60
Mercury Concentration (ppb)
120
Mercury Concentrations by Lake
L
O
Lake
Figure 2.4 Mercury Concentrations in Lucky and OddBall lake.
The boxplot is created using
> boxplot(mercury~lake,main="Mercury Concentrations by Lake",
ylab='Mercury Concentration (ppb)',xlab='Lake',col=
c('red','lightblue'))
where the term mercury~lake is a function that indicates to the boxplot()
function that the mercury quantiles should be calculated for each level in lake.
Scatter plots
A scatterplot displays quantitative data for two variables. The plot() function
was encountered in Tutorial 1, and provides a great way to create scatterplots. For
our example of mercury concentrations in two lakes, we can compare the mercury
at each sediment depth as follows:
> plot(OddBall,Lucky,main='Comparing Lake Mercury
Concentrations',xlab='OddBall Lake Concentration', ylab='Lucky
Lake Concentration',xlim=c(60,130),ylim=c(40,90))
We can highlight changes relative to the means by adding two lines that represent
the mean for each lake. The abline() function is used to create straight lines on
27
Queen’s University!
September 2012
a plot. For a vertical line, the abline() function has the form abline(v=value),
where “value” indicates where you want the line drawn.
> abline(v=84.1155,col= 'blue')
For a horizontal line the abline() function has the form abline(h=value) as
follows:
>
abline(h=69.0015,col= 'red')
The final graph is shown in Figure 2.5.
70
60
40
50
Lucky Lake Concentration
80
90
Comparing Lake Mercury Concentrations
60
70
80
90
100
110
120
130
OddBall Lake Concentration
Figure 2.5 Concentrations of Mercury in Sediments with Lucky
and OddBall Lake means, blue and red respectively.
Function List
Here is a list of the functions and options covered in tutorial 2, see the quick
Reference Card for more details.
Functions: mean( ), median( ), sd( ), quantile( ), min( ), max( ), table( ), hist( ),
barplot( ), boxplot( ), plot( ), abline( ), t(), as.matrix().
Plotting options: xlim=c( ), ylim=c( ), xlab='...', ylab='...', main='...', col='...'.
28
Queen’s University!
September 2012
Tutorial 3: Hypothesis Testing in R
H. Haig
Finding critical values
An important component of hypothesis testing is to find the critical value for a
test statistic, such as a Student’s t, F or Chi-square distribution. While these
values can be looked up in tables (such as found in your textbook) they can also
be calculated in R using the qt() , qf() and qchisq() functions. All three functions
use the same input, the cumulative probability and the degrees of freedom. The
critical value returned is the location on the x-axis where the requested
cumulative probability lies entirely to the left (Figure 3.1).
For example, to calculate the critical t-score for a one-sided t-test with 19 degrees
of freedom and α=0.05. To find the critical value type:
> qt(0.95,19)
[1] 1.729133
and for a two-tailed test you would type
> qt(0.975,19)
[1] 2.093024
These functions are quite handy, and will be used in the following sections.
Chi-squared test
A Chi-squared test is used to test for independence among categorical variables.
For example, we might be interested in whether colour blindness is independent
of gender. If males are more likely to be colour blind than females, then we
expect that the relative frequency of people with colour blindness would not be
independent of gender. The null hypothesis for the Chi-squared test is that the
categorical variables are independent of one another.
HO: Categorical variables are independent
HA: Categorical variables are not Independent
For our example with colour blindness a Chi-squared test hypothesis would be:
HO: There is no difference in the degree of colour blindness between males
and females
HA: There is a difference in the degree of colour blindness between males
and females
29
Queen’s University!
September 2012
p=0.05!
Two Tailed t-test!
Right Tail !
0.975!
T-crit!
p=0.05!
Left Tail !
0.025!
T-crit!
p=0.05!
One Tailed t-test!
Right Tail !
0.95!
T-crit!
Figure 3.1 Test scores in the qt() , qf() and qchisq()
functions. The blue area shows the cumulative probability
from the left up to the T-crit threshold. The green
hatched areas are the critical p-value thresholds.
For a fully worked example, let’s look at some hypothetical data on an invasive
species, Bythotrephes longimanus. This invasive zooplankton, commonly called
the spiny water flea, entered the Great Lakes region in the mid 1980s, and caused
a dramatic change in zooplankton composition in many lakes. The main
mechanisms of spread through inland lakes is from fishnets and boats moving
between invaded and non-invaded lakes. One area where Bythotrephes invasion
has been studied in detail is the ‘cottage country’ region of the Muskoka’s in
southern Ontario. The following table shows the number of lakes with and
without cottages, as well as the state of Bythotrephes invasion in the lake (not
invaded, invaded but not abundant, and invaded and dominant).
30
Queen’s University!
September 2012
Not
invaded
Invaded, but Invaded and
Row total
not abundant abundant
Cottages
25
60
65
150
No
Cottages
40
7
3
50
Column
total
65
67
68
200
The statistical hypotheses are:
HO: Presence of cottages and lake invasion status are independent
HA: Presence of cottages and lake invasion status are not independent
The chisq.test() function in R can be used to do a Chi-squared test. The first step
is to get the data into R. This can be done by creating a .csv file and importing the
data as was done in Tutorial 1 or, as we illustrate here, we can create the data
directly in R. Begin by creating the data vectors
> cottage=c(25,60,65)
> no.cottage=c(40,7,3)
> dat=data.frame('C'=cottage,'NC'=no.cottage)
It is a good idea to have a look at your data frame to make sure the variables are
in the right order:
> dat
C NC
1 25 40
2 60 7
3 65 3
The Chi-squared test is done by typing
> chisq.test(dat)
!
Pearson's Chi-squared test
data: dat
X-squared = 69.2218, df = 2, p-value = 9.304e-16
The output shows you:
• The type of test: Pearson's Chi-squared test
31
Queen’s University!
September 2012
• The Chi-squared value: X-squared = 69.2218
• degrees of freedom: df = 2
• and p-value: p-value = 9.304e-16
Since p<0.05, we reject our null hypothesis that Bythotrephes longimanus
abundance in Muskoka lakes are independent of the presence or absence of
cottages.
Equivalently, we can do the same test by comparing the observed versus critical
test statistics. The observed χ2 value is given in the output of the chisq.test()
function, and the critical χ2 value can be found using qchisq() with a confidence
value of 0.95 (1-α) and degrees of freedom 2.
> qchisq(0.95,2)
[1] 5.991465
Since the observed χ2 value (69.2) exceeds the critical χ2 value (5.99) the null
hypothesis is rejected.
One-sample t-test
A one-sample t-test compares the mean of an observed set of data to a known
value. This test assumes that the observed data is normally distributed. To
demonstrate a one-sample t-test in R, lets looks at some hypothetical data from a
fish farm that raises trout. As a measure of stress in adult fish, the farm
monitors the number of eggs produced per female to ensure that the fish are
under optimal conditions for reproduction. The procedure is to randomly select
ten fish during each egg harvest, and count the total eggs per fish. Experience
shows that the minimum number of eggs from a non-stressed fish is 1100. The
following table shows the egg count from the most recent harvest.
32
Fish ID
Eggs/fish
F1
778
F2
1367
F3
947
F4
1002
F5
521
F6
656
F7
1082
Queen’s University!
September 2012
F8
1144
F9
735
F10
1283
To determine if this sample is different from the minimum threshold, a onesample t-test can be used. Since the data set is small, it is easiest to enter the data
directly into R
> eggs=c(778,1367,947,1002,521,656,1082,1144,735,1283)
The general linear model function glm() is used to compute the observed t score
and p value for the data. In order to obtain the full output from this function, the
summary() function will also be needed. This is done by first assigning the results
of the glm() function to a variable using
> my.fit=glm(...)
where the “...” are the arguments used for a particular test. To run a one-sample ttest, we write the formula as
y
µ⇠1
where y is your data set, and μ is the known value you are comparing the data
against. The ~1 tells the program that you do not have any categories. Putting it
all together, the R code for a one-sample t-test is
> my.fit=glm(eggs-1100~1)
A summary is created by typing
> summary(my.fit)
Call:
glm(formula = eggs - 1100 ~ 1)
Deviance Residuals:
Min
1Q Median
-430.5 -205.8
23.0
3Q
177.0
Max
415.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -148.50
87.48 -1.697
0.124
(Dispersion parameter for gaussian family taken to be
76534.94)
Null deviance: 688814
Residual deviance: 688814
33
on 9
on 9
degrees of freedom
degrees of freedom
Queen’s University!
September 2012
AIC: 143.78
Number of Fisher Scoring iterations: 2
The key values to look for in this output are the degrees of freedom, observed t
value, and the p value. In this case the degrees of freedom is 9, t-value is -1.697,
and the p-value is 0.124. To interpret this further, we need to decide if the test is
a one-tailed or two-tailed test. If the statistical hypotheses are:
HO: There is no difference between the mean number of eggs per fish in the
sample and the threshold of 1100
HA: There is a difference between the mean number of eggs per fish in the
sample and the minimum threshold of 1100.
then it is a two-tailed test because we are not concerned with the direction of the
difference. In this case, we would fail to reject the null hypothesis because
p>0.05. The same result is obtained by using the qt() function and comparing the
observed and critical test statistic.
> qt(0.975,9)
[1] 2.262157
Figure 3.2 Histogram of pre and post coral densities.
For a two-tailed test, the absolute value of the critical test statistic is t.crit=2.262,
which is greater than the observed value t.obs=1.697. The intent of sampling the
34
Queen’s University!
September 2012
fish, however, is to evaluate whether the eggs per fish are less than the minimum
threshold, so a one-tailed hypothesis would be more appropriate. The statistical
hypotheses would be:
HO: The mean number of eggs per fish in the sample is not less the
threshold of 1100
HA: The mean number of eggs per fish in the sample is less the threshold of
1100
A one-tailed test can still be done using the output from the glm() function by
dividing the stated p-value in half. The mean number of eggs per fish is less than
the threshold of 1100 as shown by the negative sign in the estimate (-148.50), so
the sign of the observed test statistic is negative (t.obs=-1.697). The one-sided pvalue is p=0.062, which is greater than 0.05 so we fail to reject the null
hypothesis. Thus, we can conclude that the observed eggs per fish are not
reflective of stressed fish.
Paired Two sample t-tests
A paired t-test can be used to determine if there is a difference between two
groups. This is tested using the null hypothesis that there is no difference in the
mean between two treatments. Before and after studies commonly use paired ttests to show how a specific population changes with the application of some
treatment. For example, changes in biological population before and after a
dramatic event can be used to determine if the event has caused a significant
change to the population. The general hypothesis for this type of test is:
HO: There is no difference between the two groups
HA: There is a difference between the two groups
An application of a paired t-test would be to determine if the density of coral
species before and after the widespread bleaching event in 1998 are statistically
different. Coral density was monitored on this reef before and after the event
showing changes between the 1997 and 1999 coral species densities.
Observations from a study on a nearby reef categorized coral species into
‘Winners’ and ‘Losers’ based on their abilities to survive bleaching events (Loya et
al. 2001).
Species
35
Density (indv/m2)
1997
1999
Difference
Queen’s University!
Winners
Porites lutea
Porites lobata
Leptastrea transvera
Goniastera aspera
Goniastrea pectinata
Leptastrea purpurea
Platrygra ryukuenis
Porites rus
Favites halicora
Favia favus
Losers
Millepora intricata
Millepora dichotoma
Acropora digitifera
Porites attenuata
Porites sillimaniani
Stylophora pistillata
Porites cylindrica
Montipora aequituberculata
Porites nigrescens
Pocillopora damicornis
Millepora platphylla
Porites aranetai
Porites horizontalata
Seriatopora hystrix
September 2012
18.1
21.3
16
21.6
21
20.3
25.2
22
22.9
23.5
24
26.1
18.1
24.2
21
19.2
24
21.8
19.2
19.8
26.7
23.1
25
14.8
33.1
39.4
36.1
42
33.2
43.6
39.4
39.1
37.8
45.1
1.2
0.8
1.7
12.1
9.7
10.1
4.8
4.1
10.2
5.2
6.3
8.1
4.1
2.3
15
18.1
20.1
20.4
12.2
23.3
14.2
17.1
14.9
21.6
-22.8
-25.3
-16.4
-12.1
-11.3
-9.1
-19.2
-17.7
-9
-14.6
-20.4
-15
-20.9
-12.5
You can create a csv file of the data, or enter the data directly into R. If the data
is entered directly into R simply make 2 vectors for the 1997 and 1999 data and a
third for the difference between the pre and post bleaching values as follows:
> pre=c(18.1, 21.3, 16, 21.6, 21, 20.3, 25.2, 22, 22.9, 23.5,
24, 26.1, 18.1, 24.2, 21, 19.2, 24, 21.8, 19.2, 19.8, 26.7,
23.1, 25, 14.8)
> post=c(33.1, 39.4, 36.1, 42, 33.2, 43.6, 39.4, 39.1, 37.8,
45.1, 1.2, 0.8, 1.7, 12.1, 9.7, 10.1, 4.8, 4.1, 10.2, 5.2,
6.3, 8.1, 4.1, 2.3)
36
Queen’s University!
September 2012
Create a difference vector as follows:
> diff=post-pre
> diff
[1] 15.0 18.1 20.1 20.4 12.2 23.3 14.2 17.1
[9] 14.9 21.6 -22.8 -25.3 -16.4 -12.1 -11.3 -9.1
[17] -19.2 -17.7 -9.0 -14.6 -20.4 -15.0 -20.9 -12.5
Alternatively, it may be easier to load the csv file
my.dat=read.csv(file.choose())
With the difference between the pre and post bleaching calculated, the t-test is
treated the same way as a one-sample t-test. In this example we assume that the
difference between the pairs is zero for the null hypotheses, but this is not always
the case.
HO: μd=0
HA: μd≠0
As before, the glm() and summary() functions provide R tools for the test. In this
example, we will name the model ‘pt’ for paired t-test. In the one-sample t-test
from the previous section, the model formula was y-μ~1, where μ is the
population parameter that the data are being tested against. Here, μ=0 so the
formula is y~1.
> pt=glm(diff~1)
> summary(pt)
Call:
glm(formula = diff ~ 1)
Deviance Residuals:
Min
1Q
Median
-23.242 -14.667
-8.142
3Q
17.583
Max
25.358
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
-2.058
3.597 -0.572
0.573
(Dispersion parameter for gaussian family taken to be
310.5025)
Null deviance: 7141.6
Residual deviance: 7141.6
AIC: 208.80
on 23
on 23
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 2
37
Queen’s University!
September 2012
The summary shows p>0.05, so we fail to reject the null hypothesis. A histogram
of the data reveals that there has been changes in most species of coral, but not
always in the same direction. Try
> hist(diff)
Recall that we expect some species to do better that others (i.e., ‘winners’ and
‘losers’). To tease apart the patterns in more depth, let’s consider just the ‘losers’
data.
> loser.post=c(1.2, 0.8, 1.7, 12.1, 9.7, 10.1, 4.8, 4.1, 10.2,
5.2, 6.3, 8.1, 4.1, 2.3)
> loser.pre=c(24, 26.1, 18.1, 24.2, 21, 19.2, 24, 21.8, 19.2,
19.8, 26.7, 23.1, 25, 14.8)
> diff.loser=loser.post-loser.pre
> diff.loser
[1] -22.8 -25.3 -16.4 -12.1 -11.3 -9.1 -19.2 -17.7 -9.0
-14.6 -20.4 -15.0 -20.9 -12.5
Then rerun the analysis
> fit.loser=glm(diff.loser~1)
> summary(fit.loser)
Call:
glm(formula = diff.loser ~ 1)
Deviance Residuals:
Min
1Q
Median
-9.1357 -3.9357
0.4643
3Q
3.9643
Max
7.1643
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.164
1.363 -11.86 2.41e-08 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be
26.01016)
Null deviance: 338.13
Residual deviance: 338.13
AIC: 88.312
on 13
on 13
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 2
If only the ‘loser’ species of corals are analyzed, we find that p<0.05 and reject
the null hypothesis that the difference in density is zero. Similar to the onesample t-test, a critical t-score can be determined using qt()
38
Queen’s University!
September 2012
> qt(0.95,13)
[1] 1.770933
Again tobs <tcrit indicating that there are significantly difference in the density of
coral before and after the beaching event. To take this one step farther we could
evaluate the hypothesis that there has been a decrease in coral density after
bleaching
HO: μd≥0
HA: μd<0
which is a one-tailed test. Since the p-value reported in the glm summary is for a
two-tailed test, we need to use the qt() function to find the correct critical tscore as follows:
> qt(0.975,13)
[1] 2.160369
Since tobs <tcrit, we reject the null hypothesis and conclude that the coral density
of these particular species has decreased since the 1998 bleaching event.
Loya, Y., Sakai, K., Yamazzato, K., and van Woesik, R., 2001. Coral bleaching: the
winners and the losers. Ecology letters 4: 122-131.
Independent two sample t-tests
Independent two-sample t-tests are used to evaluate whether two populations
have different means, and are an invaluable tool for differentiating between the
outcome of two trials. For example, imagine that you started a new job working
for an engineering firm as an aquatic biologist and risk assessment expert. The
company is planning to dam a large river, which will cause a decrease in the water
flow during the spawning season for trout. The change in water flow might
decrease the amount of oxygen flowing over the eggs, and thereby have an impact
on the fish population. As the biologist in the group, they have given you access
to a flow tunnel to see if the fish eggs will be able to survive under this new flow
regime. You have taken many eggs from the same fish stock and used 100 eggs
for each trial. After 14 replicate trials of each flow speed your preliminary data
on the number of eggs that hatched is as follows:
Trial
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Normal
78
72
88
80
73
81
62
76
73
90
92
76
71
74
Slow
58
57
45
56
66
60
49
51
52
65
50
48
58
57
39
Queen’s University!
September 2012
To test if these two treatments have a different impact on hatching success, you
can use an independent two-sample t-test. The first step in performing a twosample independent t-test in R is to input the data in the correct form.
Specifically, one column must contain the data and the other column a coding
variable indicating the treatment for each trial (i.e., in stacked form). Start by
entering the data in two vectors. The data vector is
> egg.count=c(78, 72, 88, 80, 73, 81, 62, 76, 73, 90, 92, 76,
71, 74, 58, 57, 45, 56, 66, 60, 49, 51, 52, 65, 50, 48, 58,
57)
To produce the categorical vector a code for normal and slow water movement
can be used. Let N, and S denote normal and slow respectively
> trial=c('N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N',
'N', 'N', 'N', 'N', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S',
'S', 'S', 'S', 'S', 'S', 'S')
Then load these vectors into a single data frame
> my.data=data.frame(egg.count,trial)
The data set should look like
> my.data
!
egg.count trial
1
78
N
2
72
N
3
88
N
4
80
N
5
73
N
6
81
N
7
62
N
8
76
N
9
73
N
10
90
N
11
92
N
12
76
N
13
71
N
14
74
N
15
58
S
16
57
S
17
45
S
18
56
S
19
66
S
20
60
S
21
49
S
22
51
S
40
Queen’s University!
23
24
25
26
27
28
September 2012
52
65
50
48
58
57
S
S
S
S
S
S
With the data setup in this fashion, the glm() function can be used for the test
by including the categorical variable as a predictor variable (independent
variable). The formula is y~x, where y is the full data set, and x the categorical
vector that indicates what group the data belong to.
> fit.ind=glm(egg.count~trial, data=my.data)
The data option in glm indicates to R where to look for the data
> summary(fit.ind)
Call:
glm(formula = egg.count ~ trial)
Deviance Residuals:
Min
1Q
-15.5714
-4.7143
Median
-0.5714
3Q
3.0000
Max
14.4286
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
77.571
1.942
39.939 < 2e-16 ***
trial !
-22.429
2.747
-8.165 1.20e-08 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be
52.81319)
Null deviance: 4894.4
Residual deviance: 1373.1
AIC: 194.45
on 27
on 26
degrees of freedom
degrees of freedom
Number of Fisher Scoring iterations: 2
The output from this type of t-test displays two sets of t and p-values: ‘Intercept’
and ‘trial’. Figure 3.3 illustrates the meaning of each. The ‘Intercept’ is the
estimated mean for one treatment, and the ‘trial‘ (will have a different label
depending on the glm formula used) is the difference between the groups. The
second set of values labeled trials should be used because it reflects the
treatment effect. For all other t-test the intercept is used because you are asking if
the mean value is different than a given intercept on the x-axis in your
hypothesis.
41
Queen’s University!
September 2012
Mean
Intercept
Trial
Figure 3.3 Determining which values to use from
summary output. Demonstrating the difference between
intercept and trial values from the above output. For
one sample and two sample paired outputs the intercept
values can be used. For two sample independent t-test
the second set of values (in this case called Trial) must
be used.
From the summary, p<0.05 so we reject the null hypothesis and conclude that
stream flow changes the mean hatching rate of fish eggs.
Function List
Here is a list of the functions and options covered in tutorial 1, see the quick
Reference Card for more details.
Functions: qt( ), qf( ), qchisq( ), glm( ), summary( ).
42
Queen’s University!
September 2012
Tutorial 4: Regression and correlation in R
B. Wiltse
This tutorial covers linear regression, the assumptions and appropriate
significance tests associated with it, and correlation.
The equation for a linear regression is given by,
Y = a + bX
where Y is the response variable (dependent), X is the explanatory variable
(independent), and the coefficients a and b are the intercept and slope
respectively. We use linear regression to predict the response variable (Y) based
on the explanatory variable (X). The regression coefficients (a and b) are
calculated from your data using the method of least squares (see text book for
more details). Here we can use the glm() function in R, which stands for
Generalized Linear Model, to fit the model and to do hypothesis testing on
regression parameters.
To illustrate linear regression, we can use an example data set already in R called
‘cars’. To see the data type:
> cars
You should see two columns speed and dist, which indicate the speed the car is
traveling at the time of breaking and the distance the car travels before coming to
a stop. To reduce clutter in the regression statement code, we can assign these
columns to new vector names:
> speed=cars$speed
> dist=cars$dist
Fitting the linear regression
The linear regression is entered into the glm() function in the format Y~X, where
Y is the dependent variable, X is the independent variable, and the tilde means
“given by.” For the cars data set, we want to predict the breaking distance given
by the speed of the car, so the code is input as follows:
> glm(dist~speed)
which produces the following output:
Call:
glm(formula = dist ~ speed)
Coefficients:
43
Queen’s University!
(Intercept)
-17.579
September 2012
speed
3.932
Degrees of Freedom: 49 Total (i.e. Null); 48 Residual
Null Deviance:
32540
Residual Deviance: 11350
AIC: 419.2
To reduce clutter in the code, it first helps to save the fitted statistical model (i.e.,
the glm object) to a variable. For example, we can type
> my.fit=glm(dist~speed)
The basic output of glm() is rather short, but there is a lot more going on behind
the scenes. The output of glm(), as with many R functions, is an object that
contains a range of information about the statistical model. We can use extractor
functions to pull information out of the object. A list of extractor functions for
the object can be seen by using the names() function.
> names(my.fit)
[1] "coefficients"
[4] "effects"
[7] "qr"
[10] "deviance"
[13] "iter"
[16] "df.residual"
[19] "converged"
[22] "call"
[25] "data"
[28] "method"
"residuals"
"R"
"family"
"aic"
"weights"
"df.null"
"boundary"
"formula"
"offset"
"contrasts"
"fitted.values"
"rank"
"linear.predictors"
"null.deviance"
"prior.weights"
"y"
"model"
"terms"
"control"
"xlevels"
We can use any of these extractor functions on the glm() function to gather more
information about the model. To do this we simply include our glm() function
inside the extractor function. For example,
> formula(my.fit)
dist ~ speed
gives us the formula used in our model.
A basic extractor function that can be used on almost all R objects is the
summary() function, which gives a list of the more pertinent information about
that object. We will use a slightly modified version of this function that gives
information particularly useful for linear regression, which is summary.lm(). If
you are curious, you can run both summary() and summary.lm() on the glm
object to see the difference, but we will use summary.lm() as it provides more
relevant information for linear regression. Let’s go ahead and run this on our
model.
> summary.lm(my.fit)
44
Queen’s University!
September 2012
1)
Call:
glm(formula = dist ~ speed)
2)
Residuals:
Min
1Q
-29.069 -9.525
3)
Median
-2.272
3Q
9.215
Max
43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791
6.7584 -2.601
0.0123 *
speed
3.9324
0.4155
9.464 1.49e-12 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
4)
5)
6)
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511,
Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12
There is a lot of information in the output, and it is useful to discuss each piece
individually:
1. The first lines indicate the formula and variables used in the glm()function,
which is handy when you save the output and come back to it at a later time.
Call:
glm(formula = dist ~ speed)
2. The next lines show the the distribution of residuals in the form of quantiles,
which gives a qualitative sense of the residual characteristics.
Residuals:
Min
1Q
-29.069 -9.525
Median
-2.272
3Q
9.215
Max
43.201
3. The next section shows the estimated linear regression coefficients, and onesample tests of hypothesis for each.
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791
6.7584 -2.601
0.0123 *
speed
3.9324
0.4155
9.464 1.49e-12 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The first column shows each parameter in the model (i.e., Intercept, slope),
which will vary depending on what terms were in the original model. The next
column gives the estimate for each parameter. In this example, the estimated
intercept is a = -17.58, and the estimated slope for the speed covariate is b =
3.93. We are often interested in whether the slope of the regression line (b) or
45
Queen’s University!
September 2012
intercept (a) is significantly different from zero, which can be evaluated using a
one-sample t-test. The next two columns show the standard error of the
estimate for each coefficient, as well as the observed t-score. The t-test can be
done by comparing the observed t-score against the critical t-score (using the
qt() function). For this example, the critical t-score for the slope parameter is
2.01 (df=48, two-tailed, α=0.05), which is less than the observed absolute value
of 9.464, so we reject the null hypothesis that the slope is equal to zero.
Equivalently, we can conduct the test using the p-value, which is shown in the
last column (assumes a two-tailed test). Since the p-value is less than 0.05, we
come to the same conclusion. R includes a series of graphical significance
codes next to the p-values to help give you a quick assessment of the
significance of your regression parameters.
4. The final section of the output provides information that is useful for Analysis
of Variance and correlation. The top line in this section gives residual standard
error, which is a measure of the variation in the observations around the fitted
line. The smaller this number the closer the observed values are to the fitted
line.
Residual standard error: 15.38 on 48 degrees of freedom
5. The next line is the R2 value of the regression, which you can think of %
variance explained. One problem with R-squared values is they tend to be
artificially increased with sample size, therefore if we intend on comparing R2
values between regressions with different sample sizes we need to take this
effect into account. The adjusted R2 value does this and therefore is a more
accurate estimate of the % variance explained. If our X values perfectly
predicted our Y values than our R2 value would be 1.0 In our case we see that
about 64% of the variation in breaking distance is explained by the speed of the
car.
Multiple R-squared: 0.6511,
Adjusted R-squared: 0.6438
6. The final line shows the F-statistic for the test of whether the ratio of the
explained variance explained over the unexplained variance is different from
one. The F-test in a linear regression is the same as a t-test of whether the
slope is different than zero, but this is not general to other types of statistical
model.
F-statistic: 89.57 on 1 and 48 DF,
p-value: 1.490e-12
Using the qf() function, the critical F-score is about 4.04. Since the observed Fscore is greater than the critical F-score, we we reject the null hypothesis that
the ratio is equal to one. Equivalently, we could perform the test using the pvalue provided.
46
Queen’s University!
September 2012
Plotting the fit linear regression
After fitting the linear regression using glm(), plot the raw data and fit line to see
if the fit statistical model makes sense. Begin by plotting the raw data:
> plot(speed,dist, xlab=”Speed (mph)”, ylab=”Breaking Distance
(feet)”)
The fit linear regression can be added using the abline() function, which plots a
line with the intercept (a) and slope (b) from the fit glm object.
> abline(my.fit)
You should see the plot shown in Figure 4.1, which suggests that a linear
regression fits the data fairly well.
Figure 4.1 Scatterplot (points) and linear regression
(line) of stopping distance by speed.
Evaluating the assumptions of linear regression
There are four key assumptions that need to be met before the results of a linear
regression can be trusted. They are 1) the relationship between the independent
and dependent variable is linear, 2) the residuals are Normally distributed, 3) the
47
Queen’s University!
September 2012
residuals have equal variance across the range of the independent variable
(heteroscedasticity), and 4) the residuals are independent.
Evaluating the assumptions of linear regression begin with a qualitative
assessment using two kinds of plots—residual plots and histograms. A plot of the
residuals (y-axis) by the predicted value (x-axis) allows you to visualize the
assumption of linearity and heteroscedasticity. Try the fitted() function
> fitted(my.fit)
to see the predicted value for each value of the independent variable (speed in
this example), and the resid() function
> resid(my.fit)
to see the residuals. To plot the residuals against the predicted values, type:
> plot(fitted(my.fit),resid(my.fit))
The abline() function can be used to add a horizontal line that helps visualize
trends.
>
abline(0,0)
The resulting plot (Figure 4.2) suggests that a linear relationship is appropriate,
but there is a small trend of increasing residual variance as the predicted values
increase. A histogram of the residuals is created using the hist() function:
>
hist(resid(my.fit))
In our example, the histogram (Figure 4.3) suggests that the residuals may be
slightly skewed. Plots of the raw data, fit line, and residuals provide a qualitative
overview of the linear regression. All four assumptions, however, can be
addresses more formally with the gvlma() function. The gvlma() function is not
part of the base R software and must first be installed using
> install.packages(“gvlma”)
Load the gvlma library (this needs to be done every time you start R) using
> library(gvlma)
To run the gvlma() function on our model, type:
> gvlma(lm(my.fit))
The lm() function is important here because is converts the glm object into an lm
object as required by gvlma package.
48
Queen’s University!
September 2012
Figure 4.2 Plot of the residual vs fitted values.
The output is
Call:
lm(formula = glm.dist)
Coefficients:
(Intercept)
-17.579
speed
3.932
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = lm(glm.dist))
Global Stat
49
Value p-value
Decision
15.801 0.003298 Assumptions NOT satisfied!
Queen’s University!
September 2012
Skewness
Kurtosis
Link Function
Heteroscedasticity
6.528
1.661
2.329
5.283
0.010621 Assumptions NOT satisfied!
0.197449
Assumptions acceptable.
0.126998
Assumptions acceptable.
0.021530 Assumptions NOT satisfied!
The first part of the output shows the basic regression parameters and the level
of significance used to evaluate the assumptions. The later part of the output
shows the results of the assumption tests. The ‘Global Stat’ test statistic is a
combination of the four component tests: ‘Skewness’, ‘Kurtosis’, ‘Link Function’,
and ‘Heteroscedasticity’. The component tests measure the effect of violation
from the four assumptions. Violations of the assumption of a linear relationship
will influence ‘Skewness’ and ‘Kurtosis’; violations of Normally distributed
residuals will influence ‘Kurtosis’ and ‘Link Function’; and violations of
independence and heterscedasticity will influence ‘Heteroscedasticity’. For our
example, the ‘Global Stat’ indicates a violation of the assumptions from skewness
and heteroscedasticity, which confirms our earlier graphical assessment (Figure
4.2). Thus, a transfromation or modified statistical model is required in order to
make conclusions about the data set.
Correlation
Correlation measures the tendency of two dependent variables to change together
(ie. when one increases the other increases, or when one decreases the other
decreases). The Pearson’s correlation coefficient is used to estimate this tendency,
which is given by
P
(xi
r = pP
(xi
x̄)(yi
x̄)2 (yi
ȳ)
ȳ)2
The correlation coefficient (r) varies from -1 to 1, where the absolute value of the
coefficient represents the strength of the correlation while the sign represents the
direction. A positive correlation means that when one variable either increases or
decreases the other variable does the same, a negative correlation means that
when one variable either increases or decreases the other does the opposite. The
Pearson correlation coefficient can be obtained from the glm() function by taking
the square root of the multiple R-squared value given in the regression summary.
In our example, the Multiple R-squared is 0.6511, which yields a correlation
coefficient of r=0.807.
> sqrt(0.6511)
[1] 0.8069077
50
USE COR.TEST() rather than
trying to pull it from the glm
output because it is way too
confusing!
Queen’s University!
September 2012
Figure 4.3 Histogram of the residuals.
We can test the null hypothesis that the correlation equals zero by looking at the
p-value of the slope of the regression parameter. The reason we can do this is
that the correlation coefficient can be converted to a t-distributed variable at
which point the outcome of a t-test on the correlation coefficient will be the same
as that of the coefficient of the slope.
Function List
Here is a list of the functions and options covered in tutorial 1, see the quick
Reference Card for more details.
Functions: glm( ), names( ), summary.lm( ), abline( ), resid( ), gvlma( ).
51
Queen’s University!
September 2012
Tutorial 5: One-Factor ANOVA in R
W.A. Nelson
Analysis of variance (ANOVA) is similar to linear regression, but uses categorical
rather than numerical variables as independent variables. In fact, there is no
difference in the machinery used to fit ANOVA models and linear regressions, so
we can use the now familiar glm() function. The only difference we need to worry
about in R is how the data frame is created, and some nuances of hypothesis
testing. A single-factor ANOVA means that we are interested in one factor (e.g.,
nutrient concentrations), but there is more than one level in the factor (e.g., 0μM,
1.2μM, 3.2μM, and 5μM of Nitrogen).
Creating data frames for ANOVA
Similar to a two-sample t-test (which is a one factor ANOVA with two levels), the
data must be in stacked form with one column as the response variable
(dependent variable) and one or more columns as ‘coded’ categorical variables
(independent variable). Consider the following example of the influence that
substrate type has on the growth rates of benthic algae. The algae was relocated
from a common substrate (pebbles) to each of the four test substrate types. The
area of living algae was used to calculate the per-capita growth rate. Each column
is a different substrate and each row is a replicate.
Sand
Silt
Pebbles
Glass
1.45
1.24
2.24
1.18
0.76
1.93
3.71
0.59
1.11
1.96
2.92
0.52
1.71
2.20
3.01
-0.74
0.97
3.93
6.33
-0.99
To create the data set, first create the independent and categorial vectors
> growth=c(1.45, 0.76, 1.11, 1.71, 0.97, 1.24, 1.93, 1.96,
2.20, 3.93, 2.24, 3.71, 2.92, 3.01, 6.33, 1.18, 0.59, 0.52,
-0.74, -0.99)
52
Queen’s University!
September 2012
> substrate=c('sand', 'sand', 'sand', 'sand', 'sand', 'silt',
'silt', 'silt', 'silt', 'silt', 'pebbles', 'pebbles',
'pebbles', 'pebbles', 'pebbles', 'glass', 'glass', 'glass',
'glass', 'glass')
and combine these into a data frame
> my.data=data.frame('growth'=growth, 'substrate'=substrate)
Type my.data to see the contents of the data frame, and compare this with the
above table.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
growth substrate
1.45
sand
0.76
sand
1.11
sand
1.71
sand
0.97
sand
1.24
silt
1.93
silt
1.96
silt
2.20
silt
3.93
silt
2.24
pebbles
3.71
pebbles
2.92
pebbles
3.01
pebbles
6.33
pebbles
1.18
glass
0.59
glass
0.52
glass
-0.74
glass
-0.99
glass
To get an impression of the influence of substrate type of algal growth rates,
begin by plotting the data with the boxplot() function (Figure 5.1)
> boxplot(growth~substrate, ylab='Algal Growth Rate (/
day)',xlab='Substrate Type')
The plot suggests that algae have higher growth rates on pebbles, and lower on
glass. Let’s run the model to see whether these trends are statistically significant.
Fitting the ANOVA model
The ANOVA model is fit using the glm() function just as was done for linear
regression, but the X variable is now categorial. This difference is dealt with
‘behind the scenes’ in R, so we do not need to make any changes to the formula.
53
Queen’s University!
September 2012
4
2
0
Algal Growth Rate (/day)
6
> my.fit=glm(growth~substrate, data=my.data)
glass
pebbles
sand
silt
Substrate Type
Figure 5.1 Box plot of algal growth rate by substrate.
The data=my.data options tells R where to look for the variables. The output is
the same as for a linear regression:
> summary.lm(my.fit)
Call:
glm(formula = growth ~ substrate, data = my.data)
Residuals:
Min
1Q Median
-1.4020 -0.6545 -0.1600
3Q
0.4255
Max
2.6880
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
0.1120
0.4770
0.235 0.81733
substratepebbles
3.5300
0.6746
5.233 8.2e-05 ***
substratesand
1.0880
0.6746
1.613 0.12631
substratesilt
2.1400
0.6746
3.172 0.00591 **
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1)
2)
3)
4)
Residual standard error: 1.067 on 16 degrees of freedom
54
Queen’s University!
September 2012
Multiple R-squared: 0.6516,!
Adjusted R-squared: 0.5862
F-statistic: 9.973 on 3 and 16 DF, p-value: 0.0006026
As with independent t-test, the statistical tests presented are all relative to one
factor level (Figure 3.3). In this example, they are all relative to the ‘glass’
treatment, which is shown by the fact that it is missing from all of the terms (e.g.,
substratepebbles). Let’s go through the output line by line.
1. The first line is labeled (Intercept), which in this example is the glass substrate
(analogous to the intercept in an independent t-test, see Figure 3.3). The value
under the ‘Estimate’ heading is the least square means (LS means) estimate
from the fit model. The LS mean per-capita algal growth rate is 0.1120 per day.
The t-test test the hypothesis that the estimate is different from zero. It is a
two-tailed test, and may not always be the one you want to evaluate. In this
example, p>0.05, so we fail to reject the null hypothesis that the growth rate is
not different from zero.
Coefficients:
(Intercept)
Estimate Std. Error t value Pr(>|t|)
0.1120
0.4770
0.235 0.81733
2. The second line is the first of the treatment levels relative to the glass
treatment. The LS mean algal growth rate is 3.53 per day more than glass,
which means the actual growth rate estimate is 3.64 per day. The t-test tests
the hypothesis that the difference in algal growth rate relative to the glass is
not zero. Since p<0.05, we can conclude that the pebbles substrate had an
influence on growth relative to glass.
substratepebbles
3.5300
0.6746
5.233
8.2e-05 ***
3. The third line is the next treatment level relative to the glass treatment. The LS
mean algal growth rate is 1.088 per day more than glass, which means the
actual growth rate estimate is 1.2 per day. Since p>0.05, we fail to reject the
null hypothesis that the difference in growth rate is not different from zero.
substratesand
1.0880
0.6746
1.613
0.12631
4. The last line is the final treatment level relative to the glass treatment. The LS
mean algal growth rate is 2.14 per day more than glass, which means the
actual growth rate estimate is 2.25 per day. Since p<0.05, we can conclude that
the silt substrate had an influence on growth relative to glass.
substratesilt
2.1400
0.6746
3.172
0.00591 **
The remainder of the output is interpreted the same as for a linear regression (see
Tutorial 4 for details).
55
Queen’s University!
September 2012
The output from the summary.lm() function provides t-test evaluations of the
term against zero for each level. The F-table will let you evaluate the general
significance of the factor (substrate in this case) in the model. Use the aov()
function to get the F-table:
> summary(aov(my.fit))
Df Sum Sq Mean Sq F value
Pr(>F)
substrate
3 34.033 11.3443 9.9726 0.0006026 ***
Residuals
16 18.201 1.1376
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1
0
-1
resid(my.fit)
2
The line of interest is the one labelled ‘substrate’, which is the hypothesis test for
the substrate factor. Since p<0.05, we conclude that the substrate had an
influence on the per-capita algal growth rate.
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
fitted(my.fit)
Figure 5.2 Plot of residuals against predicted values.
Evaluating the assumptions of ANOVA
The evaluation of assumptions for one-factor ANOVA’s is similar to linear
regressions. Begin by plotting the residuals. For ANOVA’s, this is done by plotting
them against the fitted mean for each level (which are the LS means).
> plot(fitted(my.fit),resid(my.fit))
56
Queen’s University!
September 2012
The variance appears to change across the substrate types (Figure 5.2), which we
will investigate quantitatively below. The histogram of raw residuals also appears
slightly skewed (Figure 5.3).
> hist(resid(my.fit))
3
0
1
2
Frequency
4
5
6
Histogram of resid(my.fit)
-1
0
1
2
3
resid(my.fit)
Figure 5.3 Histogram of residuals.
A more formal test is done using the gvlma() function in the gvlma library (see
Tutorial 4 for details).
> library(gvlma)
> gvlma(lm(my.fit))
Call:
lm(formula = my.fit)
Coefficients:
(Intercept) substratepebbles
0.112
3.530
57
substratesand
1.088
substratesilt
2.140
Queen’s University!
September 2012
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = lm(my.fit))
Global Stat
Skewness
Kurtosis
Link Function
Heteroscedasticity
Value p-value
7.648e+00 0.10538
4.057e+00 0.04398
1.131e+00 0.28748
1.599e-15 1.00000
2.459e+00 0.11687
Decision
Assumptions acceptable.
Assumptions NOT satisfied!
Assumptions acceptable.
Assumptions acceptable.
Assumptions acceptable.
The output of the gvmla test indicates that while the residuals are slightly
skewed, overall the assumptions are sufficiently valid.
Posteriori contrasts
The F-test indicated that the substrates had an impact on algal growth, and the
summary of the fit model (i.e., from summary.lm()) tested each level against the
glass substrate. However, since glass is not a control for this experiment, we are
interested in how all substrates compare to each other (e.g., pebbles versus sand).
To evaluate these hypotheses, we use posteriori contrasts. One way to do this is
to use Tukey’s honest significant difference test, which compares all possible
pair-wise contrasts. In R, this is done using the TukeyHSD() function.
> TukeyHSD(aov(my.fit))
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = my.fit)
$substrate
diff
lwr
upr
p adj
pebbles-glass 3.530 1.6000921 5.4599079 0.0004306
sand-glass
1.088 -0.8419079 3.0179079 0.3994762
silt-glass
2.140 0.2100921 4.0699079 0.0272350
sand-pebbles -2.442 -4.3719079 -0.5120921 0.0110907
silt-pebbles -1.390 -3.3199079 0.5399079 0.2078944
silt-sand
1.052 -0.8779079 2.9819079 0.4276917
The aov() function is used to pass the proper elements of the ANOVA fit to the
TukeyHSD() function. Each line of the output gives a test between two substrate
58
Queen’s University!
September 2012
types. For example, the fourth line tests the hypothesis that the growth rate is not
different between the sand and pebble substrates. Since p<0.05, we conclude that
the per-capita growth rates are different between the two substrates.
Function List
Here is a list of the functions and options covered in tutorial 1, see the quick
Reference Card for more details.
Functions: glm( ), TukeyHSD( ), aov( ).
59
Queen’s University!
September 2012
Tutorial 6: Two-Factor ANOVA in R
W.A. Nelson
The previous tutorial considered ANOVA’s where there was just a single factor.
However, it is often the case that you are interested in the influence two factors
(e.g., nutrients and temperature), and whether they interact with each other (e.g.,
high temperature amplifies the impact of nutrients). The overall analysis for two
factors is similar to the case with one-factor, but they differ in a number of
details.
Creating data frames for ANOVA
Similar to a single-factor ANOVA, the data must be in the stacked form with one
column as the response variable (dependent variable) and two columns as ‘coded’
categorical variables (independent variable). Consider the following example that
looks at the influence of pine tree sap on the growth of a wood fungus. Different
species of pine trees contain different concentrations of antifungal agents. The
following table shows growth rates (mm/day) of wood fungus in agar (which
provides a controlled medium to evaluate growth) containing sap from two
species in isolation, a mix of each, and a control (i.e., no tree sap). Each row is a
replicate.
Treatment A
Pinus taeda
Treatment B
Pinus pinea
Treatment C
Pinus taeda &
Pinus pinea
Treatment D
Control
3.08
3.22
2.31
4.99
3.52
2.42
2.51
5.21
2.61
2.48
2.51
5.03
2.36
2.45
2.15
5.29
2.66
2.58
2.09
5.02
3.00
2.67
2.30
4.06
Since the data set has observations for all combinations of presence/absence for
both species, it is a two-factor experiment as shown in the following table:
60
Queen’s University!
September 2012
Pinus taeda
absent
present
absent
Treatment D Treatment A
present
Treatment B
Pinus
pinea
Treatment C
To create the data set, first create the independent vector and the two categorial
vectors.
> Growth=c(3.08, 3.52, 2.61, 2.36, 2.66, 3.00, 3.22, 2.42,
2.48, 2.45, 2.58, 2.67, 2.31, 2.51, 2.51, 2.15, 2.09, 2.30,
4.99, 5.21, 5.03, 5.29, 5.02, 4.06)
> P.taeda=c('yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'no',
'no', 'no', 'no', 'no', 'no', 'yes', 'yes', 'yes', 'yes',
'yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'no')
> P.pinea=c('no', 'no', 'no', 'no', 'no', 'no', 'yes', 'yes',
'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes',
'yes', 'no', 'no', 'no', 'no', 'no', 'no')
Here we have used ‘P’ to denote that the tree sap was present for a species, and
‘A’ to denote absent. Even though two of the vectors are categorical, all three can
be combined into a single a data frame
> my.data=data.frame('Growth'=Growth, 'P.taeda'=P.taeda,
'P.pinea'=P.pinea)
Type my.data to see the contents of the data frame, and compare this with the
two-factor table shown above.
> my.data
Growth P.taeda P.pinea
1
3.08
yes
no
2
3.52
yes
no
3
2.61
yes
no
4
2.36
yes
no
5
2.66
yes
no
6
3.00
yes
no
7
3.22
no
yes
8
2.42
no
yes
61
Queen’s University!
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
2.48
2.45
2.58
2.67
2.31
2.51
2.51
2.15
2.09
2.30
4.99
5.21
5.03
5.29
5.02
4.06
September 2012
no
no
no
no
yes
yes
yes
yes
yes
yes
no
no
no
no
no
no
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
no
no
no
no
no
no
The new data frame has the same structure as we used for one-factor ANOVA’s,
but with two categorical variables we can investigate a two-factor statistical model
including testing for an interaction between the two species. Let’s begin by
plotting the data with the boxplot() function as before (Figure 6.1)
> boxplot(Growth~P.taeda+P.pinea, ylab='Fungal Growth (mm/
day)',xlab='Presence/Absence of Pine sap (P.taeda:P.pinea)')
It is easy to see that the fungal growth rate decreases with the presence of sap
from either species of pine, but it is more difficult to see if there is an interaction
between the species. To see the interaction more clearly, we can use the
interaction.plot(X1,X2,Y) function, where X1 and X2 are the categorical
predictor variables, and growth is the quantitative response variable. The
interaction plot shows the mean of each treatment cell for each level in the X1
variable on x-axis. Each level of the X2 predictor variable is shown with a different
style of line, and a different number.
> interaction.plot(P.taeda,P.pinea,Growth, type='b')
The interaction plot (Figure 6.2) shows that the presence of sap from either
species of pine reduces the fungal growth rate, and that there is a negative
interaction between the species such that the presence of both is not as effective
as you would expect from the reduction in growth caused by the species in
isolation.
Fitting the ANOVA model
The ANOVA model is fit using the glm() function as before. For the two-factor
ANOVA, we must add both predictor variables to the formula, as well as their
62
Queen’s University!
September 2012
1
P.pinea
1
2
4.0
no
yes
3.5
2.5
3.0
mean of Growth
4.5
5.0
interaction. This is done using the ‘+’ and ‘:’ symbols as y∼X1+X2+X1:X2, where
1
2
2
no
yes
P.taeda
Figure 6.2 Interaction plot of fungal growth by
treatment.
the ‘+’ symbol separates the various independent variables. Writing the variable
name denotes a main effect, and the ‘:’ symbol denotes an interaction.
> my.fit=glm(Growth~P.taeda + P.pinea + P.taeda:P.pinea,
data=my.data)
The data=my.data options tells R where to look for the variables. The output is
the same as for a linear regression:
> summary.lm(my.fit)
Call:
glm(formula = Growth ~ P.taeda + P.pinea + P.taeda:P.pinea,
data = my.data)
Residuals:
Min
1Q
-0.87333 -0.19292
Median
0.01583
3Q
0.19833
Max
0.64833
Coefficients:
1)
(Intercept)
63
Estimate Std. Error t value Pr(>|t|)
4.9333
0.1428 34.548 < 2e-16 ***
Queen’s University!
2)
3)
4)
September 2012
P.taedayes
-2.0617
P.pineayes
-2.2967
P.taedayes:P.pineayes
1.7367
--Signif. codes: 0 ‘***’ 0.001 ‘**’
0.2019 -10.209 2.23e-09 ***
0.2019 -11.373 3.49e-10 ***
0.2856
6.081 6.07e-06 ***
0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3498 on 20 degrees of freedom
Multiple R-squared: 0.9118,!
Adjusted R-squared: 0.8986
F-statistic: 68.96 on 3 and 20 DF, p-value: 1.006e-10
4.5
4.0
3.5
3.0
2.0
2.5
Fungal Growth (mm/day)
5.0
The output reveals that the sap from both pine species, as well as the interaction,
are significantly different from the control. Let’s work through this line by line.
no.no
yes.no
no.yes
yes.yes
Presence/Absence of Pine sap (P.taeda:P.pinea)
Figure 6.1 Box plot of fungal growth by treatment.
1. The first line is labeled (Intercept), which in this example is the control
treatment (analogous to the intercept in an independent t-test, see Figure 3.3).
The value under the ‘Estimate’ heading is the least square means (LS means)
estimate from the fit model. The LS mean fungal growth rate is 4.93 mm/day.
The t-test test the hypothesis that the estimate is different from zero. It is a
two-tailed test, and may not always be the one you want to evaluate. In this
example, the growth rate of the control is different from zero (p<0.05).
64
Queen’s University!
September 2012
Coefficients:
(Intercept)
Estimate Std. Error t value Pr(>|t|)
4.9333
0.1428 34.548 < 2e-16 ***
2. The main effect of Pinus taeda sap on fungal growth relative to the control.
The LS mean fungal growth rate is 2.06 mm/day less than the control, which
means the actual growth rate is 2.87 mm/day (try this for yourself). The t-test
tests the hypothesis that the difference in fungal growth rate relative to the
control is not zero. Since p<0.05, we can conclude that the Pinus taeda sap had
an influence on fungal growth.
P.taedayes
-2.0617
0.2019 -10.209 2.23e-09 ***
3. The main effect of Pinus pinea sap on fungal growth relative to the control.
The LS mean fungal growth rate is 2.30 mm/day less than the control, which
means the actual growth rate is 2.63 mm/day. Since p<0.05, we can conclude
that the Pinus pinea sap had an influence on fungal growth.
P.pineayes
-2.2967
0.2019 -11.373 3.49e-10 ***
4. The final line shows the interaction between the Pinus taeda and Pinus pinea
sap on fungal growth relative to the control and relative the main effects of
each. If the two tree species were purely additive, then the fungal growth rate
would be 4.36 slower than the control (2.06+2.30). The LS mean fungal growth
rate is 1.74 mm/day faster than the purely additive case, which means the
actual growth rate is 2.31 mm/day. The t-test tests the hypothesis that the
interaction, which is the difference between the observed growth and the
purely additive case, is zero. Since p<0.05, we conclude that there is an
interaction between the sap of the two species.
P.taedayes:P.pineayes
1.7367
0.2856
6.081 6.07e-06 ***
The remainder of the output is interpreted the same as for a linear regression (see
Tutorial 4 for details).
The output from the summary.lm() function provides t-test evaluations of the
term against zero, which is not always useful. The F-table, however, will let you
evaluate the general significance of each factor in the model, rather than just a
test against zero. To get the F-table, we can use the aov() function:
> summary(aov(my.fit))
Df Sum Sq Mean Sq F value
Pr(>F)
P.taeda
1 8.5443 8.5443 69.839 5.900e-08 ***
P.pinea
1 12.2408 12.2408 100.054 3.149e-09 ***
P.taeda:P.pinea 1 4.5240 4.5240 36.978 6.066e-06 ***
Residuals
20 2.4468 0.1223
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
65
Queen’s University!
September 2012
The first line is the F-test for main effects of Pinus taeda, the second line is the Ftest for the main effects of Pinus pinea, and the last line in the F-test for the
interaction.
Evaluating the assumptions of ANOVA
The evaluation of assumptions for two-factor ANOVA’s is the same as for onefactor ANOVA’s. Begin by plotting the residuals against the fitted values (which
are the LS means).
> plot(fitted(my.fit),resid(my.fit))
The variance appears relatively constant across the fitted values (Figure 6.3). The
histogram of raw residuals also appears roughly normal given the small number
of replicates.
> hist(resid(my.fit))
A more formal test is done using the gvlma() function in the gvlma library (see
Tutorial 4 for details).
> library(gvlma)
> gvlma(lm(my.fit))
Call:
lm(formula = my.fit)
1)
Coefficients:
(Intercept)
P.taedayes
4.933
-2.062
2)
P.pineaP
-2.297
P.taedayes:P.pineayes
1.737
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
3)
Call:
gvlma(x = lm(my.fit))
a)
b)
c)
d)
e)
f)
Global Stat
Skewness
Kurtosis
Link Function
Heteroscedasticity
Value p-value
Decision
1.49217 0.8280 Assumptions acceptable.
0.57877 0.4468 Assumptions acceptable.
0.89226 0.3449 Assumptions acceptable.
0.00000 1.0000 Assumptions acceptable.
0.02114 0.8844 Assumptions acceptable.
The output of the gvmla test indicates that the assumptions of linearity,
normality, homoscedasticity, and independence of the residuals are satisfied in
66
Queen’s University!
September 2012
0.0
-0.5
resid(my.fit)
0.5
our example. Since the assumptions are satisfied, we can move on to interpret the
analysis in more detail.
2.5
3.0
3.5
4.0
4.5
5.0
fitted(my.fit)
Figure 6.3 Plot of residuals by the predicted value.
Posteriori contrasts
The fit ANOVA model indicates a significant interaction, which means that the
sap from the two species of pine have a non-additive impact on fungal growth
rates. It also means that the main effects (e.g., Pinus taeda influences fungal
growth rates) are not directly interpretable because it includes treatments with
the other species as well. For example, it could be that Pinus taeda in isolation
does not influence fungal growth rates, but in the presence of Pinus pinea it does.
Posteriori contrasts provide a way to disentangle interactions by comparing sub
components of the ANOVA table (e.g., Pinus taeda in isolation versus control).
One way to do this is to use Tukey’s honest significant difference test, which
compares all possible pair-wise contrasts. In R, this is done using the TukeyHSD()
function.
> TukeyHSD(aov(my.fit))
Tukey multiple comparisons of means
67
Queen’s University!
September 2012
95% family-wise confidence level
Fit: aov(formula = my.fit)
$P.taeda
diff
lwr
upr p adj
yes-no -1.193333 -1.491197 -0.8954692 1e-07
$P.pinea
diff
lwr
upr p adj
yes-no -1.428333 -1.726197 -1.130469
0
$`P.taeda:P.pinea`
yes:no-no:no
no:yes-no:no
yes:yes-no:no
no:yes-yes:no
yes:yes-yes:no
yes:yes-no:yes
diff
-2.061667
-2.296667
-2.621667
-0.235000
-0.560000
-0.325000
lwr
upr
p adj
-2.6268893 -1.496444013 0.0000000
-2.8618893 -1.731444013 0.0000000
-3.1868893 -2.056444013 0.0000000
-0.8002227 0.330222654 0.6557255
-1.1252227 0.005222654 0.0527069
-0.8902227 0.240222654 0.3960647
The aov() function is used to pass the proper elements of the ANOVA fit to the
TukeyHSD() function. let’s go through each section of the output from the
Tukey’s test.
1. The first section indicates that Tukey test is using a 5% error rate across all
contrasts.
2. The next two sections ($P.taeda and $P.pinea) show the results of the main
effects, which are of little value here because of the significant interaction.
3. The final section ($P.taeda:P.pinea) is really what we are after. It shows the
results of all pair-wise contrasts between the treatment. For example, the first
line is ‘yes:no-no:no’, which is the contrast between the treatment with just
Pinus taeda (read as P.taeda present, P.pinus absent) and the control (read as
P.taeda absent, P.pinus absent). Since p<0.05, we can conclude that Pinus taeda
in isolation causes a change in fungal growth rate.
The Tukey’s test indicates that each pine species had an influence on fungal
growth rates in isolation and in combination relative to the control (contrasts a, b
& c). The two species of pine did not differ significantly in the effect on fungal
growth in isolation (contrast d), and the addition of Pinus taeda to Pinus pinus
had no significant effect relative to Pinus pinus in isolation (contrast f). The
interaction is really the result of the addition of Pinus pinua to Pinus taeda
relative to Pinus taeda in isolation (contrast e).
68
Queen’s University!
September 2012
Function List
Here is a list of the functions and options covered in tutorial 1, see the quick
Reference Card for more details.
Functions: glm( ), TukeyHSD( ), aov( ), interaction.plot( ).
69
Queen’s University!
September 2012
Reference Cards
Functions
Function
Description
Tutorial
c( x, y, z, ... )
Creates a data vector from the z,
y, z, ... entries.
1
max( x ), min( x )
Returns the maximum/minimum
value in the x vector
1, 2
sum( x )
Returns the sum of all entries in
the x vector.
1
Returns the mean or median of
the x vector.
1, 2
mean( x )
median( x )
Creates a data frame of the
vectors x, y, ... , each with the
titles given by ‘X’, ‘Y’, ...
1
Returns the transpose of the data
frame x.
2
as.matrix( x )
Converts a data frame x into a
matrix object.
2
read.csv()
Reads in .csv files (comma
separated variables). Use in
conjunction with file.choose().
1
Brings up a graphical window to
select your file to upload. Use in
conjunction with read.csv().
1
data.frame( ‘X’=x, ‘Y’=y, ... )
t( x )
file.choose()
hist( x )
plot( x , y )
70
Creates a histogram of the x
vector. See Plotting options to
customize your plots.
1, 2
Creates a scatter plot of the x and
y vectors. See Plotting options to
customize your plots.
1, 2
Queen’s University!
Function
Description
Tutorial
interaction.plot( x1, x2, y)
Creates a line plot of the mean
response for each combination of
x1 and x2. The x1 categories are
shown on the x-axis, and the x2
categories are denoted with
different lines.
6
sd( x )
Returns the standard deviation of
the x vector.
2
quantile( x )
Returns the quantiles of the x
vector.
2
table( x , y )
Creates a contingency table from
the vector x, using the categorical
values in y. In this case, x is
always a stacked vector.
2
Creates a bar plot of the data in
x. If x is a vector, then it is a
simple bar plot. If x is a data
frame, then each column of x is a
separate group. The default is a
stacked bar plot, but the option
beside=TRUE can be used to
create a grouped bar plot.
2
Creates a box plot of the data in
x. If provided, y is a categorical
vector describing the grouping of
x (i.e., x can be a stacked vector).
2
barplot( x )
barplot( x , beside=TRUE)
boxplot( x , y )
abline( h=a )
abline( v=a )
71
September 2012
Draws a horizontal (‘h’) or
vertical (‘v’) line on a plot that
crosses the axis at a.
2, 4
Queen’s University!
September 2012
Function
Description
Tutorial
Displays the critical value from a
t-distribution given a confidence
value and the degrees of
freedom. For a one-tailed t-test,
the confidence value is 1-α. The
two-tailed test requires 1-α/2 for
the right hand tail and α/2 for
the left hand tail.
3
qf(1-α,df1,df2)
Displays the critical value from a
F-distribution for a given
confidence value and degrees of
freedom. The order for degrees
of freedom is df1 is numerator,
df2 is denominator.
3
qchisq(1-α,df)
Displays the critical chi-squared
value for a given confidence
value and degrees of freedom.
3
Tests the null hypothesis that the
data contained in the vector y is
significantly different from a
population mean μ.
3
Test the null hypothesis that the
difference between two
populations, y, is different than
zero.
3
qt(1-α,df)
or
qt(1-α/2,df)
glm(y-μ~1)
(one sample t-test)
glm(y~1)
(two sample paired t-test)
Two-sample t-test: Test the null
hypothesis that the mean of two
data sets are different.
glm(y~x)
(two-sample independent ttest, regression)
72
Regression: Runs a regression on
x and y, where y is your response
variable and x is your
explanatory variable. We can
think of the tilde as “given by”,
therefore y “given by” x.
3, 4
Queen’s University!
Function
Description
Tutorial
summary()
This function extracts additional
data from a glm model and
displays it on the screen.
3, 4
summary.lm()
names()
Shows the list of extractor
functions specific to a certain
object.
4
fitted()
Lists the fitted values calculated
from the regression coefficients.
4
resid()
Lists the residuals, which is the
difference between the fitted and
observed values.
4
gvlma(lm())
Used to assess the assumptions
of a linear regression. lm() must
be used inside this function to
convert a glm object to an lm
object.
4
glm(y~A+B+A:B)
Performs an ANOVA on the
dependent variable y and
independent categorical variables
A and B. The ‘~’ separates
independent and dependent
variables, the ‘+’ separated the
various independent variables,
and the ‘:’ denotes an interaction.
5
TukeyHSD(aov(my.fit))
Performs a posteriori Tukey’s
HSD test on the glm object
‘my.fit’. The aov() function is
used to pass specific parts of the
glm object to the TukeyHSD
function.
5
This function extracts the
ANOVA table from a fit glm
object.
5
aov(my.fit)
73
September 2012
Queen’s University!
September 2012
Arithmetic, indexing and Logic Operators
Symbol
Description
Tutorial
+
Addition
1
-
Subtraction
1
/
Division
1
*
Multiplication
1
^
Raise to the power (i.e., x^y is x
raised to the power y)
1
=
Assigns a value
1
[,]
Used to access entries of data
structures such as vectors, arrays
and matrices. Commas are used
to separate dimensions.
1
()
Used for order of operations (i.e.,
do what is in the brackets first)
and for the arguments of
functions.
1
Description
Tutorial
Plotting Options
Option
xlim, ylim = c(a,b)
xlab, ylab = “ ”
main = “ ”
74
Specifies the axis limits to be
plotted using a vector with two
entries. The first entry ‘a’ is the
minimum, the second entry ‘b’ is
the maximum.
1, 2
Labels the axes with what is
written in the quotes
1, 2
Labels the plot title with what is
written in the quotes
2
Queen’s University!
75
September 2012
Option
Description
Tutorial
col =
Specifies a color for the line/
point/bar. Can be either a color
name in quotes (e.g., ‘red’) or a
number. Numbers one to five
define colors as 1-black, 2-red, 3green, 4-blue, 5-cyan, 6-pink, 7yellow and 8-grey. See colors() for
a full list of color names.
1, 2
lty =
Specifies the type of line as 1solid, 2-dashed, 3-dotted, 4dotdash, 5-longdash, 6-twodash.
1
lwd =
Specifies the line width. Takes a
positive number.
1
pch =
Specifies the point type. Some
useful values are 19- solid circle,
20-bullet (smaller circle), 21-filled
circle, 22-filled square, 23-filled
diamond, 24-filled triangle pointup and 25-filled triangle point
down.
1
Download