Queen’s University! September 2012 Introduction to Statistics in R Tutorial Material BIOL 243 2012 Queen’s University Edited by WA Nelson 1 Queen’s University! September 2012 Table of Contents Tutorial 1: Getting Started with R! How to get and install R 5 Great, let’s get started! 5 Working with vectors 7 Working with data 8 Getting help on R functions 10 Let’s get plotting 11 The editor 13 Concluding remarks 14 Resources 15 Function List 15 Overview of the R software 16 Customizing the work environment 18 Tutorial 2: Graphs and Descriptive Statistics in R 19 Descriptive statistics 19 Contingency tables 21 Histograms 23 Bar plots 24 Box plots 26 Scatter plots 27 Function List 28 Tutorial 3: Hypothesis Testing in R 2 4 29 Finding critical values 29 Chi-squared test 29 One-sample t-test 32 Paired Two sample t-tests 35 Independent two sample t-tests 39 Queen’s University! Function List Tutorial 4: Regression and correlation in R September 2012 42 43 Fitting the linear regression 43 Plotting the fit linear regression 47 Evaluating the assumptions of linear regression 47 Correlation 50 Function List 51 Tutorial 5: One-Factor ANOVA in R 52 Creating data frames for ANOVA 52 Fitting the ANOVA model 53 Evaluating the assumptions of ANOVA 56 Posteriori contrasts 58 Function List 59 Tutorial 6: Two-Factor ANOVA in R 60 Creating data frames for ANOVA 60 Fitting the ANOVA model 62 Evaluating the assumptions of ANOVA 66 Posteriori contrasts 67 Function List 69 Reference Cards 70 3 Functions 70 Arithmetic, indexing and Logic Operators 74 Plotting Options 74 Queen’s University! September 2012 Tutorial 1: Getting Started with R! W.A. Nelson This guide is an introduction to the statistical software environment R. There are many excellent books and online resources (see Resources) that give a broader background and overview of the wonderful world of R. Rather than try to emulate these resources, the intention here is to provide a beginner’s guide that will help you become familiar with the basics of R quickly, while giving you the tools to learn more as you encounter new problems. What is R? R is an object-oriented programming language and computational environment for a wide range of statistical and mathematical analyses. R can be used as a calculator; as a versatile graphing tool; to run basic and advanced statistics; for numerically solving differential equations and matrix models; and for more specialized analyses such as inferring phylogenies, calculating genetic distances and bioinformatics. R is supported and developed by academics, and has an amazing array of references and help resources available on-line and in print. Here’s a short list of reasons to use R, even in an intro stats course1. • R is used by the majority of academic statisticians. • R in free! • R is effectively platform independent. If you live in an all-Windows environment, this may not be such a big deal but for those of us who use Linux/Mac and work with people who use Windows, it’s a tremendous advantage. • R has unrivaled help through books and on-line resources (but the immediate help functions in R can be difficult to understand). • R makes the best graphics. • The command-line interface — perhaps counterintuitively — is much, much better for teaching. You can post your code exactly and students can reproduce your work exactly. • R is more than a statistics application. It is a full programming language. • The online distribution system beats anything out there. Along with its growing popularity, however, R has a reputation for being difficult to learn because it uses command-line programming rather than the graphical 1 4 Why use R? by J. Holland Jones (monkeysuncle.stanford.edu/?p=367) Queen’s University! September 2012 user interface (menus, etc.) used by most commercial statistics programs. Contents of this primer are influenced to a large extent by our perspective on the challenges of learning R, and with the intent to provide a straightforward roadmap to start using R efficiently. The first sections introduce you to using the console (the window where commands are typed), understanding the structure of arithmetic operations, plotting simple graphs, and getting help. How to get and install R R is free. Go to http://cran.r-project.org (CRAN stands for the Comprehensive R Archive Network) and click on one of the Linux, MacOS X or Windows links in the Download and Install R box (you won’t need any of the source code files, these are for developers). Follow the links for the base files (for Windows users) and download the latest version (‘R-2.15.1-win32.exe’ for Windows users and ‘R-2.15.1.dmg’ for MacOSX users). The downloaded file has an installer in it, so double clicking the file and following the prompts is all you need to do to install R. Great, let’s get started! Start R as you would any program. You should have a single window open, which is called the console (see Overview of the R software for more details). This is the R program. When you move your cursor to the console, you should see > (note that we will use a green background throughout this guide to indicate what you see in the console). This is where commands are entered. Lets try it. Type “2+2” and then press the the enter key: > 2+2 You should see the answer appear (preceded by “[1]”) below the line you typed: > 2+2 [1] 4 The number in square brackets just indicates that this is the first entry in the vector being returned. Now let’s create some variables. First, create a new variable ‘dan’ and assign it the value 2. > dan=2 then create a new variable ‘ted’ and assign it the value 5 > ted=5 now we can have some fun manipulating dan and ted. 5 Queen’s University! September 2012 > dan-ted [1] -3 > ted/dan [1] 2.5 > dan*ted [1] 10 If you want to raise something to the exponent, use the ‘^’ symbol > ted^dan [1] 25 The Reference card shows a list of common arithmetic operations. Now that you have your feet wet, here’s a short list of useful R commands and operations 2: • > is the prompt from R on the console, indicating that it is waiting for you to enter a command • Each command entered by hitting <return> (<enter> on a Windows PC) is executed immediately and often generates a response; commands and responses are shown in different colours (can be changed in Preferences, see below) and this really helps when you are trying to find things on the console after a long session. • When you hit <return> (<enter> on a Windows PC) before a command is completely typed a + will appear at the beginning of the next line • R is case sensitive, so House is a different variable from house. • R functions are followed by (), with or without something inside the brackets depending on the function (e.g., min(x) gives the minimum value of the x vector). • Enter citation() to see how to cite R in your paper • Don’t worry about spaces around symbols in a typed line; they do not have any effect, so win=3+4 is the same as win = 3 + 4. • The # sign begins a comment that will not be executed, and can either follow a command, or begin a new line. This is a great way to makes notes to yourself in the code. • To assign variables, use the = assignment operator; xxx = yyy assigns yyy to a new object xxx; most advanced R users and books use <- instead of = but this is equivalent for our purposes and = is easier to remember and type. • The up arrow ↑ scrolls back through previous commands that you have entered on the console; you can either press <return> (<enter> on a Windows PC) to execute the command again or edit it to correct a mistake. Very handy. 2 6 Starting to Use R by R. Montgomerie 2010 Queen’s University! September 2012 • If you make a mistake when typing and R returns ‘syntax error’ or has a ‘+’ sign rather than the prompt, press the escape key. This will give you a fresh prompt. Working with vectors A vector (i.e., a list of numbers) is created in R using the function c(). In R, all functions use round brackets to accept input with each entry separated by a comma. To create a vector all we need to do is to provide the function c() with a list of numbers. > julie=c(3.2,4.1,5.5,6.2,7.1,8.4,9.5) To see what is in the variable, just type it’s name. > julie [1] 3.2 4.1 5.5 6.2 7.1 8.4 9.5 To access a particular entry in the vector, use square brackets at the end of the name. Here the [3] indicates that you want the number in the third spot. > julie[3] [1] 5.5 We can even access a specific set of the elements in the vector > a=c(3,4,5) > julie[a] [1] 5.5 6.2 7.1 Here the ‘a’ vector is used to specify which of the elements in ‘julie’ were needed. The same thing can be achieved by entering: > julie[c(3,4,5)] [1] 5.5 6.2 7.1 Mathematical operations can be done directly on vectors—here’s some examples: > julie-7 [1] -3.8 -2.9 -1.5 -0.8 0.1 1.4 2.5 > julie-julie [1] 0 0 0 0 0 0 0 > julie^2 [1] 10.24 16.81 30.25 38.44 50.41 70.56 90.25 Notice that the operation is carried out on each entry of the vector independently. The output from a new operation can also be assigned to a new variable: > b=julie-7 We can look at the sum of all entries in the vector (using the function sum()) 7 Queen’s University! September 2012 > sum(b) [1] -5 or skip a step and nest these two operations in one command > sum(julie-7) [1] -5 Some other functions that are useful for summarizing information in vectors are the mean(), min() and max(). Remember that spaces within a function have no influence, but R is very picky about commas and whether or not letters are capitalized. Working with data Vectors are a great way to store certain types of data in R, but they consist of only a single column of data. In this section, we will learn to store more than one column of data in data frames, as well as to import into R. Data frames are much like a collection of vectors, and are created using the data.frame() function. Begin by creating two vectors (they must have the same number of elements) > x=c(1,2,3,4,5,6,7,8) > y=c(2.1,1.3,4.2,7.1,8.7,12.1,11.2,14.5) Then use the data.frame() function to bind these two vectors into one > steve=data.frame('X'=x,'Y'=y) #here ‘X’ and ‘Y’ become the column labels This will create an 8 x 2 data frame with column titles X and Y (you can choose any name you like for the column titles). Type the variable name steve to see what is in it. To access just the X column, you can type > steve[,1] [1] 1 2 3 4 5 6 7 8 where the square brackets give you access to specific rows and columns in the format [rows , columns]. The blank before the comma means to select all rows. As an alternative, data frames let you call a column by it’s name with the ’$’ sign: > steve$X [1] 1 2 3 4 5 6 7 8 If you want a specific value from the data frame, you can access it using either of the indexing methods. > steve[4,1] [1] 4 > steve$X[4] [1] 4 8 Queen’s University! September 2012 Data frames are one of the more common ways to store data in R, and are often the form required by functions (the function help file will always indicate the form that the argument needs to be in). Now we can learn about importing and exporting data. R can read and write many types of files (e.g., .txt, .csv, .xls), but here we will just explore working with CSV (comma separated variable) files. CSV files have a ‘.csv’ ending to the file name, and can be created using a variety of different spreadsheet software, such as Excel or Numbers. Reading data from .csv files is done with the read.csv() function. To start, make sure that your data are organized in the .csv file the way you want to use it in R, and give each column a title (we recommend that each column have a title without spaces, like ‘BillWidth’, and that you keep these short as you will have to type them out later). For example, the file WidthLengthBeetles.csv (opened in Excel) in shown Figure 1.1. If you create your data file in EXCEL, make sure you save it as a CSV file. Figure 1.1 Example Excel file for creating data sets. The data file is imported using > my.dat=read.csv(file.choose()) The function file.choose() opens up a new window that lets you browse for the file, and the contents of the file are assigned to the variable my.dat (note that you can use a different name here if you want, but be aware that there are some restrictions) as a data frame. Once you have imported the .csv file, type my.dat to 9 Queen’s University! September 2012 see its contents. Notice that we used a period in the new variable name. In R, variable names can include text, numbers, periods and underscores, but not spaces or other symbols used for arithmetic such as “!”. This gives you the basics of importing data, as well as creating and using data frames. Getting help on R functions This is probably the most important section of this primer because there are thousands of functions in R, and it is essential to learn how to use the help files (which can be difficult to understand). There are numerous sources for help, such as web forums and blogs, books, and internet help sites (see Resources). Another good source is the help files in R itself. For example, lets have a closer look at the mean() function, which returns the average value of a vector. To get the help-file for a function in R, simply type a question mark followed by the name. > ?mean This will open a new window with the help file. All help files have the same basic sections: description, usage, arguments, values, references, see also, and examples. Figure 1.2 shows an annotated help file for the mean() function. The key parts for any function help-file are the ‘Arguments’ and ‘Value’ sections since these describe the input and output of a function. For the mean() function, the arguments are the data object x that you are computing the mean for, and two options—trim and na.rm. The typical use is > y=c(2.1,1.3,4.2,7.1,8.7,12.1,11.2,14.5) > mean(y) [1] 7.65 but if you have values that are not a number (designated by NA or NAN in R), the na.rm=TRUE option allows you to ignore these entries when calculating the mean. For example, the following commands return NA > y=c(2.1,1.3,4.2,7.1,8.7,12.1,11.2,14.5,NA) > mean(y) [1] NA while the following will drop the NA entry and calculate the correct mean > mean(y,na.rm=TRUE) [1] 7.65 The help files give an example of the function in use, and often give suggestions for closely related functions. 10 Queen’s University! September 2012 The library the function comes from. Here it is the ʻbaseʼ package. General description of the function This shows the order of the arguments in the function, as well as default values This section gives you the details of what arguments the function accepts, and any options This section gives you the details of what the function returns This section suggests related functions This section gives a worked example Figure 1.2 Navigating the help files for R functions. This marked up sheet is for the mean() function. Let’s get plotting R can be used to generate some of the nicest graphs of any statistical software. This section introduces you to two common plots to illustrate what can be done. We start with a histogram of some data, which shows the number of observations in your data set that fall in bins along the x-axis. Import the WidthLengthBeetles.csv data set as described above. This data set has two columns of data: Width and Length. To plot a histogram of just the wing width data, type > hist(my.dat$Width) where the command my.dat$Width accesses just the width data as described above. The plot should look like the image in Figure 1.3. 11 Queen’s University! September 2012 3 2 0 1 Frequency 4 5 Histogram of my.dat$Width 2 4 6 8 10 my.dat$Width Figure 1.3 Histogram using the hist() function. It’s as simple as that. Graphs are plotted in a separate window, and a plotting function (such as hist()) will open a new window if none are open. The second type of plot we consider here is an X-Y scatter plot, which we can use to plot the width and length at the same time. Scatter plots are created using the plot() function, which requires the arguments X and Y (as well as other options if desired), where X is a vector of all x values you want to plot, and Y is a vector of the y values. The first entry of the X vector corresponds with the first entry of the Y vector, and so on. For example, to create a basic plot of the weevil data we would type: > plot(my.dat$Width,my.dat$Length) Axis labels are added using the xlab='...' and ylab='...'options, and a plot title is added using the main='...' option, where the ‘...’ is where you place the label text. There are many options available to make plots look nice (see Plotting options), but we will wait until the second tutorial to go beyond adding labels. The weevil data plot with labels added looks like: > plot(my.dat$Width,my.dat$Length,xlab='Wing Width (mm)',ylab='Wing Length (mm)',main='Plot of Wing Width by Length in Cowpea Weevils') 12 Queen’s University! September 2012 60 50 40 30 10 20 Wing Length (mm) 70 80 Plot of Wing Width by Length in Cowpea Weevils 3 4 5 6 7 8 9 Wing Width (mm) Figure 1.4 Scatterplot using the plot() function. The editor Now that we have covered a number of functions and tools, we will want to save our work for future use, which is done using an editor. The editor also lets you build your own library of commonly used functions and notes on how you implemented them. R has an editor embedded within it. The editor is linked to the console, so we can ‘submit’ our work to the console right from the editor if we want. To start a new editor file in MacOSX, click on ‘File’ from the menu bar and select ‘new document’; in Windows OS, click on ‘File’ and select ‘New Script’. Save this file (it should have a ‘.r’ ending). Now you can type your program. It’s a good idea to add comments to help remind yourself of what each line does. Here’s an example, notice that the prompts are shown in the console but not in the editor. #This file calculates the mean wing width and length, and plots the weevil data my.dat=read.csv(file.choose())! #reads in the data file mean(my.dat$Width)! 13 #calculates the mean wing width Queen’s University! September 2012 mean(my.dat$Length)! #calculates the mean wing length! #Scatter plot of the weevil data plot(my.dat$Width,my.dat$Length,xlab='Wing Width (mm)',ylab='Wing Length (mm)',main='Plot of Wing Width by Length in Cowpea Weevils') To submit the program in MacOSX, select the text with your mouse, hold the key and press return; in Windows OS click ‘Edit’ from the menu and select ‘run all’. In Windows OS, you can also run each line by selecting the line and pressing the Ctrl followed by the ‘R’ key. Concluding remarks This first tutorial provides the elements to begin using R, which we will build on in the following tutorials to expand your ability to do a range of statistics, calculations and graphing. The Resource section below provides a list of on-line books and websites for more information on these topics, as well as a series of quick reference cards. The final sections provides a complete overview of the R environment that will become a useful reference as you gain more experience using the software. 14 Queen’s University! September 2012 Resources Websites 1. Quick R by Robert Kabacoff is a good resource for a quick introduction to statistical functions in R. (www.statmethods.net) 2. R Tutorial by Chi Yau is a really good quick reference website and primer. (www.r-tutor.com/) Online Books and Manuals 3. Simple R by John Verzani is a more extensive guide to standard statistics from one-way hypothesis testing, to ANOVA’s and regression. (cran.r-project.org/ doc/contrib/Verzani-SimpleR.pdf) 4. R Primer by Christopher Green is a primer for an upper year stats course, and has a detailed introduction to logic tests and looping (as well as statistical functions). (www.stat.washington.edu/cggreen/rprimer/first-edition/rprimer. 09212004.firsted.pdf) 5. An Introduction to R by Venables & Smith is a great resource for matrices and matrix functions, as well as writing your own functions and more advanced plotting functions. (cran.r-project.org/doc/manuals/R-intro.pdf) 6. Statistics Using R with Biological Examples by Seefeld & Linder is a nice reference for statistical methods that go beyond frequentist statistics. (cran.rproject.org/doc/contrib/Seefeld_StatsRBio.pdf) 7. A Beginner’s Guide to R by A. Zurr is a comprehensive R book that is available online through Queen’s library (type the title into QCAT). 8. Introductory Statistics with R by P. Dalgaard is a good statistics text book that uses R, and is available online through Queen’s library (type the title into QCAT). Function List Here is a list of the functions, operators and options covered in tutorial 1, see the quick Reference Card for more details. Functions: c( ), min( ), max( ), sum( ), mean( ), data.frame( ), read.csv( ), file.choose( ), hist( ), plot( ). Arithmetic operators: +, -, /, *, ^, =, [ ], ( ). Plotting options: xlim=c( ), ylim=c( ), xlab=' ', ylab=' '. 15 Queen’s University! September 2012 Overview of the R software This overview is from Start using R by R. Mongomerie 3. The diagram below shows the relation between the R programming language and various internal and external components, all described below: While this probably looks a bit daunting, we think it helps to understand what is linked to the console and the console is what you will see and work with most of the time. Console This is your main window that you see when you open the R application, where you will type commands for R to implement, and where you will see the text responses to those commands. The console has a toolbar at the top for quick access to some commands. R Language This is the compiler that is the heart and soul, or more correctly the brain, of R. Invisible to you. 3 Starting to Use R by R. Montgomerie 2010 16 Queen’s University! September 2012 Working Directory This is where R will look for files that you specify in commands, unless you give the full path to a file. Set this in the Preferences. Packages R is complemented by packages that contain specific functions and data sets that you can use once the Package is available in your workspace. Many packages are built in and are immediately available when you download and install R; others can be downloaded and installed into R. See http://CRAN.R-project.org for a long list of available packages. Editor This is either an internal module or an external program (that you specify) to compose and edit R command and programs and then to save them in a text file for later use. The R editor highlights functions, commands and other things in different colours and automatically closes brackets for you etc. R is easiest to work with if you do all your basic work in the editor, then transfer commands to the console, then save your editor files for easy later reference. Package Installer Use this to download new packages so that they can be loaded into your workspace as needed using the Package Manager. Quartz Window All of your graphs will appear here. Size and shape can be adjusted by just changing the window size using the handle in the lower right corner of the window. You can save the window as a pdf file that can be opened and edited in Adobe Illustrator. Package Manager Use this to load into the workspace packages that have been downloaded and installed. Data Manager Lets you see all data sets and variables that are in your workspace, including those in loaded Packages. Data Editor This is a window where you can edit your data. Use fix(dat) to see your data in the Data Editor, where dat is the name of a data set object in your workspace (note that we use dat for the generic name of your data set throughout this document) Workspace This is an invisible area where all of the work gets done, using the R Language to manipulate objects. When you are working in an R session, all of the objects that 17 Queen’s University! September 2012 you have loaded or created are stored in a workspace that can be saved at the end of your session. You can also set a preference so that R always saves the current workspace when you close R, then opens that same workspace when you re-open R. Workspace Browser Use this menu item to see a listing of the objects in the current workspace in spreadsheet format. You can use the toolbar on this window to delete objects or edit them, which will open the Data Editor window. History Use the History icon in the toolbar to see all the commands used since you last cleared the history. Double click on a command in this list to have it entered anew at the prompt. You can also load previous history that you might have saved. External Files These are text files containing data, commands, functions, history or other information that is either saved during an R session or read into the workspace during a session. These are shown as diamonds in the diagram Customizing the work environment You could just start using R but your experience will be better if you customize the working environment to your personal taste. You need to do this only once (on each of your computers and the environment will be the same each time you start R. First, customize the toolbar by control clicking on it and choosing Customize Toolbar and setting it up the way you like it. Here’s an example: Second, select the R>Preferences menu item and customize each of the settings. You might want to make the background colour something other than plain white, to make it easy to distinguish from other windows on your screen, or your may want bigger font than the default. Don’t put anything on the tool bar that you do not understand or use; you can always add things later. 18 Queen’s University! September 2012 Tutorial 2: Graphs and Descriptive Statistics in R M. Kelly In this tutorial, we will learn how to calculate summary statistics and graphs with R. Descriptive statistics Descriptive statistics are quantitative attributes that characterize a data set, such as the mean, median, and standard deviation. For example, Table 2.1 shows observations of mercury concentration in the sediments of two imaginary lakes. Table 2.0 Sediment Concentrations in Two Lakes Studied for Mercury Distribution 19 Depth (cm) Lucky Lake (ppb) OddBall Lake (ppb) 1 64.07 122.17 2 55.36 100.29 3 61.17 79.54 4 72.51 86.07 5 87.68 78.24 6 72.31 77.25 7 76.67 69.89 8 63.05 66.08 9 68.33 91.26 10 59.87 73.68 11 86.48 67.58 12 80.71 73.08 13 58.85 102.54 Queen’s University! September 2012 Depth (cm) Lucky Lake (ppb) OddBall Lake (ppb) 14 45.02 83.73 15 63.17 88.86 16 72.06 106.63 17 71.51 67.72 18 69.88 82.56 19 63.55 93.73 20 87.78 71.41 At first glance, it can be difficult to see patterns in the data set. However, by comparing mercury concentrations between lakes using descriptive statistics, you can get some first impressions of the data. Since this data set is short, we can enter it directly in the console. > OddBall=c(122.17, 100.29, 79.54, 86.07, 78.24, 77.25, 69.89, 66.08, 91.26, 73.68, 67.58, 73.08, 102.54, 83.73, 88.86, 106.63, 67.72, 82.56, 93.73, 71.41) > Lucky=c(64.07, 55.36, 61.17, 72.51, 87.68, 72.31, 76.67, 63.05, 68.33, 59.87,86.48, 80.71, 58.85, 45.02, 63.17, 72.06, 71.51,69.88, 63.55, 87.78) The mean() function is used to calculate average mercury concentration for each lake > mean(Lucky) [1] 69.0015 > mean(OddBall) [1] 84.1155 which reveals that OddBall lake has a higher mean mercury concentration than Lucky. Similarly, the standard deviation function sd() is one way to describe the amount of variation within a data set, and find that Oddball lake has more variation as well. > sd(Lucky) [1] 11.18146 > sd(OddBall) [1] 15.03170 20 Queen’s University! September 2012 At this point, we could summarize the two data sets by saying that the mean mercury concentration in Lucky lake is 69.00 (s2=11.18) and in OddBall lake is 84.12 (s2=15.03), where s2 represents a sample estimate of the standard deviation Other useful descriptive statistics are the minimum and maximum values, which are calculated using the min() and max() functions respectively. > max(Lucky) [1] 87.78 > min(Lucky) [1] 45.02 > max(OddBall) [1] 122.17 > min(OddBall) [1] 66.08 The mean is a suitable description of central tendency when the data does not have extreme values. When a data set has extreme values, the median and quantiles are often more valuable descriptors of central tendency and variation. Medians are calculated using the median() function > median(Lucky) [1] 69.105 > median(OddBall) [1] 81.05 and quantiles (including the 50% quantile, which is the median) using the quantile() function. > quantile(Lucky) 0% 25% 50% 75% 100% 45.020 62.580 69.105 73.550 87.780 > quantile(OddBall) 0% 25% 50% 75% 100% 66.0800 72.6625 81.0500 91.8775 122.1700 Contingency tables Contingency tables summarize the frequency of observations that fall into one or more categories, and provide a compact way to visualize relationships among the categories. For example, mercury concentrations above 100 ppb might be considered ‘at risk’, while those below 100 ppb could be considered ‘acceptable’. Table 2.1 shows the contingency table for Lucky lake and OddBall lake. 21 Queen’s University! September 2012 Table 2.1 Contingency Table of Lake Sediments at Risk of Contamination Sediment Status Lucky Lake OddBall Lake Row Total At Risk 0 4 4 Acceptable 20 16 36 Column Total 20 20 40 To create a contingency table in R, we first need to create two new categorical variables that indicate the source lake for the observation, and whether it is acceptable or at risk. This type of data structure is referred to as stacked because the data from each category is stacked on top of each other. Here, we will call the vector indicating the source lake ‘lake’ (‘O’ is for OddBall lake, and ‘L’ is for Lucky lake), and the vector indicating risk ‘risk’ (‘Y’ is for at risk, ‘N’ is for not at risk). > lake=c('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L') > risk=c('Y', 'Y', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'Y', 'N', 'N', 'Y', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N') The contingency table is created using the table() function. > table(lake,risk) risk lake N Y L 20 0 O 16 4 The table shows that no sediment layers in Lucky lake are of concern, but four layers in OddBall lake are. Graphs Tutorial 1 introduced some basic plotting functions. Here, we will introduce new functions that allow you to create versatile graphs of your data. 22 Queen’s University! September 2012 10 Distribution of Hg Concentrations in Lucky Lake Sediment 6 4 0 2 Observed Frequency 8 A 40 50 60 70 80 90 Hg Concentration (ppb) 10 Distribution of Hg Concentrations in OddBall Lake Sediment 0 2 4 6 8 B 60 70 80 90 100 110 120 130 Hg Concentration (ppb) Figure 2.1 Data Distribution for Sediment Mercury Concentration at Lucky Lake (A) and OddBall Lake (B) Histograms A histogram is a plot of the observed frequency of an event, and is created in R using the hist() function. A basic histogram is created as follows: > hist(Lucky) 23 Queen’s University! September 2012 To create custom text for the figure, use the xlab=’...’, ylab=’...’ , and main=’...’ options, which add the desired text (entered in place of ...) to the xaxis, y-axis and plot title respectively. > hist(Lucky,main="Distribution of Hg Concentrations in Lucky Lake Sediment",xlab="Hg Concentration (ppb)",ylab="Observed Frequency") These options work with most of the plotting functions used in R. The histogram for each lake is shown in Figure 2.1. Bar plots The barplot() function is used to create bar plots of your data. The number of bars in the figure is equal to the number of observations in the data set, and the y -axis shows the value of the observations within a variable. For example, a labelled bar plot of the Lucky lake data is created by > barplot(Lucky,main='Sediment Mercury Concentration in Lucky Lake') With the default values, the function created a box plot with a y-axis that ranges from 0 to 80, which leaves some bars to over hang the observed axis range. To tidy this up, we can use the ylim option, which has the form ylim=c(a,b), where the desired minimum and maximum are entered in ‘a’ and ‘b’ respectively. We can also add color to the plot using the col option, which has the form col =‘...’ (e.g., col =‘blue’), where the color is entered as text (type colors() into R to see a list of all colors). The more polished barplot (Figure 2.2) is created by > barplot(Lucky,ylim=c(0,100),col='red',main='Sediment Mercury Concentration in Lucky Lake',xlab='Sediment Sample',ylab='Mercury Concentration (ppb)') To visualize more than one category, we can use stacked or grouped bar plots. To do this, we first create a data frame of the entire data set. > BothLakes=data.frame('OddBall'=OddBall, 'Lucky'=Lucky) The data frame (“BothLakes”) has two columns with each row a different depth. Since we want to plot the different depths across the x-axis, the barplot() function requires us to switch the rows and columns so that the data frame has two rows (OddBall and Lucky), and 20 columns that represent the depths. 24 Queen’s University! September 2012 80 60 40 20 0 Mercury Concentration (ppb) 100 Sediment Mercury Concentration in Lucky Lake Sediment Sample Figure 2.2 Bar plots of Mercury Concentrations (ppb) in Sediments. This is done using the t() function, which transposes the data set (i.e., switches the rows and columns). This is done by first converting the dataframe to a matrix using the as.matrix() function. > BothLakes=as.matrix(BothLakes) > BothLakes=t(BothLakes) A stacked bar plot (Figure 2.3) is then created using > barplot(BothLakes,main='Comparison of Mercury Concentration by Depth Interval', xlab="Depth Interval", ylab='Concentration of Mercury (ppb)', ylim=c(0,200),col=c('red', 'lightblue')) A grouped bar is created using the beside option, which has the form beside=TRUE. > barplot(BothLakes,main='Comparison of Mercury Concentration by Depth Interval', xlab="Depth Interval", ylab='Concentration of Mercury (ppb)', ylim=c(0,150),col=c('red', 'lightblue'),beside=TRUE) 25 Queen’s University! September 2012 150 100 0 50 Concentration of Mercury (ppb) 200 Comparison of Mercury Concentration by Depth Interval Depth Interval Figure 2.3 Stacked Barplot of Sediment Mercury in OddBall Lake (red) and Lucky Lake (blue). Box plots A box and whisker plot illustrates the median and quantile levels. The function, boxplot(), indicates the median by a dark band within a “box” which is bound by the 1st and 3rd quantiles. “Whiskers” in this plot represent the most extreme datapoint within 1.5 times the interquartile range (i.e., the 75% quantile minus the 25% quantile). Values outside the whiskers are plotted as points. To visualize the mercury concentrations in each lake, we need to create stacked data similar to what was used for the contingency tables. The first vector will be the mercury concentrations for both lakes, and the second vector will indicate which lake the observation is from. > mercury=c(OddBall,Lucky) > lake=c('O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L', 'L') 26 Queen’s University! September 2012 100 80 60 Mercury Concentration (ppb) 120 Mercury Concentrations by Lake L O Lake Figure 2.4 Mercury Concentrations in Lucky and OddBall lake. The boxplot is created using > boxplot(mercury~lake,main="Mercury Concentrations by Lake", ylab='Mercury Concentration (ppb)',xlab='Lake',col= c('red','lightblue')) where the term mercury~lake is a function that indicates to the boxplot() function that the mercury quantiles should be calculated for each level in lake. Scatter plots A scatterplot displays quantitative data for two variables. The plot() function was encountered in Tutorial 1, and provides a great way to create scatterplots. For our example of mercury concentrations in two lakes, we can compare the mercury at each sediment depth as follows: > plot(OddBall,Lucky,main='Comparing Lake Mercury Concentrations',xlab='OddBall Lake Concentration', ylab='Lucky Lake Concentration',xlim=c(60,130),ylim=c(40,90)) We can highlight changes relative to the means by adding two lines that represent the mean for each lake. The abline() function is used to create straight lines on 27 Queen’s University! September 2012 a plot. For a vertical line, the abline() function has the form abline(v=value), where “value” indicates where you want the line drawn. > abline(v=84.1155,col= 'blue') For a horizontal line the abline() function has the form abline(h=value) as follows: > abline(h=69.0015,col= 'red') The final graph is shown in Figure 2.5. 70 60 40 50 Lucky Lake Concentration 80 90 Comparing Lake Mercury Concentrations 60 70 80 90 100 110 120 130 OddBall Lake Concentration Figure 2.5 Concentrations of Mercury in Sediments with Lucky and OddBall Lake means, blue and red respectively. Function List Here is a list of the functions and options covered in tutorial 2, see the quick Reference Card for more details. Functions: mean( ), median( ), sd( ), quantile( ), min( ), max( ), table( ), hist( ), barplot( ), boxplot( ), plot( ), abline( ), t(), as.matrix(). Plotting options: xlim=c( ), ylim=c( ), xlab='...', ylab='...', main='...', col='...'. 28 Queen’s University! September 2012 Tutorial 3: Hypothesis Testing in R H. Haig Finding critical values An important component of hypothesis testing is to find the critical value for a test statistic, such as a Student’s t, F or Chi-square distribution. While these values can be looked up in tables (such as found in your textbook) they can also be calculated in R using the qt() , qf() and qchisq() functions. All three functions use the same input, the cumulative probability and the degrees of freedom. The critical value returned is the location on the x-axis where the requested cumulative probability lies entirely to the left (Figure 3.1). For example, to calculate the critical t-score for a one-sided t-test with 19 degrees of freedom and α=0.05. To find the critical value type: > qt(0.95,19) [1] 1.729133 and for a two-tailed test you would type > qt(0.975,19) [1] 2.093024 These functions are quite handy, and will be used in the following sections. Chi-squared test A Chi-squared test is used to test for independence among categorical variables. For example, we might be interested in whether colour blindness is independent of gender. If males are more likely to be colour blind than females, then we expect that the relative frequency of people with colour blindness would not be independent of gender. The null hypothesis for the Chi-squared test is that the categorical variables are independent of one another. HO: Categorical variables are independent HA: Categorical variables are not Independent For our example with colour blindness a Chi-squared test hypothesis would be: HO: There is no difference in the degree of colour blindness between males and females HA: There is a difference in the degree of colour blindness between males and females 29 Queen’s University! September 2012 p=0.05! Two Tailed t-test! Right Tail ! 0.975! T-crit! p=0.05! Left Tail ! 0.025! T-crit! p=0.05! One Tailed t-test! Right Tail ! 0.95! T-crit! Figure 3.1 Test scores in the qt() , qf() and qchisq() functions. The blue area shows the cumulative probability from the left up to the T-crit threshold. The green hatched areas are the critical p-value thresholds. For a fully worked example, let’s look at some hypothetical data on an invasive species, Bythotrephes longimanus. This invasive zooplankton, commonly called the spiny water flea, entered the Great Lakes region in the mid 1980s, and caused a dramatic change in zooplankton composition in many lakes. The main mechanisms of spread through inland lakes is from fishnets and boats moving between invaded and non-invaded lakes. One area where Bythotrephes invasion has been studied in detail is the ‘cottage country’ region of the Muskoka’s in southern Ontario. The following table shows the number of lakes with and without cottages, as well as the state of Bythotrephes invasion in the lake (not invaded, invaded but not abundant, and invaded and dominant). 30 Queen’s University! September 2012 Not invaded Invaded, but Invaded and Row total not abundant abundant Cottages 25 60 65 150 No Cottages 40 7 3 50 Column total 65 67 68 200 The statistical hypotheses are: HO: Presence of cottages and lake invasion status are independent HA: Presence of cottages and lake invasion status are not independent The chisq.test() function in R can be used to do a Chi-squared test. The first step is to get the data into R. This can be done by creating a .csv file and importing the data as was done in Tutorial 1 or, as we illustrate here, we can create the data directly in R. Begin by creating the data vectors > cottage=c(25,60,65) > no.cottage=c(40,7,3) > dat=data.frame('C'=cottage,'NC'=no.cottage) It is a good idea to have a look at your data frame to make sure the variables are in the right order: > dat C NC 1 25 40 2 60 7 3 65 3 The Chi-squared test is done by typing > chisq.test(dat) ! Pearson's Chi-squared test data: dat X-squared = 69.2218, df = 2, p-value = 9.304e-16 The output shows you: • The type of test: Pearson's Chi-squared test 31 Queen’s University! September 2012 • The Chi-squared value: X-squared = 69.2218 • degrees of freedom: df = 2 • and p-value: p-value = 9.304e-16 Since p<0.05, we reject our null hypothesis that Bythotrephes longimanus abundance in Muskoka lakes are independent of the presence or absence of cottages. Equivalently, we can do the same test by comparing the observed versus critical test statistics. The observed χ2 value is given in the output of the chisq.test() function, and the critical χ2 value can be found using qchisq() with a confidence value of 0.95 (1-α) and degrees of freedom 2. > qchisq(0.95,2) [1] 5.991465 Since the observed χ2 value (69.2) exceeds the critical χ2 value (5.99) the null hypothesis is rejected. One-sample t-test A one-sample t-test compares the mean of an observed set of data to a known value. This test assumes that the observed data is normally distributed. To demonstrate a one-sample t-test in R, lets looks at some hypothetical data from a fish farm that raises trout. As a measure of stress in adult fish, the farm monitors the number of eggs produced per female to ensure that the fish are under optimal conditions for reproduction. The procedure is to randomly select ten fish during each egg harvest, and count the total eggs per fish. Experience shows that the minimum number of eggs from a non-stressed fish is 1100. The following table shows the egg count from the most recent harvest. 32 Fish ID Eggs/fish F1 778 F2 1367 F3 947 F4 1002 F5 521 F6 656 F7 1082 Queen’s University! September 2012 F8 1144 F9 735 F10 1283 To determine if this sample is different from the minimum threshold, a onesample t-test can be used. Since the data set is small, it is easiest to enter the data directly into R > eggs=c(778,1367,947,1002,521,656,1082,1144,735,1283) The general linear model function glm() is used to compute the observed t score and p value for the data. In order to obtain the full output from this function, the summary() function will also be needed. This is done by first assigning the results of the glm() function to a variable using > my.fit=glm(...) where the “...” are the arguments used for a particular test. To run a one-sample ttest, we write the formula as y µ⇠1 where y is your data set, and μ is the known value you are comparing the data against. The ~1 tells the program that you do not have any categories. Putting it all together, the R code for a one-sample t-test is > my.fit=glm(eggs-1100~1) A summary is created by typing > summary(my.fit) Call: glm(formula = eggs - 1100 ~ 1) Deviance Residuals: Min 1Q Median -430.5 -205.8 23.0 3Q 177.0 Max 415.5 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -148.50 87.48 -1.697 0.124 (Dispersion parameter for gaussian family taken to be 76534.94) Null deviance: 688814 Residual deviance: 688814 33 on 9 on 9 degrees of freedom degrees of freedom Queen’s University! September 2012 AIC: 143.78 Number of Fisher Scoring iterations: 2 The key values to look for in this output are the degrees of freedom, observed t value, and the p value. In this case the degrees of freedom is 9, t-value is -1.697, and the p-value is 0.124. To interpret this further, we need to decide if the test is a one-tailed or two-tailed test. If the statistical hypotheses are: HO: There is no difference between the mean number of eggs per fish in the sample and the threshold of 1100 HA: There is a difference between the mean number of eggs per fish in the sample and the minimum threshold of 1100. then it is a two-tailed test because we are not concerned with the direction of the difference. In this case, we would fail to reject the null hypothesis because p>0.05. The same result is obtained by using the qt() function and comparing the observed and critical test statistic. > qt(0.975,9) [1] 2.262157 Figure 3.2 Histogram of pre and post coral densities. For a two-tailed test, the absolute value of the critical test statistic is t.crit=2.262, which is greater than the observed value t.obs=1.697. The intent of sampling the 34 Queen’s University! September 2012 fish, however, is to evaluate whether the eggs per fish are less than the minimum threshold, so a one-tailed hypothesis would be more appropriate. The statistical hypotheses would be: HO: The mean number of eggs per fish in the sample is not less the threshold of 1100 HA: The mean number of eggs per fish in the sample is less the threshold of 1100 A one-tailed test can still be done using the output from the glm() function by dividing the stated p-value in half. The mean number of eggs per fish is less than the threshold of 1100 as shown by the negative sign in the estimate (-148.50), so the sign of the observed test statistic is negative (t.obs=-1.697). The one-sided pvalue is p=0.062, which is greater than 0.05 so we fail to reject the null hypothesis. Thus, we can conclude that the observed eggs per fish are not reflective of stressed fish. Paired Two sample t-tests A paired t-test can be used to determine if there is a difference between two groups. This is tested using the null hypothesis that there is no difference in the mean between two treatments. Before and after studies commonly use paired ttests to show how a specific population changes with the application of some treatment. For example, changes in biological population before and after a dramatic event can be used to determine if the event has caused a significant change to the population. The general hypothesis for this type of test is: HO: There is no difference between the two groups HA: There is a difference between the two groups An application of a paired t-test would be to determine if the density of coral species before and after the widespread bleaching event in 1998 are statistically different. Coral density was monitored on this reef before and after the event showing changes between the 1997 and 1999 coral species densities. Observations from a study on a nearby reef categorized coral species into ‘Winners’ and ‘Losers’ based on their abilities to survive bleaching events (Loya et al. 2001). Species 35 Density (indv/m2) 1997 1999 Difference Queen’s University! Winners Porites lutea Porites lobata Leptastrea transvera Goniastera aspera Goniastrea pectinata Leptastrea purpurea Platrygra ryukuenis Porites rus Favites halicora Favia favus Losers Millepora intricata Millepora dichotoma Acropora digitifera Porites attenuata Porites sillimaniani Stylophora pistillata Porites cylindrica Montipora aequituberculata Porites nigrescens Pocillopora damicornis Millepora platphylla Porites aranetai Porites horizontalata Seriatopora hystrix September 2012 18.1 21.3 16 21.6 21 20.3 25.2 22 22.9 23.5 24 26.1 18.1 24.2 21 19.2 24 21.8 19.2 19.8 26.7 23.1 25 14.8 33.1 39.4 36.1 42 33.2 43.6 39.4 39.1 37.8 45.1 1.2 0.8 1.7 12.1 9.7 10.1 4.8 4.1 10.2 5.2 6.3 8.1 4.1 2.3 15 18.1 20.1 20.4 12.2 23.3 14.2 17.1 14.9 21.6 -22.8 -25.3 -16.4 -12.1 -11.3 -9.1 -19.2 -17.7 -9 -14.6 -20.4 -15 -20.9 -12.5 You can create a csv file of the data, or enter the data directly into R. If the data is entered directly into R simply make 2 vectors for the 1997 and 1999 data and a third for the difference between the pre and post bleaching values as follows: > pre=c(18.1, 21.3, 16, 21.6, 21, 20.3, 25.2, 22, 22.9, 23.5, 24, 26.1, 18.1, 24.2, 21, 19.2, 24, 21.8, 19.2, 19.8, 26.7, 23.1, 25, 14.8) > post=c(33.1, 39.4, 36.1, 42, 33.2, 43.6, 39.4, 39.1, 37.8, 45.1, 1.2, 0.8, 1.7, 12.1, 9.7, 10.1, 4.8, 4.1, 10.2, 5.2, 6.3, 8.1, 4.1, 2.3) 36 Queen’s University! September 2012 Create a difference vector as follows: > diff=post-pre > diff [1] 15.0 18.1 20.1 20.4 12.2 23.3 14.2 17.1 [9] 14.9 21.6 -22.8 -25.3 -16.4 -12.1 -11.3 -9.1 [17] -19.2 -17.7 -9.0 -14.6 -20.4 -15.0 -20.9 -12.5 Alternatively, it may be easier to load the csv file my.dat=read.csv(file.choose()) With the difference between the pre and post bleaching calculated, the t-test is treated the same way as a one-sample t-test. In this example we assume that the difference between the pairs is zero for the null hypotheses, but this is not always the case. HO: μd=0 HA: μd≠0 As before, the glm() and summary() functions provide R tools for the test. In this example, we will name the model ‘pt’ for paired t-test. In the one-sample t-test from the previous section, the model formula was y-μ~1, where μ is the population parameter that the data are being tested against. Here, μ=0 so the formula is y~1. > pt=glm(diff~1) > summary(pt) Call: glm(formula = diff ~ 1) Deviance Residuals: Min 1Q Median -23.242 -14.667 -8.142 3Q 17.583 Max 25.358 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -2.058 3.597 -0.572 0.573 (Dispersion parameter for gaussian family taken to be 310.5025) Null deviance: 7141.6 Residual deviance: 7141.6 AIC: 208.80 on 23 on 23 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 2 37 Queen’s University! September 2012 The summary shows p>0.05, so we fail to reject the null hypothesis. A histogram of the data reveals that there has been changes in most species of coral, but not always in the same direction. Try > hist(diff) Recall that we expect some species to do better that others (i.e., ‘winners’ and ‘losers’). To tease apart the patterns in more depth, let’s consider just the ‘losers’ data. > loser.post=c(1.2, 0.8, 1.7, 12.1, 9.7, 10.1, 4.8, 4.1, 10.2, 5.2, 6.3, 8.1, 4.1, 2.3) > loser.pre=c(24, 26.1, 18.1, 24.2, 21, 19.2, 24, 21.8, 19.2, 19.8, 26.7, 23.1, 25, 14.8) > diff.loser=loser.post-loser.pre > diff.loser [1] -22.8 -25.3 -16.4 -12.1 -11.3 -9.1 -19.2 -17.7 -9.0 -14.6 -20.4 -15.0 -20.9 -12.5 Then rerun the analysis > fit.loser=glm(diff.loser~1) > summary(fit.loser) Call: glm(formula = diff.loser ~ 1) Deviance Residuals: Min 1Q Median -9.1357 -3.9357 0.4643 3Q 3.9643 Max 7.1643 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -16.164 1.363 -11.86 2.41e-08 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 26.01016) Null deviance: 338.13 Residual deviance: 338.13 AIC: 88.312 on 13 on 13 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 2 If only the ‘loser’ species of corals are analyzed, we find that p<0.05 and reject the null hypothesis that the difference in density is zero. Similar to the onesample t-test, a critical t-score can be determined using qt() 38 Queen’s University! September 2012 > qt(0.95,13) [1] 1.770933 Again tobs <tcrit indicating that there are significantly difference in the density of coral before and after the beaching event. To take this one step farther we could evaluate the hypothesis that there has been a decrease in coral density after bleaching HO: μd≥0 HA: μd<0 which is a one-tailed test. Since the p-value reported in the glm summary is for a two-tailed test, we need to use the qt() function to find the correct critical tscore as follows: > qt(0.975,13) [1] 2.160369 Since tobs <tcrit, we reject the null hypothesis and conclude that the coral density of these particular species has decreased since the 1998 bleaching event. Loya, Y., Sakai, K., Yamazzato, K., and van Woesik, R., 2001. Coral bleaching: the winners and the losers. Ecology letters 4: 122-131. Independent two sample t-tests Independent two-sample t-tests are used to evaluate whether two populations have different means, and are an invaluable tool for differentiating between the outcome of two trials. For example, imagine that you started a new job working for an engineering firm as an aquatic biologist and risk assessment expert. The company is planning to dam a large river, which will cause a decrease in the water flow during the spawning season for trout. The change in water flow might decrease the amount of oxygen flowing over the eggs, and thereby have an impact on the fish population. As the biologist in the group, they have given you access to a flow tunnel to see if the fish eggs will be able to survive under this new flow regime. You have taken many eggs from the same fish stock and used 100 eggs for each trial. After 14 replicate trials of each flow speed your preliminary data on the number of eggs that hatched is as follows: Trial 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Normal 78 72 88 80 73 81 62 76 73 90 92 76 71 74 Slow 58 57 45 56 66 60 49 51 52 65 50 48 58 57 39 Queen’s University! September 2012 To test if these two treatments have a different impact on hatching success, you can use an independent two-sample t-test. The first step in performing a twosample independent t-test in R is to input the data in the correct form. Specifically, one column must contain the data and the other column a coding variable indicating the treatment for each trial (i.e., in stacked form). Start by entering the data in two vectors. The data vector is > egg.count=c(78, 72, 88, 80, 73, 81, 62, 76, 73, 90, 92, 76, 71, 74, 58, 57, 45, 56, 66, 60, 49, 51, 52, 65, 50, 48, 58, 57) To produce the categorical vector a code for normal and slow water movement can be used. Let N, and S denote normal and slow respectively > trial=c('N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'N', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S', 'S') Then load these vectors into a single data frame > my.data=data.frame(egg.count,trial) The data set should look like > my.data ! egg.count trial 1 78 N 2 72 N 3 88 N 4 80 N 5 73 N 6 81 N 7 62 N 8 76 N 9 73 N 10 90 N 11 92 N 12 76 N 13 71 N 14 74 N 15 58 S 16 57 S 17 45 S 18 56 S 19 66 S 20 60 S 21 49 S 22 51 S 40 Queen’s University! 23 24 25 26 27 28 September 2012 52 65 50 48 58 57 S S S S S S With the data setup in this fashion, the glm() function can be used for the test by including the categorical variable as a predictor variable (independent variable). The formula is y~x, where y is the full data set, and x the categorical vector that indicates what group the data belong to. > fit.ind=glm(egg.count~trial, data=my.data) The data option in glm indicates to R where to look for the data > summary(fit.ind) Call: glm(formula = egg.count ~ trial) Deviance Residuals: Min 1Q -15.5714 -4.7143 Median -0.5714 3Q 3.0000 Max 14.4286 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 77.571 1.942 39.939 < 2e-16 *** trial ! -22.429 2.747 -8.165 1.20e-08 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 52.81319) Null deviance: 4894.4 Residual deviance: 1373.1 AIC: 194.45 on 27 on 26 degrees of freedom degrees of freedom Number of Fisher Scoring iterations: 2 The output from this type of t-test displays two sets of t and p-values: ‘Intercept’ and ‘trial’. Figure 3.3 illustrates the meaning of each. The ‘Intercept’ is the estimated mean for one treatment, and the ‘trial‘ (will have a different label depending on the glm formula used) is the difference between the groups. The second set of values labeled trials should be used because it reflects the treatment effect. For all other t-test the intercept is used because you are asking if the mean value is different than a given intercept on the x-axis in your hypothesis. 41 Queen’s University! September 2012 Mean Intercept Trial Figure 3.3 Determining which values to use from summary output. Demonstrating the difference between intercept and trial values from the above output. For one sample and two sample paired outputs the intercept values can be used. For two sample independent t-test the second set of values (in this case called Trial) must be used. From the summary, p<0.05 so we reject the null hypothesis and conclude that stream flow changes the mean hatching rate of fish eggs. Function List Here is a list of the functions and options covered in tutorial 1, see the quick Reference Card for more details. Functions: qt( ), qf( ), qchisq( ), glm( ), summary( ). 42 Queen’s University! September 2012 Tutorial 4: Regression and correlation in R B. Wiltse This tutorial covers linear regression, the assumptions and appropriate significance tests associated with it, and correlation. The equation for a linear regression is given by, Y = a + bX where Y is the response variable (dependent), X is the explanatory variable (independent), and the coefficients a and b are the intercept and slope respectively. We use linear regression to predict the response variable (Y) based on the explanatory variable (X). The regression coefficients (a and b) are calculated from your data using the method of least squares (see text book for more details). Here we can use the glm() function in R, which stands for Generalized Linear Model, to fit the model and to do hypothesis testing on regression parameters. To illustrate linear regression, we can use an example data set already in R called ‘cars’. To see the data type: > cars You should see two columns speed and dist, which indicate the speed the car is traveling at the time of breaking and the distance the car travels before coming to a stop. To reduce clutter in the regression statement code, we can assign these columns to new vector names: > speed=cars$speed > dist=cars$dist Fitting the linear regression The linear regression is entered into the glm() function in the format Y~X, where Y is the dependent variable, X is the independent variable, and the tilde means “given by.” For the cars data set, we want to predict the breaking distance given by the speed of the car, so the code is input as follows: > glm(dist~speed) which produces the following output: Call: glm(formula = dist ~ speed) Coefficients: 43 Queen’s University! (Intercept) -17.579 September 2012 speed 3.932 Degrees of Freedom: 49 Total (i.e. Null); 48 Residual Null Deviance: 32540 Residual Deviance: 11350 AIC: 419.2 To reduce clutter in the code, it first helps to save the fitted statistical model (i.e., the glm object) to a variable. For example, we can type > my.fit=glm(dist~speed) The basic output of glm() is rather short, but there is a lot more going on behind the scenes. The output of glm(), as with many R functions, is an object that contains a range of information about the statistical model. We can use extractor functions to pull information out of the object. A list of extractor functions for the object can be seen by using the names() function. > names(my.fit) [1] "coefficients" [4] "effects" [7] "qr" [10] "deviance" [13] "iter" [16] "df.residual" [19] "converged" [22] "call" [25] "data" [28] "method" "residuals" "R" "family" "aic" "weights" "df.null" "boundary" "formula" "offset" "contrasts" "fitted.values" "rank" "linear.predictors" "null.deviance" "prior.weights" "y" "model" "terms" "control" "xlevels" We can use any of these extractor functions on the glm() function to gather more information about the model. To do this we simply include our glm() function inside the extractor function. For example, > formula(my.fit) dist ~ speed gives us the formula used in our model. A basic extractor function that can be used on almost all R objects is the summary() function, which gives a list of the more pertinent information about that object. We will use a slightly modified version of this function that gives information particularly useful for linear regression, which is summary.lm(). If you are curious, you can run both summary() and summary.lm() on the glm object to see the difference, but we will use summary.lm() as it provides more relevant information for linear regression. Let’s go ahead and run this on our model. > summary.lm(my.fit) 44 Queen’s University! September 2012 1) Call: glm(formula = dist ~ speed) 2) Residuals: Min 1Q -29.069 -9.525 3) Median -2.272 3Q 9.215 Max 43.201 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.5791 6.7584 -2.601 0.0123 * speed 3.9324 0.4155 9.464 1.49e-12 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 4) 5) 6) Residual standard error: 15.38 on 48 degrees of freedom Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12 There is a lot of information in the output, and it is useful to discuss each piece individually: 1. The first lines indicate the formula and variables used in the glm()function, which is handy when you save the output and come back to it at a later time. Call: glm(formula = dist ~ speed) 2. The next lines show the the distribution of residuals in the form of quantiles, which gives a qualitative sense of the residual characteristics. Residuals: Min 1Q -29.069 -9.525 Median -2.272 3Q 9.215 Max 43.201 3. The next section shows the estimated linear regression coefficients, and onesample tests of hypothesis for each. Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -17.5791 6.7584 -2.601 0.0123 * speed 3.9324 0.4155 9.464 1.49e-12 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 The first column shows each parameter in the model (i.e., Intercept, slope), which will vary depending on what terms were in the original model. The next column gives the estimate for each parameter. In this example, the estimated intercept is a = -17.58, and the estimated slope for the speed covariate is b = 3.93. We are often interested in whether the slope of the regression line (b) or 45 Queen’s University! September 2012 intercept (a) is significantly different from zero, which can be evaluated using a one-sample t-test. The next two columns show the standard error of the estimate for each coefficient, as well as the observed t-score. The t-test can be done by comparing the observed t-score against the critical t-score (using the qt() function). For this example, the critical t-score for the slope parameter is 2.01 (df=48, two-tailed, α=0.05), which is less than the observed absolute value of 9.464, so we reject the null hypothesis that the slope is equal to zero. Equivalently, we can conduct the test using the p-value, which is shown in the last column (assumes a two-tailed test). Since the p-value is less than 0.05, we come to the same conclusion. R includes a series of graphical significance codes next to the p-values to help give you a quick assessment of the significance of your regression parameters. 4. The final section of the output provides information that is useful for Analysis of Variance and correlation. The top line in this section gives residual standard error, which is a measure of the variation in the observations around the fitted line. The smaller this number the closer the observed values are to the fitted line. Residual standard error: 15.38 on 48 degrees of freedom 5. The next line is the R2 value of the regression, which you can think of % variance explained. One problem with R-squared values is they tend to be artificially increased with sample size, therefore if we intend on comparing R2 values between regressions with different sample sizes we need to take this effect into account. The adjusted R2 value does this and therefore is a more accurate estimate of the % variance explained. If our X values perfectly predicted our Y values than our R2 value would be 1.0 In our case we see that about 64% of the variation in breaking distance is explained by the speed of the car. Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438 6. The final line shows the F-statistic for the test of whether the ratio of the explained variance explained over the unexplained variance is different from one. The F-test in a linear regression is the same as a t-test of whether the slope is different than zero, but this is not general to other types of statistical model. F-statistic: 89.57 on 1 and 48 DF, p-value: 1.490e-12 Using the qf() function, the critical F-score is about 4.04. Since the observed Fscore is greater than the critical F-score, we we reject the null hypothesis that the ratio is equal to one. Equivalently, we could perform the test using the pvalue provided. 46 Queen’s University! September 2012 Plotting the fit linear regression After fitting the linear regression using glm(), plot the raw data and fit line to see if the fit statistical model makes sense. Begin by plotting the raw data: > plot(speed,dist, xlab=”Speed (mph)”, ylab=”Breaking Distance (feet)”) The fit linear regression can be added using the abline() function, which plots a line with the intercept (a) and slope (b) from the fit glm object. > abline(my.fit) You should see the plot shown in Figure 4.1, which suggests that a linear regression fits the data fairly well. Figure 4.1 Scatterplot (points) and linear regression (line) of stopping distance by speed. Evaluating the assumptions of linear regression There are four key assumptions that need to be met before the results of a linear regression can be trusted. They are 1) the relationship between the independent and dependent variable is linear, 2) the residuals are Normally distributed, 3) the 47 Queen’s University! September 2012 residuals have equal variance across the range of the independent variable (heteroscedasticity), and 4) the residuals are independent. Evaluating the assumptions of linear regression begin with a qualitative assessment using two kinds of plots—residual plots and histograms. A plot of the residuals (y-axis) by the predicted value (x-axis) allows you to visualize the assumption of linearity and heteroscedasticity. Try the fitted() function > fitted(my.fit) to see the predicted value for each value of the independent variable (speed in this example), and the resid() function > resid(my.fit) to see the residuals. To plot the residuals against the predicted values, type: > plot(fitted(my.fit),resid(my.fit)) The abline() function can be used to add a horizontal line that helps visualize trends. > abline(0,0) The resulting plot (Figure 4.2) suggests that a linear relationship is appropriate, but there is a small trend of increasing residual variance as the predicted values increase. A histogram of the residuals is created using the hist() function: > hist(resid(my.fit)) In our example, the histogram (Figure 4.3) suggests that the residuals may be slightly skewed. Plots of the raw data, fit line, and residuals provide a qualitative overview of the linear regression. All four assumptions, however, can be addresses more formally with the gvlma() function. The gvlma() function is not part of the base R software and must first be installed using > install.packages(“gvlma”) Load the gvlma library (this needs to be done every time you start R) using > library(gvlma) To run the gvlma() function on our model, type: > gvlma(lm(my.fit)) The lm() function is important here because is converts the glm object into an lm object as required by gvlma package. 48 Queen’s University! September 2012 Figure 4.2 Plot of the residual vs fitted values. The output is Call: lm(formula = glm.dist) Coefficients: (Intercept) -17.579 speed 3.932 ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM: Level of Significance = 0.05 Call: gvlma(x = lm(glm.dist)) Global Stat 49 Value p-value Decision 15.801 0.003298 Assumptions NOT satisfied! Queen’s University! September 2012 Skewness Kurtosis Link Function Heteroscedasticity 6.528 1.661 2.329 5.283 0.010621 Assumptions NOT satisfied! 0.197449 Assumptions acceptable. 0.126998 Assumptions acceptable. 0.021530 Assumptions NOT satisfied! The first part of the output shows the basic regression parameters and the level of significance used to evaluate the assumptions. The later part of the output shows the results of the assumption tests. The ‘Global Stat’ test statistic is a combination of the four component tests: ‘Skewness’, ‘Kurtosis’, ‘Link Function’, and ‘Heteroscedasticity’. The component tests measure the effect of violation from the four assumptions. Violations of the assumption of a linear relationship will influence ‘Skewness’ and ‘Kurtosis’; violations of Normally distributed residuals will influence ‘Kurtosis’ and ‘Link Function’; and violations of independence and heterscedasticity will influence ‘Heteroscedasticity’. For our example, the ‘Global Stat’ indicates a violation of the assumptions from skewness and heteroscedasticity, which confirms our earlier graphical assessment (Figure 4.2). Thus, a transfromation or modified statistical model is required in order to make conclusions about the data set. Correlation Correlation measures the tendency of two dependent variables to change together (ie. when one increases the other increases, or when one decreases the other decreases). The Pearson’s correlation coefficient is used to estimate this tendency, which is given by P (xi r = pP (xi x̄)(yi x̄)2 (yi ȳ) ȳ)2 The correlation coefficient (r) varies from -1 to 1, where the absolute value of the coefficient represents the strength of the correlation while the sign represents the direction. A positive correlation means that when one variable either increases or decreases the other variable does the same, a negative correlation means that when one variable either increases or decreases the other does the opposite. The Pearson correlation coefficient can be obtained from the glm() function by taking the square root of the multiple R-squared value given in the regression summary. In our example, the Multiple R-squared is 0.6511, which yields a correlation coefficient of r=0.807. > sqrt(0.6511) [1] 0.8069077 50 USE COR.TEST() rather than trying to pull it from the glm output because it is way too confusing! Queen’s University! September 2012 Figure 4.3 Histogram of the residuals. We can test the null hypothesis that the correlation equals zero by looking at the p-value of the slope of the regression parameter. The reason we can do this is that the correlation coefficient can be converted to a t-distributed variable at which point the outcome of a t-test on the correlation coefficient will be the same as that of the coefficient of the slope. Function List Here is a list of the functions and options covered in tutorial 1, see the quick Reference Card for more details. Functions: glm( ), names( ), summary.lm( ), abline( ), resid( ), gvlma( ). 51 Queen’s University! September 2012 Tutorial 5: One-Factor ANOVA in R W.A. Nelson Analysis of variance (ANOVA) is similar to linear regression, but uses categorical rather than numerical variables as independent variables. In fact, there is no difference in the machinery used to fit ANOVA models and linear regressions, so we can use the now familiar glm() function. The only difference we need to worry about in R is how the data frame is created, and some nuances of hypothesis testing. A single-factor ANOVA means that we are interested in one factor (e.g., nutrient concentrations), but there is more than one level in the factor (e.g., 0μM, 1.2μM, 3.2μM, and 5μM of Nitrogen). Creating data frames for ANOVA Similar to a two-sample t-test (which is a one factor ANOVA with two levels), the data must be in stacked form with one column as the response variable (dependent variable) and one or more columns as ‘coded’ categorical variables (independent variable). Consider the following example of the influence that substrate type has on the growth rates of benthic algae. The algae was relocated from a common substrate (pebbles) to each of the four test substrate types. The area of living algae was used to calculate the per-capita growth rate. Each column is a different substrate and each row is a replicate. Sand Silt Pebbles Glass 1.45 1.24 2.24 1.18 0.76 1.93 3.71 0.59 1.11 1.96 2.92 0.52 1.71 2.20 3.01 -0.74 0.97 3.93 6.33 -0.99 To create the data set, first create the independent and categorial vectors > growth=c(1.45, 0.76, 1.11, 1.71, 0.97, 1.24, 1.93, 1.96, 2.20, 3.93, 2.24, 3.71, 2.92, 3.01, 6.33, 1.18, 0.59, 0.52, -0.74, -0.99) 52 Queen’s University! September 2012 > substrate=c('sand', 'sand', 'sand', 'sand', 'sand', 'silt', 'silt', 'silt', 'silt', 'silt', 'pebbles', 'pebbles', 'pebbles', 'pebbles', 'pebbles', 'glass', 'glass', 'glass', 'glass', 'glass') and combine these into a data frame > my.data=data.frame('growth'=growth, 'substrate'=substrate) Type my.data to see the contents of the data frame, and compare this with the above table. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 growth substrate 1.45 sand 0.76 sand 1.11 sand 1.71 sand 0.97 sand 1.24 silt 1.93 silt 1.96 silt 2.20 silt 3.93 silt 2.24 pebbles 3.71 pebbles 2.92 pebbles 3.01 pebbles 6.33 pebbles 1.18 glass 0.59 glass 0.52 glass -0.74 glass -0.99 glass To get an impression of the influence of substrate type of algal growth rates, begin by plotting the data with the boxplot() function (Figure 5.1) > boxplot(growth~substrate, ylab='Algal Growth Rate (/ day)',xlab='Substrate Type') The plot suggests that algae have higher growth rates on pebbles, and lower on glass. Let’s run the model to see whether these trends are statistically significant. Fitting the ANOVA model The ANOVA model is fit using the glm() function just as was done for linear regression, but the X variable is now categorial. This difference is dealt with ‘behind the scenes’ in R, so we do not need to make any changes to the formula. 53 Queen’s University! September 2012 4 2 0 Algal Growth Rate (/day) 6 > my.fit=glm(growth~substrate, data=my.data) glass pebbles sand silt Substrate Type Figure 5.1 Box plot of algal growth rate by substrate. The data=my.data options tells R where to look for the variables. The output is the same as for a linear regression: > summary.lm(my.fit) Call: glm(formula = growth ~ substrate, data = my.data) Residuals: Min 1Q Median -1.4020 -0.6545 -0.1600 3Q 0.4255 Max 2.6880 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1120 0.4770 0.235 0.81733 substratepebbles 3.5300 0.6746 5.233 8.2e-05 *** substratesand 1.0880 0.6746 1.613 0.12631 substratesilt 2.1400 0.6746 3.172 0.00591 ** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 1) 2) 3) 4) Residual standard error: 1.067 on 16 degrees of freedom 54 Queen’s University! September 2012 Multiple R-squared: 0.6516,! Adjusted R-squared: 0.5862 F-statistic: 9.973 on 3 and 16 DF, p-value: 0.0006026 As with independent t-test, the statistical tests presented are all relative to one factor level (Figure 3.3). In this example, they are all relative to the ‘glass’ treatment, which is shown by the fact that it is missing from all of the terms (e.g., substratepebbles). Let’s go through the output line by line. 1. The first line is labeled (Intercept), which in this example is the glass substrate (analogous to the intercept in an independent t-test, see Figure 3.3). The value under the ‘Estimate’ heading is the least square means (LS means) estimate from the fit model. The LS mean per-capita algal growth rate is 0.1120 per day. The t-test test the hypothesis that the estimate is different from zero. It is a two-tailed test, and may not always be the one you want to evaluate. In this example, p>0.05, so we fail to reject the null hypothesis that the growth rate is not different from zero. Coefficients: (Intercept) Estimate Std. Error t value Pr(>|t|) 0.1120 0.4770 0.235 0.81733 2. The second line is the first of the treatment levels relative to the glass treatment. The LS mean algal growth rate is 3.53 per day more than glass, which means the actual growth rate estimate is 3.64 per day. The t-test tests the hypothesis that the difference in algal growth rate relative to the glass is not zero. Since p<0.05, we can conclude that the pebbles substrate had an influence on growth relative to glass. substratepebbles 3.5300 0.6746 5.233 8.2e-05 *** 3. The third line is the next treatment level relative to the glass treatment. The LS mean algal growth rate is 1.088 per day more than glass, which means the actual growth rate estimate is 1.2 per day. Since p>0.05, we fail to reject the null hypothesis that the difference in growth rate is not different from zero. substratesand 1.0880 0.6746 1.613 0.12631 4. The last line is the final treatment level relative to the glass treatment. The LS mean algal growth rate is 2.14 per day more than glass, which means the actual growth rate estimate is 2.25 per day. Since p<0.05, we can conclude that the silt substrate had an influence on growth relative to glass. substratesilt 2.1400 0.6746 3.172 0.00591 ** The remainder of the output is interpreted the same as for a linear regression (see Tutorial 4 for details). 55 Queen’s University! September 2012 The output from the summary.lm() function provides t-test evaluations of the term against zero for each level. The F-table will let you evaluate the general significance of the factor (substrate in this case) in the model. Use the aov() function to get the F-table: > summary(aov(my.fit)) Df Sum Sq Mean Sq F value Pr(>F) substrate 3 34.033 11.3443 9.9726 0.0006026 *** Residuals 16 18.201 1.1376 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 1 0 -1 resid(my.fit) 2 The line of interest is the one labelled ‘substrate’, which is the hypothesis test for the substrate factor. Since p<0.05, we conclude that the substrate had an influence on the per-capita algal growth rate. 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 fitted(my.fit) Figure 5.2 Plot of residuals against predicted values. Evaluating the assumptions of ANOVA The evaluation of assumptions for one-factor ANOVA’s is similar to linear regressions. Begin by plotting the residuals. For ANOVA’s, this is done by plotting them against the fitted mean for each level (which are the LS means). > plot(fitted(my.fit),resid(my.fit)) 56 Queen’s University! September 2012 The variance appears to change across the substrate types (Figure 5.2), which we will investigate quantitatively below. The histogram of raw residuals also appears slightly skewed (Figure 5.3). > hist(resid(my.fit)) 3 0 1 2 Frequency 4 5 6 Histogram of resid(my.fit) -1 0 1 2 3 resid(my.fit) Figure 5.3 Histogram of residuals. A more formal test is done using the gvlma() function in the gvlma library (see Tutorial 4 for details). > library(gvlma) > gvlma(lm(my.fit)) Call: lm(formula = my.fit) Coefficients: (Intercept) substratepebbles 0.112 3.530 57 substratesand 1.088 substratesilt 2.140 Queen’s University! September 2012 ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM: Level of Significance = 0.05 Call: gvlma(x = lm(my.fit)) Global Stat Skewness Kurtosis Link Function Heteroscedasticity Value p-value 7.648e+00 0.10538 4.057e+00 0.04398 1.131e+00 0.28748 1.599e-15 1.00000 2.459e+00 0.11687 Decision Assumptions acceptable. Assumptions NOT satisfied! Assumptions acceptable. Assumptions acceptable. Assumptions acceptable. The output of the gvmla test indicates that while the residuals are slightly skewed, overall the assumptions are sufficiently valid. Posteriori contrasts The F-test indicated that the substrates had an impact on algal growth, and the summary of the fit model (i.e., from summary.lm()) tested each level against the glass substrate. However, since glass is not a control for this experiment, we are interested in how all substrates compare to each other (e.g., pebbles versus sand). To evaluate these hypotheses, we use posteriori contrasts. One way to do this is to use Tukey’s honest significant difference test, which compares all possible pair-wise contrasts. In R, this is done using the TukeyHSD() function. > TukeyHSD(aov(my.fit)) Tukey multiple comparisons of means 95% family-wise confidence level Fit: aov(formula = my.fit) $substrate diff lwr upr p adj pebbles-glass 3.530 1.6000921 5.4599079 0.0004306 sand-glass 1.088 -0.8419079 3.0179079 0.3994762 silt-glass 2.140 0.2100921 4.0699079 0.0272350 sand-pebbles -2.442 -4.3719079 -0.5120921 0.0110907 silt-pebbles -1.390 -3.3199079 0.5399079 0.2078944 silt-sand 1.052 -0.8779079 2.9819079 0.4276917 The aov() function is used to pass the proper elements of the ANOVA fit to the TukeyHSD() function. Each line of the output gives a test between two substrate 58 Queen’s University! September 2012 types. For example, the fourth line tests the hypothesis that the growth rate is not different between the sand and pebble substrates. Since p<0.05, we conclude that the per-capita growth rates are different between the two substrates. Function List Here is a list of the functions and options covered in tutorial 1, see the quick Reference Card for more details. Functions: glm( ), TukeyHSD( ), aov( ). 59 Queen’s University! September 2012 Tutorial 6: Two-Factor ANOVA in R W.A. Nelson The previous tutorial considered ANOVA’s where there was just a single factor. However, it is often the case that you are interested in the influence two factors (e.g., nutrients and temperature), and whether they interact with each other (e.g., high temperature amplifies the impact of nutrients). The overall analysis for two factors is similar to the case with one-factor, but they differ in a number of details. Creating data frames for ANOVA Similar to a single-factor ANOVA, the data must be in the stacked form with one column as the response variable (dependent variable) and two columns as ‘coded’ categorical variables (independent variable). Consider the following example that looks at the influence of pine tree sap on the growth of a wood fungus. Different species of pine trees contain different concentrations of antifungal agents. The following table shows growth rates (mm/day) of wood fungus in agar (which provides a controlled medium to evaluate growth) containing sap from two species in isolation, a mix of each, and a control (i.e., no tree sap). Each row is a replicate. Treatment A Pinus taeda Treatment B Pinus pinea Treatment C Pinus taeda & Pinus pinea Treatment D Control 3.08 3.22 2.31 4.99 3.52 2.42 2.51 5.21 2.61 2.48 2.51 5.03 2.36 2.45 2.15 5.29 2.66 2.58 2.09 5.02 3.00 2.67 2.30 4.06 Since the data set has observations for all combinations of presence/absence for both species, it is a two-factor experiment as shown in the following table: 60 Queen’s University! September 2012 Pinus taeda absent present absent Treatment D Treatment A present Treatment B Pinus pinea Treatment C To create the data set, first create the independent vector and the two categorial vectors. > Growth=c(3.08, 3.52, 2.61, 2.36, 2.66, 3.00, 3.22, 2.42, 2.48, 2.45, 2.58, 2.67, 2.31, 2.51, 2.51, 2.15, 2.09, 2.30, 4.99, 5.21, 5.03, 5.29, 5.02, 4.06) > P.taeda=c('yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'no') > P.pinea=c('no', 'no', 'no', 'no', 'no', 'no', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no', 'no') Here we have used ‘P’ to denote that the tree sap was present for a species, and ‘A’ to denote absent. Even though two of the vectors are categorical, all three can be combined into a single a data frame > my.data=data.frame('Growth'=Growth, 'P.taeda'=P.taeda, 'P.pinea'=P.pinea) Type my.data to see the contents of the data frame, and compare this with the two-factor table shown above. > my.data Growth P.taeda P.pinea 1 3.08 yes no 2 3.52 yes no 3 2.61 yes no 4 2.36 yes no 5 2.66 yes no 6 3.00 yes no 7 3.22 no yes 8 2.42 no yes 61 Queen’s University! 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 2.48 2.45 2.58 2.67 2.31 2.51 2.51 2.15 2.09 2.30 4.99 5.21 5.03 5.29 5.02 4.06 September 2012 no no no no yes yes yes yes yes yes no no no no no no yes yes yes yes yes yes yes yes yes yes no no no no no no The new data frame has the same structure as we used for one-factor ANOVA’s, but with two categorical variables we can investigate a two-factor statistical model including testing for an interaction between the two species. Let’s begin by plotting the data with the boxplot() function as before (Figure 6.1) > boxplot(Growth~P.taeda+P.pinea, ylab='Fungal Growth (mm/ day)',xlab='Presence/Absence of Pine sap (P.taeda:P.pinea)') It is easy to see that the fungal growth rate decreases with the presence of sap from either species of pine, but it is more difficult to see if there is an interaction between the species. To see the interaction more clearly, we can use the interaction.plot(X1,X2,Y) function, where X1 and X2 are the categorical predictor variables, and growth is the quantitative response variable. The interaction plot shows the mean of each treatment cell for each level in the X1 variable on x-axis. Each level of the X2 predictor variable is shown with a different style of line, and a different number. > interaction.plot(P.taeda,P.pinea,Growth, type='b') The interaction plot (Figure 6.2) shows that the presence of sap from either species of pine reduces the fungal growth rate, and that there is a negative interaction between the species such that the presence of both is not as effective as you would expect from the reduction in growth caused by the species in isolation. Fitting the ANOVA model The ANOVA model is fit using the glm() function as before. For the two-factor ANOVA, we must add both predictor variables to the formula, as well as their 62 Queen’s University! September 2012 1 P.pinea 1 2 4.0 no yes 3.5 2.5 3.0 mean of Growth 4.5 5.0 interaction. This is done using the ‘+’ and ‘:’ symbols as y∼X1+X2+X1:X2, where 1 2 2 no yes P.taeda Figure 6.2 Interaction plot of fungal growth by treatment. the ‘+’ symbol separates the various independent variables. Writing the variable name denotes a main effect, and the ‘:’ symbol denotes an interaction. > my.fit=glm(Growth~P.taeda + P.pinea + P.taeda:P.pinea, data=my.data) The data=my.data options tells R where to look for the variables. The output is the same as for a linear regression: > summary.lm(my.fit) Call: glm(formula = Growth ~ P.taeda + P.pinea + P.taeda:P.pinea, data = my.data) Residuals: Min 1Q -0.87333 -0.19292 Median 0.01583 3Q 0.19833 Max 0.64833 Coefficients: 1) (Intercept) 63 Estimate Std. Error t value Pr(>|t|) 4.9333 0.1428 34.548 < 2e-16 *** Queen’s University! 2) 3) 4) September 2012 P.taedayes -2.0617 P.pineayes -2.2967 P.taedayes:P.pineayes 1.7367 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.2019 -10.209 2.23e-09 *** 0.2019 -11.373 3.49e-10 *** 0.2856 6.081 6.07e-06 *** 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.3498 on 20 degrees of freedom Multiple R-squared: 0.9118,! Adjusted R-squared: 0.8986 F-statistic: 68.96 on 3 and 20 DF, p-value: 1.006e-10 4.5 4.0 3.5 3.0 2.0 2.5 Fungal Growth (mm/day) 5.0 The output reveals that the sap from both pine species, as well as the interaction, are significantly different from the control. Let’s work through this line by line. no.no yes.no no.yes yes.yes Presence/Absence of Pine sap (P.taeda:P.pinea) Figure 6.1 Box plot of fungal growth by treatment. 1. The first line is labeled (Intercept), which in this example is the control treatment (analogous to the intercept in an independent t-test, see Figure 3.3). The value under the ‘Estimate’ heading is the least square means (LS means) estimate from the fit model. The LS mean fungal growth rate is 4.93 mm/day. The t-test test the hypothesis that the estimate is different from zero. It is a two-tailed test, and may not always be the one you want to evaluate. In this example, the growth rate of the control is different from zero (p<0.05). 64 Queen’s University! September 2012 Coefficients: (Intercept) Estimate Std. Error t value Pr(>|t|) 4.9333 0.1428 34.548 < 2e-16 *** 2. The main effect of Pinus taeda sap on fungal growth relative to the control. The LS mean fungal growth rate is 2.06 mm/day less than the control, which means the actual growth rate is 2.87 mm/day (try this for yourself). The t-test tests the hypothesis that the difference in fungal growth rate relative to the control is not zero. Since p<0.05, we can conclude that the Pinus taeda sap had an influence on fungal growth. P.taedayes -2.0617 0.2019 -10.209 2.23e-09 *** 3. The main effect of Pinus pinea sap on fungal growth relative to the control. The LS mean fungal growth rate is 2.30 mm/day less than the control, which means the actual growth rate is 2.63 mm/day. Since p<0.05, we can conclude that the Pinus pinea sap had an influence on fungal growth. P.pineayes -2.2967 0.2019 -11.373 3.49e-10 *** 4. The final line shows the interaction between the Pinus taeda and Pinus pinea sap on fungal growth relative to the control and relative the main effects of each. If the two tree species were purely additive, then the fungal growth rate would be 4.36 slower than the control (2.06+2.30). The LS mean fungal growth rate is 1.74 mm/day faster than the purely additive case, which means the actual growth rate is 2.31 mm/day. The t-test tests the hypothesis that the interaction, which is the difference between the observed growth and the purely additive case, is zero. Since p<0.05, we conclude that there is an interaction between the sap of the two species. P.taedayes:P.pineayes 1.7367 0.2856 6.081 6.07e-06 *** The remainder of the output is interpreted the same as for a linear regression (see Tutorial 4 for details). The output from the summary.lm() function provides t-test evaluations of the term against zero, which is not always useful. The F-table, however, will let you evaluate the general significance of each factor in the model, rather than just a test against zero. To get the F-table, we can use the aov() function: > summary(aov(my.fit)) Df Sum Sq Mean Sq F value Pr(>F) P.taeda 1 8.5443 8.5443 69.839 5.900e-08 *** P.pinea 1 12.2408 12.2408 100.054 3.149e-09 *** P.taeda:P.pinea 1 4.5240 4.5240 36.978 6.066e-06 *** Residuals 20 2.4468 0.1223 --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 65 Queen’s University! September 2012 The first line is the F-test for main effects of Pinus taeda, the second line is the Ftest for the main effects of Pinus pinea, and the last line in the F-test for the interaction. Evaluating the assumptions of ANOVA The evaluation of assumptions for two-factor ANOVA’s is the same as for onefactor ANOVA’s. Begin by plotting the residuals against the fitted values (which are the LS means). > plot(fitted(my.fit),resid(my.fit)) The variance appears relatively constant across the fitted values (Figure 6.3). The histogram of raw residuals also appears roughly normal given the small number of replicates. > hist(resid(my.fit)) A more formal test is done using the gvlma() function in the gvlma library (see Tutorial 4 for details). > library(gvlma) > gvlma(lm(my.fit)) Call: lm(formula = my.fit) 1) Coefficients: (Intercept) P.taedayes 4.933 -2.062 2) P.pineaP -2.297 P.taedayes:P.pineayes 1.737 ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM: Level of Significance = 0.05 3) Call: gvlma(x = lm(my.fit)) a) b) c) d) e) f) Global Stat Skewness Kurtosis Link Function Heteroscedasticity Value p-value Decision 1.49217 0.8280 Assumptions acceptable. 0.57877 0.4468 Assumptions acceptable. 0.89226 0.3449 Assumptions acceptable. 0.00000 1.0000 Assumptions acceptable. 0.02114 0.8844 Assumptions acceptable. The output of the gvmla test indicates that the assumptions of linearity, normality, homoscedasticity, and independence of the residuals are satisfied in 66 Queen’s University! September 2012 0.0 -0.5 resid(my.fit) 0.5 our example. Since the assumptions are satisfied, we can move on to interpret the analysis in more detail. 2.5 3.0 3.5 4.0 4.5 5.0 fitted(my.fit) Figure 6.3 Plot of residuals by the predicted value. Posteriori contrasts The fit ANOVA model indicates a significant interaction, which means that the sap from the two species of pine have a non-additive impact on fungal growth rates. It also means that the main effects (e.g., Pinus taeda influences fungal growth rates) are not directly interpretable because it includes treatments with the other species as well. For example, it could be that Pinus taeda in isolation does not influence fungal growth rates, but in the presence of Pinus pinea it does. Posteriori contrasts provide a way to disentangle interactions by comparing sub components of the ANOVA table (e.g., Pinus taeda in isolation versus control). One way to do this is to use Tukey’s honest significant difference test, which compares all possible pair-wise contrasts. In R, this is done using the TukeyHSD() function. > TukeyHSD(aov(my.fit)) Tukey multiple comparisons of means 67 Queen’s University! September 2012 95% family-wise confidence level Fit: aov(formula = my.fit) $P.taeda diff lwr upr p adj yes-no -1.193333 -1.491197 -0.8954692 1e-07 $P.pinea diff lwr upr p adj yes-no -1.428333 -1.726197 -1.130469 0 $`P.taeda:P.pinea` yes:no-no:no no:yes-no:no yes:yes-no:no no:yes-yes:no yes:yes-yes:no yes:yes-no:yes diff -2.061667 -2.296667 -2.621667 -0.235000 -0.560000 -0.325000 lwr upr p adj -2.6268893 -1.496444013 0.0000000 -2.8618893 -1.731444013 0.0000000 -3.1868893 -2.056444013 0.0000000 -0.8002227 0.330222654 0.6557255 -1.1252227 0.005222654 0.0527069 -0.8902227 0.240222654 0.3960647 The aov() function is used to pass the proper elements of the ANOVA fit to the TukeyHSD() function. let’s go through each section of the output from the Tukey’s test. 1. The first section indicates that Tukey test is using a 5% error rate across all contrasts. 2. The next two sections ($P.taeda and $P.pinea) show the results of the main effects, which are of little value here because of the significant interaction. 3. The final section ($P.taeda:P.pinea) is really what we are after. It shows the results of all pair-wise contrasts between the treatment. For example, the first line is ‘yes:no-no:no’, which is the contrast between the treatment with just Pinus taeda (read as P.taeda present, P.pinus absent) and the control (read as P.taeda absent, P.pinus absent). Since p<0.05, we can conclude that Pinus taeda in isolation causes a change in fungal growth rate. The Tukey’s test indicates that each pine species had an influence on fungal growth rates in isolation and in combination relative to the control (contrasts a, b & c). The two species of pine did not differ significantly in the effect on fungal growth in isolation (contrast d), and the addition of Pinus taeda to Pinus pinus had no significant effect relative to Pinus pinus in isolation (contrast f). The interaction is really the result of the addition of Pinus pinua to Pinus taeda relative to Pinus taeda in isolation (contrast e). 68 Queen’s University! September 2012 Function List Here is a list of the functions and options covered in tutorial 1, see the quick Reference Card for more details. Functions: glm( ), TukeyHSD( ), aov( ), interaction.plot( ). 69 Queen’s University! September 2012 Reference Cards Functions Function Description Tutorial c( x, y, z, ... ) Creates a data vector from the z, y, z, ... entries. 1 max( x ), min( x ) Returns the maximum/minimum value in the x vector 1, 2 sum( x ) Returns the sum of all entries in the x vector. 1 Returns the mean or median of the x vector. 1, 2 mean( x ) median( x ) Creates a data frame of the vectors x, y, ... , each with the titles given by ‘X’, ‘Y’, ... 1 Returns the transpose of the data frame x. 2 as.matrix( x ) Converts a data frame x into a matrix object. 2 read.csv() Reads in .csv files (comma separated variables). Use in conjunction with file.choose(). 1 Brings up a graphical window to select your file to upload. Use in conjunction with read.csv(). 1 data.frame( ‘X’=x, ‘Y’=y, ... ) t( x ) file.choose() hist( x ) plot( x , y ) 70 Creates a histogram of the x vector. See Plotting options to customize your plots. 1, 2 Creates a scatter plot of the x and y vectors. See Plotting options to customize your plots. 1, 2 Queen’s University! Function Description Tutorial interaction.plot( x1, x2, y) Creates a line plot of the mean response for each combination of x1 and x2. The x1 categories are shown on the x-axis, and the x2 categories are denoted with different lines. 6 sd( x ) Returns the standard deviation of the x vector. 2 quantile( x ) Returns the quantiles of the x vector. 2 table( x , y ) Creates a contingency table from the vector x, using the categorical values in y. In this case, x is always a stacked vector. 2 Creates a bar plot of the data in x. If x is a vector, then it is a simple bar plot. If x is a data frame, then each column of x is a separate group. The default is a stacked bar plot, but the option beside=TRUE can be used to create a grouped bar plot. 2 Creates a box plot of the data in x. If provided, y is a categorical vector describing the grouping of x (i.e., x can be a stacked vector). 2 barplot( x ) barplot( x , beside=TRUE) boxplot( x , y ) abline( h=a ) abline( v=a ) 71 September 2012 Draws a horizontal (‘h’) or vertical (‘v’) line on a plot that crosses the axis at a. 2, 4 Queen’s University! September 2012 Function Description Tutorial Displays the critical value from a t-distribution given a confidence value and the degrees of freedom. For a one-tailed t-test, the confidence value is 1-α. The two-tailed test requires 1-α/2 for the right hand tail and α/2 for the left hand tail. 3 qf(1-α,df1,df2) Displays the critical value from a F-distribution for a given confidence value and degrees of freedom. The order for degrees of freedom is df1 is numerator, df2 is denominator. 3 qchisq(1-α,df) Displays the critical chi-squared value for a given confidence value and degrees of freedom. 3 Tests the null hypothesis that the data contained in the vector y is significantly different from a population mean μ. 3 Test the null hypothesis that the difference between two populations, y, is different than zero. 3 qt(1-α,df) or qt(1-α/2,df) glm(y-μ~1) (one sample t-test) glm(y~1) (two sample paired t-test) Two-sample t-test: Test the null hypothesis that the mean of two data sets are different. glm(y~x) (two-sample independent ttest, regression) 72 Regression: Runs a regression on x and y, where y is your response variable and x is your explanatory variable. We can think of the tilde as “given by”, therefore y “given by” x. 3, 4 Queen’s University! Function Description Tutorial summary() This function extracts additional data from a glm model and displays it on the screen. 3, 4 summary.lm() names() Shows the list of extractor functions specific to a certain object. 4 fitted() Lists the fitted values calculated from the regression coefficients. 4 resid() Lists the residuals, which is the difference between the fitted and observed values. 4 gvlma(lm()) Used to assess the assumptions of a linear regression. lm() must be used inside this function to convert a glm object to an lm object. 4 glm(y~A+B+A:B) Performs an ANOVA on the dependent variable y and independent categorical variables A and B. The ‘~’ separates independent and dependent variables, the ‘+’ separated the various independent variables, and the ‘:’ denotes an interaction. 5 TukeyHSD(aov(my.fit)) Performs a posteriori Tukey’s HSD test on the glm object ‘my.fit’. The aov() function is used to pass specific parts of the glm object to the TukeyHSD function. 5 This function extracts the ANOVA table from a fit glm object. 5 aov(my.fit) 73 September 2012 Queen’s University! September 2012 Arithmetic, indexing and Logic Operators Symbol Description Tutorial + Addition 1 - Subtraction 1 / Division 1 * Multiplication 1 ^ Raise to the power (i.e., x^y is x raised to the power y) 1 = Assigns a value 1 [,] Used to access entries of data structures such as vectors, arrays and matrices. Commas are used to separate dimensions. 1 () Used for order of operations (i.e., do what is in the brackets first) and for the arguments of functions. 1 Description Tutorial Plotting Options Option xlim, ylim = c(a,b) xlab, ylab = “ ” main = “ ” 74 Specifies the axis limits to be plotted using a vector with two entries. The first entry ‘a’ is the minimum, the second entry ‘b’ is the maximum. 1, 2 Labels the axes with what is written in the quotes 1, 2 Labels the plot title with what is written in the quotes 2 Queen’s University! 75 September 2012 Option Description Tutorial col = Specifies a color for the line/ point/bar. Can be either a color name in quotes (e.g., ‘red’) or a number. Numbers one to five define colors as 1-black, 2-red, 3green, 4-blue, 5-cyan, 6-pink, 7yellow and 8-grey. See colors() for a full list of color names. 1, 2 lty = Specifies the type of line as 1solid, 2-dashed, 3-dotted, 4dotdash, 5-longdash, 6-twodash. 1 lwd = Specifies the line width. Takes a positive number. 1 pch = Specifies the point type. Some useful values are 19- solid circle, 20-bullet (smaller circle), 21-filled circle, 22-filled square, 23-filled diamond, 24-filled triangle pointup and 25-filled triangle point down. 1