Introduction to using R The purpose of this manual is to help you learn how to use the free R software to perform the graphs, simulations, and data analyses. This manual will describe how to obtain R and give a brief introduction to the software. Two methods will be used to demonstrate the R code. The first method uses screen copies. The screen copy shows exactly what you will see when running R. The red text shown in the screen copy is what you type. You will not need to type the initial “>” prompt at the start of each line. The blue text in the screen copy is output produced by R. A simple example is the first bullet under “Basic Commands” of this lab. The second method is simply a display of the text you would type. You would type the command after the “>” prompt. R would respond after you type the [Enter] or [Return] key. An example of this is the third bullet under “Basic Commands” of this lab. What is R? R is a computer language and environment that was developed with statistical graphics and analysis in mind. Consequently it is commonly thought of as a statistical software package, like the proprietary Minitab and SPSS packages. In the growing atmosphere of free software, scientists are constantly making available new packages that enable R to perform very advanced modern statistics. There are several consequences of R being free software developed by scientists for scientists. First of all, it is very powerful. If you decide to continue in a career that depends heavily on statistics such as economics, biology, medicine, marketing, etc., R will allow you to develop your own statistical functions specific to your own immediate needs. Secondly, it was created as a tool for scientists rather than for mass marketing to make money. Thus it is line command driven and lacks features as pull down menus and point-and-click commands. This results in a software that has a high “nerd factor” as you will notice when looking at the help commands and manuals. S-Plus, a proprietary software package, is almost identical to R with respect to line commands, but includes pull down menus and some point-and-click commands. Obtaining R R is freely available via the website www.r-project.org as are its online manuals. To install R on your computer, follow the links which originate from the “Download CRAN” link found on the left-hand side of the R project homepage. This link leads you to a web page where you select the location of a mirror site closest to your location; .e.g., University of California at Berkeley, California, USA. If you are using Microsoft Windows, for example, you would then click on the “Windows (95 and later)” link, followed by the “base” link, and then the “rw2010.exe” link to install R. Basic commands The most essential features and commands to keep in mind to use R are: R is a line command driven software with the commands typed at the > prompt followed by the [Enter] or [Return] key. Values are assigned using the R the two key arrow <-. It is created by typing < followed by a dash. To see the value of a variable, simply type it and hit [Enter]. The following screen copy shows the basic command of assigning the value 10 to the variable n. 1 (Note: The bracketed [1] is used to count the values in the output as you will notice when working with larger datasets.) The # sign is used to add comments to a command line. R ignores everything on the command line typed after #. For example, at the prompt, type the following command to assign 4, 6, and 3 to the variable “heights”. > heights <- c(4,6,3) #c()assigns a string of numbers Although comments are provided throughout this manual to give extra instruction, it is not necessary for you to type the comments to run the commands. R commands are followed by parentheses with variables and options put within the parentheses. Example: > sort( heights ) [1] 3 4 6 Typing the command without the parentheses will result in the software code flashing on the screen; however, it causes no harm. R is case sensitive which means that R will distinguish between the variables Heights and heights. To list what objects you have available, use the ls() command. Example: > ls( ) [1] "heights" "Heights" "n" To delete an object use the remove command, remove(). Example: > remove( heights ) > ls() [1] "Heights" "n" If you want to change a previous command you can hit the up arrow key and edit your old commands. If you type [Enter] before a command is completed, R will go to the next line and respond with a “+” to denote the command is not finished and you can continue typing. To terminate the command, type the [Escape] or [Esc] key. To quit R, type q(). 2 Importing data into R R is best designed to import and export text data. Go to www.humboldt.edu/~mar13/datasets.html and click on the Trillium dataset. Save the dataset to your computer on the C drive, for example, in a folder called Temp, “c:/Temp/trillium.txt”. Using the Notepad program, edit the trillium.txt file to remove the header. The read.table( ) command is used to import data. R uses forwards rather than backward slashes to denote a folder location. Try importing the trillium.txt dataset. > trillium <read.table("c:/Temp/trillium.txt", header=T) The header=T option lets R know that the first row in the dataset represent the variable names. To ensure you read the data, type trillium and watch the data flash by. Also try the following command to see the names of the variables for the trillium data. > names(trillium) [1] "leaf" "stem" "flower" "site" Storing data: Objects, vectors, matrices, and data frames. Anything stored by R is called an object. Thus the function sort( ) is an object, as were the variables height and n in the previous examples. We will focus primarily on objects that store data, such as did height and n. Data will most commonly be stored in either vectors or data frames. A vector is simply a string of numbers. n is a vector of length 1 and heights a vector of length 3. If you are typing a one variable with only a few data values you will often simply type the values into a vector. R allows mathematical operations to be carried out on an entire vector. Example: > x <- c( 5,2,4 ) > x + 6 [1] 11 8 10 Elements of a vector can be specified by use of the hard brackets [ ]. Either a single element or more than one can be specified. Examples: > x[3] [1] 4 > x[2:3] [1] 2 4 > x[c(1,3)] [1] 5 4 A matrix is a string of vectors of the same mode (numeric, character, or factor) and of the same length bound together. The dimension of a matrix is typically described by its number of rows and columns. Analogous to a vector, the elements within a matrix are described by their row and column position in the matrix; e.g., X[ row, column ]. Leaving either the row or column unspecified is the same as specifying them all. Examples: > X <- c(2,4,6,8,10,12,14,16,18,20,22,24) #vector > X <- matrix( X, nrow=3, byrow=F ) # turn into matrix > X [,1] [,2] [,3] [,4] [1,] 2 8 14 20 [2,] 4 10 16 22 [3,] 6 12 18 24 > X[2,3] # row 2, column 3 [1] 16 > X[1:2, 3:4] # rows 1 through 2 and columns 3 and 4. [,1] [,2] 3 [1,] 14 20 [2,] 16 22 > X[1,] # row 1, all columns [1] 2 8 14 20 > X[,3] # all columns, row 3 [1] 14 16 18 > X+3 # add 3 to all values of X [,1] [,2] [,3] [,4] [1,] 5 11 17 23 [2,] 7 13 19 25 [3,] 9 15 21 27 A data frame is like a matrix, but one column of the data frame may consist of numbers and another column words. The read.table()function automatically puts the data into a data frame. Do the following commands: > trillium # scroll back up to see the column headers > trillium$flower The first column (1,2,,582) are simply the row labels while the data are the columns listed under “leaf”, “stem”, “flower”, and “site”. Columns within a data frame can be accessed directly using the $ symbol. The variables within a data frame can be accessed more directly by attaching the data frame. For example, suppose you typed flower. R would say that it could not find the variable. That is because it looks in the directory (displayed by ls() ) and sees the data frame trillium, but does not look inside trillium for the variable flower. Using the attach(trillium) command tells R to also look inside data frame trillium if it can not find flower in the directory. Once finished using the dataset, detach the data frame. Examples: > flower [1] p w [38] p p [75] w w [112] w w [149] w w [186] w w [223] w w [260] p s [297] w s [334] w w [371] w p [408] w s [445] w p [482] p w [519] w w [556] w p Levels: p w p w w w p p s p w w p p w w w s w w w w w w w p w w p s p p w w w p w p w p w w p w w w s w w w p p p p w w s w w w w w s p w p p w p w w w s w w w p w p w p w w p w p p w w w w p w p p w p p w p w p w w w p w p w s s w w p w p p w p w w w w p p p s p p w w w w p w w w w w w w w s w w p p w w w w p w w p w p s w w p w p w p w w w w w w w p s w p p w w p p w p w w p p w s w w p w w w p p w w w p w w w w w w p p w p p w w p p p w p p w p p w w p w w w p w w w w p w s s p w p p w p w w p w w w p p w w p w w w w w w w p p w w w w w w w p w w p w w w p p p p w p p p p w w p w w w p w p p p p w w w w p w p w p w p w w w p w s w w p w p w w p p p w p w w w w w w w p w w p w w p w w p p w p w w p w w p p w w p w p w w p w s s p p w p w w w w w p p w s w p s p w p p p w w w w p p p p p p w p p p w w w w w p p w p s w p p w w w w w w p w p p p w w s p w p p w w w w s p w w w w p s p w w w w w p p p p p w w w w p w w p p w p p w w w w w p s p w p w p w w p w w w p w w s p p p w w w p w p w w p w w s p p w w w w p p w p p w w w s p p w w p w w w w p p w p w w p p w p w w w w w w p w s w p p w p w > detach("trillium") > flower Error: Object "flower" not found Of course, basic statistical commands can be carried out on objects. > attach(trillium) > mean(leaf) [1] 12.78847 > sd(leaf) [1] 2.521759 > summary(leaf) Min. 1st Qu. Median 4.80 11.10 12.70 > plot(leaf,stem) > cor(leaf,stem) Mean 3rd Qu. 12.79 14.50 Max. 21.50 4 [1] 0.6173458 Editing data Objects in R can be edited using the edit() function where the output is assigned either to a new object or the original object. A spread sheet editor will appear for data frames and a simple text editor for vectors. To end the editing session, simply click on the X icon in the upper right hand corner of the window. Examples: > newtrillium <- edit( trillium ) > x <- edit( x ) Exporting data The function write.table()is used to export a data frame to a text file on your computer. There are many advanced optional features to the command, but you must provide an R object to export and a file destination for the object. The row.names=F option avoids writing the 1 through 20 down at the start of each row. Example: write.table(newtrillium,file="c:/Temp/newtrillium.txt",row.names=F) Learning more about R The potential for using R for statistical analysis is almost endless. There always seems to be more about this software and statistics that a person can learn no matter how introductory or advanced the user. Besides the R manuals available through the Help icon at the top of R, there are a number of books written at introductory and advanced levels which describe how to use R and the similar S-Plus package. Learning R through application The best way to learn software is to try to use it to accomplish a goal. Goal oriented computing can initially be frustrating, but ultimately is the best teacher for the user to learn advanced commands. Probability simulation will be used to introduce for-loops, simulations, and graphics. Birthday Game. To calculate the probability of at least two people sharing the same birthday, it is easier to calculate the probability of two people not sharing the same birthday and subtracting this probability from one. For 365 364 363 362 example, with four people in the room, the probability would be 1 . The equivalent R code 365 365 365 365 1-prod((365:362)/365) returns the probability 0.016. The general formula for n people in a room is n 1 i 1 366 i 366 i . Notice how the term approaches 0 as i approaches 366. This is because 366 is one more 365 365 person than there are days in a year. Try the following R code and inspect the probabilities. for ( n in 1:366) { nobody <- prod( ( 365:(366-n) )/365 ) p <- 1- nobody print ( c(p,n) ) } Question 1: What are the probabilities for 22 people in a room? 30? 41? 57? Estimating π. Remember that the area of a circle is r 2 where r is the radius of the circle. To estimating π, we can throw “darts” randomly into a two-by-two unit grid and count the proportion that fall into a circle centered inside the grid with a radius of one. The area of the circle is π and the area of the grid is four. Consequently the 5 proportion of darts that fall into the circle , p number of darts in circle , should be about equal to . Using number of darts thrown 4 algebra, we get 4 p . In R we can create “darts” by randomly choosing x and y coordinates each from the uniform distribution with a -1 to +1 range, thus having the circle centered at the origin. If the distance, d x 2 y 2 , is less than one consider the dart to be within the circle. The following R code throws 100,000 darts and estimates π. Cut and paste it repeatedly to watch the variability of the estimate. N <- 100000 xdarts <- runif(N, min=-1, max=1 ) ydarts <- runif(N, min=-1, max=1 ) d <- sqrt( xdarts^2 + ydarts^2 ) 4 * sum( d < 1 ) / N Question 2: Create a vectors of 100 zeroes and call the vectors piA and piB. (The rep function will be useful.) Using a for-loop around the code and changing N to 5000, save 100 estimates of π to vector piA. Repeat the process, but change N to 100,000 and save the 100 estimates of π to vector piB. Create side by side boxplots of the two vectors (boxplot(piA, piB)). Use the function summary() on each vector to find the first and second quartiles of each vector. For this problem you only need to show the boxplots and state the quartiles. Writing functions in R. A nice feature in R is that you can write functions to perform sequences of commands that are frequently used and where options may be altered. A simple function to add two numbers x and y would be: add <- function( x, y ) { xandy <- x + y return( xandy ) } Suppose we wanted defaults to be available in case x and/or y is not specified by the user. The following alterations would have x=1 and y=3 unless specified. Play with both functions sometimes providing only one or none of the x and y values. add2 <- function( x=1, y=3 ) { xandy <- x+ y return( xandy ) } Question 3: Create a function that will throw N darts into a grid with -1 and 1 being the lower and upper bounds for the x and y coordinates. Let the default for N be 100. Calculate and return the proportion of darts that were inside the circle with radius one. Also plot the points out where the points within the circle are a circle and outside of the circle a cross. Provide your function and a graph produced from N=50. The following graphing codes will be useful. plot( xdarts, ydarts, type="n" ) # plot nothing, but create frame points( xdarts[d<1], ydarts[d<1], pch=1 ) #pch=point character points( xdarts[d>=1], ydarts[d>=1], pch=3 ) 6