STAT 411/511 - A short introduction to R (Fall 2015) DUE Friday, August 28 by 4:00 in D2L Dropbox Complete this before starting Assignment 1 Statistical analysis now requires the use of a computer and hence computer software. For Stat 411/511 it will not suffice to do statistics in a spreadsheet or in a point-and-click environment. You will need to learn to use some software program, and we have chosen R as the software for this course. It has these advantages: R is capable, fast, and accurate. Open Source and Free Excellent plotting capabilities Huge number of contributed packages (7000?) for specialized tasks Interfaces with other programs There are some tradeoffs with free software – mainly in terms of support – so it’s very important to learn how to use the help system and the network of people who are willing to answer questions. Do not equate ”free” with ”second-rate”. The people who built R have taken great pains to use the best computer algorithms available. We might disagree with some choices the creators have made, but then we can make whatever changes we desire and pass them back to the R community. That’s what open source is all about. Developing some skill as an R programmer will pay off in your future endeavors. You’ll want to put this on your resume. Learn to tell when you’re stuck. Signs: You keep repeating the same steps, and keep hitting the same roadblocks. Error messages make no sense, seem unrelated to your task. Fill in: Try to not get stuck in a time sink! It really helps to step back and “define the question”. State “Here’s what I want to do” and “here’s the roadblock I’m hitting”. Often just defining the issue helps you see how to proceed. Learn where to get help: R help pages. Google it: rseek, R stackoverflow Classmates / Instructor Fill in: R Intro Exercises 1. Load the current version of R onto your computer from CRAN (or use an MSU computer lab). 1 2. Load up RStudio from rstudio.org (or use MSU lab). Rstudio is not the only way to run R, but it’s a nice way which works for any computer. I will be using it in class. 3. Start Rstudio and under the File menu open a New File as an R script. To turn in: Use DROPBOX in D2L to turn in your script file after completing this introduction. At the end of your script file you should do your best to answer the questions at the end of the assignment (R terms you should know). Be organized. Include comments in the code to organize and explain what you are doing (see below for more details). This will be worth 10 points for thoroughly completing it. 1 Create a script file 1. To start, using File −→ New File −→ R Script. Use File −→ Save As to save your file in the location you would like to save your work to a folder for this class. Name it Rintro and the .R extension should automatically be added to the file. A script file is just a text file that stores all your R code. Only rarely should you find yourself typing in the Console window. 2. Type all your work for this introduction into your script file and SAVE often. You can run specific parts of the code by highlighting just the code you want to run and then hitting the “Run” button at the top of the window or using the shortcut keys described below. You can also run individual lines without highlighting - just by placing your cursor in the line before running it. The “answer” will pop up in the Console window. The shortcut keys for running commands from a script file are: Mac: Command/Enter - Return PC: Cntrl-R 3. For this tutorial, you should type the code I give you, or something very similar (you do not have to use the same numbers or names). The “answers” that R should give you in the Console Window are shown for many parts of the tutorial behind the comment symbols (number signs). You should be be checking what appears in the Console window. 2 Defining Some Simple Objects in R 1. First, let’s just use R as a calculator: 4 + 5 + 33 ## [1] 42 2. Now, let’s give the quantity a name. You can choose whateve name you want, but it shouldn’t start with a number or have spaces. Check the console window after you run this. Does it show you the answer? objct <- 4 + 5 + 33 3. Let’s see how the comment symbol works #How does the comment symbol work? #objct 2 4. If we want to see what the object objct is defined as, we need to type (or run) the name by itself and we will see how it is defined. You can also look under Environment in the upper righthand window if you are using R Studio. objct ## [1] 42 5. Let’s define another object and look at it: a <- 2*sqrt(3) #2 times the squareroot of 3 a ## [1] 3.464102 This is our first use of a function. Arguments to the function go in the parentheses. 6. Now let’s use the names to perform some mathematical functions to objct and a objct + a #addition ## [1] 45.4641 objct - a #subtraction ## [1] 38.5359 objct / a #division ## [1] 12.12436 objct * a #multiplication ## [1] 145.4923 objct ^ a #raise to the power of a ## [1] 419856.1 log(objct) #natural logarithm ## [1] 3.73767 exp(a) #exponentiate ## [1] 31.94775 7. I decided I don’t like the name objct, so I’m going to change it to something simpler and see how R’s memory changes. Be sure to run all lines one at a time and see what is in the Console window. 3 b <- objct b objct #is objct still in R's memory? objct <- 100 object #oops, I spelled it wrong! Objct #oops, I accidentally capitalized it! objct b #what about b? 8. Can we define objects of characters (letters and words), instead of numbers? letterA <- "A" letterA ## [1] "A" my.name <- "Jim R-C" my.name ## [1] "Jim R-C" 9. There are several basic functions that are very handy for putting things together. Concatonate means to put things (numbers, letters, or words) together into one string or vector. We use a c() to do this, where the things to put in the string are within the parentheses and separated with commas. some.letters <- c("I","L","O","V","E","S","t","a","t","s") some.letters ## [1] "I" "L" "O" "V" "E" "S" "t" "a" "t" "s" some.numbers <- c(2,4,6,8,20,29,10,34,20000,-20) some.numbers ## [1] 2 4 6 8 20 29 10 34 20000 -20 It is often helpful to create a data frame to use for analysis. We can combine vectors that are the same length into data frames. Let’s first check the length of the vectors to make sure we won’t get an error. length(some.letters) ## [1] 10 length(some.numbers) ## [1] 10 data.frame(letters=some.letters, nums=some.numbers) 4 ## ## ## ## ## ## ## ## ## ## ## 1 2 3 4 5 6 7 8 9 10 letters nums I 2 L 4 O 6 V 8 E 20 S 29 t 10 a 34 t 20000 s -20 Let’s give the data frame a name so that we can refer to it when we would like to use the data stored in it. letternum.df <- data.frame(letters=some.letters, nums=some.numbers) letternum.df We may also want to create a matrix. Matrices are only designed for numeric vectors (vectors of numbers). Let’s play around with this a little. A common way to make a matrix is to use the function cbind(), which stands for column-bind because we want to bind columns into a matrix (there is also an rbind() for binding rows). cbind(some.numbers, some.numbers) #make a 10x2 matrix ## some.numbers some.numbers ## [1,] 2 2 ## [2,] 4 4 ## [3,] 6 6 ## [4,] 8 8 ## [5,] 20 20 ## [6,] 29 29 ## [7,] 10 10 ## [8,] 34 34 ## [9,] 20000 20000 ## [10,] -20 -20 What happens to our numbers if we try to combine a vector of letters with a vector of numbers to make a matrix? cbind(some.letters, some.numbers) #what happens here? ## ## [1,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,] ## [7,] ## [8,] ## [9,] ## [10,] some.letters "I" "L" "O" "V" "E" "S" "t" "a" "t" "s" some.numbers "2" "4" "6" "8" "20" "29" "10" "34" "20000" "-20" 5 We can check whether vectors are numbers or characters using the following functions which will spit out a TRUE or a FALSE for us. is.numeric(some.numbers) ## [1] TRUE is.numeric(some.letters) ## [1] FALSE is.character(some.numbers) ## [1] FALSE is.character(some.letters) ## [1] TRUE 10. What if we just want to look at parts of the data frame letternum.df? These commands are useful after importing data files. names(letternum.df) #look at names head(letternum.df) #look at first 5 rows tail(letternum.df) #look at last 5 rows letternum.df$letters #look only at letters column letternum.df$nums #look only at numbers colunm ## Subsetting a data frame or matrix letternum.df[8:10, ] #look at rows 8-10, for both columns letternum.df[1:2, 1] #look at rows 1-2, for column 1 letternum.df[1, 2] #look at value in 1st row and 2nd column 3 Reading in Data Sets For this class, we’re going to focus on reading in data sets from text files. We will usually use .csv files, but a similar method can be used for others. There are also ways to directly import data sets when using RStudio and it does allow you to look at the data in a spreadsheet view, which is nice. See Import Dataset option under Environment in upper righthand window in RStudio. We will play around here with a data set that accompanies The Statistical Sleuth. It contains names of mammal species, with records of average brain weight, average body size, average litter size, and average gestation length. 1. First, we need to think about our Working Directory. The working directory is the folder on your computer that R thinks it is working in. It will look for and save files here. You can set it using command lines, or to begin with we’ll just use the drop down menu. In RStudio, Session −→ Set Working Directory −→ Choose Directory. If you have already opened your script file in the location you want to be the working directory you can just choose To Source File Location under Set Working Directory, instead of browsing to the folder using Choose Directory. You could save the data file http://www.math.montana.edu/~jimrc/classes/stat511/data/gestationBrain.csv in your working directory. 6 brain.data <- read.csv("gestationBrain.csv", head=TRUE) Or use file.choose to browse for it brain.data <- read.csv(file.choose(), head=TRUE) Or, R can grab it from my web site: brain.data <- read.csv("http://www.math.montana.edu/~jimrc/classes/stat511/data/gestationBrain.csv") 2. Now brain.data is loaded, but we have not looked at it yet. Let’s check it out and make sure we understand what is in it. You can look at the whole thing within the console window with the following: brain.data 3. To see it in spreadsheet view if you are using RStudio, look in the Environment window in upper right, find brain.data and click on the little spreadsheet view icon on the right of the line. This will open brain.data window in the same window frame where your script file is. 4. We can also look at other parts of it within the console window with the following commands. Each uses a different function on the data frame. names(brain.data) #check names of variables head(brain.data) #look at first 5 rows summary(brain.data) #get summary stats for all variables dim(brain.data) #dimension of brain.data 4 Loading Packages Before moving to plotting, we will learn how to install and load packages in R. Packages are just collections of functions we can use to do things more easily in R (thanks to others we don’t have to reinvent the wheel). Anyone can submit functions and they are not tested by anyone, so while it is a wonderful thing that makes R very powerful, we should also keep in mind that checking comes from people trying it. For plotting, we are going to use the basic plotting commands in R, as well as the ggplot2 package and the mosaic package. You will need to install the package on your computer one time and then after that, you will just have to load the package in R using library(package name) if you want to use it during a session. 1. Here is the code for installing a package OR you can use the drop down menues to do it. In RStudio the drop down menu in under Tools −→ Install Packages.... 2. To load the package after installing it, we will using library() (require() also works, but is not preferred) library(ggplot2) 7 5 Plotting We will use the base R plotting commands (you will see these in Assignments 1 and 2), but we will also using the package ggplot2 for making some fancier plots. The syntax for ggplot is different from usual R plotting commands, but it certainly makes prettier plots and many people from various disciplines are using it. 1. Construct a scatter plot with brain weight on the y-axix and gestation length on the x-axis. Here is the code you could use directly to make the plot, and the plot should look like that shown below. qplot(data=brain.data, x=Gestation, y=Brain, main = "") + theme(legend.position="none") 2. Let’s try natural log transforming both brain weight and gestation length. We can do this using ggplot2() with the below code. It looks like it should be simple in mPlot(), but the default is log base 10 instead of natural log, and usually we are going to want the natural log. qplot(data=brain.data, x=Gestation, y=Brain) + scale_x_continuous(trans='log') + scale_y_continuous(trans="log") 3. Or we could tell R to transform before plotting. qplot(x = log(Brain), y = log(Gestation), data = brain.data) log(Gestation) 6 5 4 3 0 2 4 6 8 log(Brain) 4. Another option is to actually make the log transformed variables part of the dataframe and directly plot them. This allows us to use them in other ways later on as well. #take the natural log of Brain and define new variable ln.brain brain.data$ln.brain <- log(brain.data$Brain) #take the natural log of Body and define new variable ln.body brain.data$ln.body <- log(brain.data$Body) #take the natural log of Gestation and define new variable ln.gest brain.data$ln.gest <- log(brain.data$Gestation) 8 5. One very helpful feature is being able to code points or split up scatter plots by another variable that is composed of category labels. First, we will create a new variable that assigns each species to one of six body weight categories, and it is called body6. This command will add body6 into the dataframe brain.data. brain.data$body6 <- cut(brain.data$ln.body, breaks=6) names(brain.data) #check to see that the new variable cut.ln.body is there ## [1] "Gestation" "Brain" ## [7] "ln.body" "ln.gest" "Body" "body6" "Litter" "Species" "ln.brain" I was asked: ”Why not just use cbind to combine this new column with the data frame?” You could do that, but then the column name would not appear. 6. Now, let’s color the points in the scatter plot according to what body weight category they are in. Can you change the color of the points in the first scatter plot you made according to the body size category? qplot(data=brain.data, x=ln.gest, y=ln.body, colour=body6, main = "") + theme(legend.position="none" 7. Alternatively, we could split up the plot into facets or panels according to the body weight category to end up with 6 scatter plots within the same plot. Try to do this using the Facets menu. qplot(data=brain.data, x=ln.gest, y=ln.brain, colour=body6, main = "") + facet_wrap(~body6, ncol=3) theme(legend.position="none") 8. Can you add a legend to scatterplot with color coding? ggplot(data=brain.data, aes(x=ln.gest, y=ln.brain)) + geom_point() + aes(colour=cat.body) + theme(legend.position="right") + labs(title="") 6 Closing down R after working for awhile SAVE your R script!! This is your record of your work and keeps you from ever having to type the same command twice. It allows someone to check your work and makes your work reproducible. If you come to me for help I will ask to see your script file. No need to save your WORKSPACE. 7 R terms you should know Using the help files, books, and the internet, do your best to answer these questions. We will go over most of these. How do we name and save an object? What are some common functions? How are they called? What is a package? 9 What is a vector or matrix? What is a factor? What is the console window? What is a script file and why should I use it? What is RStudio and do I have to use it? 10