STAT 411/511 - A short introduction to R (Fall 2015)

advertisement
STAT 411/511 - A short introduction to R (Fall 2015)
DUE Friday, August 28 by 4:00 in D2L Dropbox
Complete this before starting Assignment 1
Statistical analysis now requires the use of a computer and hence computer software. For Stat 411/511
it will not suffice to do statistics in a spreadsheet or in a point-and-click environment. You will need to learn
to use some software program, and we have chosen R as the software for this course. It has these advantages:
ˆ R is capable, fast, and accurate.
ˆ Open Source and Free
ˆ Excellent plotting capabilities
ˆ Huge number of contributed packages (7000?) for specialized tasks
ˆ Interfaces with other programs
There are some tradeoffs with free software – mainly in terms of support – so it’s very important to learn how
to use the help system and the network of people who are willing to answer questions. Do not equate ”free”
with ”second-rate”. The people who built R have taken great pains to use the best computer algorithms
available. We might disagree with some choices the creators have made, but then we can make whatever
changes we desire and pass them back to the R community. That’s what open source is all about.
Developing some skill as an R programmer will pay off in your future endeavors. You’ll want to put this
on your resume.
Learn to tell when you’re stuck.
Signs:
ˆ You keep repeating the same steps, and keep hitting the same roadblocks.
ˆ Error messages make no sense, seem unrelated to your task.
ˆ Fill in:
Try to not get stuck in a time sink! It really helps to step back and “define the question”. State “Here’s
what I want to do” and “here’s the roadblock I’m hitting”. Often just defining the issue helps you see how
to proceed.
Learn where to get help:
ˆ R help pages.
ˆ Google it: rseek, R stackoverflow
ˆ Classmates / Instructor
ˆ Fill in:
R Intro Exercises
1. Load the current version of R onto your computer from CRAN (or use an MSU computer lab).
1
2. Load up RStudio from rstudio.org (or use MSU lab). Rstudio is not the only way to run R, but it’s a
nice way which works for any computer. I will be using it in class.
3. Start Rstudio and under the File menu open a New File as an R script.
To turn in: Use DROPBOX in D2L to turn in your script file after completing this introduction. At
the end of your script file you should do your best to answer the questions at the end of the assignment (R
terms you should know). Be organized. Include comments in the code to organize and explain what you are
doing (see below for more details). This will be worth 10 points for thoroughly completing it.
1
Create a script file
1. To start, using File −→ New File −→ R Script. Use File −→ Save As to save your file in the location
you would like to save your work to a folder for this class. Name it Rintro and the .R extension should
automatically be added to the file. A script file is just a text file that stores all your R code. Only
rarely should you find yourself typing in the Console window.
2. Type all your work for this introduction into your script file and SAVE often. You can run specific
parts of the code by highlighting just the code you want to run and then hitting the “Run” button at
the top of the window or using the shortcut keys described below. You can also run individual lines
without highlighting - just by placing your cursor in the line before running it. The “answer” will pop
up in the Console window. The shortcut keys for running commands from a script file are:
ˆ Mac: Command/Enter - Return
ˆ PC: Cntrl-R
3. For this tutorial, you should type the code I give you, or something very similar (you do not have to
use the same numbers or names). The “answers” that R should give you in the Console Window are
shown for many parts of the tutorial behind the comment symbols (number signs). You should be be
checking what appears in the Console window.
2
Defining Some Simple Objects in R
1. First, let’s just use R as a calculator:
4 + 5 + 33
## [1] 42
2. Now, let’s give the quantity a name. You can choose whateve name you want, but it shouldn’t start
with a number or have spaces. Check the console window after you run this. Does it show you the
answer?
objct <- 4 + 5 + 33
3. Let’s see how the comment symbol works
#How does the comment symbol work?
#objct
2
4. If we want to see what the object objct is defined as, we need to type (or run) the name by itself and
we will see how it is defined. You can also look under Environment in the upper righthand window if
you are using R Studio.
objct
## [1] 42
5. Let’s define another object and look at it:
a <- 2*sqrt(3) #2 times the squareroot of 3
a
## [1] 3.464102
This is our first use of a function. Arguments to the function go in the parentheses.
6. Now let’s use the names to perform some mathematical functions to objct and a
objct + a
#addition
## [1] 45.4641
objct - a
#subtraction
## [1] 38.5359
objct / a
#division
## [1] 12.12436
objct * a
#multiplication
## [1] 145.4923
objct ^ a
#raise to the power of a
## [1] 419856.1
log(objct)
#natural logarithm
## [1] 3.73767
exp(a)
#exponentiate
## [1] 31.94775
7. I decided I don’t like the name objct, so I’m going to change it to something simpler and see how R’s
memory changes. Be sure to run all lines one at a time and see what is in the Console window.
3
b <- objct
b
objct #is objct still in R's memory?
objct <- 100
object #oops, I spelled it wrong!
Objct #oops, I accidentally capitalized it!
objct
b #what about b?
8. Can we define objects of characters (letters and words), instead of numbers?
letterA <- "A"
letterA
## [1] "A"
my.name <- "Jim R-C"
my.name
## [1] "Jim R-C"
9. There are several basic functions that are very handy for putting things together.
ˆ Concatonate means to put things (numbers, letters, or words) together into one string or vector.
We use a c() to do this, where the things to put in the string are within the parentheses and
separated with commas.
some.letters <- c("I","L","O","V","E","S","t","a","t","s")
some.letters
##
[1] "I" "L" "O" "V" "E" "S" "t" "a" "t" "s"
some.numbers <- c(2,4,6,8,20,29,10,34,20000,-20)
some.numbers
##
[1]
2
4
6
8
20
29
10
34 20000
-20
ˆ It is often helpful to create a data frame to use for analysis. We can combine vectors that are the
same length into data frames. Let’s first check the length of the vectors to make sure we won’t
get an error.
length(some.letters)
## [1] 10
length(some.numbers)
## [1] 10
data.frame(letters=some.letters, nums=some.numbers)
4
##
##
##
##
##
##
##
##
##
##
##
1
2
3
4
5
6
7
8
9
10
letters nums
I
2
L
4
O
6
V
8
E
20
S
29
t
10
a
34
t 20000
s
-20
ˆ Let’s give the data frame a name so that we can refer to it when we would like to use the data
stored in it.
letternum.df <- data.frame(letters=some.letters, nums=some.numbers)
letternum.df
ˆ We may also want to create a matrix. Matrices are only designed for numeric vectors (vectors of
numbers). Let’s play around with this a little. A common way to make a matrix is to use the
function cbind(), which stands for column-bind because we want to bind columns into a matrix
(there is also an rbind() for binding rows).
cbind(some.numbers, some.numbers) #make a 10x2 matrix
##
some.numbers some.numbers
## [1,]
2
2
## [2,]
4
4
## [3,]
6
6
## [4,]
8
8
## [5,]
20
20
## [6,]
29
29
## [7,]
10
10
## [8,]
34
34
## [9,]
20000
20000
## [10,]
-20
-20
ˆ What happens to our numbers if we try to combine a vector of letters with a vector of numbers
to make a matrix?
cbind(some.letters, some.numbers) #what happens here?
##
## [1,]
## [2,]
## [3,]
## [4,]
## [5,]
## [6,]
## [7,]
## [8,]
## [9,]
## [10,]
some.letters
"I"
"L"
"O"
"V"
"E"
"S"
"t"
"a"
"t"
"s"
some.numbers
"2"
"4"
"6"
"8"
"20"
"29"
"10"
"34"
"20000"
"-20"
5
ˆ We can check whether vectors are numbers or characters using the following functions which will
spit out a TRUE or a FALSE for us.
is.numeric(some.numbers)
## [1] TRUE
is.numeric(some.letters)
## [1] FALSE
is.character(some.numbers)
## [1] FALSE
is.character(some.letters)
## [1] TRUE
10. What if we just want to look at parts of the data frame letternum.df? These commands are useful
after importing data files.
names(letternum.df) #look at names
head(letternum.df) #look at first 5 rows
tail(letternum.df) #look at last 5 rows
letternum.df$letters #look only at letters column
letternum.df$nums #look only at numbers colunm
## Subsetting a data frame or matrix
letternum.df[8:10, ] #look at rows 8-10, for both columns
letternum.df[1:2, 1] #look at rows 1-2, for column 1
letternum.df[1, 2] #look at value in 1st row and 2nd column
3
Reading in Data Sets
For this class, we’re going to focus on reading in data sets from text files. We will usually use .csv files,
but a similar method can be used for others. There are also ways to directly import data sets when using
RStudio and it does allow you to look at the data in a spreadsheet view, which is nice. See Import Dataset
option under Environment in upper righthand window in RStudio. We will play around here with a data
set that accompanies The Statistical Sleuth. It contains names of mammal species, with records of average
brain weight, average body size, average litter size, and average gestation length.
1. First, we need to think about our Working Directory. The working directory is the folder on your
computer that R thinks it is working in. It will look for and save files here. You can set it using
command lines, or to begin with we’ll just use the drop down menu. In RStudio, Session −→ Set
Working Directory −→ Choose Directory. If you have already opened your script file in the location
you want to be the working directory you can just choose To Source File Location under Set Working
Directory, instead of browsing to the folder using Choose Directory.
You could save the data file http://www.math.montana.edu/~jimrc/classes/stat511/data/gestationBrain.csv
in your working directory.
6
brain.data <- read.csv("gestationBrain.csv", head=TRUE)
Or use file.choose to browse for it
brain.data <- read.csv(file.choose(), head=TRUE)
Or, R can grab it from my web site:
brain.data <- read.csv("http://www.math.montana.edu/~jimrc/classes/stat511/data/gestationBrain.csv")
2. Now brain.data is loaded, but we have not looked at it yet. Let’s check it out and make sure we
understand what is in it. You can look at the whole thing within the console window with the following:
brain.data
3. To see it in spreadsheet view if you are using RStudio, look in the Environment window in upper right,
find brain.data and click on the little spreadsheet view icon on the right of the line. This will open
brain.data window in the same window frame where your script file is.
4. We can also look at other parts of it within the console window with the following commands. Each
uses a different function on the data frame.
names(brain.data) #check names of variables
head(brain.data) #look at first 5 rows
summary(brain.data) #get summary stats for all variables
dim(brain.data) #dimension of brain.data
4
Loading Packages
Before moving to plotting, we will learn how to install and load packages in R. Packages are just collections
of functions we can use to do things more easily in R (thanks to others we don’t have to reinvent the wheel).
Anyone can submit functions and they are not tested by anyone, so while it is a wonderful thing that makes
R very powerful, we should also keep in mind that checking comes from people trying it.
For plotting, we are going to use the basic plotting commands in R, as well as the ggplot2 package and
the mosaic package. You will need to install the package on your computer one time and then after that,
you will just have to load the package in R using library(package name) if you want to use it during a
session.
1. Here is the code for installing a package OR you can use the drop down menues to do it. In RStudio
the drop down menu in under Tools −→ Install Packages....
2. To load the package after installing it, we will using library() (require() also works, but is not
preferred)
library(ggplot2)
7
5
Plotting
We will use the base R plotting commands (you will see these in Assignments 1 and 2), but we will also
using the package ggplot2 for making some fancier plots. The syntax for ggplot is different from usual R
plotting commands, but it certainly makes prettier plots and many people from various disciplines are using
it.
1. Construct a scatter plot with brain weight on the y-axix and gestation length on the x-axis. Here is
the code you could use directly to make the plot, and the plot should look like that shown below.
qplot(data=brain.data, x=Gestation, y=Brain,
main = "") + theme(legend.position="none")
2. Let’s try natural log transforming both brain weight and gestation length. We can do this using
ggplot2() with the below code. It looks like it should be simple in mPlot(), but the default is log base
10 instead of natural log, and usually we are going to want the natural log.
qplot(data=brain.data, x=Gestation, y=Brain) + scale_x_continuous(trans='log') +
scale_y_continuous(trans="log")
3. Or we could tell R to transform before plotting.
qplot(x = log(Brain), y = log(Gestation), data = brain.data)
log(Gestation)
6
5
4
3
0
2
4
6
8
log(Brain)
4. Another option is to actually make the log transformed variables part of the dataframe and directly
plot them. This allows us to use them in other ways later on as well.
#take the natural log of Brain and define new variable ln.brain
brain.data$ln.brain <- log(brain.data$Brain)
#take the natural log of Body and define new variable ln.body
brain.data$ln.body <- log(brain.data$Body)
#take the natural log of Gestation and define new variable ln.gest
brain.data$ln.gest <- log(brain.data$Gestation)
8
5. One very helpful feature is being able to code points or split up scatter plots by another variable that
is composed of category labels. First, we will create a new variable that assigns each species to one of
six body weight categories, and it is called body6. This command will add body6 into the dataframe
brain.data.
brain.data$body6 <- cut(brain.data$ln.body, breaks=6)
names(brain.data)
#check to see that the new variable cut.ln.body is there
## [1] "Gestation" "Brain"
## [7] "ln.body"
"ln.gest"
"Body"
"body6"
"Litter"
"Species"
"ln.brain"
I was asked: ”Why not just use cbind to combine this new column with the data frame?” You could
do that, but then the column name would not appear.
6. Now, let’s color the points in the scatter plot according to what body weight category they are in.
Can you change the color of the points in the first scatter plot you made according to the body size
category?
qplot(data=brain.data, x=ln.gest, y=ln.body, colour=body6, main = "") + theme(legend.position="none"
7. Alternatively, we could split up the plot into facets or panels according to the body weight category
to end up with 6 scatter plots within the same plot. Try to do this using the Facets menu.
qplot(data=brain.data, x=ln.gest, y=ln.brain, colour=body6, main = "") + facet_wrap(~body6, ncol=3)
theme(legend.position="none")
8. Can you add a legend to scatterplot with color coding?
ggplot(data=brain.data, aes(x=ln.gest, y=ln.brain)) + geom_point() +
aes(colour=cat.body) + theme(legend.position="right") + labs(title="")
6
Closing down R after working for awhile
SAVE your R script!! This is your record of your work and keeps you from ever having to type the same
command twice. It allows someone to check your work and makes your work reproducible. If you come to
me for help I will ask to see your script file.
No need to save your WORKSPACE.
7
R terms you should know
Using the help files, books, and the internet, do your best to answer these questions. We will go over most
of these.
ˆ How do we name and save an object?
ˆ What are some common functions? How are they called?
ˆ What is a package?
9
ˆ What is a vector or matrix?
ˆ What is a factor?
ˆ What is the console window?
ˆ What is a script file and why should I use it?
ˆ What is RStudio and do I have to use it?
10
Download