Before we begin, please download the
“SwissNotes.csv” and “cardiac.txt” files from the
ISCC website, under the R workshop (more info).
www.iub.edu/~iscc
Workshop in Methods from the Indiana Statistical Consulting
Center
Thomas A. Jackson
February 15, 2013
The R Project for Statistical Computing http://cran.r-project.org
“R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now
Lucent Technologies) by John Chambers and Colleagues.
R can be considered as a different implementation of S.
There are some important differences, but much code written for S runs unaltered under R.”
- Description from CRAN Website
R …
• is free
• is interactive: we can type something in and work with it
▫ How we analyze data can be broken into small steps
• is interpretative: we give it commands and it translates them into mathematical procedures or data management steps
• can be used in a batch: nice because it is documented
• is a calculator: it is unlike other calculators though because you can create variables and objects
• How to open R
→ Start Menu
→ Programs
→ Departmentally Supported
→ Stat/Math
→ R
Three Environments
• Command Window (aka Console)
• Script Window
• Plot Window
To quit: type q()
Save workspace image? Moves from memory to harddrive
Storing variable in memory
• <- , -> , or =
• a<- 5 stores the number 5 in the object “a”
• pi -> b stores the number π= 3.141593 in “b”
• x = 1 + 2 stores the result of the calculation (3) in “x”
• “=“ requires left-hand assignment
Try not to overwrite reserved names such as t, c, and pi!
Printing to output
• Calculations that are not stored print to output
> 3 + 5
[1] 8
• Type name to view stored object
> a
[1] 5
• Use print()
> print(a)
[1] 5
View objects in workspace
• objects() or ls()
Clearing the console (command window)
• Mac: Edit → Clear Console
• Windows: Edit → Clear Console or
• Mac: Alt + Command + L
• Windows: Ctrl + L
Removing variables from memory
• rm() or remove()
> x <- 4
> rm(x)
• rm(list = ls()) remove all variables
Saving syntax (code)
• Mac: File → New
• Windows: File → New Script
Documenting code: # Comments out everything on line behind
Running code from Script Window
• Mac: Apple + Enter
• Windows: F5 or Ctrl + r
Obtaining working directory
• getwd()
• Mac: Misc → Get Working Directory
• Windows: File → Change dir...
Changing working directory
• setwd()
• Mac: Misc → Change Working Directory
• Windows: File → Change dir...
Specify with forward slashes or double backslashes
Enclose in single or double quotation marks
Examples
• setwd(“C:/Program Files/R/R-2.6.1”)
• setwd(‘C:\\Program Files\\R\\R-2.6.1’)
Helpful commands
• If you know the function name: help() or ?
> help(log)
> ?exp
• If you do not know the function name: help.search() or ??
> help.search(“anova”)
> ??regression
Elements of a documentation file
• Function{Package}
• Description
• Usage: What your code should look like, “=“ gives default
• Arguments: Inputs to the function
• Details
• Value: What the function will return
• See Also: Related functions
• Examples
• CRAN Website: http://cran.r-project.org/
• R Seek: http://www.rseek.org/
• Quick-R tutorial: http://www.statmethods.net/
• R Tutor: http://www.r-tutor.com/
• UCLA: http://www.ats.ucla.edu/stat/r/
• R listservs
Google tip: include “[R]” (instead of just “R”) with search topic to help filter out non-R websites
Over 2,500 listed on the CRAN website!
• Use with caution
• Initial download of R: base, graphics, stats, utils
1) Installing a package:
• Mac: Packages & Data → Package Installer
Use Package Search to locate and press ‘Install Selected’
• Windows: Packages → Install Packages
Locate desired package and press ‘OK’
• install.packages(“MASS”)
2) Using an installed package:
You MUST call it into active memory with library()
> library(MASS)
R has several basic types (or “classes”) of data:
• Numeric - Numbers
• Character – Strings (letters, words, etc.)
• Logical – TRUE or FALSE
• Vector
• Matrix
• Array
• Data Frame
• List
NOTE: There are other classes, but these are most common. Understanding differences will save you some headache.
• Find class of data
• Unknown class: class()
• Check particular class: is.“classname”()
> a <- 5
> class(a)
[1] “numeric”
> is.character(a)
[1] FALSE
Change class: as.classname()
> as.character(a)
[1] “5”
Combine items into vector: c()
> c(1,2,3,4,5,6)
[1] 1 2 3 4 5 6
Repeat number of sequence of numbers: rep()
> rep(1,5)
[1] 1 1 1 1 1
> rep (c(2,5,7), times = 3)
[1] 2 5 7 2 5 7 2 5 7
Sequence generation: seq()
> seq(1,5)
[1] 1 2 3 4 5
> seq(1,5, by = .5)
[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
Try 1:10 or 10:1
Create matrix: matrix()
• 6 x 1 matrix: matrix(1:6, ncol = 1)
• 2 x 3 matrix: matrix(1:6, nrow =2, ncol =3)
• 2 x 3 matrix filling across rows first: matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)
Create matrix of more than two dimensions
(array): array()
Create a list: list()
• Holds vectors, matrices, arrays, etc. of varying lengths
• Objects in the list can be named or unnamed
> list(matrix(0, 2, 2), y = rep(c(“A”, “B”), each = 2))
[[1]]
[,1] [,2]
[1,] 0
[2,] 0
0
0
$y
[1] “A” “A” “B” “B”
Data Frame: specialized list that holds variables of same length
Create a data frame: data.frame()
• Like a matrix, holds specified number of rows and columns
> x <- 1:4
> y <- rep(c(“A”, ”B”), each = 2)
> data.frame(x,y) x y
1 1 A
2 2 A
3 3 B
4 4 B
• Unnamed variables get assigned names
> data.frame(1:2, c(“A”, “B”))
X1.2 c..A….B..
1 1 A
2 2 B
• Arithmetic: +, -, *, /
• Order of operations: ()
• Exponentiaition: ^, exp()
• Other: log(), sqrt
• Evaluate standard Normal density curve, at x = 3
> x <- 3
> 1/sqrt(2*pi)*exp(-(x^2)/2)
[1] 0.004431848
R is great at vectorizing operations
• Feed a matrix or vector into an expression
• Receive an object of similar dimension as output
For example, evaluate at x = 0,1,2,3
> x <- c(0,1,2,3)
> 1/sqrt(2*pi)*exp(-(x^2)/2)
[1] 0.39842280 0.241970725 0.053990967
0.004431848
• Compare: ==, >, <, >=, <=, !=
> a <- c(1,1,2,4,3,1)
> a == 2
[1] FALSE FALSE TRUE FALSE FALSE
FALSE
• And: & or &&
• Or: | or ||
• Find location of TRUEs: which()
> which(a == 1)
[1] 1 2 6
> a <- 1:5
> b <- matrix(1:12,nrow = 3)
Use Square brackets []
• Pick range of elements: a[1:3]
• Pick particular elements: a[c(1,3,5)]
• Do not include elements: a[-c(1,4)]
Use commas in more than on dimension (matrices
& data frames)
• Pick particular elements: B[1:2,2:4]
• Give all rows and specified columns: B[,1:2]
• Give all columns and specified rows: B[1:2,]
• Note: B[2] coerces into a vector then gives specified element
SwissNotes.csv Data set
• Complied by Bernard Flury
• Contains measurements on 200 Swiss Bank
Notes
• 100 genuine and 100 counterfeit notes
Most general function: read.table() read.table(file,header=FALSE,sep = “”,…)
• Creates a data frame
• File name must be in quotes, single or double
• File name is case sensitive
• Include file name extension if data not in working directory
> read.table(“C:/Users/jacksota/Desktop/SwissNotes.csv”, T,“,”)
Don’t know the file extension? Try: file.choose()
> read.table(file.choose(), header = TRUE, sep = ”,”)
• sep defines the separator, e.g. “,” or “\t” or “”
• header indicates variable names should be read from first row
For comma delimited files: read.csv()
For tab delimited files: read.delim()
For Minitab, SPSS, SAS, STATA, etc. data:
foreign package
• Contains functions to read variety of file formats
• Functions operate like read.data()
• Contains functions for writing data into these file formats
• Identify variable names in data frame: names()
> data1 <- read.table(“SwissNotes.csv”, sep=“,”, header =TRUE)
> names(data1)
[1] “Length” “LeftHeight” “RightHeight” “LowerInner.Frame”
[5] “UpperInner.Frame” “Diagonal” “Type”
Assign name to data frame variables
> names(data1) <- c(“Length”, “LeftHeight”, “RightHeight”,
“LowerInner..Frame”, “UpperInner.Frame”, “Diagonal”, “Type”)
Note: names are strings and MUST be contained in quotes
Create objects out of each data frame variable: attach()
In the Swiss Note data, to refer to Type as its own object
> attach(data1)
> Type
[1] Genuine Genuine Genuine ….
Remove attached objects from workspace: detach()
> detach(data1)
> Type
Error: object “Type” not found
Note: Type is still part of original data frame, but is no longer a separate object.
plot() is the primary plotting function
Calling plot will open a new plotting window
Documentation: ?plot
For complete list of graphical parameters to manipulate: ?par
Let’s visualize the SwissNotes.csv data.
After loading the data into R, attach the data frame using attach(data).
Let’s try a scatter plot of LeftHeight by RightHeight.
>plot(LeftHeight, RightHeight)
Change symbols: Option pch=.
See ?par for details.
>plot(LeftHeight,RightHeight,pch=2)
Change symbol color: Option col=
Specify by number or by name: col=2 or col=“red”
Hint: Type palette() to see colors associated with number
Type colors() to see all possible colors
> plot(LeftHeight, RightHeight, col=“red”)
Change plot type: Option type =
“p” for points
“l” for lines
“b” for both
“c” for lines part alone of “b”
“o” for both overplotted
“h” for histogram like (or high-density) vertical lines
“s” for stair steps
“S” for other steps, see Details below
“n” for no plotting
Points with lines…works better on sorted list of points
>plot(LeftHeight,RightHeight,type=“o”)
Use plot() with points() to plot different groups in same plot
Genuine notes vs. Counterfeit notes
>plot(LeftHeight[Type==“Genuine”],Rightheight[Type==“Genuine”], col=“red”)
>points(LeftHeight[Type==“Counterfeit”],RightHeight[Type==“Counterfeit”]
,col=“blue”)
The plot() command call has options to
• Specify x-axis label: xlab = “X Label”
• Specify y-axis label: ylab = “Y Label”
• Specify plot title: main = “Main Title”
• Specify subtitle: sub = “Subtitle”
>plot(LeftHeight[Type==”Genuine”],RightHeight[Type==“Genuine”], col=“red”,main=“Plot of Bank Note Heights”,sub=“Measurements are in mm”,xlab=“Height of Left Side”,ylab=“Height of Right Side”)
>points(LeftHeight[Type==“Counterfeit”],
RightHeight[Type=“Counterfeit”],col=“blue”)
legend(“topleft”,c(“Genuine Notes”,
”Counterfeit Notes”),pch=c(21,21),col=c(“red”,”blue”))
To add straight lines to plot: abline() abline() refers to standard equation for a line: y = bx + a
• Horizontal line: abline(h= )
• Vertical Line: abline(v= )
• Otherwise: abline(a= , b= ) or abline(coef=c(a,b))
> abline(coef=c(21.7104,0.8319))
Histograms are another popular plotting option.
> hist(Length)
Using the SwissNote Data
> pairs(swiss)
To create boxplots: boxplot()
Specify one or more variables to plot.
> boxplot(swiss$Length)
> boxplot(swiss[,2:3])
Use a formula specification for side-by-side boxplots.
Note: boxplot() has many options, e.g. notches. See
?boxplot.
> boxplot(Length~Type,notch=TRUE,data=swiss)
• Mean()
> mean(swiss[,”Length”])
> mean(swiss)
• rowMeans()
> rowMeans(swiss[,1:6])
• colMeans
> colMeans(swiss[,7])
• Variance: var()
> var(swiss[,”Length”])
> var(swiss)
• Covariance()
> cov(swiss)
• Correlation()
> cor(swiss[,1:6])
>summary(swiss[1:3])
Length
Min. :213.8
1st Qu.:214.6
Median :214.9
Mean :214.9
3rd Qu.:215.1
Max. :216.3
LeftHeight
Min. :129.0
1st Qu.:129.9
Median :130.2
Mean :130.1
3rd Qu.:130.4
Max. :131.0
RightHeight
Min. :129.0
1st Qu.:129.7
Median :130.0
Mean :130.0
3rd Qu.:130.2
Max. :131.1
table() produces crosstabs of factors or categorical variables
Using the cardiac data:
> table(cardiac[,7:9])
, , newMI = 0 chestpain gender 0 1
F 6 10
M 4 8
, , newMI = 1 chestpain gender 0 1
F 100 222
M 62 146
t.test() produces 1- and 2-sample (paired or independent) ttests.
• 1-sample t-test
> t.test(x,alternative=“two.sided”,mu=0,conf.level=0.95)
• 2 independent samples t-test
> t.test(x,y,alternative=“two.sided”,mu=0,paired=FALSE,
• paired t-test conf.level=0.95)
> t.test(x,y,alternative=“two.sided”,mu=0,paired=TRUE, var.equal=TRUE,conf.level=0.95)
x: diagonal measurements for Genuine bank notes y: diagonal measurements for Counterfeit bank notes
> x = swiss[Type==“Genuine”,”Diagonal”]
> y = swiss[Type==“Counterfeit”,”Diagonal”]
> t.test(x,y,alternative=“greater”,mu=0, paired=FALSE,var.equal=TRUE)
> t.test(x,y,alternative=“greater”,mu=0, paired=FALSE,var.equal=TRUE)
Two Sample t-test data: x and y
T = 28.9149, df = 198, p-value < 2.2e-16 alternative hypothesis: true difference in means is greater than
0
95 percent confidence interval:
1.948864
Inf sample estimates: mean of x mean of y
141.517
139.450
R contains functions for generating random numbers from many well-known distributions.
Random number from standard normal distribution:
> rnorm(1,mean=0,sd=1)
[1] 0.5308293
Vector of random numbers from uniform distribution:
> runif(3, min=0, max=1)
[1] 0.6578880 0.3261863 0.3093383
To reproduce results: set.seed()
if() statement
> n = rnorm(1)
> if(n < 0){ n = abs(n)
} if() statement with else()
> n = rnorm(1)
>if (n < 0){ n = abs(n)
} else{n = 0}
for() loop
> temp = rep(0,10)
> for (i in 1:10){ temp[i] = i+1
}
> temp
[1] 2 3 4 5 6 7 8 9 10 11
while() loop
> n = 1
> while (n < 10 ){ n = n+1
}
test.function = function(input arguments){ commands to execute
}
For example, let’s define a new function average to find the average of a set of numbers.
average = function(x){ n = length(x) average = sum(x)/n print(average)
}
After writing a function in a script file, bring it into working memory using source().
Source(“pathname/test.function.R”)