Introduction to R Workshop in Methods and Indiana Statistical

advertisement

Welcome to the R intro Workshop

Before we begin, please download the

“SwissNotes.csv” and “cardiac.txt” files from the

ISCC website, under the R workshop (more info).

www.iub.edu/~iscc

Introduction to R

Workshop in Methods from the Indiana Statistical Consulting

Center

Thomas A. Jackson

February 15, 2013

Overview

The R Project for Statistical Computing http://cran.r-project.org

“R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now

Lucent Technologies) by John Chambers and Colleagues.

R can be considered as a different implementation of S.

There are some important differences, but much code written for S runs unaltered under R.”

- Description from CRAN Website

Benefits

R …

• is free

• is interactive: we can type something in and work with it

▫ How we analyze data can be broken into small steps

• is interpretative: we give it commands and it translates them into mathematical procedures or data management steps

• can be used in a batch: nice because it is documented

• is a calculator: it is unlike other calculators though because you can create variables and objects

Let’s Get R Started

• How to open R

→ Start Menu

→ Programs

→ Departmentally Supported

→ Stat/Math

→ R

Graphical User Interface (GUI)

Three Environments

• Command Window (aka Console)

• Script Window

• Plot Window

Command Window Basics

To quit: type q()

Save workspace image? Moves from memory to harddrive

Storing variable in memory

• <- , -> , or =

• a<- 5 stores the number 5 in the object “a”

• pi -> b stores the number π= 3.141593 in “b”

• x = 1 + 2 stores the result of the calculation (3) in “x”

• “=“ requires left-hand assignment

Try not to overwrite reserved names such as t, c, and pi!

Command Window Basics

Printing to output

• Calculations that are not stored print to output

> 3 + 5

[1] 8

• Type name to view stored object

> a

[1] 5

• Use print()

> print(a)

[1] 5

View objects in workspace

• objects() or ls()

Command Window Basics

Clearing the console (command window)

• Mac: Edit → Clear Console

• Windows: Edit → Clear Console or

• Mac: Alt + Command + L

• Windows: Ctrl + L

Removing variables from memory

• rm() or remove()

> x <- 4

> rm(x)

• rm(list = ls()) remove all variables

Script Window Basics

Saving syntax (code)

• Mac: File → New

• Windows: File → New Script

Documenting code: # Comments out everything on line behind

Running code from Script Window

• Mac: Apple + Enter

• Windows: F5 or Ctrl + r

Working Directory

Obtaining working directory

• getwd()

• Mac: Misc → Get Working Directory

• Windows: File → Change dir...

Changing working directory

• setwd()

• Mac: Misc → Change Working Directory

• Windows: File → Change dir...

Path Names

Specify with forward slashes or double backslashes

Enclose in single or double quotation marks

Examples

• setwd(“C:/Program Files/R/R-2.6.1”)

• setwd(‘C:\\Program Files\\R\\R-2.6.1’)

R Help

Helpful commands

• If you know the function name: help() or ?

> help(log)

> ?exp

• If you do not know the function name: help.search() or ??

> help.search(“anova”)

> ??regression

Documentation

Elements of a documentation file

• Function{Package}

• Description

• Usage: What your code should look like, “=“ gives default

• Arguments: Inputs to the function

• Details

• Value: What the function will return

• See Also: Related functions

• Examples

Online Resources

• CRAN Website: http://cran.r-project.org/

• R Seek: http://www.rseek.org/

• Quick-R tutorial: http://www.statmethods.net/

• R Tutor: http://www.r-tutor.com/

• UCLA: http://www.ats.ucla.edu/stat/r/

• R listservs

• Google

Google tip: include “[R]” (instead of just “R”) with search topic to help filter out non-R websites

Additional Packages

Over 2,500 listed on the CRAN website!

• Use with caution

• Initial download of R: base, graphics, stats, utils

1) Installing a package:

• Mac: Packages & Data → Package Installer

Use Package Search to locate and press ‘Install Selected’

• Windows: Packages → Install Packages

Locate desired package and press ‘OK’

• install.packages(“MASS”)

2) Using an installed package:

You MUST call it into active memory with library()

> library(MASS)

Data Structures

R has several basic types (or “classes”) of data:

• Numeric - Numbers

• Character – Strings (letters, words, etc.)

• Logical – TRUE or FALSE

• Vector

• Matrix

• Array

• Data Frame

• List

NOTE: There are other classes, but these are most common. Understanding differences will save you some headache.

Data Structures

• Find class of data

• Unknown class: class()

• Check particular class: is.“classname”()

> a <- 5

> class(a)

[1] “numeric”

> is.character(a)

[1] FALSE

Change class: as.classname()

> as.character(a)

[1] “5”

Vectors

Combine items into vector: c()

> c(1,2,3,4,5,6)

[1] 1 2 3 4 5 6

Repeat number of sequence of numbers: rep()

> rep(1,5)

[1] 1 1 1 1 1

> rep (c(2,5,7), times = 3)

[1] 2 5 7 2 5 7 2 5 7

Vectors

Sequence generation: seq()

> seq(1,5)

[1] 1 2 3 4 5

> seq(1,5, by = .5)

[1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

Try 1:10 or 10:1

Matrices

Create matrix: matrix()

• 6 x 1 matrix: matrix(1:6, ncol = 1)

• 2 x 3 matrix: matrix(1:6, nrow =2, ncol =3)

• 2 x 3 matrix filling across rows first: matrix(1:6, nrow = 2, ncol = 3, byrow = TRUE)

Create matrix of more than two dimensions

(array): array()

Lists

Create a list: list()

• Holds vectors, matrices, arrays, etc. of varying lengths

• Objects in the list can be named or unnamed

> list(matrix(0, 2, 2), y = rep(c(“A”, “B”), each = 2))

[[1]]

[,1] [,2]

[1,] 0

[2,] 0

0

0

$y

[1] “A” “A” “B” “B”

Data Frame: specialized list that holds variables of same length

Data Frames

Create a data frame: data.frame()

• Like a matrix, holds specified number of rows and columns

> x <- 1:4

> y <- rep(c(“A”, ”B”), each = 2)

> data.frame(x,y) x y

1 1 A

2 2 A

3 3 B

4 4 B

• Unnamed variables get assigned names

> data.frame(1:2, c(“A”, “B”))

X1.2 c..A….B..

1 1 A

2 2 B

Basic Operations

• Arithmetic: +, -, *, /

• Order of operations: ()

• Exponentiaition: ^, exp()

• Other: log(), sqrt

• Evaluate standard Normal density curve, at x = 3

> x <- 3

> 1/sqrt(2*pi)*exp(-(x^2)/2)

[1] 0.004431848

Vectorization

R is great at vectorizing operations

• Feed a matrix or vector into an expression

• Receive an object of similar dimension as output

For example, evaluate at x = 0,1,2,3

> x <- c(0,1,2,3)

> 1/sqrt(2*pi)*exp(-(x^2)/2)

[1] 0.39842280 0.241970725 0.053990967

0.004431848

Logical Operations

• Compare: ==, >, <, >=, <=, !=

> a <- c(1,1,2,4,3,1)

> a == 2

[1] FALSE FALSE TRUE FALSE FALSE

FALSE

• And: & or &&

• Or: | or ||

• Find location of TRUEs: which()

> which(a == 1)

[1] 1 2 6

Subsetting

> a <- 1:5

> b <- matrix(1:12,nrow = 3)

Use Square brackets []

• Pick range of elements: a[1:3]

• Pick particular elements: a[c(1,3,5)]

• Do not include elements: a[-c(1,4)]

Subsetting (cont.)

Use commas in more than on dimension (matrices

& data frames)

• Pick particular elements: B[1:2,2:4]

• Give all rows and specified columns: B[,1:2]

• Give all columns and specified rows: B[1:2,]

• Note: B[2] coerces into a vector then gives specified element

Reading External Data Files

SwissNotes.csv Data set

• Complied by Bernard Flury

• Contains measurements on 200 Swiss Bank

Notes

• 100 genuine and 100 counterfeit notes

Reading External Data Files (cont.)

Most general function: read.table() read.table(file,header=FALSE,sep = “”,…)

• Creates a data frame

• File name must be in quotes, single or double

• File name is case sensitive

• Include file name extension if data not in working directory

> read.table(“C:/Users/jacksota/Desktop/SwissNotes.csv”, T,“,”)

Don’t know the file extension? Try: file.choose()

> read.table(file.choose(), header = TRUE, sep = ”,”)

• sep defines the separator, e.g. “,” or “\t” or “”

• header indicates variable names should be read from first row

Reading External Data Files

For comma delimited files: read.csv()

For tab delimited files: read.delim()

For Minitab, SPSS, SAS, STATA, etc. data:

foreign package

• Contains functions to read variety of file formats

• Functions operate like read.data()

• Contains functions for writing data into these file formats

Data Frame Hints

• Identify variable names in data frame: names()

> data1 <- read.table(“SwissNotes.csv”, sep=“,”, header =TRUE)

> names(data1)

[1] “Length” “LeftHeight” “RightHeight” “LowerInner.Frame”

[5] “UpperInner.Frame” “Diagonal” “Type”

Assign name to data frame variables

> names(data1) <- c(“Length”, “LeftHeight”, “RightHeight”,

“LowerInner..Frame”, “UpperInner.Frame”, “Diagonal”, “Type”)

Note: names are strings and MUST be contained in quotes

Data Frame Hints (cont.)

Create objects out of each data frame variable: attach()

In the Swiss Note data, to refer to Type as its own object

> attach(data1)

> Type

[1] Genuine Genuine Genuine ….

Data Frame Hints (cont.)

Remove attached objects from workspace: detach()

> detach(data1)

> Type

Error: object “Type” not found

Note: Type is still part of original data frame, but is no longer a separate object.

plot() function

plot() is the primary plotting function

Calling plot will open a new plotting window

Documentation: ?plot

For complete list of graphical parameters to manipulate: ?par

plot() function

Let’s visualize the SwissNotes.csv data.

After loading the data into R, attach the data frame using attach(data).

Let’s try a scatter plot of LeftHeight by RightHeight.

>plot(LeftHeight, RightHeight)

plot() function

Change symbols: Option pch=.

See ?par for details.

>plot(LeftHeight,RightHeight,pch=2)

plot() Function

Change symbol color: Option col=

Specify by number or by name: col=2 or col=“red”

Hint: Type palette() to see colors associated with number

Type colors() to see all possible colors

> plot(LeftHeight, RightHeight, col=“red”)

What types of points can we get?

plot() Function

Change plot type: Option type =

“p” for points

“l” for lines

“b” for both

“c” for lines part alone of “b”

“o” for both overplotted

“h” for histogram like (or high-density) vertical lines

“s” for stair steps

“S” for other steps, see Details below

“n” for no plotting

Plot() Function

Points with lines…works better on sorted list of points

>plot(LeftHeight,RightHeight,type=“o”)

Scatterplots for Multiple Groups

Use plot() with points() to plot different groups in same plot

Genuine notes vs. Counterfeit notes

>plot(LeftHeight[Type==“Genuine”],Rightheight[Type==“Genuine”], col=“red”)

>points(LeftHeight[Type==“Counterfeit”],RightHeight[Type==“Counterfeit”]

,col=“blue”)

Axis Labels and Plot Titles

The plot() command call has options to

• Specify x-axis label: xlab = “X Label”

• Specify y-axis label: ylab = “Y Label”

• Specify plot title: main = “Main Title”

• Specify subtitle: sub = “Subtitle”

Axis Labels and Plot Titles

>plot(LeftHeight[Type==”Genuine”],RightHeight[Type==“Genuine”], col=“red”,main=“Plot of Bank Note Heights”,sub=“Measurements are in mm”,xlab=“Height of Left Side”,ylab=“Height of Right Side”)

>points(LeftHeight[Type==“Counterfeit”],

RightHeight[Type=“Counterfeit”],col=“blue”)

Legends

 legend(“topleft”,c(“Genuine Notes”,

”Counterfeit Notes”),pch=c(21,21),col=c(“red”,”blue”))

Adding Lines

To add straight lines to plot: abline() abline() refers to standard equation for a line: y = bx + a

• Horizontal line: abline(h= )

• Vertical Line: abline(v= )

• Otherwise: abline(a= , b= ) or abline(coef=c(a,b))

Adding Lines

> abline(coef=c(21.7104,0.8319))

Histograms

Histograms are another popular plotting option.

> hist(Length)

pairs() Function

Using the SwissNote Data

> pairs(swiss)

Boxplots

To create boxplots: boxplot()

Specify one or more variables to plot.

> boxplot(swiss$Length)

> boxplot(swiss[,2:3])

Boxplots

Use a formula specification for side-by-side boxplots.

Note: boxplot() has many options, e.g. notches. See

?boxplot.

> boxplot(Length~Type,notch=TRUE,data=swiss)

Mean or Average

• Mean()

> mean(swiss[,”Length”])

> mean(swiss)

• rowMeans()

> rowMeans(swiss[,1:6])

• colMeans

> colMeans(swiss[,7])

Variability

• Variance: var()

> var(swiss[,”Length”])

> var(swiss)

• Covariance()

> cov(swiss)

• Correlation()

> cor(swiss[,1:6])

Five-number Summary

>summary(swiss[1:3])

Length

Min. :213.8

1st Qu.:214.6

Median :214.9

Mean :214.9

3rd Qu.:215.1

Max. :216.3

LeftHeight

Min. :129.0

1st Qu.:129.9

Median :130.2

Mean :130.1

3rd Qu.:130.4

Max. :131.0

RightHeight

Min. :129.0

1st Qu.:129.7

Median :130.0

Mean :130.0

3rd Qu.:130.2

Max. :131.1

Creating Tables

table() produces crosstabs of factors or categorical variables

Using the cardiac data:

> table(cardiac[,7:9])

, , newMI = 0 chestpain gender 0 1

F 6 10

M 4 8

, , newMI = 1 chestpain gender 0 1

F 100 222

M 62 146

Univariate t-tests

t.test() produces 1- and 2-sample (paired or independent) ttests.

• 1-sample t-test

> t.test(x,alternative=“two.sided”,mu=0,conf.level=0.95)

• 2 independent samples t-test

> t.test(x,y,alternative=“two.sided”,mu=0,paired=FALSE,

• paired t-test conf.level=0.95)

> t.test(x,y,alternative=“two.sided”,mu=0,paired=TRUE, var.equal=TRUE,conf.level=0.95)

2 Independent Samples t-test

x: diagonal measurements for Genuine bank notes y: diagonal measurements for Counterfeit bank notes

> x = swiss[Type==“Genuine”,”Diagonal”]

> y = swiss[Type==“Counterfeit”,”Diagonal”]

> t.test(x,y,alternative=“greater”,mu=0, paired=FALSE,var.equal=TRUE)

2 Independent Samples t-test

> t.test(x,y,alternative=“greater”,mu=0, paired=FALSE,var.equal=TRUE)

Two Sample t-test data: x and y

T = 28.9149, df = 198, p-value < 2.2e-16 alternative hypothesis: true difference in means is greater than

0

95 percent confidence interval:

1.948864

Inf sample estimates: mean of x mean of y

141.517

139.450

Generating Random Numbers

R contains functions for generating random numbers from many well-known distributions.

Random number from standard normal distribution:

> rnorm(1,mean=0,sd=1)

[1] 0.5308293

Vector of random numbers from uniform distribution:

> runif(3, min=0, max=1)

[1] 0.6578880 0.3261863 0.3093383

To reproduce results: set.seed()

Function Basics

if() statement

> n = rnorm(1)

> if(n < 0){ n = abs(n)

} if() statement with else()

> n = rnorm(1)

>if (n < 0){ n = abs(n)

} else{n = 0}

Function Basics

for() loop

> temp = rep(0,10)

> for (i in 1:10){ temp[i] = i+1

}

> temp

[1] 2 3 4 5 6 7 8 9 10 11

Function Basics

while() loop

> n = 1

> while (n < 10 ){ n = n+1

}

Creating Functions

test.function = function(input arguments){ commands to execute

}

Creating Functions

For example, let’s define a new function average to find the average of a set of numbers.

average = function(x){ n = length(x) average = sum(x)/n print(average)

}

Sourcing

After writing a function in a script file, bring it into working memory using source().

Source(“pathname/test.function.R”)

Download