Introduction to R Workshop June 23-25, 2010 Southwest Fisheries Science Center 3333 North Torrey Pines Court La Jolla, CA 92037 Eric Archer eric.archer@noaa.gov 858-546-7121 1 Introduction to R 1) How R thinks • Environment • Data Structures • Data Input/Output 2) Becoming a codeR • Data Selection and Manipulation • Data Summary • Functions 3) Visualization and analysis • Data Processing (‘apply’ family) • Plotting & Graphics • Statistical Distributions • Statistical Tests • Model Fitting • Packages, Path, Options 2 S, S-Plus, R S Chambers, Becker, Wilks 1984: Bell Labs S-Plus 1988: Statistical Sciences 1993: MathSoft 2001: Insightful 2008: TIBCO R Ihaka & Gentleman 1996 (The R Project) “Programming ought to be regarded as an integral part of effective and responsible data analysis” - Venables and Ripley. 1999. S Programming Why R? • Free • Open source • Many packages • Large support base • Multi-platform • Vectorization 3 Workspace Entering commands • commands and assignments executed or evaluated immediately • separated by new line (Enter/Return) or semicolon • recall commands with ↑ or ↓ • case sensitive • everything is some sort of function that does something Getting help > help(mean) > ?median > help(“[“) > example(mean) > help.search(“regression”) > RSiteSearch(“genetics”) > http://www.r-project.org/ 4 Workspace ls() rm(…) rm(list = ls()) save.image() load(".rdata") list objects in workspace remove objects from workspace remove all objects from workspace saves workspace loads saved workspace history() loadhistory() savehistory() view command history load command history save command history # comments 5 Assignment and data creation <c(…) seq(x) seq(from,to,by) from:to rep(x,times) letters,LETTERS assign combine arguments into a vector generate sequence from 1 to x generate sequence with increment by generate sequence from .. to replicate x vector of 26 lower and upper case letters > x <- 1 > y <- "A" > my.vec <- c(1, 5, 6, 10) > my.nums <- 12:24 > x [1] 1 > y [1] "A" > my.vec [1] 1 5 6 10 > my.nums [1] 12 13 14 15 16 17 18 19 20 21 22 23 24 6 Data Structures Object modes (atomic structures) integer whole numbers (15, 23, 8, 42, 4, 16) numeric real numbers (double precision: 3.14, 0.0002, 6.022E23) character text string (“Hello World”, “ROFLMAO”, “A”) logical TRUE/FALSE or T/F Object classes vector factor array matrix list data.frame object with atomic mode vector object with discrete groups (ordered/unordered) multiple dimensions 2-dimensional array vector of components "matrix –like" list of variables of same # of rows Special Values NULL NA NaN Inf, -Inf object of zero length, test with is.null(x) Not Available / missing value, test with is.na(x) Not a number, test with is.nan(x) (e.g. 0/0, log(-1)) Positive/negative infinity, test with is.infinite(x) (e.g. 1/0) 7 Vectors Creation and info vector(mode,length) create vector length(x) number of elements names(x) get or set names Indexing (number, character (name), or logical) x[n] nth element x[-n] all but the nth element x[a:b] elements a to b x[-(a:b)] all but elements a to b x[c(…)] specific elements x[“name”] “name” element x[x > a] x[x %in% c(…)] all elements greater than a all elements in the set 8 Vectors Create a vector > x <- 1:10 Give the elements some names > names(x) <- c("first","second","third","fourth","fifth") Select elements based on another vector > i <- c(1,5) > x[i] first fifth 1 5 > x[-c(i,8)] second third fourth <NA> <NA> 2 3 4 6 7 <NA> 9 <NA> 10 9 logical testing == >, < >=, <= ! &, && |, || Vectors equals greater, less than greater,less than or equal to not and (single is element-by-element, double is first element) or Select elements based on a condition > x <- 1:10 > x[x < 5] [1] 1 2 3 4 > x < 5 [1] TRUE TRUE > x[x < 5] [1] 1 2 3 4 & vs && > x < 5 & x > 2 [1] FALSE FALSE FALSE > x < 5 && x > 2 [1] FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE 10 Vectorization Operator recycles smaller object enough times to cover larger object > x <- 4 > y <- c(5, 6, 7, 8, 9, 10) > z <- x + y > z [1] 9 10 11 12 13 14 > x <- c(3, 5) > z <- x + y > z [1] 8 11 10 13 12 15 > i <- 1:10 > j <- c(T, T, F) > i[j] [1] 1 2 4 5 7 8 10 11 Object Information summary(x) str(x) mode(x) class(x) is.<class>(x) attr(x, which) attributes(x) generic summary of object display object structure get or set storage mode name of object class test type of object (is.numeric, is.logical, etc.) get or set the attribute of an object get or set all attributes of an object 12 Object Information > y <- 1:10 > str(y) int [1:10] 1 2 3 4 5 6 7 8 9 10 > mode(y) [1] "numeric“ > class(y) [1] "integer“ > is.character(y) [1] FALSE > is.integer(y) [1] TRUE > is.double(y) [1] FALSE > is.numeric(y) [1] TRUE 13 > x <- 1:4 > names(x) <- c("first","second","third","four") > x first second third four 1 2 3 4 > str(x) Named int [1:4] 1 2 3 4 - attr(*, "names")= chr [1:4] "first" "second" "third" "four" > attributes(x) $names [1] "first" "second" "third" "four" Object Information > attr(x, "notes") <- "This is a really important vector." > attributes(x) $names [1] "first" "second" "third" "four" $notes [1] "This is a really important vector." > attr(x, "date") <- 20090624 > attributes(x) $names [1] "first" "second" "third" "four" $notes [1] "This is a really important vector." $date [1] 20090624 > x first second third four 1 2 3 4 attr(,"notes") [1] "This is a really important vector." attr(,"date") [1] 20090624 14 coercion as.<class>(x) coerces object x to <class> if possible > x <- 1:10 > x.char <- as.character(x) > as.numeric(x.char) [1] 1 2 3 4 5 6 7 8 9 10 > y <- letters[1:10] > as.numeric(y) [1] NA NA NA NA NA NA NA NA NA NA Warning message: NAs introduced by coercion > z <- "1char" > as.numeric(z) [1] NA Warning message: NAs introduced by coercion > logic.chars <- c("TRUE", "FALSE", "T", "F", "t", "f", "0", "1") > as.logical(logic.chars) [1] TRUE FALSE TRUE FALSE NA NA NA NA > logic.nums <- c(-2, -1, 0, 1.5, 2, 100) > as.logical(logic.nums) [1] TRUE TRUE FALSE TRUE TRUE TRUE 15 Factors • Discrete ordered or unordered data • Internally represented numerically factor(x, levels, labels, exclude, ordered) levels(x) labels(x) is.factor(x),is.ordered(x) 16 Factors > x <- c("b", "a", "a", "c", "B", "d", "a", "d") > x.fac <- factor(x) > x.fac [1] b a a c B d a d Levels: a b B c d > str(x.fac) Factor w/ 5 levels "a","b","B","c",..: 2 1 1 4 3 5 1 5 > levels(x.fac) [1] "a" "b" "B" "c" "d“ > labels(x.fac) [1] "1" "2" "3" "4" "5" "6" "7" "8“ > as.numeric(x.fac) [1] 2 1 1 4 3 5 1 5 > as.character(x.fac) [1] "b" "a" "a" "c" "B" "d" "a" "d" 17 > x.fac.lvl <- factor(x, levels = c("a", "c")) > x.fac.lvl [1] <NA> a a c <NA> <NA> a <NA> Levels: a c Factors > x.fac.exc <- factor(x, exclude = c("a", "c")) > x.fac.exc [1] b <NA> <NA> <NA> B d <NA> d Levels: b B d > x.fac.lbl <- factor(x, labels = c("L1", "L2", "L3", "L4", "L5")) > x.fac.lbl [1] L2 L1 L1 L4 L3 L5 L1 L5 Levels: L1 L2 L3 L4 L5 > x.fac[2] < x.fac[1] [1] NA Warning message: In Ops.factor(x.fac[2], x.fac[1]) : < not meaningful for factors > x.ord <- factor(x, ordered = TRUE) > x.ord [1] b a a c B d a d Levels: a < b < B < c < d > x.ord[2] < x.ord[1] [1] TRUE 18 Arrays and Matrices array(data, dim, dimnames) matrix(data, nrow, ncol, dimnames) create array (row-priority) create matrix x[row, col] x[row,] x[, col] x[“name”, ] etc. element at row,col vector of row and col vector of row “name” dim(x) nrow(x) ncol(x) retrieve or set dimensions number of rows number of columns dimnames(x) rownames(x) colnames(x) retrieve or set dimension names retrieve or set row names retrieve or set column names cbind(…) rbind(…) t(x) create array from columns create array from rows transpose (matrices) 19 Create an array Arrays and Matrices > x <- array(1:10, dim = c(4, 6)) > x [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1 5 9 3 7 1 [2,] 2 6 10 4 8 2 [3,] 3 7 1 5 9 3 [4,] 4 8 2 6 10 4 > str(x) int [1:4, 1:6] 1 2 3 4 5 6 7 8 9 10 ... > attributes(x) $dim [1] 4 6 > dim(x) [1] 4 6 > dimnames(x) NULL 20 Arrays and Matrices Set column or row names > colnames(x) <- c("col1", "col2", "col3", "col4", "5", "6") > x col1 col2 col3 col4 5 6 [1,] 1 5 9 3 7 1 [2,] 2 6 10 4 8 2 [3,] 3 7 1 5 9 3 [4,] 4 8 2 6 10 4 > colnames(x) <- c("column1", "column2") Error in dimnames(x) <- dn : length of 'dimnames' [2] not equal to array extent > colnames(x)[1] <- "column1" > x column1 col2 col3 col4 5 [1,] 1 5 9 3 7 [2,] 2 6 10 4 8 [3,] 3 7 1 5 9 [4,] 4 8 2 6 10 6 1 2 3 4 21 Set row and columns names using dimnames Arrays and Matrices > dimnames(x) <- list(c("first", "second", "third", "4"), NULL) > x [,1] [,2] [,3] [,4] [,5] [,6] first 1 5 9 3 7 1 second 2 6 10 4 8 2 third 3 7 1 5 9 3 4 4 8 2 6 10 4 Setting dimension names > dimnames(x) <- list(my.rows = c("first", "second", "third", "4"), my.cols = NULL) > x my.cols my.rows [,1] [,2] [,3] [,4] [,5] [,6] first 1 5 9 3 7 1 second 2 6 10 4 8 2 third 3 7 1 5 9 3 4 4 8 2 6 10 4 22 Change dimensionality of array Arrays > dim(x) <- c(6, 4) > x [,1] [,2] [,3] [,4] [1,] 1 7 3 9 [2,] 2 8 4 10 [3,] 3 9 5 1 [4,] 4 10 6 2 [5,] 5 1 7 3 [6,] 6 2 8 4 > dim(x) <- c(3, 4, 2) > x , , 1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 1 [3,] 3 6 9 2 , , 2 [,1] [,2] [,3] [,4] [1,] 3 6 9 2 [2,] 4 7 10 3 [3,] 5 8 1 4 23 Arrays and Matrices Bind several vectors into an array > i1 <- seq(from = 1, to = 20, length = 10) > i2 <- seq(from = 3.4, to = 25, length = 10) > i3 <- seq(from = 15, to = 25, length = 10) > i <- cbind(i1, i2, > i i1 i2 [1,] 1.000000 3.4 [2,] 3.111111 5.8 [3,] 5.222222 8.2 [4,] 7.333333 10.6 [5,] 9.444444 13.0 [6,] 11.555556 15.4 [7,] 13.666667 17.8 [8,] 15.777778 20.2 [9,] 17.888889 22.6 [10,] 20.000000 25.0 i3) i3 15.00000 16.11111 17.22222 18.33333 19.44444 20.55556 21.66667 22.77778 23.88889 25.00000 24 Arrays and Matrices > j <- rbind(i1, i2, i3) > j i1 i2 i3 i1 i2 i3 [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] 1.0 3.111111 5.222222 7.333333 9.444444 11.55556 13.66667 15.77778 17.88889 3.4 5.800000 8.200000 10.600000 13.000000 15.40000 17.80000 20.20000 22.60000 15.0 16.111111 17.222222 18.333333 19.444444 20.55556 21.66667 22.77778 23.88889 [,10] 20 25 25 > i <- cbind(col1 = i1, col2 = i2, col3 = i3) 25 Lists • Special vector • Collection of elements of different modes • Often used as return type of functions list(…), vector(“list”, length) x[i] x[[i]] x[“name”] x[[“name”]] or x$name unlist create list list of element i element i list of element name element name transform list to a vector 26 Lists > x <- list(1:10, c("a", "b"), c(TRUE, TRUE, FALSE, TRUE), 5) > x [[1]] [1] 1 2 3 4 5 6 7 8 9 10 [[2]] [1] "a" "b" [[3]] [1] TRUE TRUE FALSE TRUE [[4]] [1] 5 > is.list(x) [1] TRUE > is.vector(x) [1] TRUE > is.numeric(x) [1] FALSE 27 Lists What are the elements in a list? > x[1] [[1]] [1] 1 2 3 4 5 6 7 8 9 10 > str(x[1]) List of 1 $ : int [1:10] 1 2 3 4 5 6 7 8 9 10 > mode(x[1]) [1] "list“ > x[[1]] [1] 1 2 3 4 5 6 7 8 9 10 > str(x[[1]]) int [1:10] 1 2 3 4 5 6 7 8 9 10 > mode(x[[1]]) [1] "numeric“ 28 > y <- list(numbers = c(5, 10:25), initials = c(“rnm", "fds")) > y $numbers [1] 5 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 Lists $initials [1] “rnm" "fds" > y$initials [1] “rnm" "fds“ > y["numbers"] $numbers [1] 5 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 > y$new.element <- "This is new" > y $numbers [1] 5 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 $initials [1] “rnm" "fds" $new.element [1] "This is new" 29 Data Frames • • Like matrices, but columns of different modes Organized list where components are columns of equal length rows x[[“name”]] or x$name x[row, column], etc. > > > > > 1 2 3 4 5 column name age <- c(1:5) color <- c("neonate", "two-tone", "speckled", "mottled", "adult") juvenile <- c(TRUE, TRUE, FALSE, FALSE, FALSE) spotted <- data.frame(age, color, juvenile) spotted age color juvenile 1 neonate TRUE 2 two-tone TRUE 3 speckled FALSE 4 mottled FALSE 5 adult FALSE 30 Data Frames > is.matrix(spotted) [1] FALSE > is.array(spotted) [1] FALSE > is.list(spotted) [1] TRUE > is.data.frame(spotted) [1] TRUE > spotted$age [1] 1 2 3 4 5 > spotted$age[2] [1] 2 > spotted$color[2] [1] two-tone Levels: adult mottled neonate speckled two-tone > spotted[spotted$age < 3, ] age color juvenile 1 1 neonate TRUE 2 2 two-tone TRUE 31 Data Frames Forcing character columns > str(spotted) 'data.frame': 5 obs. of 3 variables: $ age : int 1 2 3 4 5 $ color : Factor w/ 5 levels "adult","mottled",..: 3 5 4 2 1 $ juvenile: logi TRUE TRUE FALSE FALSE FALSE > + + > 1 2 3 4 5 spotted2 <- data.frame(age.class = age, color.pattern = color, juvenile.stat = juvenile, stringsAsFactors = FALSE) spotted2 age.class color.pattern juvenile.stat 1 neonate TRUE 2 two-tone TRUE 3 speckled FALSE 4 mottled FALSE 5 adult FALSE > str(spotted2) 'data.frame': 5 $ age.class : $ color.pattern: $ juvenile.stat: obs. of 3 variables: int 1 2 3 4 5 chr "neonate" "two-tone" "speckled" "mottled" ... logi TRUE TRUE FALSE FALSE FALSE 32 Data Frames Deleting columns > spotted$age <- NULL > spotted color juvenile 1 neonate TRUE 2 two-tone TRUE 3 speckled FALSE 4 mottled FALSE 5 adult FALSE Creating new columns > spotted$freq <- c(0.3, 0.2, 0.2, 0.15, 0.15) > spotted$have.data <- TRUE > spotted color juvenile freq have.data 1 neonate TRUE 0.30 TRUE 2 two-tone TRUE 0.20 TRUE 3 speckled FALSE 0.20 TRUE 4 mottled FALSE 0.15 TRUE 5 adult FALSE 0.15 TRUE 33 Data Frames subset(x, subset, select) > subset(spotted, age >=3) age color juvenile 3 3 speckled FALSE 4 4 mottled FALSE 5 5 adult FALSE > subset(spotted, juvenile == FALSE & age <= 4) age color juvenile 3 3 speckled FALSE 4 4 mottled FALSE > subset(spotted, age <=2, select = c("color", "juvenile")) color juvenile 1 neonate TRUE 2 two-tone TRUE 34 Directory management dir() list files in directory setwd(path) set working directory getwd() get working directory ?files File and Directory Manipulation Standard ASCII read.table read.csv read.delim read.fwf write.table write.csv Data Input/Output Format creates a data frame from text file read comma-delimited file read tab-delimited file read fixed width format write data to text file write comma-delimited file R Binary Format save writes binary R objects save.image writes current environment in binary R load reload files written with save R Text Format dump source creates text representation of R objects accept input from text file (scripts) 35 Data Input/Output Reading ASCII > sets <- read.csv("Sets_All.csv", header = TRUE) > sets$Ordered.Year <- ordered(sets$Year) > sets$SpotCd.Fac <- factor(sets$SpotCd, exclude = NULL) > spotted.sets <- sets[sets$Sp1Cd == 2, ] > write.table(spotted.sets, file = "spotted.txt", + row.names = FALSE) Reading R binary > save(spotted.sets, file = "spotted.RData") > rm(list = ls()) > load("spotted.RData") Reading R commands > positions <- spotted.sets[, c("Latitude", "Longitude")] > dump("positions", file = "set_positions.R") > rm(list = ls()) > source("set_positions.R") 36 Writing Scripts • Text files containing commands and comments written as if executed on command line (usually end with .r) • From R GUI : File|New script • Any text editor (Notepad, Tinn-R, VEDIT, etc.) Commands executed with: • source("filename.r") • Copy/paste • From R Editor : Edit|Run... 37 Exercise 1A : Assemble data frame 1. Assemble a data frame from “Homework 1” files with only these columns (make these names and in this order): boat (character), skipper (character), lat, lon, year, month, day, mammals, turtles, fish 2. Add a column classifying each trip by season: Winter: Dec – Feb, Spring: Mar – May, Summer: Jun – Aug, Fall: Sep – Nov 3. Add three columns classifying bycatch size for each of: fish : < 15 (small), 15 – 200 (medium), > 200 (large) turtles : < 4 (small), >= 4 (large) mammals: < 2 (small), >= 2 (large) 4. Add column indicating that boat needs to be inspected if any bycatch class is “large” 5. Write your new data frame to a .csv file Exercise 1B : Make a list 1. Read .csv file from 1A into clean R environment 2. Create a list with one element for the entire data set and one element per bycatch type (4 elements total). Each bycatch element should contain a named vector of the number of trips with small, medium, and large bycatches 3. How many trips needed to be inspected? 4. How many trips had no bycatch at all? 5. Save list and results from 3 & 4 in an R workspace End Day 1 38 Data Selection and Manipulation sample(x, size, replace, prob) cut(x, breaks, labels) take a random sample from x divide vector into intervals %in% which(x) all(…), any(…) return logical vector of matches return index of TRUE results return TRUE if all or any arguments are TRUE unique(x) duplicated(x) return unique observations in vector return duplicated observations sort order sort vector or factor sort based on multiple arguments merge() merge two data frames by common cols or rows ceiling, floor, trunc, round, signif rounding functions 39 > x <- 1:5 sample Sample x (jumble or permute) > sample(x) [1] 2 1 4 5 3 Sample from x > sample(x, 3) [1] 2 4 3 Sample with replacement > sample(x, 10, replace = TRUE) [1] 2 3 5 3 3 4 2 1 4 4 Sample with modified probabilities > > > > > > cars <- c("Ford", "GM", "Toyota", "VW", "Subaru", "Honda") male.wts <- c(6, 5, 3, 1, 3, 3) female.wts <- c(3, 3, 4, 8, 3, 6) male.survey <- sample(cars, 100, replace = TRUE, prob = male.wts) female.survey <- sample(cars, 100, replace = TRUE, prob = female.wts) 40 cut cut(x, breaks, labels = NULL, include.lowest = FALSE, right = TRUE, dig.lab = 3, ordered_result = FALSE, ...) > y <- c(4, 5, 6, 10, 11, 30, 49, 50, 51) Bins : 5 > y <= 10, 10 > y <= 30, 30 > y <= 50 > y.cut <- cut(y, breaks = c(5, 10, 30, 50)) > y.cut [1] <NA> <NA> (5,10] (5,10] (10,30] (10,30] (30,50] (30,50] <NA> Levels: (5,10] (10,30] (30,50] > str(y.cut) Factor w/ 3 levels "(5,10]","(10,30]",..: NA NA 1 1 2 2 3 3 NA Bins : 5 >= y <= 10, 10 > y <= 30, 30 > y <= 50 > cut(y, breaks = c(5, 10, 30, 50), include.lowest = TRUE) [1] <NA> [5,10] [5,10] [5,10] (10,30] (10,30] (30,50] (30,50] <NA> Levels: [5,10] (10,30] (30,50] Bins : 5 >= y < 10, 10 >= y < 30, 30 >= y < 50 > cut(y, breaks = c(5, 10, 30, 50), right = FALSE) [1] <NA> [5,10) [5,10) [10,30) [10,30) [30,50) [30,50) <NA> Levels: [5,10) [10,30) [30,50) <NA> Bins : 5 >= y < 10, 10 >= y < 30, 30 >= y <= 50 > cut(y, breaks = c(5, 10, 30, 50), include.lowest = TRUE, right = FALSE) [1] <NA> [5,10) [5,10) [10,30) [10,30) [30,50] [30,50] [30,50] <NA> Levels: [5,10) [10,30) [30,50] 41 %in%, which > x <- sample(1:10, 20, replace = TRUE) > x [1] 4 10 2 3 4 3 6 4 7 3 9 1 [20] 5 3 4 7 1 3 2 8 > x %in% c(3, 10, 2, 1) [1] FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE FALSE [10] TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE [19] FALSE FALSE > x[x %in% c(3, 10, 2, 1)] [1] 10 2 3 3 3 1 3 1 3 2 > which(x %in% c(3, 10, 2, 1)) [1] 2 3 4 6 10 12 13 16 17 18 > which(x < 5) [1] 1 3 4 5 6 8 10 12 13 14 16 17 18 > x[which(x > 6)] [1] 10 7 9 7 8 42 any, all > x <- sample(1:10, 20, replace = TRUE) > x [1] 2 7 8 1 1 7 5 8 6 7 3 7 2 1 5 10 3 9 1 > any(x == 6) [1] TRUE > all(x < 5) [1] FALSE 43 2 unique, duplicated > x <- sample(1:10, 20, replace = TRUE) > x [1] 6 5 1 8 9 6 2 3 8 9 8 10 10 [20] 10 > unique(x) [1] 6 5 1 8 9 2 3 10 2 9 3 4 3 4 4 > duplicated(x) [1] FALSE FALSE FALSE FALSE FALSE [10] TRUE TRUE FALSE TRUE TRUE [19] TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE 44 sort, order > x <- sample(1:10, 20, replace = TRUE) > x [1] 3 6 7 1 5 3 10 3 7 2 3 9 [19] 4 1 > sort(x) [1] 1 1 [19] 9 10 1 2 2 4 3 8 2 3 3 4 4 5 6 7 7 8 8 > sort(x, decreasing = TRUE) [1] 10 9 8 8 7 7 6 5 [19] 1 1 4 4 3 3 3 3 3 2 2 1 8 11 16 15 19 5 2 3 9 14 17 1 3 8 3 > order(x) [1] 4 13 20 10 18 [19] 12 7 3 1 6 > trips <- read.csv(“homework 1a df.csv") > month.sort <- trips[order(trips$month), ] > month.days.sort <- trips[order(trips$month, trips$day), ] 45 merge merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), ...) > rm(list = ls()) > load("merge data.rdata") > str(cranial) 'data.frame': 20 obs. of 2 variables: $ id : Factor w/ 20 levels "Specimen-1","Specimen-12",..: 14 11 13 7 20 18 3 10 5 17 ... $ skull: num 260 266 259 273 262 ... > str(haps) 'data.frame': 20 obs. of 2 variables: $ id : Factor w/ 20 levels "Specimen-1","Specimen-10",..: 16 12 15 18 8 7 3 13 6 9 ... $ haps: Factor w/ 5 levels "A","B","C","D",..: 1 4 4 5 5 3 1 3 3 4 ... > merge(haps, cranial) id haps skull 1 Specimen-1 A 255.4461 2 Specimen-12 A 262.5730 3 Specimen-16 E 256.2258 4 Specimen-22 E 259.2000 ... 46 merge merge(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all, sort = TRUE, suffixes = c(".x",".y"), ...) > str(sex) 'data.frame': 40 obs. of 2 variables: $ specimens: Factor w/ 40 levels "Specimen-1","Specimen10",..: 1 12 23 34 36 37 38 39 40 2 ... $ sex : Factor w/ 2 levels "F","M": 1 2 1 2 2 2 2 2 2 1 ... > str(trials) 'data.frame': 30 obs. of 2 variables: $ id : Factor w/ 23 levels "Specimen-1","Specimen-18",..: 5 6 1 9 3 7 8 2 10 4 ... $ value: num 30.1 23.1 24.3 22.6 36.7 ... > merge(sex, trials, by.x = "specimens", by.y = "id") specimens sex value 1 Specimen-1 F 24.28745 2 Specimen-11 F 23.90455 3 Specimen-12 M 27.41010 4 Specimen-14 M 36.84547 5 Specimen-15 M 20.08898 47 String Manipulation nchar(x) substr(x, start, stop) strsplit(x, split) paste(..., sep, collapse) number of characters in string extract or replace substrings split string concatenate vectors format grep, sub, gsub format object for printing pattern matching and replacement 48 nchar, substr, strsplit > x <- "This is a sentence." > nchar(x) [1] 19 > substr(x, 3, 9) [1] "is is a“ > substr(x, 1, 4) <- "That" > x [1] "That is a sentence.“ > strsplit(x, " ") [[1]] [1] "That" "is" "sentence." > strsplit(x, "a") [[1]] [1] "Th" "t is " "a" " sentence." 49 paste > sites <- LETTERS[1:6] > paste("Site", sites) [1] "Site A" "Site B" "Site C" "Site D" "Site E" "Site F" > paste("Site", sites, sep = "-") [1] "Site-A" "Site-B" "Site-C" "Site-D" "Site-E" "Site-F" > paste("Site", sites, sep = "_", collapse = ",") [1] "Site_A,Site_B,Site_C,Site_D,Site_E,Site_F" 50 Data Summary summary table summarizes object – different for each class create contingency table sum(x), prod(x) cumsum(x) sum and product of vector vector of cumulative sums rowSums, colSums rowMeans, colMeans rowsum(x, group) compute row or column sums compute row or column means compute column sums for a grouping variable 51 table > trips <- read.csv(“homework 1a df.csv") > table(season = trips$season) season Fall Spring Summer Winter 2503 2546 2336 2615 > table(season = trips$season, fish.class = trips$fish.class) fish.class season Large Medium Small Fall 1499 897 107 Spring 1505 960 81 Summer 1380 865 91 Winter 1550 959 106 > turtle.class.table <- as.data.frame(table(turtle.class = trips$turtle.class)) > str(turtle.class.table) 'data.frame': 2 obs. of 2 variables: $ turtle.class: Factor w/ 2 levels "Large","Small": 1 2 $ Freq : int 3443 6557 > turtle.class.table turtle.class Freq 1 Large 3443 2 Small 6557 52 row/col sums/means > x <- matrix(1:18, nrow = 6, ncol = 3) > x [,1] [,2] [,3] [1,] 1 7 13 [2,] 2 8 14 [3,] 3 9 15 [4,] 4 10 16 [5,] 5 11 17 [6,] 6 12 18 > rowSums(x) [1] 21 24 27 30 33 36 > colMeans(x) [1] 3.5 9.5 15.5 > rowsum(x, [,1] [,2] 1 3 15 2 7 19 3 11 23 > rowsum(x, [,1] [,2] a 6 24 b 15 33 c(1, 1, 2, 2, 3, 3)) [,3] 27 31 35 c("a", "a", "a", "b", "b", "b")) [,3] 42 51 53 Data Summary min, max return minimum or maximum values range return a vector of minimum and maximum values which.min, which.max return index of first minimum value mean(x) sd, var, cov, cor arithmetic mean of vector standard deviation, variance, covariance, correlation median(x) quantile(x, probs) median of vector give quantiles of vector > x <- sample(1:100, 50, replace = TRUE) > mean(x) [1] 55.82 > median(x) [1] 51.5 > range(x) [1] 1 100 > quantile(x, probs = 0.1) 10% 21.9 > quantile(x, probs = c(0.025, 0.5, 0.975)) 2.5% 50% 97.5% 6.825 51.500 98.325 54 Functions fun.name <- function(args) { statements x or return(x) } • • • • result of last statement is return value arguments(args) passed by value can give default arguments “…” passes unmatched arguments to other functions 55 Functions F2C <- function(faren) { # converts farenheit to celsius cels <- round((faren - 32) * 5/9, 2) paste(faren, "deg. Farenheit =", cels, "deg. Celsius", sep=" ", collapse="") } sample.mean <- function(x, sample.size = 10) { y <- sample(x, size = sample.size, replace = TRUE) mean(y) } sample.mean <- function(x, sample.size = length(x)) { y <- sample(x, size = sample.size, replace = TRUE) mean(y) } sample.mean <- function(x, ...) { y <- sample(x, ...) mean(y, na.rm = TRUE) } 56 Functions if(cond) {statements} else {statements} evaluate condition ifelse(test, yes, no) evaluate test, return yes or no for(var in seq) {statements} execute one loop for each var in seq while(cond) {statements} execute loop as long as condition is true repeat {statements} execute expression on each loop break exits loop next moves to next iteration in loop switch(EXPR, ...) print(x) stop("...") warning("...") stopifnot(cond) select from list of alternatives prints object x to screen stop function and print error message generate warning message stop if cond not TRUE 57 fishery.status.1 <- function(catch, catch.limit = 20) { result <- list(to.close = TRUE, remaining.catch = NA) if (catch < catch.limit) { result$to.close = FALSE result$remaining.catch = catch.limit - catch } else { result$to.close = TRUE result$remaining.catch = 0 } result } if, ifelse fishery.status.2 <- function(catch, catch.limit = 20) { to.close <- catch >= catch.limit remaining.catch <- ifelse(catch < catch.limit, catch.limit - catch, 0) list(to.close = to.close, remaining.catch = remaining.catch) } > x > y > z > x [1] > x [1] > x [1] <- c(TRUE, TRUE, FALSE) <- c(FALSE, TRUE, FALSE) <- c(TRUE, FALSE, FALSE) & y FALSE TRUE FALSE && y FALSE && z TRUE 58 for make.plates <- function(num.plates) { plate.vec <- vector("character", length = num.plates) for(i in 1:num.plates) { first.num <- sample(0:9, 1) chars <- sample(LETTERS, 3, replace = TRUE) chars <- paste(chars, collapse = "") last.nums <- sample(0:9, 3, replace = TRUE) last.nums <- paste(last.nums, collapse = "") plate.vec[i] <- paste(first.num, chars, last.nums, sep = "", collapse = "") } plate.vec } check.plates <- function(plates, reserved) { bad.plates <- vector("character") for(plate in plates) { plate.str <- substr(plate, 2, 4) if (plate.str %in% reserved) bad.plates <- c(bad.plates, plate) } bad.plates } 59 bootstrap example Question: How many trips had “small” bycatches for all categories? More importantly: What is the variance of this measure? trips <- read.csv("homework 1a df.csv") boot.bycatch <- function(trip.df, nrep) { obs.num.small <- num.all.small(trip.df) boot.results <- vector("numeric", nrep) for(i in 1:nrep) { boot.rows <- sample(1:nrow(trip.df), nrow(trip.df), rep = TRUE) boot.df <- trip.df[boot.rows, ] boot.results[i] <- num.all.small(boot.df) } list(observed = obs.num.small, boot.dist = boot.results) } num.all.small <- function(trip.df) { f.small <- trip.df$fish.class == "Small" t.small <- trip.df$turtle.class == "Small" m.small <- trip.df$mammal.class == "Small" sum(f.small & t.small & m.small) } 60 Exercise 2A : Reformat dates 1) 2) 3) 4) 5) 6) Use “Homework 2 sets.csv” Write function to split Date into Year, Month, Day Save function as R object Create numeric Year, Month, Day columns in data frame Create new Date character column that is DD-MM-YY Remove old Date column and save new data frame under new name Exercise 2B : Bootstrap fishery closures 1) Use “Homework 2 catches.txt" 2) Write and save a function that takes catch.data, a catch.limit, and a number of bootstrap replicates. The function should bootstrap the catch over all years and return two objects: 1) a distribution of the number of years with closures, and 2) a distribution of the average catch remaining. 3) Run bootstrap with catch limits of 20 and 50 at 1000 replicates each. Extra: Create a table showing the frequency distribution of the number of closures in the bootstrap result. End Day 2 61 Data Processing - ‘apply’ family lapply(X, FUN, …) sapply(X, FUN, …) apply(X, MARGIN, FUN, …) tapply(X, INDEX, FUN, …) apply function to list or vector simplified version of lapply apply function to margins of array apply function to ragged array by(data, INDICES, FUN, ...) aggregate(x, by, FUN, ...) apply function to data frame compute function for subsets of object 62 lapply lapply returns list > spring.trip <- trips$season == "Spring" > spring.fish <- trips$fish[spring.trip & trips$fish > 0] > spring.turtles <- trips$turtles[spring.trip & trips$turtles > 0] > spring.mammals <- trips$mammals[spring.trip & trips$mammals > 0] > > spring <- list(fish = spring.fish, turtles = spring.turtles, mammals = spring.mammals) > > lapply(spring, length) $fish [1] 2525 $turtles [1] 1274 $mammals [1] 2119 > lapply(spring, mean) $fish [1] 250.2356 $turtles [1] 5.49843 $mammals [1] 3.050024 63 sapply sapply returns vector or matrix > sapply(spring, median) fish turtles mammals 250 5 3 > sapply(spring, function(i) sum(i > 5 & i < 20)) fish turtles mammals 63 623 0 > sapply(spring, function(i) c(n = length(i), mean = mean(i), var = var(i))) fish turtles mammals n 2525.0000 1274.00000 2119.000000 mean 250.2356 5.49843 3.050024 var 20785.6612 8.61783 1.953115 64 apply bycatch.df <- subset(trips, , c("fish", "turtles", "mammals")) Apply across columns > apply(bycatch.df, 2, mean) fish turtles mammals 248.6285 2.7283 2.5160 > apply(bycatch.df, 2, quantile, prob = c(0.025, 0.975)) fish turtles mammals 2.5% 8 0 0 97.5% 489 10 5 Apply across rows > bycatch.sum <- apply(bycatch.df, 1, sum) > range(bycatch.sum) [1] 0 512 > mean(bycatch.sum) [1] 253.8728 65 tapply apply function based on groups > tapply(trips$fish, trips$season, mean) Fall Spring Summer Winter 250.1322 248.1716 250.5051 245.9576 > tapply(trips$fish, list(season = trips$season, class = trips$fish.class), median) class season Large Medium Small Fall 354.0 112 6.0 Spring 354.0 107 5.0 Summer 353.5 111 3.0 Winter 348.0 108 5.5 66 Exercise 3 : Bootstrap with apply 1) Rewrite bootstrap from Exercise 2B using apply family 2) Run bootstrap with catch limits of 10, 15, 20, 30, 50, 60. 3) Summarize mean and median of results for each catch limit in one object 67 Simulated growth data Create a function that simulates growth data according to a Gompertz model, length L 0 e k 1 e g age The output should have two columns (age and length). Age should be rounded to two decimal places. Length should be rounded to one decimal place. Try to put in checks and traps for screwy input data. sim.growth.func <- function(age.range, L0, k, g, sd, sample.size) age.range is a two element vector giving min and max ages L0 is length at birth k, g are model rate parameters sd is the standard deviation for the error term sample.size is the number of samples to return 68 80 60 Length (cm) 40 20 0 0 10 20 30 40 50 60 Age (years) 69 Simulated growth data # Gompertz growth function gomp.func <- function(age.vec, LAB, k, g) LAB * exp(k * (1 - exp(-g * age.vec))) } { # A function to created simulated growth data according # to a Gompertz equation sim.growth.func <- function(age.range, LAB, k, g, std.dev, sample.size = 1000) { # Check to make sure age.range is a reasonable vector if (!is.numeric(age.range) || !is.vector(age.range)) stop("'age.range' is not a numeric vector") if (any(age.range < 0)) stop("'age.range' < 0") if (age.range[1] >= age.range[2]) stop("'age.range[1]' >= 'age.range[2]'") # Generate some random ages between min and max of age.range random.ages <- runif(sample.size, age.range[1], age.range[2]) # Calculate the expected length for those ages from the Gompertz equation expected.length <- gomp.func(random.ages, LAB, k, g) # Add some error to the lengths and return the named array length.err <- rnorm(sample.size, 0, std.dev) as.data.frame(cbind(age = random.ages, length = expected.length + length.err)) } growth.df <- sim.growth.func(age.range = c(0, 65), LAB = 10, k = 2, g = 0.25, std.dev = 5) 70 plot(x, y = NULL, type = "p", xlim = NULL, ylim = NULL, log = "", main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ann = par("ann"), axes = TRUE, frame.plot = axes, panel.first = NULL, panel.last = NULL, col = par("col"), bg = NA, pch = par("pch"), cex = 1, lty = par("lty"), lab = par("lab"), lwd = par("lwd"), asp = NA, ...) Plot 40 20 Length (cm) 60 80 plot(growth$age, growth.df$length, xlab = "Age (years)", ylab = "Length (cm)") 0 10 20 30 Age (years) 40 50 60 71 hist(x, breaks = "Sturges", freq = NULL, probability = !freq, include.lowest = TRUE, right = TRUE, density = NULL, angle = 45, col = NULL, border = NULL, main = paste("Histogram of" , xname), xlim = range(breaks), ylim = NULL, xlab = xname, ylab, axes = TRUE, plot = TRUE, labels = FALSE, nclass = NULL, ...) Hist > hist(growth$age) > hist(growth$age, breaks = c(0:5, seq(6, 12, 2), 15, 20, 40, max(growth.df$age)), +col = "black", border = "white") Histogram of growth.df$age 0.005 0.010 Density 40 0.000 20 0 Frequency 60 0.015 80 0.020 Histogram of growth.df$age 0 10 20 30 growth.df$age 40 50 60 0 10 20 30 40 50 60 growth.df$age 72 Boxplot boxplot(formula, data = NULL, ..., subset, na.action = NULL) boxplot(x, ..., range = 1.5, width = NULL, varwidth = FALSE, notch = FALSE, outline = TRUE, names, plot = TRUE, border = par("fg"), col = NULL, log = "", pars = list(boxwex = 0.8, staplewex = 0.5, outwex = 0.5), horizontal = FALSE, add = FALSE, at = NULL) > age.breaks <- hist(growth$age)$breaks > binned.age <- cut(growth$age, breaks = age.breaks) > boxplot(growth$length ~ binned.age, xlab = "Age bin", ylab = "Length") Length 40 0.010 20 0.005 0.000 Density 0.015 60 0.020 80 Histogram of growth.df$age 0 10 20 30 growth.df$age 40 50 60 (0,1] (2,3] (4,5] (6,8] (10,12] (15,20] (40,65] Age bin 73 Modifying Graphs abline lines points title text add straight lines to plot join points at coordinates with lines place points on plot add labels to a plot write text on a plot ?plot.default par default plot options set or get graphical parameters layout(mat, ...) split.screen(figs, ...) divide graphical screen into matrix divide graphical screen into sub-screens 80 newborns <- growth[growth$age <= 3, ] adults <- growth[growth$age > 3, ] plot(adults$age, adults$length, xlim = range(growth$age), ylim = range(growth$length), xlab = "", ylab = "", col = "red", pch = 21) abline(v = 3, col = "green") Length 40 par(new = TRUE) 60 plot(newborns$age, newborns$length, xlim = range(growth$age), ylim = range(growth$length), xlab = "Age", ylab = "Length", col = "blue", pch = 21) Transition 20 > > > > + + > > > > + + > > > > 0 10 20 30 40 50 60 Age text(3, 80, "Transition", pos = 4) 74 Modifying Graphs > layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE)) > plot(growth$age, growth$length, xlab = "Age", ylab = +"Length", main = "Simulated growth data") > age.breaks <- seq(0, max(growth$age) + 5, 5) > binned.age <- cut(growth$age, age.breaks) > hist(growth$age, age.breaks, xlab = "Age", main = "") > boxplot(growth$length ~ binned.age, names = +age.breaks[-length(age.breaks)], xlab = "Age bin") Length 20 40 60 80 Simulated growth data 0 10 20 30 40 50 60 60 60 40 40 20 20 0 Frequency 80 80 Age 0 10 20 30 Age 40 50 60 0 10 20 30 Age bin 40 50 60 75 Curve curve(expr, from, to, n = 101, add = FALSE, type = "l", ylab = NULL, log = NULL, xlim = NULL, ...) > + > + Length 40 0.0 20 -0.5 -1.0 sin (x) 60 0.5 80 1.0 > curve(sin, -10, 10) plot(growth$age, growth$length, xlab = "Age", ylab = "Length", main = "") curve(10 * exp(2 * (1 - exp(-0.25 * x))), add = TRUE, lty = "dashed", lwd = 2, col = "red") -10 -5 0 x 5 10 0 10 20 30 40 50 60 Age 76 0.4 Statistical Distributions dnorm 0.2 density distribution function quantile function random number dunif, dnorm, dgamma, dbeta, dchisq, etc. >library(help=“stats”) 0.0 0.1 dnorm (x) 0.3 d<dist> p<dist> q<dist> r<dist> -3 -2 -1 0 1 2 >set.seed(x) 3 set random number seed 1.0 x qnorm 0 0.0 -2 0.2 -1 0.4 qnorm (x) pnorm (x) 0.6 1 0.8 2 pnorm -3 -2 -1 0 x 1 2 3 0.0 0.2 0.4 0.6 x 0.8 1.0 77 Statistical Tests binom.test(x, n, p = 0.5, alternative = c("two.sided", "less", "greater"), conf.level = 0.95) chisq.test(x, y = NULL, correct = TRUE, p = rep(1/length(x), length(x)), rescale.p = FALSE, simulate.p.value = FALSE, B = 2000) t.test(x, y = NULL, alternative = c("two.sided", "less", "greater"), mu = 0, paired = FALSE, var.equal = FALSE, conf.level = 0.95, ...) 78 > > > > > > male.growth <- sim.growth.func(c(0, 65), 10, 2.05, 0.27, 5) female.growth <- sim.growth.func(c(0, 65), 10, 1.99, 0.23, 4) adult.males <- male.growth[male.growth[, "age"] > 18, ] adult.females <- female.growth[female.growth[, "age"] > 18, ] gender.test <- t.test(adult.males[, "length"], adult.females[, "length"]) gender.test t-test Welch Two Sample t-test data: adult.males[, "length"] and adult.females[, "length"] t = 19.3369, df = 1427.025, p-value < 2.2e-16 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 4.146325 5.082547 sample estimates: mean of x mean of y 77.56675 72.95232 > str(gender.test) List of 9 $ statistic : Named num 19.3 ..- attr(*, "names")= chr "t" $ parameter : Named num 1427 ..- attr(*, "names")= chr "df" $ p.value : num 3.56e-74 $ conf.int : atomic [1:2] 4.15 5.08 ..- attr(*, "conf.level")= num 0.95 $ estimate : Named num [1:2] 77.6 73.0 ..- attr(*, "names")= chr [1:2] "mean of x" "mean of y" $ null.value : Named num 0 ..- attr(*, "names")= chr "difference in means" $ alternative: chr "two.sided" $ method : chr "Welch Two Sample t-test" $ data.name : chr "adult.males[, \"length\"] and adult.females[, \"length\"]" - attr(*, "class")= chr "htest" 79 Model Fitting Linear Models lm(formula, data, subset, weights, na.action, method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE, singular.ok = TRUE, contrasts = NULL, offset, ...) Analysis of Variance Model aov(formula, data = NULL, projections = FALSE, qr = TRUE, contrasts = NULL, ...) Generalized Linear Models glm(formula, family = gaussian, data, weights, subset, na.action, start = NULL, etastart, mustart, offset, control = glm.control(...), model = TRUE, method = "glm.fit", x = FALSE, y = TRUE, contrasts = NULL, ...) Nonlinear Least Squares nls(formula, data, start, control, algorithm, trace, subset, weights, na.action, model, lower, upper, ...) Non-Linear Minimization nlm(f, p, hessian = FALSE, typsize=rep(1, length(p)), fscale=1, print.level = 0, ndigit=12, gradtol = 1e-6, stepmax = max(1000 * sqrt(sum((p/typsize)^2)), 1000), steptol = 1e-6, iterlim = 100, check.analyticals = TRUE, ...) 80 > > > > lm sim.growth <- sim.growth.func(c(0, 65), 10, 2, 0.25, 5) juv <- as.data.frame(sim.growth[sim.growth[, "age"] < 10, ]) juv.lm <- lm(length ~ age, juv) juv.lm 50 60 Call: lm(formula = length ~ age, data = juv) Coefficients: (Intercept) 12.438 30 length 40 age 5.584 10 20 > plot(juv.lm) Waiting to confirm page change... Waiting to confirm page change... Waiting to confirm page change... Waiting to confirm page change... > plot(juv) > abline(coef = juv.lm$coefficients, col = "red", lty = "dashed") 6 8 10 age 60 50 40 length 30 0 20 -1 -5 0 Standardized residuals 1 5 2 10 4 111 111 134 10 -2 -10 Residuals 2 70 Normal Q-Q Residuals vs Fitted 0 116 134 116 20 30 40 Fitted values lm(length ~ age) 50 60 -2 -1 0 Theoretical Quantiles lm(length ~ age) 1 2 0 2 4 6 age 8 10 81 Model Fitting fitted coef resid deviance logLik AIC predict anova > coef(juv.lm) (Intercept) 12.88 extract fitted values for models extract model coefficients extract model residuals extract deviances for models calculate log-likelihood for model fit calculate AIC for model fit predictions from model results calculate analysis of variance tables age 5.28 > logLik(juv.lm) 'log Lik.' -508 (df=3) > AIC(juv.lm) [1] 1023 > predict(juv.lm, data.frame(age = c(1, 5, 10))) 1 2 3 18.2 39.3 65.6 82 > gomp.form <- formula(length ~ LAB * exp(k * (1 - exp(-g * age)))) > growth.nls <- nls(gomp.form, sim.growth, start = c(LAB = 5, k = 5, g = 0.6)) > growth.nls Nonlinear regression model model: length ~ LAB * exp(k * (1 - exp(-g * age))) data: sim.growth LAB k g 10.995 1.905 0.236 residual sum-of-squares: 24793 nls 20 40 length 60 80 Number of iterations to convergence: 6 Achieved convergence tolerance: 9.67e-06 > plot(sim.growth) > age.vec <- 1:max(sim.growth$age) > lines(age.vec, predict(growth.nls, list(age = age.vec)), col = "red", + lty = "dashed", lwd = 2) 0 10 20 30 age 40 50 60 83 Packages, Path, & Options library() library(package) library(help = "package") require(package) list available packages load package list info about package (build, functions, etc.) loads package and returns FALSE if not present attach(x,pos) attach database (list, data frame, or file) to search path detach(x) remove database from search path search() list attached packages in search path options(...) set and examine global options ?Startup Control initialization of R session 84