Introduction to R Data Analysis and Calculations Katia Oleinik koleinik@bu.edu Scientific Computing and Visualization Boston University Boston University Scientific Computing and Visualization Introduction to R R arithmetic operations Operation Description x + y addition x - y subtraction x * y multiplication x / y division x ^ y exponentiation x %% y x mod y x %/% y integer division Variable Name rules Case sensitive : Party ≠ party Letters, digits, underscores and dots can be used: Cannot start with a digit, underscore or a dot followed by a digit: Should not use reserved words (if, else, repeat, etc.) DNA.data.2012 2012.DNA which R atomic constants types: 1. 2. 3. 4. 5. 6. Integer: Numeric: Complex: Logical: Character: Special: n <- 1 or n <- as.integer(1) or n <- 1L a <- 2.5 d <- 3 + 12i ans <- TRUE name <- “Katia” or name <- ‘Katia’ NULL, NA, Inf, Nan ~1~ Boston University Scientific Computing and Visualization Introduction to R R operators: Operations + > >= ! & Description * < | / <= %% == ^ != Relational Logical Model Formulas ~ -> Arithmetic Assignment <- $ List indexing : Sequence R built-in constants: Constants LETTERS letters month.abb month.name pi Description T , F TRUE, FALSE 26 upper-case letters of the Roman alphabet 26 lower-case letters of the Roman alphabet 3-letter abbreviations of month names month names π: ratio of circle circumference to diameter ~2~ Boston University Scientific Computing and Visualization Introduction to R R math functions for scalars and vectors: Function Description sin, cos, tan, asin, acos, atan, atan2, log, log10, log(x,base), exp, sinh, cosh, … min(x), max(x), range(x), abs(x) Various standard trig, log and exp. functions sum(x), diff(x), prod(x) Sum, difference and product of vector elements mean(x), median(x),sd(x), var(x) Mean, median, standard deviation, variance weighted.mean(x,w) Mean of x with weights w quantile(x,probs=) Sample quantiles corresponding to the given probabilities (defaults to 0,.25,.5,.75,1) round(x, n) Rounds the elements of x to n decimals Re(x), Im(x), Conj(x) Real, imaginary part of a complex number, Conjugate of a number Arg(x) Angle in radians of the complex number fft(x) Fast Fourier Transform of an array pmin(x,y,…), pmax(x,y,…) A vector which ith element is min/max of (x[i],y[i],…) cumsum(x), cumprod(x) A vector, which ith element is a sum/product from x[1] to x[i] cummin(x), cummax(x) A vector, which ith element is a min/max from x[1] to x[i] var(x,y) or cov(x,y) Covariance between 2 vectors cor(x,y) Linear correlation between x and y length(x) Get the length of the vector factorial(n) Calculate n! choose(n,m) Combination function: n! / ( k! * (n - k)! ) Minimum/maximum, range and absolute value *Note: Many math functions have a logical parameter na.rm=FALSE to specify missing data (NA) removal. ~3~ Boston University Scientific Computing and Visualization Introduction to R Directories and Workspace: Function Description getwd() Get working directory setwd(“/projects/myR/”) Set current directory ls() List objects in the current workspace rm(x,…) Remove objects from the current workspace list.files() List files in the current directory list.dirs() List directories file.info(“myfile.xls”) Get file properties file.exists(“myfile.xls”) Check if file exists file.remove(“myfile.xls”) Delete file file.append(file1, file2) Append file2 to file1 file.copy(from, to, …) Copy file system(“ls -la”) Execute command in the operating system save.image() Save contents of the current workspace in the default file .Rdata save.image(file=”myR.Rdata”) Save contents of the current workspace in the file save(a,b, file = “ab.Rdata”) Save a and b in the file load(“myR.Rdata”) Restore workspace from the file ~4~ Boston University Scientific Computing and Visualization Introduction to R Loading and Saving Data: Function Description read.table(file=”myData.txt”, header=TRUE) Read text file read.csv(file=”myData.csv”) Read csv file (“,” – default separator) list.files(); dir() List all files in current directory file.show(file=”myData.csv”) Show file content write.table(file=”myData.txt”,…) Save data into a file write.csv(file=”myData.csv”,…) Save data into csv formatted file Performance Tip: - For large data files, specify optional parameters if known: read.table(file, nrows=10000, colClasses=c(”integer”,…), comment.char=””) - When reading matrices, use scan() function instead of read.table() Exploring the data: Function Description class(x) Get class attribute of an object names(x) Function to get or set names of an object head(x), tail(x) Returns the first/last parts of vector, matrix, dataframe, function str(x) Structure of an object dimnames(x) Retrieve or set dimnames of an object length(x) Get or set the length of a vector or factor summary(x) Generic function – produces summary of the data attributes(x) List object’s attributes dim(x) Retrieve or set the dimension of an object nrow(x), ncol(x) Return the number of rows or columns of vector, matrix or dataframe row.names() Retrieve or set the names of the rows ~5~ Boston University Scientific Computing and Visualization Introduction to R R script file R script is usually saved in a file with extension .R (or .r). # - serves as a comment indicator (every character on the line after #-sign is ignored source(“myScript.R”) will load the script into R workspace and execute it source(“myScript.R”, echo=TRUE) will load and execute the script and also show the content of the file R script example (weather.R) # This script loads data from a table and explore the data # Script is written for Introduction to R tutorial # Load datafile weather <- read.csv(“BostonWeather_sept2012.csv”) # Get header names names(weather) # Get class of the loaded object class (weather) # Get attributes attributes(weather) # Get dimensions of the loaded data dim(weather) # Get structure of the loaded object str(weather) # Summary of the data summary(weather) ~6~ Boston University Scientific Computing and Visualization Introduction to R Installing and loading R packages To install R package from cran website: install.packages(“package”) library( package )- loads package into workspace. Library has to be loaded every time you open a workspace. Another way to load package into workspace is require(package). Usually used inside functions. It returns FALSE and gives a warning (rather than error) if package does not exist. installed.packages() – retrieve details about all packages installed library() lists all available packages search() lists all loaded packages library(help = package) provides information about all the functions in a package Getting help Function Description Example ?topic Get R documentation on topic ?mean help(topic) Get R documentation on topic help(mean) help.search(“topic”) Search the help for topic help.search(“mean”) example(topic) Get example of function usage example(mean) apropos(“topic”) Get the names of all objects in the search list that match string “topic” apropos(“mean”) methods(function) List all methods of the function methods(mean) function_name Printing a function name without parenthesis in most cases will show its code mean ~7~ Boston University Scientific Computing and Visualization Introduction to R R object types: o Vector – a set of elements of the same type. o Matrix - a set of elements of the same type organized in rows and columns. o Data Frame - a set of elements organized in rows and columns, where columns can be of different types. o List - a collection of data objects (possibly of different types) – a generalization of a vector. Vector creation (examples): #Create a vector using concatenation of elements: c() v1 <- c( 5,8,3,9) v2 <- c( “One”, “Two”, “Three” ) #Generate sequence (from:to) s1 <- 2:5 #Sequence function: seq(from, to, by, length.out) seq(0,1,length.out=5) [1] 0.00 0.25 0.50 0.75 1.00 seq(1, 6, by = 3) [1] 1 4 seq(4) [1] 1 2 3 4 #Generate vector using repeat function: rep(x,times) rep(7, 3) [1] 7 7 7 ~8~ Boston University Scientific Computing and Visualization Introduction to R Accessing vector elements: Indexing vectors x[n] x[-n] x[1:n] x[-(1:n)] x[c(1,3,6)] x[x>3 & x<7] x[x<3 | x>7] Description nth element all but nth element first n elements elements starting from n+1 specific elements all element greater than 3 and less than 7 all element less than 3 or greater than 7 Performance Tip: - R is designed to work with vectors very efficiently – avoid using loops to perform the same operation on each element – rather apply function on the whole vector! - For large arrays avoid dynamic expansion if possible. Allocate memory to hold the result and then fill in the values. Useful vector operations: Operation sort(x) rev(x) which.max(x) which.min(x) which (x == a) na.omit(x) x[is.na(x)] <- 0 Description Returns sorted vector(in increasing order) Reverses elements of x Returns index of the largest element Returns index of the smallest element Returns vector of indices i, for which x[i]==a Surpresses the observations with missing data Replace all missing elements with zeros ~9~ Boston University Scientific Computing and Visualization Introduction to R Matrix creation (examples): #Create a matrix using function: matrix(data,nrow,ncol,byrow=F) matrix( seq(1:6), nrow=2) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 #Create a diagonal matrix: diag( ) diag( 3 ) [,1] [,2] diag( 4, 2, 2 ) [,3] [,1] [,2] [1,] 1 0 0 [1,] 4 0 [2,] 0 1 0 [2,] 0 4 [3,] 0 0 1 #Combine arguments by column: cbind() cbind(c(1,2,3), c(4,5,6)) [,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6 #Combine arguments by row: rbind() rbind(c(1,2,3), c(4,5,6)) [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 #Create matrix using array(x, dim) function array(1:6, c(2,3))) [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6 ~ 10 ~ Boston University Scientific Computing and Visualization Introduction to R Accessing matrix elements: Indexing matrices x[i,j] x[i,] x[,j] x[c(1,5),] x[,c(2,3,6)] x[“name”,] x[,“name”] Description Element at row i, column j Row i (output is a vector) Column j (output is a vector) Rows 1 and 5 (output is a matrix) Columns 2 ,3 and 6 (output is a matrix) Row named “name” Column named “name” Performance Tip: - When calculating mean or a sum of a row/column elements use rowSums(), rowMeans(), colSums(), colMean() functions. They perform faster for matrices than sum() and mean() functions. - For large matrices avoid dynamic expansion (using cbind() and rbind() if possible. Allocate memory to hold the result and then fill in the values. Useful matrix operations: Operation t(x) x * y x %*% y diag(x) det(x) solve(x) solve(a,b) rowSums(), colSums() rowMeans(),colMeans() Description Transpose Multiply elements of 2 matrices Perform “normal” matrix multiplication Returns a vector of diagonal elements Returns determinant of matrix Returns inverse matrix (if exists), error-otherwise Returns solution vector for system Ax=b Returns vector with a sum of each row/column Returns vector with mean values of each row/column ~ 11 ~ Boston University Scientific Computing and Visualization Introduction to R Data frames: - - elements organized in rows and columns, where columns can be of different types All elements in the same column must have the same data type Usually obtained by reading a data file. Can be created using data.frame() function #Create a data frame using function: data.frame() name <- c(“Paul”, “Simon”, “Robert”) age <- c(8, 12, 3) height <- c(53.5, 64.8, 35.2) family <- data.frame(Name = name, Age = age, Height = height); family Name Age Height 1 Paul 8 53.5 2 Simon 12 64.8 3 Robert 3 35.2 #To sort data frame using one column family[order(family$Age),] Name Age Height 3 Robert 3 35.2 1 Paul 8 53.5 2 Simon 12 64.8 Accessing data frame elements: Indexing matrices x[[i]] x[[“name”]] x$name x[,i] x[j,] x[i:j,] x[i,j] x[i, “name”] Description Accessing Accessing Accessing Accessing Accessing Accessing Accessing Accessing column i (returns vector) column named “name” (returns vector) column named “name” (returns vector) column i (returns vector) row j (returns dataframe!) rows from i to j element in row i and column j element in row i and column “name” ~ 12 ~ Boston University Scientific Computing and Visualization Introduction to R Lists: - - Generalization of vector: ordered collection of components Elements can be of any mode or type Many R functions return list as their output object Can be created using list() function #Create a list using function: list() lst <- list(name=“Fred”, no.children=3, child.ages=c(12,8,3)) #Create a list using concatenation: c() list.ABC <- c(list.A, list.B, list.C) #List can be created from different R objects list.misc<-list(e1 = c(1,2,3), e2 = list.B, e3 = matrix(1:4,2) ) Accessing list elements: Indexing matrices x[[i]] x[[“name”]] x$name x[i:j,] Description Accessing Accessing Accessing Accessing component i component named “name” component named “name” components from i to j ~ 13 ~ Boston University Scientific Computing and Visualization Introduction to R Factors: - a numeric vector that stores the number of levels of a vector. It provides an easy way to store character strings common for categorical variables Performance Tip: - Use factors to store vectors (especially character vectors) that take only few values (categorical variables). - Factors take less memory and are faster to process, than vectors Factor operations: Operation factor(x) relevel(x, ref=…) levels(x) attributes(x) table() is.factor(x) cut(x, breaks) gl(n,k,length=n*k,labels=1:n) Description Convert vector to a factor Rearrange the order of levels in a factor List levels in a factor Inspect attributes of a factor Get count of elements in each level Checks if x is a factor. Returns TRUE or FALSE Divide x into intervals (factors) Generate factors by specifying pattern ~ 14 ~ Boston University Scientific Computing and Visualization Introduction to R Regression analysis Function lm() glm() nls() residuals() deviance() gls() gnls() x[,“name”] Description Linear regression Generalized linear regression Non-linear regression The difference between observed values and fitted values Returns the deviance Fit linear model using generalized least squares Fit nonlinear model using generalized least squares Column named “name” Miscellanies functions for data analysis Function optim() nlm() spline() kmeans() ts() t.test() binom.test() merge() sample() density() logLik(fit) predict(fit,…) anova() aov(formula) Description General purpose optimization Minimize function Spline interpolation k-means clustering on a data matrix Create a time series Students’ t-test Binomial test Merge 2 data frames Sampling Kernel density estimates of x Computes the logarithm of the likelihood Predictions from fit based on input data Analysis of variance (or deviance) Analysis of variance model ~ 15 ~ Boston University Scientific Computing and Visualization Introduction to R Distributions Function rnorm(n, mean=0, sd = 1) runif(n, min=0, max = 1) rexp(n , rate=1) rgamma(n , shape, scale=1) rpois(n, lambda) rcauchy(n, location=0, scale=1) rbeta(n , shape, scale=1) rchisq(n, df) rbinom(n, size, prob) rgeom(n, prob) rlogistic(n, location=0, scale=1) rlnorm(n, meanlog=0, sdlog=1) rt(n, df) Description Gaussian Uniform Exponential Gamma Poisson Cauchy Beta Pearson Binomial Geometric Logistic Lognormal Student ~ 16 ~