Programming in R: Statistical Computing and Graphics R is a freely available software package used for statistical analysis, data visualization, and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating systems. R is a command-based language with many objects and functions built-in. Users can also define their own objects and functions, and many specialized packages are also available. For more background, downloads, and a more thorough user-manual see: http://cran.r-project.org/ Note: On certain platforms, R will not recognize the opening and closing quotation marks (‘ and ’) found throughout this file, but will recognize the generic quotation marks. If any of the commands gives an error when copied and pasted into R, try typing in the quotation marks manually into R, or using a text version of this file. R can be used like a calculator 5+9 4 / 7 + (100-2) / 5 sqrt(16) exp(8) The assignment operator is the ‘=‘ sign; ‘<-’ can also be used a=3 x=4 x**a or x^a returns xa The workspace is defined as all objects and user-defined functions in the current environment. The command ls() returns a list of all elements in the environment The command rm(a, x) can be used to remove the two elements from the workspace that we created above Getting Help (?): ?ls ?matrix Comments: The # sign is used to denote a comment (the same is in perl) Data types: vectors – these are 1 dimensional (1 row of numbers, characters, etc.) v= 1:15 # type the name of the object, in this case v, to view it v[2] #returns the 2nd element of the vector length(v) #returns the number of elements in the vector v = c(‘a’, ‘b’, ‘c’) v = seq(1,10,by=2) v = c(1,2,5) v = rep(10,6) matrices – these can be multidimensional, but all elements must be of the same type v = 1:15 m = matrix(v, nrow = 3,ncol = 5,byrow = T) #creates a 3 (row) x 5 (column) matrix m = matrix(v, 3, byrow = T) # does the same thing ## can you create the matrix below: 111 222 333 444 dim(m) # returns the number of rows and columns of matrix m dim(m)[1] #the number of rows dim(m)[2] #the number of columns we can access elements of the matrix m using m[rows, columns], where rows and columns are the rows and columns of interest m[1:2,2:3] returns rows 1 and 2 and columns 2 and 3 m[rows, ] returns the specified rows (and all columns) m[, columns] returns the specified columns (and all rows) Note: if only 1 row or column is specified, then a vector will be returned Can you change the element in the 3rd column and the 4th row to 0? Matrix arithmetic m + 3 # adds 3 to each element of m m * 5 # multiplies all elements in m by 5 # for 2 matrices m1 and m2 of equal dimension, add corresponding elements m1 + m2 Suppose we want to evaluate a function on all rows or columns of a matrix. We can easily do this with the apply function The function mean() returns the mean value of the elements in the object passed into it. apply(m,1,mean) #returns the mean of each row apply(m,2,mean) #returns the mean of each column lists – can store elements of various types; in this respect, are similar to classes in object oriented programming languages student = list(name = ‘Bob’, age = 20) Elements of a list can be accessed according to their names… student$name # returns ‘Bob’ student$age # returns 20 We can easily add an element to the list: student$grade = ‘A’ And delete an element from the list: student$age = NULL … We can also access elements according to their index using double brackets: student[[1]] returns ‘Bob’ Important generic functions (which may or not be useful depending on the type of object): summary() mean() plot() #data coercion: as.vector() as.matrix() #check the type of an object: is.vector() is.matrix() Similar functions are defined for all data types (e.g., as.list(), as.character(), etc. ) The command typeof(object) returns the type of the object (obviously) Note: some objects may be more than one type, and only one type will be returned. ex: v = 1:1000 typeof(v) returns ‘integer’ is.vector(v) and is.integer(v) BOTH return TRUE is.matrix(v) returns FALSE General statistical functions for any object x (usually a vector or matrix) mean(x) min(x) var(x) max(x) sd(x) quantile(x) summary(x) # returns the five number summary, as well as the mean Data input A list of commands in a file can be read using source(file.name) source(‘http://www.public.iastate.edu/~gdancik/summer2007/files/setx.txt’) Reading in a file data = read.table(‘http://www.public.iastate.edu/~gdancik/summer2007/files/BigClass.txt‘, sep = ‘,’, header = T) data.frames Data frames are objects that combine features (particularly element access methods) of matrices and lists The columns of ‘data’ are ‘name’, ‘age’, ‘sex’, ‘height’, and ‘weight’ This can be determined using colnames(data) data$name data$age summary(data) Suppose we want to change the heading of ‘sex’ to ‘gender’ We can rename all of the columns using colnames(data) = new.names - Can you create a vector of the column names we want? - Can you change ‘sex’ to ‘gender’? - Can you rename the column names? Alternately, we could have used colnames(data)[3] = ‘gender’ Another data type is the logical data type (TRUE or FALSE; or alternatively T and F) 5>3 5>9 Logical operators (e.g. to compare two numbers): >, <, >=, <=, ==, != v = 1:10 index = v > 5 # for each element of v, check if that element > 5 v[index] # returns the elements of v that are > 5 - In the big class data set, retrieve a list of students greater than 15 years old o index = data$age > 15 o data[index,] - note that we need to include the ‘,’ after ‘index’. Why is this? o The previous two steps may also be combined: data[data$age > 15,] - Other examples: o data[data$gender == ‘M’,] o data[data$age == 12,] o data[data$age == 12]$height o data[data$age == 12, 4] Relationship between two variables: To reduce future typing, first enter: x = data$height y = data$weight cor(x,y) returns the correlation between x and y plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and weight’) Linear models: A linear model (for one input variable) has the form: y = b0 + b1x, where y is referred to as the response variable and x is an input variable. fit1 = lm(y~x) #fits a linear model of the above form summary(l) Estimates of b0 and b1 are the first and second elements of l$coeff, respectively. These can also be found through summary(l) or simply by printing the l object If we plug in our estimates of b0 and b1 into the original equation, we can predict a person’s weight (y) from their known height (x). Doing this for our known weights, we get a list of fitted values, l$fitted. plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and weight’) lines(x, fit1$fitted, col = ‘red’) Now let us consider the model y = b0 + b1x1 + b2x2, where y = data$weight x1 = data$age x2 = data$height fit2 = lm(y ~ x1 + x2) An alternate way to do this is to create a matrix with the 1st column = x1 and the 2nd column = x2. X = data[,c(2,4)] #returns a data.frame containing age and height (this returns a # data.frame since the original object is a data.frame) X = as.matrix(X) # X must be a matrix for the lm function to work! fit2 = lm(y~X) How does fit1 compare to fit2? Comparing two variables (box plots and the t-test) hm = data[data$gender == ‘M’,]$height hf = data[data$gender == ‘F’,]$height boxplot(hm, hf, names = c(‘male’, ‘female’), main = ‘height’) For independent samples from 2 populations, the t-test is used to test a hypothesis about the population means (μ1 and μ2). By default, the t-test function in R considers: H0: μ1 – μ2 = 0 H1: μ1 – μ2 ≠ 0 t = t.test(hm, hf) t Writing your own functions (and loops) f = function(x1, x2 = 0) { return (x1 + x2) } The return statement above is optional. We may want to have a function that simply ‘does something’. For example, suppose we have a matrix m, and we want to plot a line that corresponds to each row of the matrix. (Note: there is a function called matplot that will do this) m = matrix(1:15,ncol=5,byrow=T) plotLines = function(m, ...) { # this is a comment lower = min(m) upper = max(m) for (i in 1:dim(m)[1]) { plot(m[i,], ylim = c(lower, upper), type = ‘l’, ...) par(new=T) } } R also allows while loops: i =0 while (i < 10) { print(i) i=I+1 } Within a loop you may use break or next statements, similar to Perl. Conditional statements is5 = function(x) { if (x == 5) { print (‘x is equal to 5’) } else { print (‘x is not equal to 5’) } } Note: There is no if else expression in R – you must used nested if…else statements. Saving and Loading R objects First let’s check our current working directory. This is the directory in which files will be saved or the directory that R attempts to be read from if only a file name is specified. In order to get and set the working directory, use the functions ‘getwd’ and ‘setwd’ It is recommended that you change the working directory now…. # save the current workspace in the current working directory save.image(file = ‘file.RData’) save(x, file = x.RData’) # can be used to save a subset of objects in the workspace # will load in the specified workspace or R object (note: objects in the workspace you are # loading in will overwrite any objects currently defined load(‘file.RData’) # save the matrix m as a text file write(t(m), ncolumns = ncol(m), file = ‘m.txt’) # this can later be read in using the # read.table command. Probability distributions R can handle all common probability distributions, including the normal and (continuous) uniform distribution. For the normal distribution (standard normal by default), ‘dnorm’ gives the density, ‘pnorm’ gives the distribution function, ‘qnorm’ gives the quantile function, and ‘rnorm’ generates random deviates. Other probability functions work similarly (e.g., dunif, punif, etc. for the uniform distribution) #We can visualize the standard normal density x = seq(-5,5, by=0.1) plot(x, dnorm(x), type = ‘l’) #We can generate 1000 observations from the standard normal distribution z = rnorm(1000) hist(z) #Let Z ~ N(0,1). Then pnorm(1.645) # returns P(Z < 1.645) qnorm(.95) # returns the value z*, for which (P(Z < z*) = 0.95 # Simulate flipping a coin 100 times, where P(H) = P(T) = ½ flips = runif(100) flips[ flips < 0.5 ] = ‘H’ flips[ flips != ‘H’] = ‘T’ flips = as.factor(flips) summary(flips) ### count number of nucleotides in a sequence x countN1 = function(x) { numA = 0 numG = 0 numC = 0 numT = 0 x = toupper(x) ## convert to all uppercase for (i in 1:length(x)) { if (x[i] == ‘A’) numA = numA + 1 if (x[i] == ‘G’) numG = numG + 1 if (x[i] == ‘C’) numC = numC + 1 if (x[i] == ‘T’) numT = numT + 1 } print(paste(‘A:’, numA)) print(paste(‘G:’, numG)) print(paste(‘C:’, numC)) print(paste(‘T:’, numT)) } countN2 = function(x) { x = toupper(x) numA = length( x[ x == ‘A’] ) numG = length( x[ x == ‘G’] ) numC = length( x[ x == ‘C’] ) numT = length( x[ x == ‘T’] ) print(paste(‘A:’, numA)) print(paste(‘G:’, numG)) print(paste(‘C:’, numC)) print(paste(‘T:’, numT)) } Quitting R To quit R, type the command q(). You will be prompted to save your workspace. If you have not saved the workspace previously, R will save the workspace in a file called “.RData” in the current working directory by default. Practice Exercises (Further instructions will be given in class) The following exercises are designed primarily for programming practice in R. However, we will motivate the problems by considering the following: 1) A researcher has identified genetic structure that she believes is conserved throughout the genome. In order to determine the probability that this structure arose by chance, she generates many random sequences of the same length, with marginal probabilities for each nucleotide based on their empirical probabilities. 2) A researcher is studying promoter regions that are rich in guanine, and, from a list of candidate promoters, wants to look at all sequences where guanine content is greater than 30%. 1) Generating random sequences – a. Create a function that generates a single random nucleotide X where P(X = “G”) = 0.30, P(X = “A”) = 0.20, P(X = “C”) = 0.25, and P(X = “T”) = 0.25 Hint: You may want to use the runif() function to do this. b. Using the function you have created in (a), create another function that generates a random nucleotide sequence of length n. c. Generate a random nucleotide sequence of length 100 using the sample() function, where the probability of each nucleotide is given in (a) Hinte: type ‘?sample’ for more information. 2) Sequence analysis – a. Load the object ‘sequences’ using the following command: load(url(‘http://www.public.iastate.edu/~gdancik/summer2007/files/sequences .RData’)) to get a data.frame of dna sequences, which has the name ‘sequences’. Each column contains a 40-base nucleotide sequence. For example, sequences[,1] will return the first sequence (as a factor) b. Since the columns of sequences are factors, summary(sequences) will tell you the number of each nucleotide in each column. However, suppose that we did not know this. Modify the countN1 or countN2 functions to take a single sequence, and return a vector of 4 elements that corresponds to the number of A’s, G’s, C’s, and T’s in the sequence. (Note: you will need to remove the toupper() function, since we are now working with factors, and not characters). c. Use the apply function to return a 4 x 10 matrix with the number of A’s, G’s, C’s, and T’s in each of the 10 sequences.