Programming in R: Statistical Computing and Graphics

advertisement
Programming in R: Statistical Computing and Graphics
R is a freely available software package used for statistical analysis, data visualization,
and algebraic (matrix) computation that can run on Unix, Windows, and Mac operating
systems. R is a command-based language with many objects and functions built-in.
Users can also define their own objects and functions, and many specialized packages are
also available.
For more background, downloads, and a more thorough user-manual see:
http://cran.r-project.org/
Note: On certain platforms, R will not recognize the opening and closing quotation marks
(‘ and ’) found throughout this file, but will recognize the generic quotation marks. If any
of the commands gives an error when copied and pasted into R, try typing in the
quotation marks manually into R, or using a text version of this file.
R can be used like a calculator
5+9
4 / 7 + (100-2) / 5
sqrt(16)
exp(8)
The assignment operator is the ‘=‘ sign; ‘<-’ can also be used
a=3
x=4
x**a or x^a returns xa
The workspace is defined as all objects and user-defined functions in the current
environment. The command ls() returns a list of all elements in the environment
The command rm(a, x) can be used to remove the two elements from the workspace that
we created above
Getting Help (?):
?ls
?matrix
Comments:
The # sign is used to denote a comment (the same is in perl)
Data types:
vectors – these are 1 dimensional (1 row of numbers, characters, etc.)
v= 1:15
# type the name of the object, in this case v, to view it
v[2] #returns the 2nd element of the vector
length(v) #returns the number of elements in the vector
v = c(‘a’, ‘b’, ‘c’)
v = seq(1,10,by=2)
v = c(1,2,5)
v = rep(10,6)
matrices – these can be multidimensional, but all elements must be of the same type
v = 1:15
m = matrix(v, nrow = 3,ncol = 5,byrow = T) #creates a 3 (row) x 5 (column) matrix
m = matrix(v, 3, byrow = T) # does the same thing
## can you create the matrix below:
111
222
333
444
dim(m) # returns the number of rows and columns of matrix m
dim(m)[1] #the number of rows
dim(m)[2] #the number of columns
we can access elements of the matrix m using
m[rows, columns], where rows and columns are the rows and columns of interest
m[1:2,2:3] returns rows 1 and 2 and columns 2 and 3
m[rows, ] returns the specified rows (and all columns)
m[, columns] returns the specified columns (and all rows)
Note: if only 1 row or column is specified, then a vector will be returned
Can you change the element in the 3rd column and the 4th row to 0?
Matrix arithmetic
m + 3 # adds 3 to each element of m
m * 5 # multiplies all elements in m by 5
# for 2 matrices m1 and m2 of equal dimension, add corresponding elements
m1 + m2
Suppose we want to evaluate a function on all rows or columns of a matrix. We can
easily do this with the apply function
The function mean() returns the mean value of the elements in the object passed into it.
apply(m,1,mean) #returns the mean of each row
apply(m,2,mean) #returns the mean of each column
lists – can store elements of various types; in this respect, are similar to classes in object
oriented programming languages
student = list(name = ‘Bob’, age = 20)
Elements of a list can be accessed according to their names…
student$name # returns ‘Bob’
student$age # returns 20
We can easily add an element to the list:
student$grade = ‘A’
And delete an element from the list:
student$age = NULL
… We can also access elements according to their index using double brackets:
student[[1]] returns ‘Bob’
Important generic functions (which may or not be useful depending on the type of object):
summary()
mean()
plot()
#data coercion:
as.vector()
as.matrix()
#check the type of an object:
is.vector()
is.matrix()
Similar functions are defined for all data types (e.g., as.list(), as.character(), etc. )
The command typeof(object) returns the type of the object (obviously)
Note: some objects may be more than one type, and only one type will be returned.
ex:
v = 1:1000
typeof(v) returns ‘integer’
is.vector(v) and is.integer(v) BOTH return TRUE
is.matrix(v) returns FALSE
General statistical functions for any object x (usually a vector or matrix)
mean(x)
min(x)
var(x)
max(x)
sd(x)
quantile(x)
summary(x) # returns the five number summary, as well as the mean
Data input
A list of commands in a file can be read using
source(file.name)
source(‘http://www.public.iastate.edu/~gdancik/summer2007/files/setx.txt’)
Reading in a file
data = read.table(‘http://www.public.iastate.edu/~gdancik/summer2007/files/BigClass.txt‘,
sep = ‘,’, header = T)
data.frames
Data frames are objects that combine features (particularly element access methods) of
matrices and lists
The columns of ‘data’ are ‘name’, ‘age’, ‘sex’, ‘height’, and ‘weight’
This can be determined using colnames(data)
data$name
data$age
summary(data)
Suppose we want to change the heading of ‘sex’ to ‘gender’
We can rename all of the columns using colnames(data) = new.names
- Can you create a vector of the column names we want?
- Can you change ‘sex’ to ‘gender’?
- Can you rename the column names?
Alternately, we could have used colnames(data)[3] = ‘gender’
Another data type is the logical data type (TRUE or FALSE; or alternatively T and F)
5>3
5>9
Logical operators (e.g. to compare two numbers):
>, <, >=, <=, ==, !=
v = 1:10
index = v > 5 # for each element of v, check if that element > 5
v[index] # returns the elements of v that are > 5
-
In the big class data set, retrieve a list of students greater than 15 years old
o index = data$age > 15
o data[index,] - note that we need to include the ‘,’ after ‘index’. Why
is this?
o The previous two steps may also be combined: data[data$age > 15,]
-
Other examples:
o data[data$gender == ‘M’,]
o data[data$age == 12,]
o data[data$age == 12]$height
o data[data$age == 12, 4]
Relationship between two variables:
To reduce future typing, first enter:
x = data$height
y = data$weight
cor(x,y) returns the correlation between x and y
plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and
weight’)
Linear models:
A linear model (for one input variable) has the form:
y = b0 + b1x, where
y is referred to as the response variable and x is an input variable.
fit1 = lm(y~x) #fits a linear model of the above form
summary(l)
Estimates of b0 and b1 are the first and second elements of l$coeff, respectively.
These can also be found through summary(l) or simply by printing the l object
If we plug in our estimates of b0 and b1 into the original equation, we can predict a
person’s weight (y) from their known height (x). Doing this for our known
weights, we get a list of fitted values, l$fitted.
plot(x,y, xlab = ‘height’, ylab = ‘weight’, main = ‘scatterplot of height and
weight’)
lines(x, fit1$fitted, col = ‘red’)
Now let us consider the model
y = b0 + b1x1 + b2x2, where
y = data$weight
x1 = data$age
x2 = data$height
fit2 = lm(y ~ x1 + x2)
An alternate way to do this is to create a matrix with the 1st column = x1 and the 2nd
column = x2.
X = data[,c(2,4)] #returns a data.frame containing age and height (this returns a
# data.frame since the original object is a data.frame)
X = as.matrix(X) # X must be a matrix for the lm function to work!
fit2 = lm(y~X)
How does fit1 compare to fit2?
Comparing two variables (box plots and the t-test)
hm = data[data$gender == ‘M’,]$height
hf = data[data$gender == ‘F’,]$height
boxplot(hm, hf, names = c(‘male’, ‘female’), main = ‘height’)
For independent samples from 2 populations, the t-test is used to test a hypothesis about
the population means (μ1 and μ2). By default, the t-test function in R considers:
H0: μ1 – μ2 = 0
H1: μ1 – μ2 ≠ 0
t = t.test(hm, hf)
t
Writing your own functions (and loops)
f = function(x1, x2 = 0) {
return (x1 + x2)
}
The return statement above is optional. We may want to have a function that simply ‘does
something’.
For example, suppose we have a matrix m, and we want to plot a line that corresponds to
each row of the matrix. (Note: there is a function called matplot that will do this)
m = matrix(1:15,ncol=5,byrow=T)
plotLines = function(m, ...) {
# this is a comment
lower = min(m)
upper = max(m)
for (i in 1:dim(m)[1]) {
plot(m[i,], ylim = c(lower, upper), type = ‘l’, ...)
par(new=T)
}
}
R also allows while loops:
i =0
while (i < 10) {
print(i)
i=I+1
}
Within a loop you may use break or next statements, similar to Perl.
Conditional statements
is5 = function(x) {
if (x == 5) {
print (‘x is equal to 5’)
}
else {
print (‘x is not equal to 5’)
}
}
Note: There is no if else expression in R – you must used nested if…else statements.
Saving and Loading R objects
First let’s check our current working directory. This is the directory in which files will be
saved or the directory that R attempts to be read from if only a file name is specified.
In order to get and set the working directory, use the functions ‘getwd’ and ‘setwd’
It is recommended that you change the working directory now….
# save the current workspace in the current working directory
save.image(file = ‘file.RData’)
save(x, file = x.RData’) # can be used to save a subset of objects in the workspace
# will load in the specified workspace or R object (note: objects in the workspace you are
# loading in will overwrite any objects currently defined
load(‘file.RData’)
# save the matrix m as a text file
write(t(m), ncolumns = ncol(m), file = ‘m.txt’) # this can later be read in using the
# read.table command.
Probability distributions
R can handle all common probability distributions, including the normal and (continuous)
uniform distribution.
For the normal distribution (standard normal by default),
‘dnorm’ gives the density,
‘pnorm’ gives the distribution function,
‘qnorm’ gives the quantile function, and
‘rnorm’ generates random deviates.
Other probability functions work similarly (e.g., dunif, punif, etc. for the uniform
distribution)
#We can visualize the standard normal density
x = seq(-5,5, by=0.1)
plot(x, dnorm(x), type = ‘l’)
#We can generate 1000 observations from the standard normal distribution
z = rnorm(1000)
hist(z)
#Let Z ~ N(0,1). Then
pnorm(1.645) # returns P(Z < 1.645)
qnorm(.95) # returns the value z*, for which (P(Z < z*) = 0.95
# Simulate flipping a coin 100 times, where P(H) = P(T) = ½
flips = runif(100)
flips[ flips < 0.5 ] = ‘H’
flips[ flips != ‘H’] = ‘T’
flips = as.factor(flips)
summary(flips)
### count number of nucleotides in a sequence x
countN1 = function(x) {
numA = 0
numG = 0
numC = 0
numT = 0
x = toupper(x) ## convert to all uppercase
for (i in 1:length(x)) {
if (x[i] == ‘A’) numA = numA + 1
if (x[i] == ‘G’) numG = numG + 1
if (x[i] == ‘C’) numC = numC + 1
if (x[i] == ‘T’) numT = numT + 1
}
print(paste(‘A:’, numA))
print(paste(‘G:’, numG))
print(paste(‘C:’, numC))
print(paste(‘T:’, numT))
}
countN2 = function(x) {
x = toupper(x)
numA = length( x[ x == ‘A’] )
numG = length( x[ x == ‘G’] )
numC = length( x[ x == ‘C’] )
numT = length( x[ x == ‘T’] )
print(paste(‘A:’, numA))
print(paste(‘G:’, numG))
print(paste(‘C:’, numC))
print(paste(‘T:’, numT))
}
Quitting R
To quit R, type the command q(). You will be prompted to save your workspace. If you
have not saved the workspace previously, R will save the workspace in a file called
“.RData” in the current working directory by default.
Practice Exercises (Further instructions will be given in class)
The following exercises are designed primarily for programming practice in R. However,
we will motivate the problems by considering the following:
1) A researcher has identified genetic structure that she believes is conserved
throughout the genome. In order to determine the probability that this structure
arose by chance, she generates many random sequences of the same length, with
marginal probabilities for each nucleotide based on their empirical probabilities.
2) A researcher is studying promoter regions that are rich in guanine, and, from a list
of candidate promoters, wants to look at all sequences where guanine content is
greater than 30%.
1) Generating random sequences –
a. Create a function that generates a single random nucleotide X where
P(X = “G”) = 0.30, P(X = “A”) = 0.20, P(X = “C”) = 0.25, and
P(X = “T”) = 0.25
Hint: You may want to use the runif() function to do this.
b. Using the function you have created in (a), create another function that
generates a random nucleotide sequence of length n.
c. Generate a random nucleotide sequence of length 100 using the
sample() function, where the probability of each nucleotide is given in (a) Hinte:
type ‘?sample’ for more information.
2) Sequence analysis –
a. Load the object ‘sequences’ using the following command:
load(url(‘http://www.public.iastate.edu/~gdancik/summer2007/files/sequences
.RData’))
to get a data.frame of dna sequences, which has the name ‘sequences’. Each
column contains a 40-base nucleotide sequence. For example, sequences[,1]
will return the first sequence (as a factor)
b. Since the columns of sequences are factors, summary(sequences) will tell
you the number of each nucleotide in each column. However, suppose that we did
not know this. Modify the countN1 or countN2 functions to take a single
sequence, and return a vector of 4 elements that corresponds to the number of A’s,
G’s, C’s, and T’s in the sequence. (Note: you will need to remove the toupper()
function, since we are now working with factors, and not characters).
c. Use the apply function to return a 4 x 10 matrix with the number of A’s, G’s, C’s,
and T’s in each of the 10 sequences.
Download