H igher E ducation Q uality E nhancement P roject (HEQEP)
Software Training Program
(An Open Source Package)
Date: March 22-23, 2013
Organized by
Department of Statistics
Rajshahi University, Rajshahi-6205, Bangladesh.
Lecture 3
2
Matrices
X<-matrix(rpois(20,1.5),nrow=4) # rpois() is random sample from Poisson dist
X
[,1] [,2] [,3] [,4] [,5]
[1,] 1 0 2 5 3
[2,] 1 1 3 1 3
[3,] 3 1 0 2 2
[4,] 1 0 2 1 0
Suppose that the rows refer to four different trials and we want to label the rows ‘Trial.1’ etc . We employ the function rownames to do this. We could use the paste function but here we take advantage of the prefix option: rownames(X)<-rownames(X,do.NULL=FALSE, prefix="Trial.")
X
[,1] [,2] [,3] [,4] [,5]
Trial.1 1 0 2 5 3
Trial.2 1 1 3 1 3
Trial.3 3 1 0 2 2
Trial.4 1 0 2 1 0
3
Matrices
For the columns we want to supply a vector of different names for the five drugs involved in the trial, and use this to specify the colnames(X): drug.names<-c("aspirin", "paracetamol", "nurofen", "hedex", "placebo") colnames(X)<-drug.names
X
Trial.1
Trial.2
Trial.3
Trial.4 aspirin paracetamol nurofen hedex placebo
1
1
3
1
0
1
1
0
2
3
0
2
5
1
2
1 0
3
3
2
Alternatively, you can use the dimnames function to give names to the rows and/or columns of a matrix. In this example we want the rows to be unlabelled (NULL) and the column names to be of the form ‘drug.1’, ‘drug.2’, etc. dimnames(X)<-list(NULL, paste("drug.",1:5,sep=""))
X
[1,]
[2,]
[3,]
[4,] drug.1 drug.2 drug.3 drug.4 drug.5
1 0 2 5 3
1 1 3 1 3
3
1
1
0
0
2
2
1
2
0
4
We illustrate how to construct a data frame from the following car data.
Make Model
Honda Civic
Chevrolet Beretta
Ford Escort
Eagle Summit
Volkswagen Jetta
Buick Le Sabre
Mitsubishi Galant
Dodge
V4
Grand Caravan V6
Chrysler New Yorker V6
Acura Legend V6
V4
V4
V6
Cylinder Weight Mileage Type
V4
V4
V4
2170
2655
2345
33
26
33
Sporty
Compact
Small
2560
2330
3325
33
26
23
Small
Small
Large
2745
3735
3450
3265
25
18
22
20
Compact
Van
Medium
Medium
5
> Make <- c("Honda","Chevrolet","Ford","Eagle","Volkswagen","Buick","Mitsbusihi",
+ "Dodge","Chrysler","Acura")
> Model <- c("Civic","Beretta","Escort","Summit","Jetta","Le Sabre","Galant",
+ "Grand Caravan","New Yorker","Legend")
> Cylinder <-c (rep("V4",5),"V6","V4",rep("V6",3))
> Weight <- c(2170, 2655, 2345, 2560, 2330, 3325, 2745, 3735, 3450, 3265)
> Mileage <- c(33, 26, 33, 33, 26, 23, 25, 18, 22, 20)
> Type <- c("Sporty","Compact",rep("Small",3),"Large","Compact","Van",
+ rep("Medium",2))
# rep("V4",5) instructs R to repeat V4 five times.
6
Now data.frame() function combines the six vectors into a single data frame.
Car <- data.frame(Make,Model,Cylinder,Weight,Mileage,Type)
Car
Make
1 Honda
Model
Civic
2 Chevrolet Beretta
3 Ford Escort
4 Eagle Summit
5 Volkswagen Jetta
V4
V4
V4
V4
6 Buick Le Sabre
7 Mitsubishi Galant
8 Dodge
V6
V4
Grand Caravan V6
Cylinder Weight Mileage Type
V4 2170 33 Sporty
2655 26
2345 33
Compact
Small
2560
2330
3325
2745
3735
33
26
23
25
18
Small
Small
Large
Compact
Van
9 Chrysler New Yorker V6
10 Acura Legend V6
3450 22
3265 20
Medium
Medium
7
names(Car)
[1] "Make" "Model" "Cylinder“ "Weight" "Mileage" "Type"
Car[1,]
Make Model Cylinder Weight Mileage Type
1 Honda Civic V4 2170 33 Sporty
Car[10,4]
[1] 3265
Car$Mileage
[1] 33 26 33 33 26 23 25 18 22 20 mean(Car$Mileage) #average mileage of the 10 vehicles
[1] 25.9 min(Car$Weight)
[1] 2170
# minimum of car weights
8
table(Car$Type) # gives a frequency table
Compact Large Medium Small Sporty Van
2 1 2 3 1 1
table(Car$Make, Car$Type) # Cross tabulation
Compact Large Medium Small Sporty Van
Acura
Buick
0
0
Chevrolet 1 0
Chrysler 0 0
Dodge 0 0
0
1
1
0
0
1
0
0
0
0
0
0 0
0
0
0
0
0
0
0
0
1
Eagle
Ford
0 0
0 0
Honda 0 0
Mitsbusihi 1 0
Volkswagen 0 0
0
0
0
0
0
1 0
1 0
0 1
0 0
1 0
0
0
0
0
0
9
Make.Small <- Car$Make[Car$Type == "Small"] summary(Car$Mileage) # gives summary statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 22.25 25.50 25.90 31.25 33.00
10
Rank, Sorting and Order
Price<- scan()
1: 325 201 157 162 164 101 211 188 95 117 188 121
13:
Read 12 items ranks<-rank(Price) sorted<-sort(Price) ordered<-order(Price) # position view<-data.frame(Price, ranks, sorted, ordered) view
Price ranks sorted ordered
1 325 12.0 95 9
2 201 10.0 101 6
3 157 5.0 117 10
4 162 6.0 121 12
5 164 7.0 157 3
6 101 2.0 162 4
7 211 11.0 164 5
8 188 8.5 188 8
9 95 1.0 188 11
10 117 3.0 201 2
11 188 8.5 211 7
12 121 4.0 325 1 11
Evaluating Functions with apply , sapply and lapply apply( arr, margin, fct )
Applies the function fct along some dimensions of the array arr , according to margin , and returns a vector or array of the appropriate size.
The apply function is used for applying functions to the rows or columns of matrices or dataframes.
For example: apply
(X<-matrix(1:24,nrow=4))
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] 1 5 9 13 17 21
[2,] 2 6 10 14 18 22
[3,] 3 7 11 15 19 23
[4,] 4 8 12 16 20 24 apply(X,1,sum) # to obtain the row total
[1] 66 72 78 84 apply(X,2,sum) # to obtain the column totals (six of them):
[1] 10 26 42 58 74 90 apply(X,1,sqrt) apply(X,2,sqrt) apply(X,1,sample) apply(X,1,function(x) x^2+x)
13
Vector and sapply
If you want to apply a function to a vector then use sapply (rather than apply for matrices or margins of matrices). Here is the code to generate a list of sequences from 1:3 up to 1:7 sapply(3:7, seq)
[[1]]
[1] 1 2 3
[[2]]
[1] 1 2 3 4
[[3]]
[1] 1 2 3 4 5
[[4]]
[1] 1 2 3 4 5 6
[[5]]
[1] 1 2 3 4 5 6 7
The function sapply is most useful with complex iterative calculations.
14
Example: sapply a<-seq(0.01,0.2,.005)
Now we can use sapply to apply the sum of squares function for each of these values of a (without writing a loop), and plot the deviance against the parameter value for a: sumsq<- function(x) {sum(x^2)} # function that produce sum of squares plot(a, sapply(a, sumsq), type="l")
15
Lists and lapply lapply( li, fct ) # t o each element of the list li , the function fct is applied.
a<-c("a","b","c","d") b<-c(1,2,3,4,4,3,2,1) c<-c(T,T,F) list.object<-list(a,b,c) # create a list object class(list.object) # to see the class type
[1] "list" list.object # to see the contents of the list we just type its name:
[[1]]
[1] "a" "b" "c" "d"
[[2]]
[1] 1 2 3 4 4 3 2 1
[[3]]
[1] TRUE TRUE FALSE
The function lapply applies a specified function to each of the elements of a list in turn (without the need for specifying a loop, and not requiring us to know how many elements there are in the list).
16
Lists and lapply
#To know the length of each of the vectors making up the list: lapply(list.object, length)
#To find out class, we apply the function class to the list: lapply(list.object, class)
#To find mean lapply(list.object, mean)
17
Working with Vectors and Logical Subscripts
Take the example of a vector containing the 11 numbers 0 to 10: x<-0:10
There are two quite different kinds of things we might want to do with this. We might want to add up the values of the elements: sum(x) # adds up the values of the xs
[1] 55
Alternatively, we might want to count the elements that passed some logical criterion. Suppose we wanted to know how many of the values were less than 5: sum(x<5) # counts up the number of cases that pass the logical
# condition ‘x is less than 5’
[1] 5
18
When we counted the number of cases, the counting was applied to the entire vector, using sum(x<5). To find the sum of the values of x that are less than 5: sum(x[x<5])
[1] 10
The logical condition x<5 is either true or false: x<5
[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
[10] FALSE FALSE
Imagine false as being numeric 0 and true as being numeric 1. Then the vector of subscripts [x<5] is five 1s followed by six 0s:
1*(x<5)
[1] 1 1 1 1 1 0 0 0 0 0 0
Now imagine multiplying the values of x by the values of the logical vector x*(x<5)
[1] 0 1 2 3 4 0 0 0 0 0 0
When the function sum is applied, it gives us the answer we want: the sum of the values of the numbers 0+1+2+3+4=10.
sum(x*(x<5))
[1] 10
This produces the same answer as sum(x[x<5]) 19
Addresses within Vectors
There are two important functions for finding addresses within arrays. The function which is very easy to understand. y<-c(8,3,5,7,6,6,8,9,2,3,9,4,10,4,11) y
[1] 8 3 5 7 6 6 8 9 2 3 9 4 10 4 11
Suppose we wanted to know which elements of y contained values bigger than 5 . We type which(y>5)
[1] 1 4 5 6 7 8 11 13 15
Notice that the answer to this enquiry is a set of subscripts . We don’t use subscripts inside the which function itself. The function is applied to the whole array. To see the values of y that are larger than 5 , we just type y[y>5]
[1] 8 7 6 6 8 9 9 10 11
Note that this is a shorter vector than y itself, because values of 5 or less have been left out: length(y)
[1] 15 length(y[y>5])
[1] 9
20
Finding Closest Values
Finding the value in a vector that is closest to a specified value is straightforward using which. Here, we want to find the value of xv that is closest to 108.0: which(abs(y-8.9)==min(abs(y-8.9)))
[1] 8 11
The closest value to 108.0 is in location 332. But just how close to 8.9 is this
8 th and 11 th value? We use 8 and 11 as a subscript on y to find this out y[c(8,11)]
[1] 9 9
Thus, we can write a function to return the closest value to a specified value sv closest<-function(xv, sv){ xv[which(abs(xv-sv)==min(abs(xv-sv)))] } and run it like this: closest(y,8.9)
[1] 9 9
21
The sample Function y<- scan()
[1] 8 3 5 7 6 6 8 9 2 3 9 4 10 4 11 and here are two samples of y: sample(y)
[1] 8 8 9 9 2 10 6 7 3 11 5 4 6 3 4 sample(y)
[1] 9 3 9 8 8 6 5 11 4 6 4 7 3 2 10
The order of the values is different each time that sample is invoked, but the same numbers are shuffled in every case. This is called sampling without replacement .
sample(y, 5)
[1] 9 4 10 8 11 sample(y, 5)
[1] 9 3 4 2 8
The option replace=T allows for sampling with replacement , sample(y, replace=T)
[1] 9 6 11 2 9 4 6 8 8 4 4 4 3 9 3
22
Some general programming guidelines
Writing a computer program to solve a problem can usually be reduced by following this sequence of steps:
1 Understand the problem.
2 Work out a general idea how to solve it.
3 Translate your general idea into a detailed implementation.
4 Check: Does it work?
Is it good enough?
If yes, you are done!
If no, go back to step 2.
23
24
The for() statement allows one to specify that a certain operation should be repeated a fixed number of times.
Syntax
for (variable in sequence) expression
Or,
}
for (variable in sequence) { expression expression expression
25
{ sum.x <- 0 for (i in 1:5) sum.x <- sum.x + i print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5 print() # Prints a single R object cat() # Prints multiple objects,
# one after the other sum.x
[1] 15
26
The Fibonacci sequence is a famous sequence in mathematics. The first two elements are defined as [1, 1]. Subsequent elements are defined as the sum of the preceding two elements. For example, the third element is 2 (= 1+1), the fourth element is 3 (= 1+2), the fifth element is 5 (= 2+3), and so on.
To obtain the first 12 Fibonacci numbers in R, we can use
Fibonacci <- numeric(12) # numeric array of size 12
Fibonacci[1] <- Fibonacci[2] <- 1 for (i in 3:12) Fibonacci[i] <- Fibonacci[i - 2] + Fibonacci[i - 1]
To see all 12 values, type in
Fibonacci
[1] 1 1 2 3 5 8 13 21 34 55 89 144
27
What are the outputs of the following statement?
for (x in 1:10) print(sqrt(x))
It prints the square root of the integers one to ten
28
The if() statements:
The if() statement allows us to control which statements are executed
Syntax if (condition) {commands when TRUE}
Or, if (condition) {commands when TRUE} else {commands when FALSE}
That is, if ( expr1 ) expr2 else expr3 where expr1 must evaluate to a single logical value and the result of the entire expression is then evident.
29
Entry
False
Expression
True (1)
Statement 1
Exit
30
Entry
True (1)
Body of if
Test
Expression
False (0)
Body of else
Exit
31
A simple example: x <- 3 if (x > 2) y <- 2 * x else y <- 3 * x
Since x > 2 is TRUE, y is assigned 2 * 3 = 6.
If it hadn’t been true, y would have been assigned the value of 3 * x.
32
Sometimes we want to repeat statements, but the pattern of repetition isn’t known in advance. We need to do some calculations and keep going as long as a condition holds. The while() statement accomplishes this.
Syntax while (condition) {statements}
The condition is evaluated, and if it evaluates to FALSE, nothing more is done. If it evaluates to TRUE the statements are executed, condition is evaluated again, and the process is repeated.
33
Example: while loop
Suppose we want to list all Fibonacci numbers less than 300. We don’t know beforehand how long this list is, so we wouldn’t know how to stop the for()loop at the right time, but a while()loop is perfect:
Fib1 <- 1
Fib2 <- 1
Fibonacci <- c(Fib1, Fib2)
}
{ while (Fib2 < 300)
Fibonacci <- c(Fibonacci, Fib2) oldFib2 <- Fib2
Fib2 <- Fib1 + Fib2
Fib1 <- oldFib2
To see the final result of the computation, type
Fibonacci
[1] 1 1 1 2 3 5 8 13 21 34 55 89 144 233
34
35
36
The repeat loop, and the break and next statements
Sometimes we don’t want a fixed number of repetitions of a loop, and we don’t want to put the test at the top of the loop the way it is in a while() loop. In this situation we can use a repeat loop. This loop repeats until we execute a break statement.
Syntax repeat { statements }
This causes the statements to be repeated endlessly. The statements should normally include a break statement, typically in the form if (condition) break but this is not a requirement of the syntax.
The break statement causes the loop to terminate immediately. Break statements can also be used in for() and while() loops. The next statement causes control to return immediately to the top of the loop; it can also be used in any loop.
The repeat loop and the break and next statements are used relatively infrequently.
37
}
{
We can repeat the Newton’s algorithm example from the previous example using a repeat loop:
#****** Prog Using while x <- x0<- .5
x <- x0<- .5
tolerance <- 0.000001
f <- x^3 + 2 * x^2 - 7 repeat tolerance <- 0.000001
while (abs(f) > tolerance) f <- x^3 + 2 * x^2 - 7
{ if (abs(f) < tolerance) break f.prime <- 3 * x^2 + 4 * x f.prime <- 3 * x^2 + 4 * x x <- x - f / f.prime
x <- x - f / f.prime
f <- x^3 + 2 * x^2 - 7
} x x
This version removes the need to duplicate the line that calculates f.
38
A function is defined by an assignment of the form
> name <- function(arg_1, arg_2, ...) expression
The expression is an R expression, (usually a grouped expression), that uses the arguments, arg_i, to calculate a value. The value of the expression is the value returned for the function.
A call to the function then usually takes the form
> name(expr_1, expr_2, ...) and may occur anywhere a function call is legitimate.
39
Example 1: Write a function to compute standard deviation
} sd <- function(x){ sqrt(var(x))
If X = 9, 5, 2, 3, 7; type x <- c(9,5,2,3,7) sd(x)
[1] 2.863564
Exercise: Calculate the coefficient of variation as the standard deviation of a variable, after dividing by its mean.
40
insects<-c(1,10,1000,10,1) mean(insects)
[1] 204.4
To calculate a geometric mean by finding the antilog (exp) of the average of the logarithms (log) of the data: exp(mean(log(insects)))
[1] 10
So a function to calculate geometric mean of a vector of numbers x: geometric<-function (x) {exp(mean(log(x)))} and testing it with the insect data geometric(insects)
[1] 10
41
42
Example 3: Calculation of Grade point and Letter grade from score or marks.
grade <- function (s) { n <- length(s) gp <- matrix(0, nrow = n, ncol = 1) # gp means Grade Point lg <- matrix(0, nrow = n, ncol = 1) # lg means Letter Grade for (i in 1:n) {
} if (s[i] < 40){ gp[i] = 0.00; lg[i]= "F"
} else if (s[i] >= 40 && s[i] < 45){ gp[i] = 2.00; lg[i] = "D"
43
else if (s[i] >= 45 && s[i] < 50){ gp[i] = 2.25; lg[i] = "C"
} else if (s[i] >= 50 && s[i] < 55){ gp[i] = 2.50; lg[i] = "C+"
} else if (s[i] >= 55 && s[i] < 60){
} gp[i] = 2.75; lg[i] = "B-" else if (s[i] >= 60 && s[i] < 65){
}
} gp[i] = 3.00; lg[i] = "B" else if (s[i] >= 65 && s[i] < 70){ gp[i] = 3.25; lg[i] = "B+"
44
else if (s[i] >= 70 && s[i] < 75){ gp[i] = 3.50; lg[i] = "A-"
} else if (s[i] >= 75 && s[i] < 80){
} gp[i] = 3.75; lg[i] = "A" else{
}
} gp[i] = 4.00; lg[i] = "A+"
} # end of for loop return(list(Grade.Point = gp, Letter.Grade = lg))
# end of function score <- c(80, 45, 55, 90, 75, 38, 62) result <- grade(score) result
45
Example 4: Write a function to calculate the two sample tstatistic. This is an artificial example. The function is defined as follows:
} twosam <- function(y1, y2) { n1 <- length(y1); n2 <- length(y2) yb1 <- mean(y1); yb2 <- mean(y2) s1 <- var(y1); s2 <- var(y2) s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2)) deg.free <- n1+n2-2 return(list(test.stat = tst, df=deg.free))
46
If, x <- c(37, 29, 35, 28, 24, 36, 40, 37, 33, 28, 39) y <- c(22, 32, 27, 30, 24, 34, 32, 20, 24, 25, 28, 26, 26)
Then twosam(x, y)
$test.stat
[1] 3.307523
$df
[1] 22
47
Example 4: Maximum likelihood estimation
(Gamma distribution as an example)
The pdf of Gamma distribution f ( x ;
,
)
(
) e
x x
1
,
The likelihood and log-likelihood are x
0
L ( x
1
, x
2
, , x n
;
,
)
i n
1 f ( x i
;
,
)
(
)
n
i n
1 x i
1 e
i n
1 x i log( L )
n
log(
)
n log
(
)
(
1 ) i n
1 log( x i
)
i n
1 x i
48
mle() allows to estimate parameters by maximum likelihood method using iterative methods of numerical calculus to minimize the negative loglikelihood (which is the same of maximizing the loglikelihood).
This requires to specify the negative log-likelihood analytical expression as argument and giving some starting parameters estimates.
49
x.gam <- rgamma(200, rate=0.5, shape=3.5)
# sample size n=200 from a gamma distribution with
# λ=0.5 (scale parameter) and α=3.5 (shape parameter) library(stats4) # loading package stats4 for mle() logL <- function(lambda, alfa) { n <-200 x <- x.gam
temp1 <- -n*alfa*log(lambda)+n*log(gamma(alfa)) temp2 <- -(alfa-1)*sum(log(x))+lambda*sum(x) temp1+temp2 # -log-likelihood function
} est <- mle(minuslog =logL, start =list(lambda =2, alfa =1))
50
summary(est)
Maximum likelihood estimation
Call: mle(minuslogl = logL, start = list(lambda = 2, alfa = 1))
Coefficients:
Estimate Std. Error lambda 0.5350503 0.05485034
alfa 3.8613209 0.37065500
-2 log L: 1051.109
51
52