Note-1 - Department of Statistics | Rajshahi University

advertisement

Training on R for Students

H igher E ducation Q uality E nhancement P roject (HEQEP)

Software Training Program

(An Open Source Package)

Date: March 22-23, 2013

Organized by

Department of Statistics

Rajshahi University, Rajshahi-6205, Bangladesh.

Lecture 3

Programming with R

2

Matrices

X<-matrix(rpois(20,1.5),nrow=4) # rpois() is random sample from Poisson dist

X

[,1] [,2] [,3] [,4] [,5]

[1,] 1 0 2 5 3

[2,] 1 1 3 1 3

[3,] 3 1 0 2 2

[4,] 1 0 2 1 0

Suppose that the rows refer to four different trials and we want to label the rows ‘Trial.1’ etc . We employ the function rownames to do this. We could use the paste function but here we take advantage of the prefix option: rownames(X)<-rownames(X,do.NULL=FALSE, prefix="Trial.")

X

[,1] [,2] [,3] [,4] [,5]

Trial.1 1 0 2 5 3

Trial.2 1 1 3 1 3

Trial.3 3 1 0 2 2

Trial.4 1 0 2 1 0

3

Matrices

For the columns we want to supply a vector of different names for the five drugs involved in the trial, and use this to specify the colnames(X): drug.names<-c("aspirin", "paracetamol", "nurofen", "hedex", "placebo") colnames(X)<-drug.names

X

Trial.1

Trial.2

Trial.3

Trial.4 aspirin paracetamol nurofen hedex placebo

1

1

3

1

0

1

1

0

2

3

0

2

5

1

2

1 0

3

3

2

Alternatively, you can use the dimnames function to give names to the rows and/or columns of a matrix. In this example we want the rows to be unlabelled (NULL) and the column names to be of the form ‘drug.1’, ‘drug.2’, etc. dimnames(X)<-list(NULL, paste("drug.",1:5,sep=""))

X

[1,]

[2,]

[3,]

[4,] drug.1 drug.2 drug.3 drug.4 drug.5

1 0 2 5 3

1 1 3 1 3

3

1

1

0

0

2

2

1

2

0

4

Making data frames (1)

We illustrate how to construct a data frame from the following car data.

Make Model

Honda Civic

Chevrolet Beretta

Ford Escort

Eagle Summit

Volkswagen Jetta

Buick Le Sabre

Mitsubishi Galant

Dodge

V4

Grand Caravan V6

Chrysler New Yorker V6

Acura Legend V6

V4

V4

V6

Cylinder Weight Mileage Type

V4

V4

V4

2170

2655

2345

33

26

33

Sporty

Compact

Small

2560

2330

3325

33

26

23

Small

Small

Large

2745

3735

3450

3265

25

18

22

20

Compact

Van

Medium

Medium

5

Making data frames (2)

> Make <- c("Honda","Chevrolet","Ford","Eagle","Volkswagen","Buick","Mitsbusihi",

+ "Dodge","Chrysler","Acura")

> Model <- c("Civic","Beretta","Escort","Summit","Jetta","Le Sabre","Galant",

+ "Grand Caravan","New Yorker","Legend")

> Cylinder <-c (rep("V4",5),"V6","V4",rep("V6",3))

> Weight <- c(2170, 2655, 2345, 2560, 2330, 3325, 2745, 3735, 3450, 3265)

> Mileage <- c(33, 26, 33, 33, 26, 23, 25, 18, 22, 20)

> Type <- c("Sporty","Compact",rep("Small",3),"Large","Compact","Van",

+ rep("Medium",2))

# rep("V4",5) instructs R to repeat V4 five times.

6

Making data frames (3)

Now data.frame() function combines the six vectors into a single data frame.

Car <- data.frame(Make,Model,Cylinder,Weight,Mileage,Type)

Car

Make

1 Honda

Model

Civic

2 Chevrolet Beretta

3 Ford Escort

4 Eagle Summit

5 Volkswagen Jetta

V4

V4

V4

V4

6 Buick Le Sabre

7 Mitsubishi Galant

8 Dodge

V6

V4

Grand Caravan V6

Cylinder Weight Mileage Type

V4 2170 33 Sporty

2655 26

2345 33

Compact

Small

2560

2330

3325

2745

3735

33

26

23

25

18

Small

Small

Large

Compact

Van

9 Chrysler New Yorker V6

10 Acura Legend V6

3450 22

3265 20

Medium

Medium

7

Few Operations in data frame Car (1)

names(Car)

[1] "Make" "Model" "Cylinder“ "Weight" "Mileage" "Type"

Car[1,]

Make Model Cylinder Weight Mileage Type

1 Honda Civic V4 2170 33 Sporty

Car[10,4]

[1] 3265

Car$Mileage

[1] 33 26 33 33 26 23 25 18 22 20 mean(Car$Mileage) #average mileage of the 10 vehicles

[1] 25.9 min(Car$Weight)

[1] 2170

# minimum of car weights

8

Few Operations in data frame Car (2)

table(Car$Type) # gives a frequency table

Compact Large Medium Small Sporty Van

2 1 2 3 1 1

table(Car$Make, Car$Type) # Cross tabulation

Compact Large Medium Small Sporty Van

Acura

Buick

0

0

Chevrolet 1 0

Chrysler 0 0

Dodge 0 0

0

1

1

0

0

1

0

0

0

0

0

0 0

0

0

0

0

0

0

0

0

1

Eagle

Ford

0 0

0 0

Honda 0 0

Mitsbusihi 1 0

Volkswagen 0 0

0

0

0

0

0

1 0

1 0

0 1

0 0

1 0

0

0

0

0

0

9

Making data frames (6)

Make.Small <- Car$Make[Car$Type == "Small"] summary(Car$Mileage) # gives summary statistics

Min. 1st Qu. Median Mean 3rd Qu. Max.

18.00 22.25 25.50 25.90 31.25 33.00

10

Rank, Sorting and Order

Price<- scan()

1: 325 201 157 162 164 101 211 188 95 117 188 121

13:

Read 12 items ranks<-rank(Price) sorted<-sort(Price) ordered<-order(Price) # position view<-data.frame(Price, ranks, sorted, ordered) view

Price ranks sorted ordered

1 325 12.0 95 9

2 201 10.0 101 6

3 157 5.0 117 10

4 162 6.0 121 12

5 164 7.0 157 3

6 101 2.0 162 4

7 211 11.0 164 5

8 188 8.5 188 8

9 95 1.0 188 11

10 117 3.0 201 2

11 188 8.5 211 7

12 121 4.0 325 1 11

Evaluating Functions with apply , sapply and lapply apply( arr, margin, fct )

Applies the function fct along some dimensions of the array arr , according to margin , and returns a vector or array of the appropriate size.

The apply function is used for applying functions to the rows or columns of matrices or dataframes.

For example: apply

(X<-matrix(1:24,nrow=4))

[,1] [,2] [,3] [,4] [,5] [,6]

[1,] 1 5 9 13 17 21

[2,] 2 6 10 14 18 22

[3,] 3 7 11 15 19 23

[4,] 4 8 12 16 20 24 apply(X,1,sum) # to obtain the row total

[1] 66 72 78 84 apply(X,2,sum) # to obtain the column totals (six of them):

[1] 10 26 42 58 74 90 apply(X,1,sqrt) apply(X,2,sqrt) apply(X,1,sample) apply(X,1,function(x) x^2+x)

13

Vector and sapply

If you want to apply a function to a vector then use sapply (rather than apply for matrices or margins of matrices). Here is the code to generate a list of sequences from 1:3 up to 1:7 sapply(3:7, seq)

[[1]]

[1] 1 2 3

[[2]]

[1] 1 2 3 4

[[3]]

[1] 1 2 3 4 5

[[4]]

[1] 1 2 3 4 5 6

[[5]]

[1] 1 2 3 4 5 6 7

The function sapply is most useful with complex iterative calculations.

14

Example: sapply a<-seq(0.01,0.2,.005)

Now we can use sapply to apply the sum of squares function for each of these values of a (without writing a loop), and plot the deviance against the parameter value for a: sumsq<- function(x) {sum(x^2)} # function that produce sum of squares plot(a, sapply(a, sumsq), type="l")

15

Lists and lapply lapply( li, fct ) # t o each element of the list li , the function fct is applied.

a<-c("a","b","c","d") b<-c(1,2,3,4,4,3,2,1) c<-c(T,T,F) list.object<-list(a,b,c) # create a list object class(list.object) # to see the class type

[1] "list" list.object # to see the contents of the list we just type its name:

[[1]]

[1] "a" "b" "c" "d"

[[2]]

[1] 1 2 3 4 4 3 2 1

[[3]]

[1] TRUE TRUE FALSE

The function lapply applies a specified function to each of the elements of a list in turn (without the need for specifying a loop, and not requiring us to know how many elements there are in the list).

16

Lists and lapply

#To know the length of each of the vectors making up the list: lapply(list.object, length)

#To find out class, we apply the function class to the list: lapply(list.object, class)

#To find mean lapply(list.object, mean)

17

Working with Vectors and Logical Subscripts

Take the example of a vector containing the 11 numbers 0 to 10: x<-0:10

There are two quite different kinds of things we might want to do with this. We might want to add up the values of the elements: sum(x) # adds up the values of the xs

[1] 55

Alternatively, we might want to count the elements that passed some logical criterion. Suppose we wanted to know how many of the values were less than 5: sum(x<5) # counts up the number of cases that pass the logical

# condition ‘x is less than 5’

[1] 5

18

When we counted the number of cases, the counting was applied to the entire vector, using sum(x<5). To find the sum of the values of x that are less than 5: sum(x[x<5])

[1] 10

The logical condition x<5 is either true or false: x<5

[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE

[10] FALSE FALSE

Imagine false as being numeric 0 and true as being numeric 1. Then the vector of subscripts [x<5] is five 1s followed by six 0s:

1*(x<5)

[1] 1 1 1 1 1 0 0 0 0 0 0

Now imagine multiplying the values of x by the values of the logical vector x*(x<5)

[1] 0 1 2 3 4 0 0 0 0 0 0

When the function sum is applied, it gives us the answer we want: the sum of the values of the numbers 0+1+2+3+4=10.

sum(x*(x<5))

[1] 10

This produces the same answer as sum(x[x<5]) 19

Addresses within Vectors

There are two important functions for finding addresses within arrays. The function which is very easy to understand. y<-c(8,3,5,7,6,6,8,9,2,3,9,4,10,4,11) y

[1] 8 3 5 7 6 6 8 9 2 3 9 4 10 4 11

Suppose we wanted to know which elements of y contained values bigger than 5 . We type which(y>5)

[1] 1 4 5 6 7 8 11 13 15

Notice that the answer to this enquiry is a set of subscripts . We don’t use subscripts inside the which function itself. The function is applied to the whole array. To see the values of y that are larger than 5 , we just type y[y>5]

[1] 8 7 6 6 8 9 9 10 11

Note that this is a shorter vector than y itself, because values of 5 or less have been left out: length(y)

[1] 15 length(y[y>5])

[1] 9

20

Finding Closest Values

Finding the value in a vector that is closest to a specified value is straightforward using which. Here, we want to find the value of xv that is closest to 108.0: which(abs(y-8.9)==min(abs(y-8.9)))

[1] 8 11

The closest value to 108.0 is in location 332. But just how close to 8.9 is this

8 th and 11 th value? We use 8 and 11 as a subscript on y to find this out y[c(8,11)]

[1] 9 9

Thus, we can write a function to return the closest value to a specified value sv closest<-function(xv, sv){ xv[which(abs(xv-sv)==min(abs(xv-sv)))] } and run it like this: closest(y,8.9)

[1] 9 9

21

The sample Function y<- scan()

[1] 8 3 5 7 6 6 8 9 2 3 9 4 10 4 11 and here are two samples of y: sample(y)

[1] 8 8 9 9 2 10 6 7 3 11 5 4 6 3 4 sample(y)

[1] 9 3 9 8 8 6 5 11 4 6 4 7 3 2 10

The order of the values is different each time that sample is invoked, but the same numbers are shuffled in every case. This is called sampling without replacement .

sample(y, 5)

[1] 9 4 10 8 11 sample(y, 5)

[1] 9 3 4 2 8

The option replace=T allows for sampling with replacement , sample(y, replace=T)

[1] 9 6 11 2 9 4 6 8 8 4 4 4 3 9 3

22

Some general programming guidelines

Writing a computer program to solve a problem can usually be reduced by following this sequence of steps:

1 Understand the problem.

2 Work out a general idea how to solve it.

3 Translate your general idea into a detailed implementation.

4 Check: Does it work?

Is it good enough?

If yes, you are done!

If no, go back to step 2.

23

Flow control

1. The for() statement

2. The if() statement

3. The while() loop

4. The repeat loop, and the break and next statements

24

Flow control

The for statement (1)

The for() statement allows one to specify that a certain operation should be repeated a fixed number of times.

Syntax

for (variable in sequence) expression

Or,

}

for (variable in sequence) { expression expression expression

25

Example: The for statement

{ sum.x <- 0 for (i in 1:5) sum.x <- sum.x + i print(i)

}

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5 print() # Prints a single R object cat() # Prints multiple objects,

# one after the other sum.x

[1] 15

26

Example: for loop

The Fibonacci sequence is a famous sequence in mathematics. The first two elements are defined as [1, 1]. Subsequent elements are defined as the sum of the preceding two elements. For example, the third element is 2 (= 1+1), the fourth element is 3 (= 1+2), the fifth element is 5 (= 2+3), and so on.

To obtain the first 12 Fibonacci numbers in R, we can use

Fibonacci <- numeric(12) # numeric array of size 12

Fibonacci[1] <- Fibonacci[2] <- 1 for (i in 3:12) Fibonacci[i] <- Fibonacci[i - 2] + Fibonacci[i - 1]

To see all 12 values, type in

Fibonacci

[1] 1 1 2 3 5 8 13 21 34 55 89 144

27

The for statement (2)

What are the outputs of the following statement?

for (x in 1:10) print(sqrt(x))

It prints the square root of the integers one to ten

28

Conditional execution (1)

The if() statements:

The if() statement allows us to control which statements are executed

Syntax if (condition) {commands when TRUE}

Or, if (condition) {commands when TRUE} else {commands when FALSE}

That is, if ( expr1 ) expr2 else expr3 where expr1 must evaluate to a single logical value and the result of the entire expression is then evident.

29

Conditional execution: if statement

Entry

False

Expression

True (1)

Statement 1

Exit

30

Conditional execution: if – else statement

Entry

True (1)

Body of if

Test

Expression

False (0)

Body of else

Exit

31

Example: Conditional execution (2)

A simple example: x <- 3 if (x > 2) y <- 2 * x else y <- 3 * x

Since x > 2 is TRUE, y is assigned 2 * 3 = 6.

If it hadn’t been true, y would have been assigned the value of 3 * x.

32

The while() loop

Sometimes we want to repeat statements, but the pattern of repetition isn’t known in advance. We need to do some calculations and keep going as long as a condition holds. The while() statement accomplishes this.

Syntax while (condition) {statements}

The condition is evaluated, and if it evaluates to FALSE, nothing more is done. If it evaluates to TRUE the statements are executed, condition is evaluated again, and the process is repeated.

33

Example: while loop

Suppose we want to list all Fibonacci numbers less than 300. We don’t know beforehand how long this list is, so we wouldn’t know how to stop the for()loop at the right time, but a while()loop is perfect:

Fib1 <- 1

Fib2 <- 1

Fibonacci <- c(Fib1, Fib2)

}

{ while (Fib2 < 300)

Fibonacci <- c(Fibonacci, Fib2) oldFib2 <- Fib2

Fib2 <- Fib1 + Fib2

Fib1 <- oldFib2

To see the final result of the computation, type

Fibonacci

[1] 1 1 1 2 3 5 8 13 21 34 55 89 144 233

34

35

36

The repeat loop, and the break and next statements

Sometimes we don’t want a fixed number of repetitions of a loop, and we don’t want to put the test at the top of the loop the way it is in a while() loop. In this situation we can use a repeat loop. This loop repeats until we execute a break statement.

Syntax repeat { statements }

This causes the statements to be repeated endlessly. The statements should normally include a break statement, typically in the form if (condition) break but this is not a requirement of the syntax.

The break statement causes the loop to terminate immediately. Break statements can also be used in for() and while() loops. The next statement causes control to return immediately to the top of the loop; it can also be used in any loop.

The repeat loop and the break and next statements are used relatively infrequently.

37

Example: repeat, break

}

{

We can repeat the Newton’s algorithm example from the previous example using a repeat loop:

#****** Prog Using while x <- x0<- .5

x <- x0<- .5

tolerance <- 0.000001

f <- x^3 + 2 * x^2 - 7 repeat tolerance <- 0.000001

while (abs(f) > tolerance) f <- x^3 + 2 * x^2 - 7

{ if (abs(f) < tolerance) break f.prime <- 3 * x^2 + 4 * x f.prime <- 3 * x^2 + 4 * x x <- x - f / f.prime

x <- x - f / f.prime

f <- x^3 + 2 * x^2 - 7

} x x

This version removes the need to duplicate the line that calculates f.

38

Writing functions (1)

A function is defined by an assignment of the form

> name <- function(arg_1, arg_2, ...) expression

The expression is an R expression, (usually a grouped expression), that uses the arguments, arg_i, to calculate a value. The value of the expression is the value returned for the function.

A call to the function then usually takes the form

> name(expr_1, expr_2, ...) and may occur anywhere a function call is legitimate.

39

Writing functions (2)

Example 1: Write a function to compute standard deviation

} sd <- function(x){ sqrt(var(x))

If X = 9, 5, 2, 3, 7; type x <- c(9,5,2,3,7) sd(x)

[1] 2.863564

Exercise: Calculate the coefficient of variation as the standard deviation of a variable, after dividing by its mean.

40

Example : Geometric mean as a function

insects<-c(1,10,1000,10,1) mean(insects)

[1] 204.4

To calculate a geometric mean by finding the antilog (exp) of the average of the logarithms (log) of the data: exp(mean(log(insects)))

[1] 10

So a function to calculate geometric mean of a vector of numbers x: geometric<-function (x) {exp(mean(log(x)))} and testing it with the insect data geometric(insects)

[1] 10

41

Writing functions (3)

42

Writing functions (4)

Example 3: Calculation of Grade point and Letter grade from score or marks.

grade <- function (s) { n <- length(s) gp <- matrix(0, nrow = n, ncol = 1) # gp means Grade Point lg <- matrix(0, nrow = n, ncol = 1) # lg means Letter Grade for (i in 1:n) {

} if (s[i] < 40){ gp[i] = 0.00; lg[i]= "F"

} else if (s[i] >= 40 && s[i] < 45){ gp[i] = 2.00; lg[i] = "D"

43

Writing functions (5)

else if (s[i] >= 45 && s[i] < 50){ gp[i] = 2.25; lg[i] = "C"

} else if (s[i] >= 50 && s[i] < 55){ gp[i] = 2.50; lg[i] = "C+"

} else if (s[i] >= 55 && s[i] < 60){

} gp[i] = 2.75; lg[i] = "B-" else if (s[i] >= 60 && s[i] < 65){

}

} gp[i] = 3.00; lg[i] = "B" else if (s[i] >= 65 && s[i] < 70){ gp[i] = 3.25; lg[i] = "B+"

44

Writing functions (6)

else if (s[i] >= 70 && s[i] < 75){ gp[i] = 3.50; lg[i] = "A-"

} else if (s[i] >= 75 && s[i] < 80){

} gp[i] = 3.75; lg[i] = "A" else{

}

} gp[i] = 4.00; lg[i] = "A+"

} # end of for loop return(list(Grade.Point = gp, Letter.Grade = lg))

# end of function score <- c(80, 45, 55, 90, 75, 38, 62) result <- grade(score) result

45

Writing functions (7)

Example 4: Write a function to calculate the two sample tstatistic. This is an artificial example. The function is defined as follows:

} twosam <- function(y1, y2) { n1 <- length(y1); n2 <- length(y2) yb1 <- mean(y1); yb2 <- mean(y2) s1 <- var(y1); s2 <- var(y2) s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2)) deg.free <- n1+n2-2 return(list(test.stat = tst, df=deg.free))

46

Writing functions (8)

If, x <- c(37, 29, 35, 28, 24, 36, 40, 37, 33, 28, 39) y <- c(22, 32, 27, 30, 24, 34, 32, 20, 24, 25, 28, 26, 26)

Then twosam(x, y)

$test.stat

[1] 3.307523

$df

[1] 22

47

Maximum likelihood estimation

Example 4: Maximum likelihood estimation

(Gamma distribution as an example)

The pdf of Gamma distribution f ( x ;

,

)

 

(

) e

  x x

 

1

,

The likelihood and log-likelihood are x

0

L ( x

1

, x

2

,  , x n

;

,

)

 i n 

1 f ( x i

;

,

)



  

(

) 



 n

 i n 

1 x i 

 

1 e

  i n 

1 x i log( L )

 n

 log(

)

 n log

(

)

(

 

1 ) i n 

1 log( x i

)

  i n 

1 x i

48

Maximum likelihood estimation

mle() allows to estimate parameters by maximum likelihood method using iterative methods of numerical calculus to minimize the negative loglikelihood (which is the same of maximizing the loglikelihood).

This requires to specify the negative log-likelihood analytical expression as argument and giving some starting parameters estimates.

49

Maximum likelihood estimation

x.gam <- rgamma(200, rate=0.5, shape=3.5)

# sample size n=200 from a gamma distribution with

# λ=0.5 (scale parameter) and α=3.5 (shape parameter) library(stats4) # loading package stats4 for mle() logL <- function(lambda, alfa) { n <-200 x <- x.gam

temp1 <- -n*alfa*log(lambda)+n*log(gamma(alfa)) temp2 <- -(alfa-1)*sum(log(x))+lambda*sum(x) temp1+temp2 # -log-likelihood function

} est <- mle(minuslog =logL, start =list(lambda =2, alfa =1))

50

Maximum likelihood estimation

summary(est)

Maximum likelihood estimation

Call: mle(minuslogl = logL, start = list(lambda = 2, alfa = 1))

Coefficients:

Estimate Std. Error lambda 0.5350503 0.05485034

alfa 3.8613209 0.37065500

-2 log L: 1051.109

51

Thank You

52

Download