Introduction to R Lecture 5: More loops Andrew Jaffe 10/4/2010 Overview Review: For Loop Lists Aside: Patterns Application Review: For Loop The syntax is: for(var in seq) {code} The seq determines what values var will take in the loop The loop is performed length(seq) times On the n’th iteration of the loop, var takes the value seq[n] var is a completely new variable and not directly related to anything other variable Review: For Loop Setting up your loop requires determining the correct seq to loop over: usually easy The real challenge of looping is relating the values of seq to the dimensions/ indices of your data Review: For Loop From last lecture: we’re relating seq to the columns of the data var is indirectly related to the data, as it links/relates to the column indices – but it has only has the values 1-12 Index = 4:15 mean_wt <- rep(0, length(Index)) for(i in 1:length(Index)) { ind = Index[i] # column index mean_wt[i] = mean(dog_dat[,ind]) } Overview Review: For Loop Lists Aside: Patterns Application Lists "An R list is an object consisting of an ordered collection of objects known as its components." "Components are always numbered and may always be referred to as such" – double brackets can subset lists CRAN. Intro to R Lists > L = list() # empty list > L[[1]] = 1:4 > L[[2]] = 2:7 > L[[3]] = c("a","b","c") > L[[4]] = matrix(rnorm(4), nrow = 2) > L [[1]] [1] 1 2 3 4 [[2]] [1] 2 3 4 5 6 7 [[3]] [1] "a" "b" "c" [[4]] [,1] [,2] [1,] -1.43944849 -0.4801696 [2,] 0.09923108 1.0783053 Lists > names(L) = c("seq1","seq2","letters","mat") > L $seq1 [1] 1 2 3 4 $seq2 [1] 2 3 4 5 6 7 $letters [1] "a" "b" "c" $mat [,1] [,2] [1,] 1.824487 0.3431034 [2,] -0.533006 0.9406285 Lists > L[[1]] [1] 1 2 3 4 > str(L) List of 4 $ seq1 : $ seq2 : $ letters: $ mat : 0.343 0.941 int int chr num [1:4] [1:6] [1:3] [1:2, 1 2 3 4 2 3 4 5 6 7 "a" "b" "c" 1:2] 1.824 -0.533 Lists Why know lists? Can store data of different lengths and types Some functions return lists Lists Load back in the lecture 4 data We still have one problem to solve - the averages of weight, length, and food for each dog type at each visit Lists First we can create a list containing each group we care about Indexes = list() Indexes[[1]] = 4:15 # weight Indexes[[2]] = 16:27 # length Indexes[[3]] = 28:39 # food names(Indexes) = c("weight", "length", "food") Lists > Indexes $weight [1] 4 5 6 7 8 9 10 11 12 13 14 15 $length [1] 16 17 18 19 20 21 22 23 24 25 26 27 $food [1] 28 29 30 31 32 33 34 35 36 37 38 39 Lists Next, we can create an output list for our results, and recreate the unique dog list for our loop out <- list() dogs = unique(dog_dat$dog_type) Lists We want to loop over the different covariates (wt, len, food) and within each, the different dog types For looping over the groups, either works: > seq(along = Indexes) [1] 1 2 3 > 1:length(Indexes) [1] 1 2 3 Lists for(i in seq(along = Indexes)) { # 1:3 # take the i'th index from the list Index = Indexes[[i]] # for that variable, create a new matrix tmp = matrix(nrow = length(dogs), ncol = length(Index)) ... Lists We can then fill in that temporary matrix with an inner 'for' loop Note that this is the exact same loop as last week (note the j's): Index from the outer loop for(j in 1:length(dogs)) { hold = dog_dat[dog_dat$dog_type == dogs[j],Index] tmp[j,] = colMeans(hold) } rownames(tmp) = dogs colnames(tmp) = paste("month",1:12,sep="_") Lists Lastly, we save that tmp matrix in our output list: out[[i]] = tmp for(i in seq(along = Indexes)) { # groups Index = Indexes[[i]] tmp = matrix(nrow = length(dogs), ncol = length(Index)) for(j in 1:length(dogs)) { # dogs hold = dog_dat[dog_dat$dog_type == dogs[j],Index] tmp[j,] = colMeans(hold) } rownames(tmp) = dogs colnames(tmp) = paste("month",1:12,sep="_") out[[i]] = tmp } names(out) <- c("weight","length","food") > out $weight month_1 49.81840 49.40090 49.26372 50.19474 month_8 lab 46.54640 poodle 46.12613 husky 45.98761 retriever 46.91278 lab poodle husky retriever month_2 48.69200 48.27297 48.13097 49.06466 month_9 44.68640 44.26577 44.12832 45.05263 month_3 49.03360 48.61892 48.48142 49.40602 month_10 45.15040 44.73243 44.59469 45.51654 month_4 50.26560 49.84414 49.70088 50.62632 month_11 44.30640 43.89009 43.75221 44.68496 month_5 50.17600 49.76126 49.61858 50.54361 month_12 45.88240 45.46306 45.31858 46.24586 month_6 49.67280 49.25856 49.11327 50.04135 month_7 48.41600 47.99820 47.86195 48.79248 month_2 20.16800 20.88198 20.54159 20.71955 month_9 21.20880 21.92432 21.58142 21.75414 month_3 20.28720 21.00090 20.65575 20.83233 month_10 21.40720 22.12342 21.77876 21.95263 month_4 20.49600 21.20991 20.86195 21.04135 month_11 21.57440 22.29009 21.94779 22.12406 month_5 20.57840 21.29189 20.94867 21.12556 month_12 21.87440 22.58919 22.24779 22.42406 month_6 20.86400 21.58108 21.23805 21.41729 month_7 20.96800 21.68198 21.34071 21.51880 month_2 29.77200 29.76306 29.85221 29.62556 month_9 30.03200 30.02613 30.11770 29.88722 month_3 28.77680 28.77117 28.85841 28.63008 month_10 29.89120 29.88739 29.97345 29.74887 month_4 28.20880 28.20631 28.29646 28.06617 month_11 29.54240 29.53243 29.62389 29.39248 month_5 29.52240 29.51892 29.60973 29.37744 month_12 30.89520 30.89550 30.98053 30.75338 month_6 30.24960 30.23874 30.33363 30.10075 month_7 30.90160 30.89910 30.98584 30.75564 $length month_1 19.91840 20.63964 20.29115 20.47068 month_8 lab 21.10400 poodle 21.82072 husky 21.47699 retriever 21.64962 lab poodle husky retriever $food month_1 30.04000 30.03063 30.12301 29.89248 month_8 lab 29.20880 poodle 29.20631 husky 29.29646 retriever 29.06617 lab poodle husky retriever Overview Review: For Loop Lists Aside: Patterns Application Aside This step is potentially dangerous: Indexes[[1]] = 4:15 # weight Indexes[[2]] = 16:27 # length Indexes[[3]] = 28:39 # food Is there a better way? YES! Each group shares a common term in the name: wt, len, food Aside grep(pattern, x) : matches "pattern" in vector x > grep("wt", names(dog_dat)) [1] 4 5 6 7 8 9 10 11 12 13 14 15 > grep("len", names(dog_dat)) [1] 16 17 18 19 20 21 22 23 24 25 26 27 > grep("food", names(dog_dat)) [1] 28 29 30 31 32 33 34 35 36 37 38 39 Aside > Indexes = list() > Indexes[[1]] = grep("wt", names(dog_dat)) > Indexes[[2]] = grep("len", names(dog_dat)) > Indexes[[3]] = grep("food", names(dog_dat)) > Indexes [[1]] [1] 4 5 6 7 8 9 10 11 12 13 14 15 [[2]] [1] 16 17 18 19 20 21 22 23 24 25 26 27 [[3]] [1] 28 29 30 31 32 33 34 35 36 37 38 39 Aside grep can be a lot more powerful when combined with 'regular expression' but we're not going to get into that Aside Opposite of paste: strsplit(x, split) – splits term 'x' on 'split' character or pattern Returns a list: > x = paste("month",1:12,sep="_") > head(strsplit(x,"_"),3) [[1]] [1] "month" "1" [[2]] [1] "month" "2" [[3]] [1] "month" "3" Aside If you want one element (in this case, the number), easiest to just use a 'for' loop If you split each element separately, the output list only has 1 element: [[1]] You then need to figure out which slot you want using the single bracket Aside x = paste("month",1:12,sep="_") num = rep(0,length(x)) for(i in 1:length(x)) { num[i] = strsplit(x[i],"_")[[1]][2] } > i = 1 > strsplit(x[i],"_") # list [[1]] [1] "month" "1" > strsplit(x[i],"_")[[1]] # vector [1] "month" "1" > strsplit(x[i],"_")[[1]][2] # element [1] "1" Overview Review: For Loop Lists Aside: Patterns Application Applied Example Load in "lec5_data.rda" from the course website These are the people from "lec2_data.rda" that did not have a dog at baseline Over monthly follow-up, some of these people borrowed dogs over the past month Applied Example dog_0: baseline dog ownership – all of these people should have "no" dog_1 - dog_12: did you borrow a dog over the past month? Applied Example Determine person-time at risk for dog borrowing Create a "survival" dataset from this data with columns: ID, start, end Note that there is missing data… Applied Example We want to convert each person's wide data into two numbers: start and end Because of missing data, some people might have more than 1 row – people aren't at risk for dog borrowing if they did not report (/are missing) Applied Example Take person 1: > dat[1,] id age sex height weight dog_0 dog_1 dog_2 1 1 40 F 63.5 134.5 no no yes dog_3 dog_4 dog_5 dog_6 dog_7 dog_8 dog_9 1 yes no no yes yes no yes dog_10 dog_11 dog_12 1 <NA> no no Applied Example Person 1 in the new dataset should be: ID start end 1 0 9 1 11 12 Applied Example Basic premise: write a for-loop that passes over each person and determines their non-missing follow-up time Caveat: how many rows do we make our output matrix? Perfect opportunity for using rbind()… Applied Example Create a matrix with 0 rows and 3 columns Within the body of the loop, using rbind to append new rows (this is slow though) > out = matrix(nr = 0, nc = 3) > dim(out) [1] 0 3 > p1 = c(1,0,9) > out = rbind(out, p1) > out [,1] [,2] [,3] p1 1 0 9 Applied Example out = matrix(nrow = 0, ncol = 3) cols = grep("dog", names(dat)) for(i in 1:nrow(dat)) { hold = as.numeric(dat[i,cols]) ... Applied Example Here, the follow-up results are factors, which have numerical values: > dat[i,cols] dog_0 dog_1 dog_2 dog_3 dog_4 dog_5 dog_6 1 no no yes yes no no yes dog_7 dog_8 dog_9 dog_10 dog_11 dog_12 1 yes no yes <NA> no no > as.numeric(dat[i,cols]) [1] 1 1 2 2 1 1 2 2 1 2 NA 1 1 Applied Example Now a cool little trick: rle() – run length encoding Compute the lengths and values of runs of equal values in a vector We're going to combine this with is.na() Applied Example This says that there are 10 FALSE in a row, then 1 TRUE, then 2 FALSE We need to get this in a better format… > x = rle(is.na(hold)) > x Run Length Encoding lengths: int [1:3] 10 1 2 values : logi [1:3] FALSE TRUE FALSE Applied Example > x = data.frame(cbind(x$values, x$length)) > names(x) <- c("missing", "length") > x missing length 1 0 10 2 1 1 3 0 2 Applied Example cumsum() returns the cumulative sum of a vector > x$end <- cumsum(x$length) > x$start <- x$end - x$length + 1 > > x missing length end start 1 0 10 10 1 2 1 1 11 11 3 0 2 13 12 Applied Example Note that we actually want all of the values to be less one, since our time starts at 0 > x$end <- cumsum(x$length) - 1 > x$start <- x$end - x$length + 1 > x missing length end start 1 0 10 9 0 2 1 1 10 10 3 0 2 12 11 Applied Example Quick rearrangement: > x <- x[,c(1,2,4,3)] > x missing length start end 1 0 10 0 9 2 1 1 10 10 3 0 2 11 12 Applied Example We want the last two columns of the nonmissing visits > tmp = x[which(x$missing == 0),3:4] > tmp start end 1 0 9 3 11 12 Applied Example We then want to add a column of the individual ID to the front id = dat[i,1] tmp = cbind(rep(id,nrow(tmp)), tmp) names(tmp)[1] = "ID" > tmp ID start end 1 1 0 9 3 1 11 12 Applied Example Lastly, bind the tmp matrix to the growing out matrix This finishes off our loop body out = rbind(out,tmp) for(i in 1:nrow(dat)) { hold = as.numeric(dat[i,cols]) x = rle(is.na(hold)) x = data.frame(cbind(x$values, x$length)) names(x) <- c("missing", "length") x$end <- cumsum(x$length) - 1 x$start <- x$end - x$length + 1 x <- x[,c(1,2,4,3)] tmp = x[which(x$missing == 0),3:4] id = dat[i,1] tmp = cbind(rep(id,nrow(tmp)), tmp) names(tmp)[1] = "ID" out = rbind(out,tmp) } rownames(out) = 1:nrow(out) # cleaning > head(out,10) ID start end 1 1 0 9 2 1 11 12 3 2 0 5 4 2 7 12 5 3 0 2 6 3 5 12 7 4 0 3 8 4 6 12 9 5 0 0 10 5 3 8 > dim(out) [1] 1414 3 Applied Example One last adjustment needed, since we asked about borrowing a dog in the previous month The non-0 starts must be less 1 since these are currently indices of visit, but not time at risk ID start end 1 1 0 9 2 1 11 12 ID start end 1 1 0 9 2 1 10 12 Applied Example What is the total time at risk of this population? > time = out$end - out$start > sum(time) [1] 4988 # person-months Applied Example Save the 'out' matrix as an rda so it can be used next week