Introduction to R - Lecture 5: More loops

Introduction to R Lecture 5: More loops
Andrew Jaffe
10/4/2010
Overview
Review: For Loop
 Lists
 Aside: Patterns
 Application

Review: For Loop
The syntax is: for(var in seq) {code}
 The seq determines what values var will
take in the loop

 The
loop is performed length(seq) times
 On the n’th iteration of the loop, var takes the
value seq[n]
 var is a completely new variable and not
directly related to anything other variable
Review: For Loop
Setting up your loop requires determining
the correct seq to loop over: usually easy
 The real challenge of looping is relating
the values of seq to the dimensions/
indices of your data

Review: For Loop
From last lecture: we’re relating seq to the
columns of the data
 var is indirectly related to the data, as it
links/relates to the column indices – but it
has only has the values 1-12

Index = 4:15
mean_wt <- rep(0, length(Index))
for(i in 1:length(Index)) {
ind = Index[i] # column index
mean_wt[i] = mean(dog_dat[,ind])
}
Overview
Review: For Loop
 Lists
 Aside: Patterns
 Application

Lists
"An R list is an object consisting of an
ordered collection of objects known as
its components."
 "Components are always numbered and
may always be referred to as such" –
double brackets can subset lists

CRAN. Intro to R
Lists
> L = list() # empty list
> L[[1]] = 1:4
> L[[2]] = 2:7
> L[[3]] = c("a","b","c")
> L[[4]] = matrix(rnorm(4), nrow = 2)
> L
[[1]]
[1] 1 2 3 4
[[2]]
[1] 2 3 4 5 6 7
[[3]]
[1] "a" "b" "c"
[[4]]
[,1]
[,2]
[1,] -1.43944849 -0.4801696
[2,] 0.09923108 1.0783053
Lists
> names(L) = c("seq1","seq2","letters","mat")
> L
$seq1
[1] 1 2 3 4
$seq2
[1] 2 3 4 5 6 7
$letters
[1] "a" "b" "c"
$mat
[,1]
[,2]
[1,] 1.824487 0.3431034
[2,] -0.533006 0.9406285
Lists
> L[[1]]
[1] 1 2 3 4
> str(L)
List of 4
$ seq1
:
$ seq2
:
$ letters:
$ mat
:
0.343 0.941
int
int
chr
num
[1:4]
[1:6]
[1:3]
[1:2,
1 2 3 4
2 3 4 5 6 7
"a" "b" "c"
1:2] 1.824 -0.533
Lists

Why know lists?
 Can
store data of different lengths and types
 Some functions return lists
Lists
Load back in the lecture 4 data
 We still have one problem to solve - the
averages of weight, length, and food for
each dog type at each visit

Lists

First we can create a list containing each
group we care about
Indexes = list()
Indexes[[1]] = 4:15 # weight
Indexes[[2]] = 16:27 # length
Indexes[[3]] = 28:39 # food
names(Indexes) = c("weight",
"length", "food")
Lists
> Indexes
$weight
[1] 4 5
6
7
8
9 10 11 12 13 14 15
$length
[1] 16 17 18 19 20 21 22 23 24 25 26 27
$food
[1] 28 29 30 31 32 33 34 35 36 37 38 39
Lists

Next, we can create an output list for our
results, and recreate the unique dog list for
our loop
out <- list()
dogs = unique(dog_dat$dog_type)
Lists
We want to loop over the different
covariates (wt, len, food) and within each,
the different dog types
 For looping over the groups, either works:

> seq(along = Indexes)
[1] 1 2 3
> 1:length(Indexes)
[1] 1 2 3
Lists
for(i in seq(along = Indexes)) { # 1:3
# take the i'th index from the list
Index = Indexes[[i]]
# for that variable, create a new matrix
tmp = matrix(nrow = length(dogs),
ncol = length(Index))
...
Lists
We can then fill in that temporary matrix
with an inner 'for' loop
 Note that this is the exact same loop as
last week (note the j's):

Index from the
outer loop
for(j in 1:length(dogs)) {
hold = dog_dat[dog_dat$dog_type == dogs[j],Index]
tmp[j,] = colMeans(hold)
}
rownames(tmp) = dogs
colnames(tmp) = paste("month",1:12,sep="_")
Lists

Lastly, we save that tmp matrix in our
output list:
out[[i]] = tmp
for(i in seq(along = Indexes)) { # groups
Index = Indexes[[i]]
tmp = matrix(nrow = length(dogs),
ncol = length(Index))
for(j in 1:length(dogs)) { # dogs
hold = dog_dat[dog_dat$dog_type == dogs[j],Index]
tmp[j,] = colMeans(hold)
}
rownames(tmp) = dogs
colnames(tmp) = paste("month",1:12,sep="_")
out[[i]] = tmp
}
names(out) <- c("weight","length","food")
> out
$weight
month_1
49.81840
49.40090
49.26372
50.19474
month_8
lab
46.54640
poodle
46.12613
husky
45.98761
retriever 46.91278
lab
poodle
husky
retriever
month_2
48.69200
48.27297
48.13097
49.06466
month_9
44.68640
44.26577
44.12832
45.05263
month_3
49.03360
48.61892
48.48142
49.40602
month_10
45.15040
44.73243
44.59469
45.51654
month_4
50.26560
49.84414
49.70088
50.62632
month_11
44.30640
43.89009
43.75221
44.68496
month_5
50.17600
49.76126
49.61858
50.54361
month_12
45.88240
45.46306
45.31858
46.24586
month_6
49.67280
49.25856
49.11327
50.04135
month_7
48.41600
47.99820
47.86195
48.79248
month_2
20.16800
20.88198
20.54159
20.71955
month_9
21.20880
21.92432
21.58142
21.75414
month_3
20.28720
21.00090
20.65575
20.83233
month_10
21.40720
22.12342
21.77876
21.95263
month_4
20.49600
21.20991
20.86195
21.04135
month_11
21.57440
22.29009
21.94779
22.12406
month_5
20.57840
21.29189
20.94867
21.12556
month_12
21.87440
22.58919
22.24779
22.42406
month_6
20.86400
21.58108
21.23805
21.41729
month_7
20.96800
21.68198
21.34071
21.51880
month_2
29.77200
29.76306
29.85221
29.62556
month_9
30.03200
30.02613
30.11770
29.88722
month_3
28.77680
28.77117
28.85841
28.63008
month_10
29.89120
29.88739
29.97345
29.74887
month_4
28.20880
28.20631
28.29646
28.06617
month_11
29.54240
29.53243
29.62389
29.39248
month_5
29.52240
29.51892
29.60973
29.37744
month_12
30.89520
30.89550
30.98053
30.75338
month_6
30.24960
30.23874
30.33363
30.10075
month_7
30.90160
30.89910
30.98584
30.75564
$length
month_1
19.91840
20.63964
20.29115
20.47068
month_8
lab
21.10400
poodle
21.82072
husky
21.47699
retriever 21.64962
lab
poodle
husky
retriever
$food
month_1
30.04000
30.03063
30.12301
29.89248
month_8
lab
29.20880
poodle
29.20631
husky
29.29646
retriever 29.06617
lab
poodle
husky
retriever
Overview
Review: For Loop
 Lists
 Aside: Patterns
 Application

Aside

This step is potentially dangerous:
 Indexes[[1]]
= 4:15 # weight
 Indexes[[2]] = 16:27 # length
 Indexes[[3]] = 28:39 # food

Is there a better way? YES! Each group
shares a common term in the name:
 wt,
len, food
Aside

grep(pattern, x) : matches "pattern" in
vector x
> grep("wt", names(dog_dat))
[1] 4 5 6 7 8 9 10 11 12 13 14 15
> grep("len", names(dog_dat))
[1] 16 17 18 19 20 21 22 23 24 25 26 27
> grep("food", names(dog_dat))
[1] 28 29 30 31 32 33 34 35 36 37 38 39
Aside
> Indexes = list()
> Indexes[[1]] = grep("wt", names(dog_dat))
> Indexes[[2]] = grep("len", names(dog_dat))
> Indexes[[3]] = grep("food", names(dog_dat))
> Indexes
[[1]]
[1] 4 5 6 7 8 9 10 11 12 13 14 15
[[2]]
[1] 16 17 18 19 20 21 22 23 24 25 26 27
[[3]]
[1] 28 29 30 31 32 33 34 35 36 37 38 39
Aside

grep can be a lot more powerful when
combined with 'regular expression' but
we're not going to get into that
Aside
Opposite of paste: strsplit(x, split) – splits
term 'x' on 'split' character or pattern
 Returns a list:

> x = paste("month",1:12,sep="_")
> head(strsplit(x,"_"),3)
[[1]]
[1] "month" "1"
[[2]]
[1] "month" "2"
[[3]]
[1] "month" "3"
Aside
If you want one element (in this case, the
number), easiest to just use a 'for' loop
 If you split each element separately, the
output list only has 1 element: [[1]]
 You then need to figure out which slot you
want using the single bracket

Aside
x = paste("month",1:12,sep="_")
num = rep(0,length(x))
for(i in 1:length(x)) {
num[i] = strsplit(x[i],"_")[[1]][2]
}
> i = 1
> strsplit(x[i],"_") # list
[[1]]
[1] "month" "1"
> strsplit(x[i],"_")[[1]] # vector
[1] "month" "1"
> strsplit(x[i],"_")[[1]][2] # element
[1] "1"
Overview
Review: For Loop
 Lists
 Aside: Patterns
 Application

Applied Example
Load in "lec5_data.rda" from the course
website
 These are the people from "lec2_data.rda"
that did not have a dog at baseline
 Over monthly follow-up, some of these
people borrowed dogs over the past
month

Applied Example
dog_0: baseline dog ownership – all of
these people should have "no"
 dog_1 - dog_12: did you borrow a dog
over the past month?

Applied Example
Determine person-time at risk for dog
borrowing
 Create a "survival" dataset from this data
with columns: ID, start, end
 Note that there is missing data…

Applied Example
We want to convert each person's wide
data into two numbers: start and end
 Because of missing data, some people
might have more than 1 row – people
aren't at risk for dog borrowing if they did
not report (/are missing)

Applied Example

Take person 1:
> dat[1,]
id age sex height weight dog_0 dog_1 dog_2
1 1 40
F
63.5 134.5
no
no
yes
dog_3 dog_4 dog_5 dog_6 dog_7 dog_8 dog_9
1
yes
no
no
yes
yes
no
yes
dog_10 dog_11 dog_12
1
<NA>
no
no
Applied Example
Person 1 in the new dataset should be:
ID start
end
1
0
9
1
11
12

Applied Example
Basic premise: write a for-loop that passes
over each person and determines their
non-missing follow-up time
 Caveat: how many rows do we make our
output matrix?
 Perfect opportunity for using rbind()…

Applied Example
Create a matrix with 0 rows and 3 columns
 Within the body of the loop, using rbind to
append new rows (this is slow though)

> out = matrix(nr = 0, nc = 3)
> dim(out)
[1] 0 3
> p1 = c(1,0,9)
> out = rbind(out, p1)
> out
[,1] [,2] [,3]
p1
1
0
9
Applied Example
out = matrix(nrow = 0, ncol = 3)
cols = grep("dog", names(dat))
for(i in 1:nrow(dat)) {
hold = as.numeric(dat[i,cols])
...
Applied Example

Here, the follow-up results are factors,
which have numerical values:
> dat[i,cols]
dog_0 dog_1 dog_2 dog_3 dog_4 dog_5 dog_6
1
no
no
yes
yes
no
no
yes
dog_7 dog_8 dog_9 dog_10 dog_11 dog_12
1
yes
no
yes
<NA>
no
no
> as.numeric(dat[i,cols])
[1] 1 1 2 2 1 1 2 2 1 2 NA 1 1
Applied Example
Now a cool little trick: rle() – run length
encoding
 Compute the lengths and values of runs of
equal values in a vector
 We're going to combine this with is.na()

Applied Example
This says that there are 10 FALSE in a
row, then 1 TRUE, then 2 FALSE
 We need to get this in a better format…

> x = rle(is.na(hold))
> x
Run Length Encoding
lengths: int [1:3] 10 1 2
values : logi [1:3] FALSE TRUE FALSE
Applied Example
> x = data.frame(cbind(x$values,
x$length))
> names(x) <- c("missing", "length")
> x
missing length
1
0
10
2
1
1
3
0
2
Applied Example

cumsum() returns the cumulative sum of a
vector
> x$end <- cumsum(x$length)
> x$start <- x$end - x$length + 1
>
> x
missing length end start
1
0
10 10
1
2
1
1 11
11
3
0
2 13
12
Applied Example

Note that we actually want all of the values
to be less one, since our time starts at 0
> x$end <- cumsum(x$length) - 1
> x$start <- x$end - x$length + 1
> x
missing length end start
1
0
10
9
0
2
1
1 10
10
3
0
2 12
11
Applied Example

Quick rearrangement:
> x <- x[,c(1,2,4,3)]
> x
missing length start end
1
0
10
0
9
2
1
1
10 10
3
0
2
11 12
Applied Example

We want the last two columns of the nonmissing visits
> tmp = x[which(x$missing == 0),3:4]
> tmp
start end
1
0
9
3
11 12
Applied Example

We then want to add a column of the
individual ID to the front
id = dat[i,1]
tmp = cbind(rep(id,nrow(tmp)), tmp)
names(tmp)[1] = "ID"
> tmp
ID start end
1 1
0
9
3 1
11 12
Applied Example
Lastly, bind the tmp matrix to the growing
out matrix
 This finishes off our loop body

out = rbind(out,tmp)
for(i in 1:nrow(dat)) {
hold = as.numeric(dat[i,cols])
x = rle(is.na(hold))
x = data.frame(cbind(x$values, x$length))
names(x) <- c("missing", "length")
x$end <- cumsum(x$length) - 1
x$start <- x$end - x$length + 1
x <- x[,c(1,2,4,3)]
tmp = x[which(x$missing == 0),3:4]
id = dat[i,1]
tmp = cbind(rep(id,nrow(tmp)), tmp)
names(tmp)[1] = "ID"
out = rbind(out,tmp)
}
rownames(out) = 1:nrow(out) # cleaning
> head(out,10)
ID start end
1
1
0
9
2
1
11 12
3
2
0
5
4
2
7 12
5
3
0
2
6
3
5 12
7
4
0
3
8
4
6 12
9
5
0
0
10 5
3
8
> dim(out)
[1] 1414 3
Applied Example
One last adjustment needed, since we
asked about borrowing a dog in the
previous month
 The non-0 starts must be less 1 since
these are currently indices of visit, but not
time at risk

ID start end
1 1
0
9
2 1
11 12
ID start end
1 1
0
9
2 1
10 12
Applied Example

What is the total time at risk of this
population?
> time = out$end - out$start
> sum(time)
[1] 4988 # person-months
Applied Example

Save the 'out' matrix as an rda so it can be
used next week