CSSS 508: Intro R

advertisement
CSSS 508: Intro R
1/25/06
Lab 3: Being Loopy and Creating Variables
In Venables and Ripley’s MASS library, there are several datasets that are available for
exploring and analyzing.
>
help.start()
Click on Packages, then MASS. Scroll through; look at short descriptions of datasets.
>library(MASS)
(loads the package in your R session)
Today our sample dataset is the Veteran’s Administration Lung Cancer Trial data.
>help(VA)
> dim(VA)
[1] 137
8
> n<-nrow(VA)
> p<-ncol(VA)
> n
[1] 137
> p
[1] 8
There are 137 subjects, 8 variables. The data is a data frame; each column is named and
can be accessed by VA$(the column name). You can reassign the columns if you want.
VA$stime:
VA$status:
VA$treat:
VA$age:
VA$Karn:
VA$diag.time:
VA$cell:
VA$prior:
survival or followup time in days
dead (1) or censored (0)
standard (1) or test (2)
patient’s age in years
patient’s Karnofsky score; on scale 0-100;
high values are for relatively well patients
times since diagnosis in months at entry to trial
one of four cell types
did the patient have prior therapy (10-Yes, 0-No)
Do summary() on each variable (or on VA) to get an idea of what kind of data you have.
Any missing data?
Rebecca Nugent, Department of Statistics, U. of Washington
-1-
Creating a quartile age category variable using a for loop and if statements:
> summary(VA$age)
Min. 1st Qu. Median
34.00
51.00
62.00
Mean 3rd Qu.
58.31
66.00
Max.
81.00
> age.cat<-rep(0,n)
>
+
+
+
+
+
for(i in 1:n){
if(34<=VA$age[i]
if(51<=VA$age[i]
if(62<=VA$age[i]
if(66<=VA$age[i]
}
&
&
&
&
VA$age[i]<51) age.cat[i]<-1
VA$age[i]<62) age.cat[i]<-2
VA$age[i]<66) age.cat[i]<-3
VA$age[i]<=81) age.cat[i]<-4
> table(age.cat)
age.cat
1 2 3 4
34 30 36 37
Note: we had an if statement for every possible age value, so we didn’t need an else
statement. Could also do this with a string of if/else statements.
> age.cat<-rep(0,n)
>
+
+
+
+
+
for(i in 1:n){
if(34<=VA$age[i] & VA$age[i]<51) age.cat[i]<-1
else if(51<=VA$age[i] & VA$age[i]<62) age.cat[i]<-2
else if(62<=VA$age[i] & VA$age[i]<66) age.cat[i]<-3
else age.cat[i]<-4
}
> table(age.cat)
age.cat
1 2 3 4
34 30 36 37
Or with a series of conditional assignments:
>
>
>
>
>
age.cat<-rep(0,n)
age.cat[34<=VA$age & VA$age<51]<-1
age.cat[51<=VA$age & VA$age<62]<-2
age.cat[62<=VA$age & VA$age<66]<-3
age.cat[age.cat==0]<-4
> table(age.cat)
age.cat
1 2 3 4
34 30 36 37
Make a matrix copy of your VA data.
> VA2<-as.matrix(VA)
Change 10 random elements to missing.
> VA2[sample(seq(1,n*p),10)]<-NA
Rebecca Nugent, Department of Statistics, U. of Washington
-2-
Double for loops:
If we want to loop over all elements in a matrix, we can index over two for loops.
The below code will loop over a matrix and return a list of the locations of the missing
variables. The rows are indexed by i; the columns by j.
missing<-NULL
for(i in 1:n){
for(j in 1:p){
if(is.na(VA2 [i,j])) missing<-rbind(missing, c(i,j))
}
}
> missing
(these are the random ones I had; yours will be different)
[,1] [,2]
[1,]
2
4
[2,]
21
3
[3,]
26
2
[4,]
41
2
[5,]
42
1
[6,]
56
5
[7,]
66
5
[8,]
90
8
[9,] 112
3
[10,] 125
7
Loops can take a long time; try to do things vector by vector if you can.
missing<-NULL
for(i in 1:n){
missing.loc<-which(is.na(VA2[i,]))
how.many<-length(missing.loc)
if(how.many!=0)
missing<-rbind(missing,cbind(rep(i,how.many),missing.loc))
}
> missing
age
2
treat
21
status 26
status 41
stime
42
Karn
56
Karn
66
prior
90
treat 112
cell
125
missing.loc
4
3
2
2
1
5
5
8
3
7
Rebecca Nugent, Department of Statistics, U. of Washington
-3-
Logistic Regression: Often we want to create variables that represent a subgroup
(1 if in the subgroup; 0 if not)
Create subgroup variables for the 4 categories:
(standard treatment / Karn < 50, standard / Karn > 50, test / Karn < 50, test / Karn > 50)
> stand.lowKarn<-stand.highKarn<-test.lowKarn<-test.highKarn<-rep(0,n)
>
>
>
>
stand.lowKarn[VA$treat==1&VA$Karn<50]<-1
stand.highKarn[VA$treat==1&VA$Karn>=50]<-1
test.lowKarn[VA$treat==2&VA$Karn<50]<-1
test.highKarn[VA$treat==2&VA$Karn>=50]<-1
> table(stand.lowKarn)
stand.lowKarn
0
1
119 18
> table(stand.highKarn)
stand.highKarn
0 1
86 51
> table(test.lowKarn)
test.lowKarn
0
1
117 20
> table(test.highKarn)
test.highKarn
0 1
89 48
Note that the number of ones adds up to 137. Each person is in one of the subgroups.
How would you do this with a for loop?
General Practice:
What is the mean survival time for each of the four cell types?
> mean(VA$stime[VA$cell==1])
[1] 200.2
> mean(VA$stime[VA$cell==2])
[1] 71.66667
> mean(VA$stime[VA$cell==3])
[1] 64.11111
> mean(VA$stime[VA$cell==4])
[1] 166.1111
Which patients were in the test treatment group and had prior therapy?
> which(VA$treat==2&VA$prior==10)
[1] 70 73 75 77 84 87 89 91
131 132 135
95
96 106 109 113 122 127 128
Rebecca Nugent, Department of Statistics, U. of Washington
-4-
Looking at More Examples of While loops:
Mostly while loops are used when you’re testing a condition and you want to know when
you’ve reached convergence or some point of interest.
How many data points do we need to simulate to get within .1 of the specified mean?
data<-NULL
check<-0
while(check==0){
data<-c(data,rpois(1,lambda=2))
new.mean<-mean(data)
if(1.9<=new.mean & new.mean<=2.1) check<-1
}
> data
[1] 2
> data
[1] 4 4 0 2 1 2 2 1
To get within .05?
data<-NULL
check<-0
while(check==0){
data<-c(data,rpois(1,lambda=2))
new.mean<-mean(data)
if(1.95<=new.mean & new.mean<=2.05) check<-1
}
> data
[1] 0 2 2 1 1 2 3 0 0 1 2 2 0 2 3 2 2 2 5 1 1 1 0 0 0 0 3 2 1 3 3 3 7
0 1 0 1 1 5 3 4 4 1 1 0 1 0 4 1 3 4 4 3 3 3 4 3 4
To get within .01?
data<-NULL
check<-0
while(check==0){
data<-c(data,rpois(1,lambda=2))
new.mean<-mean(data)
if(1.99<=new.mean & new.mean<=2.01) check<-1
}
> data
[1] 0 0 1 2 1 2 4 2 3 0 2 1 2 2 2 2 3 1 3 1 0 1 1 0 1 1 0 1 4 2 3 3 3
1 1 1 3
[38] 3 0 1 0 1 3 2 1 3 5 2 4 1 0 2 2 2 2 0 4 1 2 3 4 2 1 3 3 2 0 2 3 1
0 1 1 3
[75] 6 3 1 4 2 1 2 1 5 1 4 2 4 2 1 3 2 1 3 4 0 4 2 4 2 3 1 5
Rebecca Nugent, Department of Statistics, U. of Washington
-5-
Download