CSSS 508: Intro R 1/25/06 Lab 3: Being Loopy and Creating Variables In Venables and Ripley’s MASS library, there are several datasets that are available for exploring and analyzing. > help.start() Click on Packages, then MASS. Scroll through; look at short descriptions of datasets. >library(MASS) (loads the package in your R session) Today our sample dataset is the Veteran’s Administration Lung Cancer Trial data. >help(VA) > dim(VA) [1] 137 8 > n<-nrow(VA) > p<-ncol(VA) > n [1] 137 > p [1] 8 There are 137 subjects, 8 variables. The data is a data frame; each column is named and can be accessed by VA$(the column name). You can reassign the columns if you want. VA$stime: VA$status: VA$treat: VA$age: VA$Karn: VA$diag.time: VA$cell: VA$prior: survival or followup time in days dead (1) or censored (0) standard (1) or test (2) patient’s age in years patient’s Karnofsky score; on scale 0-100; high values are for relatively well patients times since diagnosis in months at entry to trial one of four cell types did the patient have prior therapy (10-Yes, 0-No) Do summary() on each variable (or on VA) to get an idea of what kind of data you have. Any missing data? Rebecca Nugent, Department of Statistics, U. of Washington -1- Creating a quartile age category variable using a for loop and if statements: > summary(VA$age) Min. 1st Qu. Median 34.00 51.00 62.00 Mean 3rd Qu. 58.31 66.00 Max. 81.00 > age.cat<-rep(0,n) > + + + + + for(i in 1:n){ if(34<=VA$age[i] if(51<=VA$age[i] if(62<=VA$age[i] if(66<=VA$age[i] } & & & & VA$age[i]<51) age.cat[i]<-1 VA$age[i]<62) age.cat[i]<-2 VA$age[i]<66) age.cat[i]<-3 VA$age[i]<=81) age.cat[i]<-4 > table(age.cat) age.cat 1 2 3 4 34 30 36 37 Note: we had an if statement for every possible age value, so we didn’t need an else statement. Could also do this with a string of if/else statements. > age.cat<-rep(0,n) > + + + + + for(i in 1:n){ if(34<=VA$age[i] & VA$age[i]<51) age.cat[i]<-1 else if(51<=VA$age[i] & VA$age[i]<62) age.cat[i]<-2 else if(62<=VA$age[i] & VA$age[i]<66) age.cat[i]<-3 else age.cat[i]<-4 } > table(age.cat) age.cat 1 2 3 4 34 30 36 37 Or with a series of conditional assignments: > > > > > age.cat<-rep(0,n) age.cat[34<=VA$age & VA$age<51]<-1 age.cat[51<=VA$age & VA$age<62]<-2 age.cat[62<=VA$age & VA$age<66]<-3 age.cat[age.cat==0]<-4 > table(age.cat) age.cat 1 2 3 4 34 30 36 37 Make a matrix copy of your VA data. > VA2<-as.matrix(VA) Change 10 random elements to missing. > VA2[sample(seq(1,n*p),10)]<-NA Rebecca Nugent, Department of Statistics, U. of Washington -2- Double for loops: If we want to loop over all elements in a matrix, we can index over two for loops. The below code will loop over a matrix and return a list of the locations of the missing variables. The rows are indexed by i; the columns by j. missing<-NULL for(i in 1:n){ for(j in 1:p){ if(is.na(VA2 [i,j])) missing<-rbind(missing, c(i,j)) } } > missing (these are the random ones I had; yours will be different) [,1] [,2] [1,] 2 4 [2,] 21 3 [3,] 26 2 [4,] 41 2 [5,] 42 1 [6,] 56 5 [7,] 66 5 [8,] 90 8 [9,] 112 3 [10,] 125 7 Loops can take a long time; try to do things vector by vector if you can. missing<-NULL for(i in 1:n){ missing.loc<-which(is.na(VA2[i,])) how.many<-length(missing.loc) if(how.many!=0) missing<-rbind(missing,cbind(rep(i,how.many),missing.loc)) } > missing age 2 treat 21 status 26 status 41 stime 42 Karn 56 Karn 66 prior 90 treat 112 cell 125 missing.loc 4 3 2 2 1 5 5 8 3 7 Rebecca Nugent, Department of Statistics, U. of Washington -3- Logistic Regression: Often we want to create variables that represent a subgroup (1 if in the subgroup; 0 if not) Create subgroup variables for the 4 categories: (standard treatment / Karn < 50, standard / Karn > 50, test / Karn < 50, test / Karn > 50) > stand.lowKarn<-stand.highKarn<-test.lowKarn<-test.highKarn<-rep(0,n) > > > > stand.lowKarn[VA$treat==1&VA$Karn<50]<-1 stand.highKarn[VA$treat==1&VA$Karn>=50]<-1 test.lowKarn[VA$treat==2&VA$Karn<50]<-1 test.highKarn[VA$treat==2&VA$Karn>=50]<-1 > table(stand.lowKarn) stand.lowKarn 0 1 119 18 > table(stand.highKarn) stand.highKarn 0 1 86 51 > table(test.lowKarn) test.lowKarn 0 1 117 20 > table(test.highKarn) test.highKarn 0 1 89 48 Note that the number of ones adds up to 137. Each person is in one of the subgroups. How would you do this with a for loop? General Practice: What is the mean survival time for each of the four cell types? > mean(VA$stime[VA$cell==1]) [1] 200.2 > mean(VA$stime[VA$cell==2]) [1] 71.66667 > mean(VA$stime[VA$cell==3]) [1] 64.11111 > mean(VA$stime[VA$cell==4]) [1] 166.1111 Which patients were in the test treatment group and had prior therapy? > which(VA$treat==2&VA$prior==10) [1] 70 73 75 77 84 87 89 91 131 132 135 95 96 106 109 113 122 127 128 Rebecca Nugent, Department of Statistics, U. of Washington -4- Looking at More Examples of While loops: Mostly while loops are used when you’re testing a condition and you want to know when you’ve reached convergence or some point of interest. How many data points do we need to simulate to get within .1 of the specified mean? data<-NULL check<-0 while(check==0){ data<-c(data,rpois(1,lambda=2)) new.mean<-mean(data) if(1.9<=new.mean & new.mean<=2.1) check<-1 } > data [1] 2 > data [1] 4 4 0 2 1 2 2 1 To get within .05? data<-NULL check<-0 while(check==0){ data<-c(data,rpois(1,lambda=2)) new.mean<-mean(data) if(1.95<=new.mean & new.mean<=2.05) check<-1 } > data [1] 0 2 2 1 1 2 3 0 0 1 2 2 0 2 3 2 2 2 5 1 1 1 0 0 0 0 3 2 1 3 3 3 7 0 1 0 1 1 5 3 4 4 1 1 0 1 0 4 1 3 4 4 3 3 3 4 3 4 To get within .01? data<-NULL check<-0 while(check==0){ data<-c(data,rpois(1,lambda=2)) new.mean<-mean(data) if(1.99<=new.mean & new.mean<=2.01) check<-1 } > data [1] 0 0 1 2 1 2 4 2 3 0 2 1 2 2 2 2 3 1 3 1 0 1 1 0 1 1 0 1 4 2 3 3 3 1 1 1 3 [38] 3 0 1 0 1 3 2 1 3 5 2 4 1 0 2 2 2 2 0 4 1 2 3 4 2 1 3 3 2 0 2 3 1 0 1 1 3 [75] 6 3 1 4 2 1 2 1 5 1 4 2 4 2 1 3 2 1 3 4 0 4 2 4 2 3 1 5 Rebecca Nugent, Department of Statistics, U. of Washington -5-