Data manipulation in R Editing R programs You can create and save your R code using any text editor, such as notepad or wordpad. If you want to make the effort to learn a more powerful editor specifically designed for R, you may want to look at R-Studio (www.rstudio.com) or Tinn-R (sourceforge.net/projects/tinn-r). R-Studio also has tools for saving the results of your analyses as HTML or PDF files, which you may find helpful for your homework or presentations to colleagues. Creating a data set in R If you have a small data set, you can create if directly in R using code like the following. Suppose we have a data set of 12 observations on the flight time of three different shapes of confetti: confetti.type flight.time 1 Ball 0.56 2 Ball 0.59 3 Ball 0.61 4 Ball 0.61 5 Flat 1.06 6 Flat 1.09 7 Flat 1.22 8 Flat 1.56 9 Folded 1.44 10 Folded 1.42 11 Folded 1.65 12 Folded 1.95 ## Execute the following commands to create the confetti flight time data set in R mystring="ID,confetti.type, flight.time 1,Ball,0.56 2,Ball,0.59 3,Ball,0.61 4,Ball,0.61 5,Flat,1.06 6,Flat,1.09 7,Flat,1.22 8,Flat,1.56 9,Folded,1.44 10,Folded,1.42 11,Folded,1.65 12,Folded,1.95" flight.time.data=read.table(textConnection(mystring), header=TRUE, sep=",", row.names="ID") flight.time.data Read data sets from a file in a directory On my computer, I have the course data files in the directory C:/Users/Walker/Desktop/Burnham/Intermediate Statistics using R/Data. To access the data from inside R, I tell R the working directory where the data files are using the setwd() command: setwd("C:/Users/Walker/Desktop/UCSD Biom 285/Data") Change the setwd() command to point to the directory on your computer. For windows make sure the slashes are "/", not "\" "C:\Users\Walker\Desktop\UCSD Biom 285\Data" must have the slashes changed to: "C:/Users/Walker/Desktop/UCSD Biom 285/Data" Use read.table to read data from a file If you have data in an Excel file that you want to read into R, save the Excel file as a ".csv" file. In Excel, use the following commands. Click File Click Save as Click the menu "Save as type" Select CSV Then save the file. The file "biomarkers.csv" is a comma-separated data file with these contents: ID,Sex,Age,Disease,Biomarker.1,Biomarker.2,Biomarker.3 1,Female,30,Case,138,137,79 2,Female,30,Control,141,143,93 3,Female,40,Case,134,148,58 4,Female,40,Control,150,153,87 5,Female,50,Case,153,147,53 6,Female,50,Control,168,163,62 7,Female,60,Case,161,167,34 8,Female,60,Control,180,178,53 9,Male,30,Case,135,138,88 10,Male,30,Control,160,164,109 11,Male,40,Case,152,155,76 12,Male,40,Control,169,164,93 13,Male,50,Case,163,165,61 14,Male,50,Control,182,183,73 15,Male,60,Case,179,179,49 16,Male,60,Control,178,184,61 To access the data from an R session, I must tell R the working directory where the data files are. We'll read the data from the file into a variable named "biomarker.data". setwd("C:/Users/Walker/Desktop/UCSD Biom 285/Data") biomarker.data= read.table("biomarkers.csv", header=TRUE, sep=",") biomarker.data > biomarker.data ID Sex Age Disease Biomarker.1 Biomarker.2 Biomarker.3 1 1 Female 30 Case 138 137 79 2 2 Female 30 Control 141 143 93 3 3 Female 40 Case 134 148 58 4 4 Female 40 Control 150 153 87 5 5 Female 50 Case 153 147 53 6 6 Female 50 Control 168 163 62 7 7 Female 60 Case 161 167 34 8 8 Female 60 Control 180 178 53 9 9 Male 30 Case 135 138 88 10 10 Male 30 Control 160 164 109 11 11 Male 40 Case 152 155 76 12 12 Male 40 Control 169 164 93 13 13 Male 50 Case 163 165 61 14 14 Male 50 Control 182 183 73 15 15 Male 60 Case 179 179 49 16 16 Male 60 Control 178 184 61 You can also use read.csv() for a CSV file: biomarker.data= read.csv("biomarkers.csv", header=TRUE) The file biomarkers.txt is an ordinary text file. Instead of using a comma delimiter, to separate values, text files using the Tab delimiter, which is represented in R by "\t". The following command reads data from a .txt file. biomarker.data2= read.table("biomarkers.txt", header=TRUE, sep="\t") Useful R commands Create a variable, weight, to contain a list of values as a data vector. Use c() to create a list and assign it to the weight variable. > weight = c(89, 122, 125, 111, 192, 111, 211, 133, 156, 79) > sum(weight) [1] 1524 > mean(weight) [1] 132.9 > length(weight) [1] 10 > rm(weight) Create a sequence of numbers > 1:10 [1] 1 2 3 4 5 6 7 8 9 10 seq(1,20,by=2) [1] 1 3 5 7 9 11 13 15 17 19 Missing values (NA) > data1 = c(1,1,1,NA,0,0,0) > mean(data1) [1] NA > mean(data1, na.rm=TRUE) [1] 0.5 Data frames Data frames are a convenient way to name and access data sets in R. When we read data from files earlier, R created data frames to hold the data. The data sets that you have already seen, such as malaria and cystic fibrosis, are data frames. Usually, the easiest way to get your data into R is to put it into an Excel file and then save it as a csv file. When you use read.table or read.csv R automatically creates a data frame. To help you understand data frames, we'll create a data frame for the following data using the function data.frame(). Student Alice James Randy Assignment 60 58 60 Final 40 35 37 Student=c("Alice","James", "Randy") Assignment=c(60,58,60) Final=c(40,35,37) mydata=data.frame(Student,Assignment,Final) mydata names(mydata) > mydata Student Assignment Final 1 Alice 60 40 2 James 58 35 3 Randy 60 37 > names(mydata) [1] "Student" "Assignment" "Final" > Indexing Sometimes you want to extract particular rows or columns from a data frame. You might want only a few specific variables (columns) or a subset of the observations (rows). R uses indexing to extract data from a list or a data frame. R also uses indexing to put data into a list or data frame. Use square brackets after a variable name to specify the index of the value(s) you want. > weight = c(89, 122, 125, 111, 192, 111, 211, 133, 156, 79) Use indexing to extract the first element in the list weight > weight[1] [1] 89 > weight[2] [1] 122 > weight[3] [1] 125 > weight[2:5] [1] 122 125 111 192 Use indexing to find the min, sum, and mean of the first 4 elements in the list weight > min(weight[1:4]) [1] 89 > sum(weight[1:4]) [1] 447 > mean(weight[1:4]) [1] 111.75 Conditional Indexing > x=c(23, 15, -5, -9, 101) > x[x>0] [1] 23 15 101 > x[x>0 & x < 100] [1] 23 15 Indexing data frames Load library(ISwR) to make the malaria data set available. Using indexing to extract the first 5 rows and 2 columns of the malaria data frame. malaria[1:5,1:2] The index [row,col] specifies which row(s) and column(s) to extract values from. > malaria[1:5,1:2] subject age 1 1 15 2 2 14 3 3 12 4 4 15 5 5 14 Using indexing to extract the first 5 rows and all columns of the malaria data frame. malaria[1:5,] > malaria[1:5,] subject age ab mal 1 1 15 546 0 2 2 14 268 0 3 3 12 284 0 4 4 15 38 0 5 5 14 827 0 By default, if you don't specify a value for either row or column in [row,col], R will return all the rows or columns. Look at the cystfibr data set. age sex height weight bmp fev1 rv frc tlc pemax 1 7 0 109 13.1 68 32 258 183 137 95 2 7 1 112 12.9 65 19 449 245 134 85 3 8 0 124 14.1 64 22 441 268 147 100 4 8 1 125 16.2 67 41 234 146 124 85 5 8 0 127 21.5 93 52 202 131 104 95 6 9 0 7 11 1 8 12 1 9 12 0 10 13 1 11 13 0 12 14 1 13 14 0 14 15 1 15 16 1 16 17 1 17 17 0 18 17 1 19 17 0 20 19 1 21 19 0 22 20 0 23 23 0 24 23 0 25 23 0 > 130 17.5 68 44 308 155 118 139 30.7 89 28 305 179 119 150 28.4 69 18 369 198 103 146 25.1 67 24 312 194 128 155 31.5 68 23 413 225 136 156 39.9 89 39 206 142 95 153 42.1 90 26 253 191 121 160 45.6 93 45 174 139 108 158 51.2 93 45 158 124 90 160 35.9 66 31 302 133 101 153 34.8 70 29 204 118 120 174 44.7 70 49 187 104 103 176 60.1 92 29 188 129 130 171 42.6 69 38 172 130 103 156 37.2 72 21 216 119 81 174 54.6 86 37 184 118 101 178 64.0 86 34 225 148 135 180 73.8 97 57 171 108 98 175 51.1 71 33 224 131 113 179 71.5 95 52 225 127 101 80 65 110 70 95 110 90 100 80 134 134 165 120 130 85 85 160 165 95 195 Extract the first 3 columns of the cystfibr data set cystfibr[,1:3] > cystfibr[,1:3] age sex height 1 7 0 109 2 7 1 112 3 8 0 124 4 8 1 125 5 8 0 127 6 9 0 130 7 11 1 139 8 12 1 150 9 12 0 146 10 13 1 155 11 13 0 156 12 14 1 153 13 14 0 160 14 15 1 158 15 16 1 160 16 17 1 153 17 17 0 174 18 17 1 176 19 17 0 171 20 19 1 156 21 22 23 24 25 19 20 23 23 23 0 0 0 0 0 174 178 180 175 179 You can specify the row or column you want by name: cystfibr[1:6,"weight"] > cystfibr[1:6,"weight"] [1] 13.1 12.9 14.1 16.2 21.5 17.5 Find the youngest patient in cystfibr min(cystfibr[,"age"]) > min(cystfibr[,"age"]) [1] 7 Tables table(x) finds all the unique values in the data vector x and tabulates (counts) the frequencies of their occurrence. Suppose we have a list of the outcomes for 6 patients in a cancer clinical trial: outcomes = c("alive", "alive", "alive", "dead") > outcomes [1] "alive" "alive" "alive" "dead" We would like a count of the number of patients with each outcome. Use the table() function. table(outcomes) outcomes alive dead 3 1 In the statistics and medical literature, this table is sometimes called a "contingency table". Here's a table of counts for the Age variable in the biomarker.data. table(biomarker.data[,"Age"]) 30 40 50 60 4 4 4 4 table(biomarker.data[,c("Age", "Sex")]) Factors How does R determine the categories to use in the table(x) function? help(table) states that the category names must be defined as factors. R uses factors to specify that a variable is categorical, and to define the levels of the categorical variable. You define factors with the function factor(), or with the function as.factor(). Factors take a specified set of values called levels(). Factors are different from data vectors. Here are some examples. # Data vector of the numbers 1 to 5 1:5 # A factor with levels 1 to 5. > factor(1:5) [1] 1 2 3 4 5 Levels: 1 2 3 4 5 # Notice the output from the factor definition: Levels: 1 2 3 4 5 # A data vector is numeric mean(1:5) # A factor is not numeric. It is just the names of the levels. mean(factor(1:5)) > mean(factor(1:5)) [1] NA Warning message: In mean.default(factor(1:5)) : argument is not numeric or logical: returning NA > levels(x) tells us the possible levels (values) that the categorical variable can take. outcomes = c("alive no cancer", "alive no cancer", "alive no cancer", "alive cancer", "alive cancer", "dead") levels(as.factor(outcomes)) [1] "alive cancer" "alive no cancer" "dead" Statistical data often have categorical variables (male, female), (mild, moderate, severe) that are stored as numeric values (0,1,2,…) in the data set. Use sample() to take samples Suppose we go fishing. If we catch a fish and put it back in the water, that is a sample with replacement. If we catch a fish and eat it, we cannot put it back in the water, so that is a sample without replacement. We use the R function sample() to take samples with or without replacement from a list of items or numbers. The same number may appear more than once in the sample when we sample with replacement. Use seq() to create a sequence of numbers from 1 to 10 in the variable x. x=seq(1,10) >x [1] 1 2 3 4 5 6 7 8 9 10 Take a single sample (one observation) from x. sample(x,1) > sample(x,1) [1] 2 > sample(x,1) [1] 10 > sample(x,1) [1] 1 Use the function sample() to draw a random sample of 10 observations from x without replacement. Repeat this several times to see the result. By default, sample() takes samples without replacement. sample(x,10) > sample(x,10) [1] 10 6 1 4 3 7 2 5 9 8 > sample(x,10) [1] 8 5 7 10 9 6 4 1 2 3 > sample(x,10) [1] 4 6 7 3 9 8 10 2 1 5 Notice that, by default, sample() takes samples without replacement. Use the function sample() to draw a random sample of 4 observations from x with replacement. sample(x,10, replace=TRUE) > sample(x,10, replace=TRUE) [1] 8 2 4 4 4 2 5 6 7 5 > sample(x,10, replace=TRUE) [1] 7 3 7 10 3 1 7 7 2 8 > sample(x,10, replace=TRUE) [1] 1 9 3 8 9 8 7 1 4 7 When we sample with replacement, the same number may appear more than once in the sample. Execute commands from a file using source() You can use the source() function to execute a series of R commands from a file. The file "example source file.txt" has the following contents. weight = c(89, 122, 125, 111, 192, 111, 211, 133, 156, 79) print("weight data") print(weight) I can execute the commands in the file using the source() command. setwd("C:/Users/Walker/Desktop/UCSD Biom 285/Biom 285 Lectures/Data") source("example source file.txt") R will display the following results. [1] "weight data" [1] 89 122 125 111 192 111 211 133 156 79 ### The following is more advanced material on for loops and functions. It is provided for students with programming experience who want to implement R programs. For loops Use a for loop when you want to perform the same action many times on a list of values: for (index in values) { block of commands } result.vector= c() for (index in 1:10) { result.vector[index] = index^2 print(c(index, result.vector[index])) } plot(1:10,result.vector) [1] 1 1 [1] 2 4 [1] 3 9 [1] 4 16 [1] 5 25 [1] 6 36 [1] 7 49 [1] 8 64 [1] 9 81 [1] 10 100 Functions When you perform a more complicated action many times, it is convenient to put it into a function. # Define a function that has no arguments my.function = function() { sum(1:20) } # Look at the function definition my.function > my.function function() { sum(1:20) } > # Execute the function my.function() > my.function() [1] 210 # Define a function that has three arguments, with default value for one argument my.function2 = function(first,last,step=1) { sum(seq(first,last, by=step)) } # Look at the function definition my.function2 > my.function2 function(first,last,step=1) { sum(seq(first,last, by=step)) } > # Execute the function my.function2(1,20) > my.function2(1,20) [1] 210 my.function2(1,20, step=2) > my.function2(1,20, step=2) [1] 100 >