1.1 Data manipulation in R

advertisement
Data manipulation in R
Editing R programs
You can create and save your R code using any text editor, such as notepad or wordpad.
If you want to make the effort to learn a more powerful editor specifically designed for
R, you may want to look at R-Studio (www.rstudio.com) or Tinn-R
(sourceforge.net/projects/tinn-r). R-Studio also has tools for saving the results of your
analyses as HTML or PDF files, which you may find helpful for your homework or
presentations to colleagues.
Creating a data set in R
If you have a small data set, you can create if directly in R using code like the following.
Suppose we have a data set of 12 observations on the flight time of three different
shapes of confetti:
confetti.type flight.time
1
Ball
0.56
2
Ball
0.59
3
Ball
0.61
4
Ball
0.61
5
Flat
1.06
6
Flat
1.09
7
Flat
1.22
8
Flat
1.56
9
Folded
1.44
10
Folded
1.42
11
Folded
1.65
12
Folded
1.95
## Execute the following commands to create the confetti flight time data set in R
mystring="ID,confetti.type, flight.time
1,Ball,0.56
2,Ball,0.59
3,Ball,0.61
4,Ball,0.61
5,Flat,1.06
6,Flat,1.09
7,Flat,1.22
8,Flat,1.56
9,Folded,1.44
10,Folded,1.42
11,Folded,1.65
12,Folded,1.95"
flight.time.data=read.table(textConnection(mystring), header=TRUE, sep=",",
row.names="ID")
flight.time.data
Read data sets from a file in a directory
On my computer, I have the course data files in the directory
C:/Users/Walker/Desktop/Burnham/Intermediate Statistics using R/Data.
To access the data from inside R, I tell R the working directory where the data files are
using the setwd() command:
setwd("C:/Users/Walker/Desktop/UCSD Biom 285/Data")
Change the setwd() command to point to the directory on your computer.
For windows make sure the slashes are "/", not "\"
"C:\Users\Walker\Desktop\UCSD Biom 285\Data"
must have the slashes changed to:
"C:/Users/Walker/Desktop/UCSD Biom 285/Data"
Use read.table to read data from a file
If you have data in an Excel file that you want to read into R, save the Excel file as a
".csv" file. In Excel, use the following commands.
Click File
Click Save as
Click the menu "Save as type"
Select CSV
Then save the file.
The file "biomarkers.csv" is a comma-separated data file with these contents:
ID,Sex,Age,Disease,Biomarker.1,Biomarker.2,Biomarker.3
1,Female,30,Case,138,137,79
2,Female,30,Control,141,143,93
3,Female,40,Case,134,148,58
4,Female,40,Control,150,153,87
5,Female,50,Case,153,147,53
6,Female,50,Control,168,163,62
7,Female,60,Case,161,167,34
8,Female,60,Control,180,178,53
9,Male,30,Case,135,138,88
10,Male,30,Control,160,164,109
11,Male,40,Case,152,155,76
12,Male,40,Control,169,164,93
13,Male,50,Case,163,165,61
14,Male,50,Control,182,183,73
15,Male,60,Case,179,179,49
16,Male,60,Control,178,184,61
To access the data from an R session, I must tell R the working directory where the data
files are. We'll read the data from the file into a variable named "biomarker.data".
setwd("C:/Users/Walker/Desktop/UCSD Biom 285/Data")
biomarker.data= read.table("biomarkers.csv", header=TRUE, sep=",")
biomarker.data
> biomarker.data
ID
Sex Age Disease Biomarker.1 Biomarker.2 Biomarker.3
1
1 Female 30
Case
138
137
79
2
2 Female 30 Control
141
143
93
3
3 Female 40
Case
134
148
58
4
4 Female 40 Control
150
153
87
5
5 Female 50
Case
153
147
53
6
6 Female 50 Control
168
163
62
7
7 Female 60
Case
161
167
34
8
8 Female 60 Control
180
178
53
9
9
Male 30
Case
135
138
88
10 10
Male 30 Control
160
164
109
11 11
Male 40
Case
152
155
76
12 12
Male 40 Control
169
164
93
13 13
Male 50
Case
163
165
61
14 14
Male 50 Control
182
183
73
15 15
Male 60
Case
179
179
49
16 16
Male
60 Control
178
184
61
You can also use read.csv() for a CSV file:
biomarker.data= read.csv("biomarkers.csv", header=TRUE)
The file biomarkers.txt is an ordinary text file. Instead of using a comma delimiter, to
separate values, text files using the Tab delimiter, which is represented in R by "\t".
The following command reads data from a .txt file.
biomarker.data2= read.table("biomarkers.txt", header=TRUE, sep="\t")
Useful R commands
Create a variable, weight, to contain a list of values as a data vector. Use c() to create a
list and assign it to the weight variable.
> weight = c(89, 122, 125, 111, 192, 111, 211, 133, 156, 79)
> sum(weight)
[1] 1524
> mean(weight)
[1] 132.9
> length(weight)
[1] 10
> rm(weight)
Create a sequence of numbers
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
seq(1,20,by=2)
[1] 1 3 5 7 9 11 13 15 17 19
Missing values (NA)
> data1 = c(1,1,1,NA,0,0,0)
> mean(data1)
[1] NA
> mean(data1, na.rm=TRUE)
[1] 0.5
Data frames
Data frames are a convenient way to name and access data sets in R. When we read
data from files earlier, R created data frames to hold the data.
The data sets that you have already seen, such as malaria and cystic fibrosis, are data
frames. Usually, the easiest way to get your data into R is to put it into an Excel file and
then save it as a csv file. When you use read.table or read.csv R automatically creates a
data frame. To help you understand data frames, we'll create a data frame for the
following data using the function data.frame().
Student
Alice
James
Randy
Assignment
60
58
60
Final
40
35
37
Student=c("Alice","James", "Randy")
Assignment=c(60,58,60)
Final=c(40,35,37)
mydata=data.frame(Student,Assignment,Final)
mydata
names(mydata)
> mydata
Student Assignment Final
1 Alice
60 40
2 James
58 35
3 Randy
60 37
> names(mydata)
[1] "Student" "Assignment" "Final"
>
Indexing
Sometimes you want to extract particular rows or columns from a data frame. You
might want only a few specific variables (columns) or a subset of the observations
(rows).
R uses indexing to extract data from a list or a data frame. R also uses indexing to put
data into a list or data frame. Use square brackets after a variable name to specify the
index of the value(s) you want.
> weight = c(89, 122, 125, 111, 192, 111, 211, 133, 156, 79)
Use indexing to extract the first element in the list weight
> weight[1]
[1] 89
> weight[2]
[1] 122
> weight[3]
[1] 125
> weight[2:5]
[1] 122 125 111 192
Use indexing to find the min, sum, and mean of the first 4 elements in the list weight
> min(weight[1:4])
[1] 89
> sum(weight[1:4])
[1] 447
> mean(weight[1:4])
[1] 111.75
Conditional Indexing
> x=c(23, 15, -5, -9, 101)
> x[x>0]
[1] 23 15 101
> x[x>0 & x < 100]
[1] 23 15
Indexing data frames
Load library(ISwR) to make the malaria data set available.
Using indexing to extract the first 5 rows and 2 columns of the malaria data frame.
malaria[1:5,1:2]
The index [row,col] specifies which row(s) and column(s) to extract values from.
> malaria[1:5,1:2]
subject age
1
1 15
2
2 14
3
3 12
4
4 15
5
5 14
Using indexing to extract the first 5 rows and all columns of the malaria data frame.
malaria[1:5,]
> malaria[1:5,]
subject age ab mal
1
1 15 546 0
2
2 14 268 0
3
3 12 284 0
4
4 15 38 0
5
5 14 827 0
By default, if you don't specify a value for either row or column in [row,col], R will return
all the rows or columns.
Look at the cystfibr data set.
age sex height weight bmp fev1 rv frc tlc pemax
1 7 0 109 13.1 68 32 258 183 137 95
2 7 1 112 12.9 65 19 449 245 134 85
3 8 0 124 14.1 64 22 441 268 147 100
4 8 1 125 16.2 67 41 234 146 124 85
5 8 0 127 21.5 93 52 202 131 104 95
6 9 0
7 11 1
8 12 1
9 12 0
10 13 1
11 13 0
12 14 1
13 14 0
14 15 1
15 16 1
16 17 1
17 17 0
18 17 1
19 17 0
20 19 1
21 19 0
22 20 0
23 23 0
24 23 0
25 23 0
>
130 17.5 68 44 308 155 118
139 30.7 89 28 305 179 119
150 28.4 69 18 369 198 103
146 25.1 67 24 312 194 128
155 31.5 68 23 413 225 136
156 39.9 89 39 206 142 95
153 42.1 90 26 253 191 121
160 45.6 93 45 174 139 108
158 51.2 93 45 158 124 90
160 35.9 66 31 302 133 101
153 34.8 70 29 204 118 120
174 44.7 70 49 187 104 103
176 60.1 92 29 188 129 130
171 42.6 69 38 172 130 103
156 37.2 72 21 216 119 81
174 54.6 86 37 184 118 101
178 64.0 86 34 225 148 135
180 73.8 97 57 171 108 98
175 51.1 71 33 224 131 113
179 71.5 95 52 225 127 101
80
65
110
70
95
110
90
100
80
134
134
165
120
130
85
85
160
165
95
195
Extract the first 3 columns of the cystfibr data set
cystfibr[,1:3]
> cystfibr[,1:3]
age sex height
1 7 0 109
2 7 1 112
3 8 0 124
4 8 1 125
5 8 0 127
6 9 0 130
7 11 1 139
8 12 1 150
9 12 0 146
10 13 1 155
11 13 0 156
12 14 1 153
13 14 0 160
14 15 1 158
15 16 1 160
16 17 1 153
17 17 0 174
18 17 1 176
19 17 0 171
20 19 1 156
21
22
23
24
25
19
20
23
23
23
0
0
0
0
0
174
178
180
175
179
You can specify the row or column you want by name:
cystfibr[1:6,"weight"]
> cystfibr[1:6,"weight"]
[1] 13.1 12.9 14.1 16.2 21.5 17.5
Find the youngest patient in cystfibr
min(cystfibr[,"age"])
> min(cystfibr[,"age"])
[1] 7
Tables
table(x) finds all the unique values in the data vector x and tabulates (counts) the
frequencies of their occurrence.
Suppose we have a list of the outcomes for 6 patients in a cancer clinical trial:
outcomes = c("alive", "alive", "alive", "dead")
> outcomes
[1] "alive" "alive" "alive" "dead"
We would like a count of the number of patients with each outcome. Use the table()
function.
table(outcomes)
outcomes
alive dead
3 1
In the statistics and medical literature, this table is sometimes called a "contingency
table".
Here's a table of counts for the Age variable in the biomarker.data.
table(biomarker.data[,"Age"])
30 40 50 60
4 4 4 4
table(biomarker.data[,c("Age", "Sex")])
Factors
How does R determine the categories to use in the table(x) function?
help(table) states that the category names must be defined as factors. R uses factors to
specify that a variable is categorical, and to define the levels of the categorical variable.
You define factors with the function factor(), or with the function as.factor().
Factors take a specified set of values called levels().
Factors are different from data vectors.
Here are some examples.
# Data vector of the numbers 1 to 5
1:5
# A factor with levels 1 to 5.
> factor(1:5)
[1] 1 2 3 4 5
Levels: 1 2 3 4 5
# Notice the output from the factor definition:
Levels: 1 2 3 4 5
# A data vector is numeric
mean(1:5)
# A factor is not numeric. It is just the names of the levels.
mean(factor(1:5))
> mean(factor(1:5))
[1] NA
Warning message:
In mean.default(factor(1:5)) :
argument is not numeric or logical: returning NA
>
levels(x) tells us the possible levels (values) that the categorical variable can take.
outcomes = c("alive no cancer", "alive no cancer", "alive no cancer", "alive cancer",
"alive cancer", "dead")
levels(as.factor(outcomes))
[1] "alive cancer" "alive no cancer" "dead"
Statistical data often have categorical variables (male, female), (mild, moderate, severe)
that are stored as numeric values (0,1,2,…) in the data set.
Use sample() to take samples
Suppose we go fishing. If we catch a fish and put it back in the water, that is a sample
with replacement. If we catch a fish and eat it, we cannot put it back in the water, so
that is a sample without replacement.
We use the R function sample() to take samples with or without replacement from a list
of items or numbers. The same number may appear more than once in the sample
when we sample with replacement.
Use seq() to create a sequence of numbers from 1 to 10 in the variable x.
x=seq(1,10)
>x
[1] 1 2 3 4 5 6 7 8 9 10
Take a single sample (one observation) from x.
sample(x,1)
> sample(x,1)
[1] 2
> sample(x,1)
[1] 10
> sample(x,1)
[1] 1
Use the function sample() to draw a random sample of 10 observations from x without
replacement. Repeat this several times to see the result. By default, sample() takes
samples without replacement.
sample(x,10)
> sample(x,10)
[1] 10 6 1 4 3 7 2 5 9 8
> sample(x,10)
[1] 8 5 7 10 9 6 4 1 2 3
> sample(x,10)
[1] 4 6 7 3 9 8 10 2 1 5
Notice that, by default, sample() takes samples without replacement.
Use the function sample() to draw a random sample of 4 observations from x with
replacement.
sample(x,10, replace=TRUE)
> sample(x,10, replace=TRUE)
[1] 8 2 4 4 4 2 5 6 7 5
> sample(x,10, replace=TRUE)
[1] 7 3 7 10 3 1 7 7 2 8
> sample(x,10, replace=TRUE)
[1] 1 9 3 8 9 8 7 1 4 7
When we sample with replacement, the same number may appear more than once in
the sample.
Execute commands from a file using source()
You can use the source() function to execute a series of R commands from a file.
The file "example source file.txt" has the following contents.
weight = c(89, 122, 125, 111, 192, 111, 211, 133, 156, 79)
print("weight data")
print(weight)
I can execute the commands in the file using the source() command.
setwd("C:/Users/Walker/Desktop/UCSD Biom 285/Biom 285 Lectures/Data")
source("example source file.txt")
R will display the following results.
[1] "weight data"
[1] 89 122 125 111 192 111 211 133 156
79
### The following is more advanced material on for loops and functions. It is provided
for students with programming experience who want to implement R programs.
For loops
Use a for loop when you want to perform the same action many times on a list of
values:
for (index in values)
{
block of commands
}
result.vector= c()
for (index in 1:10)
{
result.vector[index] = index^2
print(c(index, result.vector[index]))
}
plot(1:10,result.vector)
[1] 1 1
[1] 2 4
[1] 3 9
[1] 4 16
[1] 5 25
[1] 6 36
[1] 7 49
[1] 8 64
[1] 9 81
[1] 10 100
Functions
When you perform a more complicated action many times, it is convenient to put it into
a function.
# Define a function that has no arguments
my.function = function()
{
sum(1:20)
}
# Look at the function definition
my.function
> my.function
function()
{
sum(1:20)
}
>
# Execute the function
my.function()
> my.function()
[1] 210
# Define a function that has three arguments, with default value for one argument
my.function2 = function(first,last,step=1)
{
sum(seq(first,last, by=step))
}
# Look at the function definition
my.function2
> my.function2
function(first,last,step=1)
{
sum(seq(first,last, by=step))
}
>
# Execute the function
my.function2(1,20)
> my.function2(1,20)
[1] 210
my.function2(1,20, step=2)
> my.function2(1,20, step=2)
[1] 100
>
Download