Lecture 3: Data Manipulation

Introduction to R
Lecture 3: Data
Manipulation
Andrew Jaffe
9/27/10
Overview
Practice Solutions
 Indexing
 Data Management
 Data Summaries

Practice

Make a 2 x 2 table of sex and dog
> table(dat$sex, dat$dog)
no yes
F 264 229
M 254 253
Practice

Create a 'BMI' variable using height and
weight
> dat$bmi = dat$weight*703/dat$height^2
> head(dat$bmi)
[1] 23.44931 31.29991 25.69422 23.89881
23.11172 28.13324
Practice

Create an 'overweight' variable, which
gives the value 1 for people with BMI > 30
and 0 otherwise
> dat$overweight = ifelse(dat$bmi > 30,
1, 0)
> head(dat$overweight)
[1] 0 1 0 0 0 0
Practice

Add those two variables to the datasets
and save it as a text file somewhere
write.table(dat, "lec2_practice.txt",
quote = F, row.names = F, sep="\t")
Overview
Practice Solutions
 Indexing
 Data Management
 Data Summaries

Indexing

Vectors: vector[index] takes ‘index’
elements from vector and returns them
> x = c(1,3,7,34,435)
> x[1]
[1] 1
> x[c(1,4)]
> 2:4
[1] 1 34
[1] 2 3 4
> x[2:4]
[1] 3 7 34
Indexing

Replace elements in a vector – combining
indexing, is.na(), and rep()
> x = c(1,3,NA,6,NA,8)
> which(is.na(x))
[1] 3 5
> x[is.na(x)] = 0 # or rep(0)
> x
[1] 1 3 0 6 0 8
Indexing

Data.frames/matrices: dat[row,col]
 Can
subset/extract a row: dat[row,]
 Can subset/extract a column: dat[,col]
> x = matrix(c(1,2,3,4,5,6), ncol = 3)
> x
[,1] [,2] [,3]
[1,]
1
3
5
[2,]
2
4
6
Indexing
> x[1,]
[1] 1 3 5
> x[,1]
[1] 1 2
> x[1,1]
[1] 1
> x[1:2,1:2]
[,1] [,2]
[1,]
1
3
[2,]
2
4
> x
[,1] [,2] [,3]
[1,]
1
3
5
[2,]
2
4
6
Indexing
> x[1,] =
> x
[,1]
[1,]
1
[2,]
2
> x[,1] =
> x
[,1]
[1,]
2
[2,]
2
rep(1)
> x
[,1] [,2] [,3]
[1,]
1
3
5
[2,]
2
4
6
[,2] [,3]
1
1
4
6
rep(2)
[,2] [,3]
1
1
4
6
Overview
Practice Solutions
 Indexing
 Data Management
 Data Summaries

Data Management
An aside: save() and load()
 save(obj_1,…,obj_n, file = “filename.rda”)

 Saves
R objects (vectors, matrices, or
data.frames) as an .rda file (similar to .dta)

load(“filename.rda”)
 Loads

whatever files were saved in the .rda
Easier than reading/writing tables
Data Management

Your workspace can be saved an .rda file
 You
get asked this every time you close R
 save.image(“filename.Rdata”) saves all
objects in your workspace (what ls() returns)
 Each folder might have its own .Rdata file

Doing this is personal preference - if you
have a script and it’s a quick analysis,
probably don’t need a saved image
Data Management
“lec3_data.rda” can be downloaded from
the website
 Similar method to read in the data:
load(“lec3_data.rda”)

 Put
in the same directory as your script
 Set your working directory
 Use the full filename
Data Management

What are the dimensions of the dataset?
Data Management

What are the dimensions of the dataset?
> dim(dog_dat)
[1] 482
6
Data Management

How many dogs are in this dataset? Is this
dataset unique?
Data Management

How many dogs are in this dataset? Is this
dataset unique?
> length(unique(dog_dat$dog_id))
[1] 482
> length(dog_dat$dog_id))
[1] 482
Data Management

What are the column/variable names?
Data Management

What are the column/variable names?
> head(dog_dat)
dog_id owner_id dog_type dog_wt_mo1 dog_len_mo1 dog_food_mo1
1
1
394
lab
51.5
13.8
25.8
2
2
571
lab
48.3
24.6
33.1
3
3
986
poodle
59.3
22.7
29.2
4
4
750
lab
46.4
22.3
27.6
5
5
882
husky
48.0
20.9
28.0
6
6
762
poodle
47.0
19.1
31.0
> names(dog_dat)
[1] "dog_id" "owner_id" "dog_type” "dog_wt_mo1"
[5] "dog_len_mo1" "dog_food_mo1"
Data Management

Some explanation of the variables
 dog_id:
id of dog
 owner_id: id of owner
 dog_type: type of dog
 dog_wt_mo1: dog weight at month 1 (baseline)
 dog_len_mo1: dog length at month 1
 dog_food_mo1: baseline dog food consumption
Data Management

Subsetting data: separate data into two
data.frames based on a variable:
> lab = dog_dat[dog_dat$dog_type == "lab",]
> head(lab)
dog_id owner_id dog_type dog_wt_mo1 dog_len_mo1 dog_food_mo1
1
1
394
lab
51.5
13.8
25.8
2
2
571
lab
48.3
24.6
33.1
4
4
750
lab
46.4
22.3
27.6
7
7
664
lab
53.0
18.2
25.7
13
13
713
lab
48.3
23.4
31.8
15
15
480
lab
46.6
20.8
31.3
Data Management
> lab = dog_dat[dog_dat$dog_type == "lab",]
> head(which(dog_dat$dog_type == "lab"))
[1] 1 2 4 7 13 15
Taking those specific rows,
and all of the columns of the
original data
Data Management
> lab2 = dog_dat[dog_dat$dog_type == ”lab",1:3]
> head(lab2,3)
dog_id owner_id dog_type
1
1
394
lab
2
2
571
lab
4
4
750
lab
Taking those specific rows,
and the first 3 columns of
the original data
Data Management

Note (stata users…) that we have two
data.frames in our workspace! [ls()]
Data Management

Remember we used ifelse() for binary
conversions?
> heavy = ifelse(dog_dat[,4] >
mean(dog_dat[,4]), 1, 0)
> head(heavy)
Note that you can use
[1] 1 0 1 0 0 0
column indexing instead
This is just the mean of of $name for data.frames
that column:
> mean(dog_dat[,4])
[1] 49.69606
Data Management
The cut() function can split data into more
groups – quintiles, tertiles, etc
 cut(dat, breaks)

 dat
is a vector of numerical or integer values
 breaks is where to make the cuts
Data Management

If ‘breaks’ is one number (n), it splits the
data into ‘n’ equal sized groups
> x = 1:5 # 1 2 3 4 5 or seq(1,5)
> cut(x, 2)
[1] (0.996,3] (0.996,3] (0.996,3] (3,5]
(3,5]
Levels: (0.996,3] (3,5]
FACTORS!
> cut(x, 3)
[1] (0.996,2.33] (0.996,2.33] (2.33,3.67] (3.67,5]
(3.67,5]
Levels: (0.996,2.33] (2.33,3.67] (3.67,5]
> cut(x,3, labels=F) # returns integers of groups, not factors
[1] 1 1 2 3 3
Data Management
What is a factor? Similar to terms like
‘category’ and ‘enumerated type’
 Has ‘levels’ associated with it – could be
ordinal if factor(…,ordered = T)
 Must only have an as.character() method
and be sortable to be converted to a factor
using factor()

Data Management

If ‘breaks’ are more than one number,
splits the vector by those numbers
> x = 1:10
> cut(x, c(0,3,6,10))
[1] (0,3] (0,3] (0,3] (3,6] (3,6] (3,6] (6,10] (6,10] (6,10] (6,10]
Levels: (0,3] (3,6] (6,10]
> cut(x, c(0,3,6,10), FALSE)
[1] 1 1 1 2 2 2 3 3 3 3
Data Management
Something more applicable for cut: the
quantile(x,probs) function - default ‘probs’
is seq(0,1,0.25), ie quintiles
 seq(start, end, by) – creates a sequence
from the starting value, to the ending value
by the specified amount

 seq(0,10)
~ 0:10 # 0, 1, 2, …, 9, 10
 seq(0,10,0.5) # 0, 0.5, 1.0, …, 9.5, 10.0
Data Management
 Now for stuff with our data:
> quantile(dog_dat$dog_wt_mo1)
0%
25%
50%
75%
100%
10.600 44.600 49.200 55.275 72.500
> quantile(dog_dat$dog_wt_mo1, seq(0,1,0.5))
0% 50% 100%
10.6 49.2 72.5
> quantile(dog_dat$dog_wt_mo1, 0.6)
60%
51.5
> quantile(dog_dat$dog_wt_mo1, c(0.4,0.6))
40%
60%
47.24 51.50
Data Management
> sp = quantile(dog_dat$dog_wt_mo1, 0.75)
> big = ifelse(dog_dat$dog_wt_mo1 > sp, 1, 0)
> head(big)
[1] 0 0 1 0 0 0
> quant = cut(dog_dat$dog_wt_mo1,
quantile(dog_dat$dog_wt_mo1))
> head(quant)
[1] (49.2,55.3] (44.6,49.2] (55.3,72.5]
(44.6,49.2] (44.6,49.2] (44.6,49.2]
Levels: (10.6,44.6] (44.6,49.2] (49.2,55.3]
(55.3,72.5]
Overview
Practice Solutions
 Indexing
 Data Management
 Data Summaries

Data Summaries
This is some of the only “statistics” in the
course
 R functions can perform statistics well,
here are some basics for summaries

Data Summaries
mean(dat, na.rm = F) > x = c(1,2,4,6,NA)
> mean(x)
 median(dat, na.rm=F) [1] NA

> mean(x, na.rm=T)
[1] 3.25
> median(x,na.rm=T)
[1] 3
Data Summaries
> x = c(1,2,4,7,9,11)
> mean(x)
[1] 5.666667
> median(x)
[1] 5.5
> var(x)
[1] 15.86667
> sd(x)
[1] 3.983298
Data Summaries
Let’s combine some concepts!
 Take the mean food consumption of all of
the labs

Data Summaries

First, figure out which entries correspond
to dogs that are labs
> Index = which(dog_dat$dog_type == "lab")
> head(Index)
[1] 1 2 4 7 13 15
Data Summaries

Then, take the mean of the data you want
> mean(dog_dat$dog_food_mo1[Index])
[1] 30.04
Note that we first created a vector of
dog food, then indexed it - there are
no commas needed for the indexing
(because it’s a vector)
Data Summaries

Combined into 1 line/command:
> mean(dog_dat$dog_food_mo1[dog_dat$dog_type == "lab"])
[1] 30.04
> mean(dog_dat[dog_dat$dog_type == "lab",6])
[1] 30.04
> mean(dog_dat[dog_dat$dog_type == "lab","dog_food_mo1"])
[1] 30.04
Pick your favorite – they’re all the same! Note that
the first option might make the most sense…
Practice
Compute the average dog weight, dog
length, and dog food consumption for each
dog type at baseline
 Reminder: the dog types are lab, poodle,
husky, and retriever
