Lab 2 - Department of Statistics

advertisement
CSSS 508: Intro to R
1/18/06
Lab 2
Working with Your Dataset
This lab is just a practice session on reading in a dataset and asking questions about it.
1) Download the lab 2 data from the class website (cardiacoutput.csv). Save on C drive.
This Excel file has been saved as a .csv file by going to the Save As option in the File
Menu and choosing CSV (comma delimited) (*.csv) in the Save as Type: option. This
format sometimes makes it easier for R to read in an Excel file.
At the R command line: data<-read.table(“C://cardiacoutput.csv”,sep=”,”)
The sep=”,” is an option that goes with the csv option.
Now we have a data matrix - a subset from a study used to look at several different
measures of cardiac output. They were interested in determining which measures better
predicted cardiac heart failure.
2) Learn the basics of your data.
> dim(data)
[1] 64 12
You have 64 subjects, 12 variables.
Your variables (in order) are: ID, Diagnosis, Age, Gender, Pulmonary Artery Pressure,
Pulmonary Wedge Pressure, Cardiac Output Measure 1, Cardiac Output Measure 2,
Cardiac Index 1, Cardiac Index 2, Heart Rate, and Mean Arterial Pressure.
Often it is easier to name each of the variables so you can refer to them by name rather
than by column number. There are two ways to do this.
a) Individually assign each column to a name variable.
>
>
>
>
>
>
>
>
>
>
>
>
id<-data[,1]
dx<-data[,2]
age<-data[,3]
gender<-data[,4]
pap<-data[,5]
pwp<-data[,6]
co.1<-data[,7]
co.2<-data[,8]
ci.1<-data[,9]
ci.2<-data[,10]
hr<-data[,11]
map<-data[,12]
Rebecca Nugent, Department of Statistics, U. of Washington
-1-
b) Define a data frame where each column has a name.
> data<data.frame(id=data[,1],dx=data[,2],age=data[,3],gender=data[,4],pap=dat
a[,5],pwp=data[,6],co.1=data[,7],co.2=data[,8],ci.1=data[,9],ci.2=data[
,10],hr=data[,11],map=data[,12])
Now when you need to access a variable, you can just type: data$age.
It’s always a good idea to take a look at your individual variables to get an overall
picture. One command that will give you the range, mean, median, etc as well as if there
are any missing values is: summary( ).
> summary(dx)
CHF HTx
22 42
> summary(age)
Min. 1st Qu. Median
25.00
54.00
60.00
> summary(data$gender)
f m
4 60
> summary(data$pap)
Min. 1st Qu. Median
11.0
18.0
22.0
> summary(pwp)
Min. 1st Qu. Median
6.10
13.05
16.30
> summary(co.1)
Min. 1st Qu. Median
2.700
4.200
5.200
> summary(co.2)
Min. 1st Qu. Median
1.700
4.300
5.550
> summary(ci.1)
Min. 1st Qu. Median
1.400
2.236
2.600
> summary(ci.2)
Min. 1st Qu. Median
1.000
2.175
2.700
> summary(hr)
Min. 1st Qu. Median
54.00
78.00
85.50
> summary(map)
Min. 1st Qu. Median
76.0
94.0
110.0
Mean 3rd Qu.
57.19
68.00
Max.
80.00
Mean 3rd Qu.
23.9
27.5
Max.
46.0
NA's
5.0
Mean 3rd Qu.
18.50
19.25
Max.
47.40
NA's
5.00
Mean 3rd Qu.
5.347
6.225
Max.
9.700
Mean 3rd Qu.
5.623
7.225
Max.
9.300
Mean 3rd Qu.
2.696
3.063
Max.
5.100
Mean 3rd Qu.
2.787
3.525
Max.
4.600
Mean 3rd Qu.
87.88
98.00
Max.
121.00
Mean 3rd Qu.
109.4
123.0
Max.
145.0
Rebecca Nugent, Department of Statistics, U. of Washington
NA's
2.0
-2-
So who’s missing data?
> which(is.na(pap))
[1] 1 31 47 56 57
> which(is.na(pwp))
[1] 31 47 54 56 57
> which(is.na(map))
[1] 12 64
Note that some people are missing more than one value. So the number of people who
are missing data is NOT found by adding up the number of NA’s for each variable.
> c(which(is.na(pap)),which(is.na(pwp)),which(is.na(map)))
[1] 1 31 47 56 57 31 47 54 56 57 12 64
> unique(c(which(is.na(pap)),which(is.na(pwp)),which(is.na(map))))
[1] 1 31 47 56 57 54 12 64
>
sort(unique(c(which(is.na(pap)),which(is.na(pwp)),which(is.na(map)))))
[1] 1 12 31 47 54 56 57 64
(The unique( ) function removes all duplicates from a vector.)
We can choose subsets of the data matrix by values of just one variable.
Let’s say we want to split up the dataset into males and females.
> females<-data[gender=="f",]
> dim(females)
[1] 4 12
> males<-data[gender=="m",]
> dim(males)
[1] 60 12
Or choose the subset of people who are NOT missing data:
> missing<sort(unique(c(which(is.na(pap)),which(is.na(pwp)),which(is.na(map)))))
> newdata<-data[-missing,]
> dim(newdata)
[1] 56 12
Or a random sample of people from our matrix:
> random.sample<-sample(seq(1,nrow(data)),10)
> random.sample
[1] 52 2 25 54 27 8 40 11 49 17
> sample.subset<-data[random.sample,]
> dim(sample.subset)
[1] 10 12
Can also do any combination of variables:
> data[gender=="m"&age<30,]
(males who are younger than 30)
> data[co.1<4&co.2<4,]
(subjects with both cardiac output measures < 4)
Rebecca Nugent, Department of Statistics, U. of Washington
-3-
We can also select a group of variables (instead of patients).
> new.vars<-cbind(gender,age,dx)
> dim(new.vars)
[1] 64 3
Or
> newvars<-data[,c(4,3,2)]
Practice several conditional statements with this dataset. What are some questions you
would ask about your data?
How many people are older than 45?
> sum(age>45)
[1] 52
What percent of the patients have a mean arterial pressure between 95 and 125?
> sum(map>95 & map<125)/nrow(data)
[1] NA
We have missing data in the map variable.
> sum(!is.na(map)&map>95 & map<125)/nrow(data)
[1] 0.5
If is.na(map) gives you which ones are missing, !is.na(map) gives which ones are not
missing. Putting a ! in front of the true/false commands flips them. Think of the ! as the
putting a not with the command: !is.na = is not missing
Other helpful commands:
any( )
all( )
: checks if there are any trues in your logical vector: returns TRUE or FALSE
: checks if all values are trues in your logical vector: returns TRUE or FALSE
Are there any women in the study?
> any(gender=="f")
[1] TRUE
Are there any missing values in pulmonary wedge pressure?
> all(!is.na(pwp))
[1] FALSE
These are helpful if you have a really long list of true/falses that you don’t want to scan.
Rebecca Nugent, Department of Statistics, U. of Washington
-4-
Download